Quantcast
Channel: Planet Python
Viewing all 23953 articles
Browse latest View live

Shannon -jj Behrens: Python: My Favorite Python Tricks for LeetCode Questions

$
0
0

I've been spending a lot of time practicing on LeetCode recently, so I thought I'd share some of my favorite intermediate-level Python tricks. I'll also cover some newer features of Python you may not have started using yet. I'll start with basic tips and then move to more advanced ones.

Get help()

Python's documentation is pretty great, and some of these examples are taken from there.

For instance, if you just google "heapq", you'll see the official docs for heapq, which are often enough.

However, it's also helpful to sometimes just quickly use help() in the shell. Here, I can't remember that push() is actually called append().

>>> help([])

>>> dir([])

>>> help([].append)

enumerate()

If you need to loop over a list, you can use enumerate() to get both the item as well as the index. As a pneumonic, I like to think for (i, x) in enumerate(...):

for (i, x) in enumerate(some_list):
...

items()

Similarly, you can get both the key and the value at the same time when looping over a dict using items():

for (k, v) in some_dict.items():
...

[] vs. get()

Remember, when you use [] with a dict, if the value doesn't exist, you'll get a KeyError. Rather than see if an item is in the dict and then look up its value, you can use get():

val = some_dict.get(key)  # It defaults to None.
if val is None:
...

Similarly, .setdefault() is sometimes helpful.

Some people prefer to just use [] and handle the KeyError since exceptions aren't as expensive in Python as they are in other languages.

range() is smarter than you think

for item in range(items):
...

for index in range(len(items)):
...

# Count by 2s.
for i in range(0, 100, 2):
...

# Count backward from 100 to 0 inclusive.
for i in range(100, -1, -1):
...

# Okay, Mr. Smarty Pants, I'm sure you knew all that, but did you know
# that you can pass a range object around, and it knows how to reverse
# itself via slice notation? :-P
r = range(100)
r = r[::-1] # range(99, -1, -1)

print(f'') debugging

Have you switched to Python's new format strings yet? They're more convenient and safer (from injection vulnerabilities) than % and .format(). They even have a syntax for outputing the thing as well as its value:

# Got 2+2=4
print(f'Got {2+2=}')

for else

Python has a feature that I haven't seen in other programming languages. Both for and while can be followed by an else clause, which is useful when you're searching for something.

for item in some_list:
if is_what_im_looking_for(item):
print(f"Yay! It's {item}.")
break
else:
print("I couldn't find what I was looking for.")

Use a list as a stack

The cost of using a list as a stack is (amortized) O(1):

elements = []
elements.append(element) # Not push
element = elements.pop()

Note that inserting something at the beginning of the list or in the middle is more expensive it has to shift everything to the right--see deque below.

sort() vs. sorted()

# sort() sorts a list in place.
my_list.sort()

# Whereas sorted() returns a sorted *copy* of an iterable:
my_sorted_list = sorted(some_iterable)

And, both of these can take a key function if you need to sort objects.

set and frozenset

Sets are so useful for so many problems! Just in case you didn't know some of these tricks:

# There is now syntax for creating sets.
s = {'Von'}

# There are set "comprehensions" which are like list comprehensions, but for sets.
s2 = {f'{name} the III' for name in s}
{'Von the III'}

# If you can't remember how to use union, intersection, difference, etc.
help(set())

# If you need an immutable set, for instance, to use as a dict key, use frozenset.
frozenset((1, 2, 3))

deque

If you find yourself needing a queue or a list that you can push and pop from either side, use a deque:

>>> from collections import deque
>>>
>>> d = deque()
>>> d.append(3)
>>> d.append(4)
>>> d.appendleft(2)
>>> d.appendleft(1)
>>> d
deque([1, 2, 3, 4])
>>> d.popleft()
1
>>> d.pop()
4

Using a stack instead of recursion

Instead of using recursion (which has a depth of about 1024 frames), you can use a while loop and manually manage a stack yourself. Here's a slightly contrived example:

work = [create_initial_work()]
while work:
work_item = work.pop()
result = process(work_item)
if is_done(result):
return result
work.push(result.pieces[0])
work.push(result.pieces[1])

Using yield from

If you don't know about yield, you can go spend some time learning about that. It's awesome.

Sometimes, when you're in one generator, you need to call another generator. Python now has yield from for that:

def my_generator():
yield 1
yield from some_other_generator()
yield 6

So, here's an example of backtracking:

class Solution:
def problem(self, digits: str) -> List[str]:
def generate_possibilities(work_so_far, remaining_work):
if not remaining_work:
if work_so_far:
yield work_so_far
return
first_part, remaining_part = remaining_work[0], remaining_work[1:]
for i in things_to_try:
yield from generate_possibilities(work_so_far + i, remaining_part)

output = list(generate_possibilities(no_work_so_far, its_all_remaining_work))
return output

This is appropriate if you have less than 1000 "levels" but a ton of possibilities for each of those levels. This won't work if you're going to need more than 1000 layers of recursion. In that case, switch to "Using a stack instead of recursion".

Pre-initialize your list

If you know how long your list is going to be ahead of time, you can avoid needing to resize it multiple times by just pre-initializing it:

dp = [None] * len(items)

collections.Counter()

How many times have you used a dict to count up something? It's built-in in Python:

>>> from collections import Counter
>>> c = Counter('abcabcabcaaa')
>>> c
Counter({'a': 6, 'b': 3, 'c': 3})

defaultdict

Similarly, there's defaultdict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> d['girls'].append('Jocylenn')
>>> d['boys'].append('Greggory')
>>> d
defaultdict(<class 'list'>, {'girls': ['Jocylenn'], 'boys': ['Greggory']})

Notice that I didn't need to set d['girls'] to an empty list before I started appending to it.

heapq

I had heard of heaps in school, but I didn't really know what they were. Well, it turns out they're pretty helpful for several of the problems, and Python has a list-based heap implementation built-in.

If you don't know what a heap is, I recommend this video and this video. They'll explain what a heap is and how to implement one using a list.

The heapq module is a built-in module for managing a heap. It builds on top of an existing list:

import heapq

some_list = ...
heapq.heapify(some_list)

# The head of the heap is some_list[0].
# The len of the heap is still len(some_list).

heapq.heappush(some_list, item)
head_item = heapq.heappop(some_list)

The heapq module also has nlargest and nsmallest built-in so you don't have to implement those things yourself.

Keep in mind that heapq is a minheap. Let's say that what you really want is a maxheap, and you're not working with ints, you're working with objects. Here's how to tweak your data to get it to fit heapq's way of thinking:

heap = []
heapq.heappush(heap, (-obj.value, obj))

(ignored, first_obj) = heapq.heappop()

Here, I'm using - to make it a maxheap. I'm wrapping things in a tuple so that it's sorted by the obj.value, and I'm including the obj as the second value so that I can get it.

I'm sure you've implemented binary search before. Python has it built-in. It even has keyword arguments that you can use to search in only part of the list:

import bisect

insertion_point = bisect.bisect_left(sorted_list, some_item, lo=lo, high=high)

Pay attention to the key argument which is sometimes useful, but may take a little work for it to work the way you want.

namedtuple and dataclasses

Tuples are great, but it can be a pain to deal with remembering the order of the elements or unpacking just a single element in the tuple. That's where namedtuple comes in.

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(5, 7)
>>> p
Point(x=5, y=7)
>>> p.x
5
>>> q = p._replace(x=92)
>>> p
Point(x=5, y=7)
>>> q
Point(x=92, y=7)

Keep in mind that tuples are immutable. I particularly like using namedtuples for backtracking problems. In that case, the immutability is actually a huge asset. I use a namedtuple to represent the state of the problem at each step. I have this much stuff done, this much stuff left to do, this is where I am, etc. At each step, you take the old namedtuple and create a new one in an immutable way.

If you need something mutable, use a dataclass instead:

from dataclasses import dataclass

@dataclass
class InventoryItem:
"""Class for keeping track of an item in inventory."""
name: str
unit_price: float
quantity_on_hand: int = 0

def total_cost(self) -> float:
return self.unit_price * self.quantity_on_hand

item = InventoryItem(name='Box', unit_price=19, quantity_on_hand=2)

dataclasses are great when you want a little class to hold some data, but you don't want to waste much time writing one from scratch.

int, decimal, math.inf, etc.

Thankfully, Python's int type supports arbitrarily large values by default:

>>> 1 << 128
340282366920938463463374607431768211456

There's also the decimal module if you need to work with things like money where a float isn't accurate enough or when you need a lot of decimal places of precision.

Sometimes, they'll say the range is -2 ^ 32 to 2 ^ 32 - 1. You can get those values via bitshifting:

>>> -(2 ** 32) == -(1 << 32)
True
>>> (2 ** 32) - 1 == (1 << 32) - 1
True

Sometimes, it's useful to initialize a variable with math.inf (i.e. infinity) and then try to find new values less than that.

Closures

I'm not sure every interviewer is going to like this, but I tend to skip the OOP stuff and use a bunch of local helper functions so that I can access things via closure:

class Solution():  # This is what LeetCode gave me.
def solveProblem(self, arg1, arg2): # Why they used camelCase, I have no idea.

def helper_function():
# I have access to arg1 and arg2 via closure.
# I don't have to store them on self or pass them around
# explicitly.
return arg1 + arg2

counter = 0

def can_mutate_counter():
# By using nonlocal, I can even mutate counter.
# I rarely use this approach in practice. I usually pass in it
# as an argument and return a value.
nonlocal counter
counter += 1

can_mutate_counter()
return helper_function() + counter

match statement

Did you know Python now has a match statement?

# Taken from: https://learnpython.com/blog/python-match-case-statement/

>>> command = 'Hello, World!'
>>> match command:
... case 'Hello, World!':
... print('Hello to you too!')
... case 'Goodbye, World!':
... print('See you later')
... case other:
... print('No match found')

It's actually much more sophisticated than a switch statement, so take a look, especially if you've never used match in a functional language like Haskell.

OrderedDict

If you ever need to implement an LRU cache, it'll be quite helpful to have an OrderedDict.

Python's dicts are now ordered by default. However, the docs for OrderedDict say that there are still some cases where you might need to use OrderedDict. I can't remember. If you never need your dicts to be ordered, just read the docs and figure out if you need an OrderedDict or if you can use just a normal dict.

@functools.cache

If you need a cache, sometimes you can just wrap your code in a function and use functools.cache:

from functools import cache

@cache
def factorial(n):
return n * factorial(n - 1) if n else 1

print(factorial(5))
...
factorial.cache_info() # CacheInfo(hits=3, misses=8, maxsize=32, currsize=8)

Debugging ListNodes

A lot of the problems involve a ListNode class that's provided by LeetCode. It's not very "debuggable". Add this code temporarily to improve that:

def list_node_str(head):
seen_before = set()
pieces = []
p = head
while p is not None:
if p in seen_before:
pieces.append(f'loop at {p.val}')
break
pieces.append(str(p.val))
seen_before.add(p)
p = p.next
joined_pieces = ', '.join(pieces)
return f'[{joined_pieces}]'


ListNode.__str__ = list_node_str

Saving memory with the array module

Sometimes you need a really long list of simple numeric (or boolean) values. The array module can help with this, and it's an easy way to decrease your memory usage after you've already gotten your algorithm working.

>>> import array
>>> array_of_bytes = array.array('b')
>>> array_of_bytes.frombytes(b'\0' * (array_of_bytes.itemsize * 10_000_000))

Pay close attention to the type of values you configure the array to accept. Read the docs.

I'm sure there's a way to use individual bits for an array of booleans to save even more space, but it'd probably cost more CPU, and I generally care about CPU more than memory.

Using an exception for the success case rather than the error case

A lot of Python programmers don't like this trick because it's equivalent to goto, but I still occasionally find it convenient:

class Eureka(StopIteration):
"""Eureka means "I found it!" """
pass


def do_something_else():
some_value = 5
raise Eureka(some_value)


def do_something():
do_something_else()


try:
do_something()
except Eureka as exc:
print(f'I found it: {exc.args[0]}')

Using VS Code, etc.

VS Code has a pretty nice Python extension. If you highlight the code and hit shift-enter, it'll run it in a shell. That's more convenient than just typing everything directly in the shell. Other editors have something similar, or perhaps you use a Jupyter notebook for this.

Another thing that helps me is that I'll often have separate files open with separate attempts at a solution. I guess you can call this the "fast" approach to branching.

Conclusion

Well, those are my favorite tricks off the top of my head. I'll add more if I think of any.

This is just a single blog post, but if you want more, check out Python 3 Module of the Week.


Anwesha Das: How to add a renew hook for certbot?

$
0
0

After moving the foss.trainingserver to a new location, we found that TLS certificate was expired. I had looked and figured out that though certbot renewed the certificate, it never reloaded the nginx.

Now to make sure that nginx is reloaded next time, we must add the renew hook in the /etc/letsencrypt/renewal/foss.training.conf under the [renewalparams]

renew_hook = service nginx reload

One must remember to update the path based on their DNS. Thank you Saptak for pointing to the expired certificate and mentioning that it is a common pain point for people. I hope that this will be helpful for people in the future.

PyBites: The importance of practicing gratitude

$
0
0

Listen now:

This week we talk about gratitude.

Why? We spotted a trend that people are not saying thanks enough. We often suppose things are “just” working, forgetting that there is actually a lot of work involved to keep things running smoothly.

Expressing gratitude takes relatively little effort, yet it can have a big impact on the motivation of others, even their lives.

We hope you enjoy this exercise and don’t forget to practice gratitude (even if it’s only in your own journal, it can boost your happiness).

Bob & Julian

Links:

– For mindset and career tips, subscribe here.

– To become a better skilled, confident developer, check out our PDM program.

– Our previous podcast episodes are here.

– To provide us any feedback or share topics you’d like to hear discuss: send us an email.

John Ludhi/nbshare.io: PySpark Substr and Substring

$
0
0

PySpark Substr and Substring

substring(col_name, pos, len) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

First we load the important libraries

In [1]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring)
In [24]:
# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Let us load our initial records.

In [3]:
columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley Jones",120000),("Jane Tom James",5000)]
In [4]:
# converting data to rddsrdd=spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulationdfFromRDD2.show()
+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|     John A Smith|  1000|
|Alex Wesley Jones|120000|
|   Jane Tom James|  5000|
+-----------------+------+

PySpark substring

1) Here we are taking a substring for the first name from the Full_Name Column. The Full_Name contains first name, middle name and last name. We are adding a new column for the substring called First_Name

In [7]:
# here we add a new column called 'First_Name' and use substring() to get partial string from 'Full_Name' columnmodified_dfFromRDD2=dfFromRDD2.withColumn("First_Name",substring('Full_Name',1,4))
In [8]:
# visualizing the modified dataframe modified_dfFromRDD2.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

2) We can also get a substring with select and alias to achieve the same result as above

In [9]:
modified_dfFromRDD3=dfFromRDD2.select("Full_Name",'Salary',substring('Full_Name',1,4).alias('First_Name'))
In [10]:
# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD3.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

3) We can also use substring with selectExpr to get a substring of 'Full_Name' column. selectExpr takes SQL expression(s) in a string to execute. This way we can run SQL-like expressions without creating views.

In [11]:
modified_dfFromRDD4=dfFromRDD2.selectExpr("Full_Name",'Salary','substring(Full_Name, 1, 4) as First_Name')
In [12]:
# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD4.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

4) Here we are going to use substr function of the Column data type to obtain the substring from the 'Full_Name' column and create a new column called 'First_Name'

In [13]:
modified_dfFromRDD5=dfFromRDD2.withColumn("First_Name",col('Full_Name').substr(1,4))
In [14]:
# visualizing the modified dataframe yields the same output as seen for all previous examples.modified_dfFromRDD5.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

5) Let us consider now a example of substring when the indices are beyond the length of column. In that case, the substring() function only returns characters that fall in the bounds i.e (start, start+len). This can be seen in the example below

In [15]:
# In this example we are going to get the four characters of Full_Name column starting from position 14.#  As can be seen in the example, 4 or fewer characters are returned depending on the string lengthmodified_dfFromRDD6=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',14,4))
In [16]:
modified_dfFromRDD6.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|         |
|Alex Wesley Jones|120000|     ones|
|   Jane Tom James|  5000|        s|
+-----------------+------+---------+

The above method produces wrong last name. We can fix it by following approach.

6) Another example of substring when we want to get the characters relative to end of the string. In this example, we are going to extract the last name from the Full_Name column.

In [17]:
# In this example we are going to get the five characters of Full_Name column relative to the end of the string.#  As can be seen in the example, last 5 charcters are returnedmodified_dfFromRDD7=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',-5,5))
In [18]:
modified_dfFromRDD7.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|    Smith|
|Alex Wesley Jones|120000|    Jones|
|   Jane Tom James|  5000|    James|
+-----------------+------+---------+

Note above approach works only if the last name in each row is of constant characters length. What if the last name is of different characters length, the solution is not that simple.
I will need the index at which the last name starts and also the length of 'Full_Name'. If you are curious, I have provided the solution below without the explanation.

In [19]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring,lit,substring_index,length)

Let us create an example with last names having variable character length.

In [20]:
columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley leeper",120000),("Jane Tom kinderman",5000)]rdd=spark.sparkContext.parallelize(data)dfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)dfFromRDD2.show()
+------------------+------+
|         Full_Name|Salary|
+------------------+------+
|      John A Smith|  1000|
|Alex Wesley leeper|120000|
|Jane Tom kinderman|  5000|
+------------------+------+

Pyspark substr

In [21]:
dfFromRDD2.withColumn('Last_Name',col("Full_Name").substr((length('Full_Name')-length(substring_index('Full_Name',"",-1))),length('Full_Name'))).show()
+------------------+------+----------+
|         Full_Name|Salary| Last_Name|
+------------------+------+----------+
|      John A Smith|  1000|     Smith|
|Alex Wesley leeper|120000|    leeper|
|Jane Tom kinderman|  5000| kinderman|
+------------------+------+----------+

In [22]:
spark.stop()

Real Python: The Real Python Podcast – Episode #122: Configuring a Coding Environment on Windows & Using TOML With Python

$
0
0

Have you attempted to set up a Python development environment on Windows before? Would it be helpful to have an easy-to-follow guide to get you started? This week on the show, Christopher Trudeau is here, bringing another batch of PyCoder's Weekly articles and projects.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Python for Beginners: Check For Superset in Python

$
0
0

In python, we use sets to store unique immutable objects. In this article,  we will discuss what is a superset of a set. We will also discuss ways to check for superset in Python.

What is a Superset?

A superset of a set is another set that contains all the elements of the given set. In other words, If we have a set A and set B, and each element of set B belongs to set A, then set A is said to be a superset of set B.

Let us consider an example where we are given three sets A, B, and C as follows.

A={1,2,3,4,5,6,7,8}

B={2,4,6,8}

C={0,1,2,3,4}

Here, you can observe that all the elements in set B are present in set A. Hence, set A is a superset of set B. On the other hand, all the elements of set C do not belong to set A. Hence, set A is not a superset of set C.

You can observe that a superset will always have more or equal elements than the original set. Now, let us describe a step-by-step algorithm to check for a superset  in python.

Suggested Reading: Chat Application in Python

How to Check For Superset in Python?

Consider that we are given two sets A and B. Now, we have to check if set B is a superset of set A or not. For this, we will traverse all the elements of set A and check whether they are present in set B or not. If there exists an element in set A that doesn’t belong to set B, we will say that set B is not a superset of set A. Otherwise, set B will be a superset of set A. 

To implement this approach in Python, we will use a for loop and a flag variable isSuperset. We will initialize the isSuperset variable to True denoting that set B is a superset of set A. Now we will traverse set A using a for loop. While traversing the elements in set A, we will check if the element is present in set B or not. 

If we find any element in A that isn’t present in set B, we will assign False to isSuperset showing that set B is not a superset of the set A. 

If we do not find any element in set A that does not belong to set B, the isSuperset variable will contain the value True showing that set B is a superset of set A. The entire logic to check for superset  can be implemented in Python as follows.

def checkSuperset(set1, set2):
    isSuperset = True
    for element in set2:
        if element not in set1:
            isSuperset = False
            break
    return isSuperset


A = {1, 2, 3, 4, 5, 6, 7, 8}
B = {2, 4, 6, 8}
C = {0, 1, 2, 3, 4}
print("Set {} is: {}".format("A", A))
print("Set {} is: {}".format("B", B))
print("Set {} is: {}".format("C", C))
print("Set A is superset of B :", checkSuperset(A, B))
print("Set A is superset of C :", checkSuperset(A, C))
print("Set B is superset of C :", checkSuperset(B, C))

Output:

Set A is: {1, 2, 3, 4, 5, 6, 7, 8}
Set B is: {8, 2, 4, 6}
Set C is: {0, 1, 2, 3, 4}
Set A is superset of B : True
Set A is superset of C : False
Set B is superset of C : False

Check For Superset Using issuperset() Method

We can also use the issuperset() method to check for superset in python. The issuperset() method, when invoked on a set A, accepts a set B as input argument and returns True if set A is a superset of B. Otherwise, it returns False.

You can use the issuperset() method to check for superset in python as follows.

A = {1, 2, 3, 4, 5, 6, 7, 8}
B = {2, 4, 6, 8}
C = {0, 1, 2, 3, 4}
print("Set {} is: {}".format("A", A))
print("Set {} is: {}".format("B", B))
print("Set {} is: {}".format("C", C))
print("Set A is superset of B :", A.issuperset(B))
print("Set A is superset of C :", A.issuperset(C))
print("Set B is superset of C :", B.issuperset(C))

Output:

Set A is: {1, 2, 3, 4, 5, 6, 7, 8}
Set B is: {8, 2, 4, 6}
Set C is: {0, 1, 2, 3, 4}
Set A is superset of B : True
Set A is superset of C : False
Set B is superset of C : False

Conclusion

In this article, we have discussed two ways to check for superset in python. To learn more about sets, you can read this article on set comprehension in python. You might also like this article on list comprehension in python.

The post Check For Superset in Python appeared first on PythonForBeginners.com.

eGenix.com: eGenix Antispam Bot for Telegram 0.4.0 GA

$
0
0

Introduction

eGenix has long been running a local user group meeting in Düsseldorf called Python Meeting Düsseldorf and we are using a Telegram group for most of our communication.

In the early days, the group worked well and we only had few spammers joining it, which we could well handle manually.

More recently, this has changed dramatically. We are seeing between 2-5 spam signups per day, often at night. Furthermore, the signups accounts are not always easy to spot as spammers, since they often come with profile images, descriptions, etc.

With the bot, we now have a more flexible way of dealing with the problem.

Please see our project page for details and download links.

Features

  • Low impact mode of operation: the bot tries to keep noise in the group to a minimum
  • Several challenge mechanisms to choose from, more can be added as needed
  • Flexible and easy to use configuration
  • Only needs a few MB of RAM, so can easily be put into a container or run on a Raspberry Pi
  • Can handle quite a bit of load due to the async implementation
  • Works with Python 3.9+
  • MIT open source licensed

News

The 0.4.0 release fixes a few bugs and adds more features:

  • Added new challenge MathMultiplyChallenge
  • Made the MathAddChallenge and MathMultiplyChallenge a little more difficult

It has been battle-tested in production for several months already and is proving to be a really useful tool to help with Telegram group administration.

More Information

For more information on the eGenix.com Python products, licensing and download instructions, please write to sales@egenix.com.

Enjoy !

Marc-Andre Lemburg, eGenix.com

Stack Abuse: RetinaNet Object Detection with PyTorch and torchvision

$
0
0

Introduction

Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". On one end, it can be used to build autonomous systems that navigate agents through environments - be it robots performing tasks or self-driving cars, but this requires intersection with other fields. However, anomaly detection (such as defective products on a line), locating objects within images, facial detection and various other applications of object detection can be done without intersecting other fields.

Advice This short guide is based on a small part of a much larger lesson on object detection belonging to our "Practical Deep Learning for Computer Vision with Python" course.

Object detection isn't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.

This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification. One of the major benefits of being in an ecosystem is that it provides you with a way to not search for useful information on good practices, tools and approaches to use. With object detection - most have to do way more research on the landscape of the field to get a good grip.

Object Detection with PyTorch/TorchVision's RetinaNet

torchvision is PyTorch's Computer Vision project, and aims to make the development of PyTorch-based CV models easier, by providing transformation and augmentation scripts, a model zoo with pre-trained weights, datasets and utilities that can be useful for a practitioner.

While still in beta and very much experimental - torchvision offers a relatively simple Object Detection API with a few models to choose from:

  • Faster R-CNN
  • RetinaNet
  • FCOS (Fully convolutional RetinaNet)
  • SSD (VGG16 backbone... yikes)
  • SSDLite (MobileNetV3 backbone)

While the API isn't as polished or simple as some other third-party APIs, it's a very decent starting point for those who'd still prefer the safety of being in an ecosystem they're familiar with. Before going forward, make sure you install PyTorch and Torchvision:

$ pip install torch torchvision

Let's load in some of the utility functions, such as read_image(), draw_bounding_boxes() and to_pil_image() to make it easier to read, draw on and output images, followed by importing RetinaNet and its pre-trained weights (MS COCO):

from torchvision.io.image import read_image
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import to_pil_image
from torchvision.models.detection import retinanet_resnet50_fpn_v2, RetinaNet_ResNet50_FPN_V2_Weights

import matplotlib.pyplot as plt

RetinaNet uses a ResNet50 backbone and a Feature Pyramid Network (FPN) on top of it. While the name of the class is verbose, it's indicative of the architecture. Let's fetch an image using the requests library and save it as a file on our local drive:

import requests
response = requests.get('https://i.ytimg.com/vi/q71MCWAEfL8/maxresdefault.jpg')
open("obj_det.jpeg", "wb").write(response.content)

img = read_image("obj_det.jpeg")

With an image in place - we can instantiate our model and weights:

weights = RetinaNet_ResNet50_FPN_V2_Weights.DEFAULT
model = retinanet_resnet50_fpn_v2(weights=weights, score_thresh=0.35)
# Put the model in inference mode
model.eval()
# Get the transforms for the model's weights
preprocess = weights.transforms()

The score_thresh argument defines the threshold at which an object is detected as an object of a class. Intuitively, it's the confidence threshold, and we won't classify an object to belong to a class if the model is less than 35% confident that it belongs to a class.

Let's preprocess the image using the transforms from our weights, create a batch and run inference:

batch = [preprocess(img)]
prediction = model(batch)[0]

That's it, our prediction dictionary holds the inferred object classes and locations! Now, the results aren't very useful for us in this form - we'll want to extract the labels with respect to the metadata from the weights and draw bounding boxes, which can be done via draw_bounding_boxes():

labels = [weights.meta["categories"][i] for i in prediction["labels"]]

box = draw_bounding_boxes(img, boxes=prediction["boxes"],
                          labels=labels,
                          colors="cyan",
                          width=2, 
                          font_size=30,
                          font='Arial')

im = to_pil_image(box.detach())

fig, ax = plt.subplots(figsize=(16, 12))
ax.imshow(im)
plt.show()

This results in:

RetinaNet actually classified the person peeking behind the car! That's a pretty difficult classification.

You can switch out RetinaNet to an FCOS (fully convolutional RetinaNet) by replacing retinanet_resnet50_fpn_v2 with fcos_resnet50_fpn, and use the FCOS_ResNet50_FPN_Weights weights:

from torchvision.io.image import read_image
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms.functional import to_pil_image
from torchvision.models.detection import fcos_resnet50_fpn, FCOS_ResNet50_FPN_Weights

import matplotlib.pyplot as plt
import requests
response = requests.get('https://i.ytimg.com/vi/q71MCWAEfL8/maxresdefault.jpg')
open("obj_det.jpeg", "wb").write(response.content)

img = read_image("obj_det.jpeg")
weights = FCOS_ResNet50_FPN_Weights.DEFAULT
model = fcos_resnet50_fpn(weights=weights, score_thresh=0.35)
model.eval()

preprocess = weights.transforms()
batch = [preprocess(img)]
prediction = model(batch)[0]

labels = [weights.meta["categories"][i] for i in prediction["labels"]]

box = draw_bounding_boxes(img, boxes=prediction["boxes"],
                          labels=labels,
                          colors="cyan",
                          width=2, 
                          font_size=30,
                          font='Arial')

im = to_pil_image(box.detach())

fig, ax = plt.subplots(figsize=(16, 12))
ax.imshow(im)
plt.show()

Going Further - Practical Deep Learning for Computer Vision

Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".

Another Computer Vision Course?

We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance.

We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.

What's inside?

  • The first principles of vision and how computers can be taught to "see"
  • Different tasks and applications of computer vision
  • The tools of the trade that will make your work easier
  • Finding, creating and utilizing datasets for computer vision
  • The theory and application of Convolutional Neural Networks
  • Handling domain shift, co-occurrence, and other biases in datasets
  • Transfer Learning and utilizing others' training time and computational resources for your benefit
  • Building and training a state-of-the-art breast cancer classifier
  • How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
  • Visualizing a ConvNet's "concept space" using t-SNE and PCA
  • Case studies of how companies use computer vision techniques to achieve better results
  • Proper model evaluation, latent space visualization and identifying the model's attention
  • Performing domain research, processing your own datasets and establishing model tests
  • Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
  • KerasCV - a WIP library for creating state of the art pipelines and models
  • How to parse and read papers and implement them yourself
  • Selecting models depending on your application
  • Creating an end-to-end machine learning pipeline
  • Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
  • Instance and semantic segmentation
  • Real-Time Object Recognition with YOLOv5
  • Training YOLOv5 Object Detectors
  • Working with Transformers using KerasNLP (industry-strength WIP library)
  • Integrating Transformers with ConvNets to generate captions of images
  • DeepDream

Conclusion

Object Detection is an important field of Computer Vision, and one that's unfortunately less approachable than it should be.

In this short guide, we've taken a look at how torchvision, PyTorch's Computer Vision package, makes it easier to perform object detection on images, using RetinaNet.


Mike Driscoll: Python 101 - Debugging Your Code with pdb (Video)

$
0
0

Learn how to debug your Python programs using Python's built-in debugger, pdb with Mike Driscoll

In this tutorial, you will learn the following:

  • Starting pdb in the REPL
  • Starting pdb on the Command Line
  • Stepping Through Code
  • Adding Breakpoints in pdb
  • Creating a Breakpoint with set_trace()
  • Using the built-in breakpoint() Function - Getting Help

This video is based on a chapter from the book, Python 101 by Mike Driscoll

Related Articles

The post Python 101 - Debugging Your Code with pdb (Video) appeared first on Mouse Vs Python.

Stack Abuse: Object Detection and Instance Segmentation in Python with Detectron2

$
0
0

Introduction

Object detection is a large field in computer vision, and one of the more important applications of computer vision "in the wild". On one end, it can be used to build autonomous systems that navigate agents through environments - be it robots performing tasks or self-driving cars, but this requires intersection with other fields. However, anomaly detection (such as defective products on a line), locating objects within images, facial detection and various other applications of object detection can be done without intersecting other fields.

Advice This short guide is based on a small part of a much larger lesson on object detection belonging to our "Practical Deep Learning for Computer Vision with Python" course.

Object detection isn't as standardized as image classification, mainly because most of the new developments are typically done by individual researchers, maintainers and developers, rather than large libraries and frameworks. It's difficult to package the necessary utility scripts in a framework like TensorFlow or PyTorch and maintain the API guidelines that guided the development so far.

This makes object detection somewhat more complex, typically more verbose (but not always), and less approachable than image classification. One of the major benefits of being in an ecosystem is that it provides you with a way to not search for useful information on good practices, tools and approaches to use. With object detection - most have to do way more research on the landscape of the field to get a good grip.

Meta AI's Detectron2 - Instance Segmentation and Object Detection

Detectron2 is Meta AI (formerly FAIR - Facebook AI Research)'s open source object detection, segmentation and pose estimation package - all in one. Given an input image, it can return the labels, bounding boxes, confidence scores, masks and skeletons of objects. This is well-represented on the repository's page:

It's meant to be used as a library on the top of which you can build research projects. It offers a model zoo with most implementations relying on Mask R-CNN and R-CNNs in general, alongside RetinaNet. They also have a pretty decent documentation. Let's run an examplory inference script!

First, let's install the dependencies:

$ pip install pyyaml==5.1$ pip install 'git+https://github.com/facebookresearch/detectron2.git'

Next, we'll import the Detectron2 utilities - this is where framework-domain knowledge comes into play. You can construct a detector using the DefaultPredictor class, by passing in a configuration object that sets it up. The Visualizer offers support for visualizing results. MetadataCatalog and DatasetCatalog belong to Detectron2's data API and offer information on built-in datasets as well as their metadata.

Let's import the classes and functions we'll be using:

import torch, detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog

Using requests, we'll download an image and save it to our local drive:

import matplotlib.pyplot as plt
import requests
response = requests.get('http://images.cocodataset.org/val2017/000000439715.jpg')
open("input.jpg", "wb").write(response.content)
    
im = cv2.imread("./input.jpg")
fig, ax = plt.subplots(figsize=(18, 8))
ax.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB))

This results in:

Now, we load the configuration, enact changes if need be (the models run on GPU by default, so if you don't have a GPU, you'll want to set the device to 'cpu' in the config):

cfg = get_cfg()

cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
# If you don't have a GPU and CUDA enabled, the next line is required# cfg.MODEL.DEVICE = "cpu"

Here, we specify which model we'd like to run from the model_zoo. We've imported an instance segmentation model, based on the Mask R-CNN architecture, and with a ResNet50 backbone. Depending on what you'd like to achieve (keypoint detection, instance segmentation, panoptic segmentation or object detection), you'll load in the appropriate model.

Finally, we can construct a predictor with this cfg and run it on the inputs! The Visualizer class is used to draw predictions on the image (in this case, segmented instances, classes and bounding boxes:

predictor = DefaultPredictor(cfg)
outputs = predictor(im)

v = Visualizer(im[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
fig, ax = plt.subplots(figsize=(18, 8))
ax.imshow(out.get_image()[:, :, ::-1])

Finally, this results in:

Going Further - Practical Deep Learning for Computer Vision

Your inquisitive nature makes you want to go further? We recommend checking out our Course: "Practical Deep Learning for Computer Vision with Python".

Another Computer Vision Course?

We won't be doing classification of MNIST digits or MNIST fashion. They served their part a long time ago. Too many learning resources are focusing on basic datasets and basic architectures before letting advanced black-box architectures shoulder the burden of performance.

We want to focus on demystification, practicality, understanding, intuition and real projects. Want to learn how you can make a difference? We'll take you on a ride from the way our brains process images to writing a research-grade deep learning classifier for breast cancer to deep learning networks that "hallucinate", teaching you the principles and theory through practical work, equipping you with the know-how and tools to become an expert at applying deep learning to solve computer vision.

What's inside?

  • The first principles of vision and how computers can be taught to "see"
  • Different tasks and applications of computer vision
  • The tools of the trade that will make your work easier
  • Finding, creating and utilizing datasets for computer vision
  • The theory and application of Convolutional Neural Networks
  • Handling domain shift, co-occurrence, and other biases in datasets
  • Transfer Learning and utilizing others' training time and computational resources for your benefit
  • Building and training a state-of-the-art breast cancer classifier
  • How to apply a healthy dose of skepticism to mainstream ideas and understand the implications of widely adopted techniques
  • Visualizing a ConvNet's "concept space" using t-SNE and PCA
  • Case studies of how companies use computer vision techniques to achieve better results
  • Proper model evaluation, latent space visualization and identifying the model's attention
  • Performing domain research, processing your own datasets and establishing model tests
  • Cutting-edge architectures, the progression of ideas, what makes them unique and how to implement them
  • KerasCV - a WIP library for creating state of the art pipelines and models
  • How to parse and read papers and implement them yourself
  • Selecting models depending on your application
  • Creating an end-to-end machine learning pipeline
  • Landscape and intuition on object detection with Faster R-CNNs, RetinaNets, SSDs and YOLO
  • Instance and semantic segmentation
  • Real-Time Object Recognition with YOLOv5
  • Training YOLOv5 Object Detectors
  • Working with Transformers using KerasNLP (industry-strength WIP library)
  • Integrating Transformers with ConvNets to generate captions of images
  • DeepDream

Conclusion

Instance segmentation goes one step beyond semantic segmentation, and notes the qualitative difference between individual instances of a class (person 1, person 2, etc...) rather than just whether they belong to one. In a way - it's pixel-level classification.

In this short guide, we've taken a quick look at how Detectron2 makes instance segmentation and object detection easy and accessible through their API, using a Mask R-CNN.

John Ludhi/nbshare.io: PySpark Replace Values In DataFrames

$
0
0

PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

First we load the important libraries

In [1]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,regexp_replace,translate,overlay,when,expr)
In [25]:
# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

In [3]:
columns=["Full_Name","Salary","Last_Name_Pattern","Last_Name_Replacement"]data=[('Sam A Smith','1,000.01','Sm','Griffi'),('Alex Wesley Jones','120,000.89','Jo','Ba'),('Steve Paul Jobs','5,000.90','Jo','Bo')]
In [4]:
# converting data to rddsrdd=spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulationdfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Jones|120,000.89|               Jo|                   Ba|
|  Steve Paul Jobs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark regex_replace

regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value

1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'

In [7]:
# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteriamodified_dfFromRDD2=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','Jo','Ba'))
In [8]:
# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteriamodified_dfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally

In [9]:
# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria# based on the conditionsmodified_dfFromRDD3=dfFromRDD2.withColumn("Full_Name",when(col('Full_Name').endswith('th'),regexp_replace('Full_Name','Smith','Griffith'))\
                                                         .otherwise(regexp_replace('Full_Name','Jo','Ba')))
In [10]:
# visualizing the modified dataframe we see how all the column values are updated based on the conditions providedmodified_dfFromRDD3.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|   Sam A Griffith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.

In [11]:
modified_dfFromRDD4=dfFromRDD2.withColumn("Salary",regexp_replace('Salary','\\.\d\d$','.00 \\$'))
In [12]:
# visualizing the modified dataframe, we see how the Salary column is updatedmodified_dfFromRDD4.show(truncate=False)
+-----------------+------------+-----------------+---------------------+
|Full_Name        |Salary      |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+------------+-----------------+---------------------+
|Sam A Smith      |1,000.00 $  |Sm               |Griffi               |
|Alex Wesley Jones|120,000.00 $|Jo               |Ba                   |
|Steve Paul Jobs  |5,000.00 $  |Jo               |Bo                   |
+-----------------+------------+-----------------+---------------------+

4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'

In [13]:
# Replace only the lowercase characters in the Full_Name with --modified_dfFromRDD5=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','[a-z]+','--'))
In [14]:
# visualizing the modified data frame. We see that all the lowercase characters are replaced.# The uppercase characters are same as they were beforemodified_dfFromRDD5.show()
+-----------+----------+-----------------+---------------------+
|  Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------+----------+-----------------+---------------------+
|  S-- A S--|  1,000.01|               Sm|               Griffi|
|A-- W-- J--|120,000.89|               Jo|                   Ba|
|S-- P-- J--|  5,000.90|               Jo|                   Bo|
+-----------+----------+-----------------+---------------------+

5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.

In [15]:
# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'modified_dfFromRDD6=modified_dfFromRDD2.withColumn("Full_Name",expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))
In [16]:
# visualizing the modified dataframe. # The Full_Name column has been updated with some characters from Last_Name_Replacementmodified_dfFromRDD6.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|  Sam A Griffiith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark translate()

translate(): This function is used to do character by character replacement of column values

In [17]:
# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9alphabets='abcdefjhij'digits='0123456789'modified_dfFromRDD7=dfFromRDD2.withColumn("Full_Name",translate('Full_Name',alphabets,digits))
In [18]:
# visualizing the modified dataframe we see the replacements has been done character by charactermodified_dfFromRDD7.show(truncate=False)
+-----------------+----------+-----------------+---------------------+
|Full_Name        |Salary    |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|S0m A Sm8t7      |1,000.01  |Sm               |Griffi               |
|Al4x W4sl4y Jon4s|120,000.89|Jo               |Ba                   |
|St4v4 P0ul Jo1s  |5,000.90  |Jo               |Bo                   |
+-----------------+----------+-----------------+---------------------+

PySpark overlay()

overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)

In [19]:
# Here the first two characters are replaced by the replacement string in Last_Name_Replacement columnmodified_dfFromRDD8=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",1,2).alias("FullName_Overlayed"))
In [20]:
# Visualizing the modified dataframemodified_dfFromRDD8.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|   Griffim A Smith|
|Alex Wesley Jones| Baex Wesley Jones|
|  Steve Paul Jobs|   Boeve Paul Jobs|
+-----------------+------------------+

In [21]:
# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the # length of the replacement stringmodified_dfFromRDD9=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",5).alias("FullName_Overlayed"))
In [22]:
# Visualizing the modified dataframemodified_dfFromRDD9.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|       Sam Griffih|
|Alex Wesley Jones| AlexBaesley Jones|
|  Steve Paul Jobs|   StevBoPaul Jobs|
+-----------------+------------------+

In [23]:
spark.stop()

Stack Abuse: Guide to the K-Nearest Neighbors Algorithm in Python and Scikit-Learn

$
0
0

Introduction

The K-nearest Neighbors (KNN) algorithm is a type of supervised machine learning algorithm used for classification, regression as well as outlier detection. It is extremely easy to implement in its most basic form but can perform fairly complex tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying (or regressing) a new data point or instance.

KNN is a non-parametric learning algorithm, which means that it doesn't assume anything about the underlying data. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption e.g. linear separability, uniform distribution, etc.

In this guide, we will see how KNN can be implemented with Python's Scikit-Learn library. Before that we'll first explore how can we use KNN and explain the theory behind it. After that, we'll take a look at the California Housing dataset we'll be using to illustrate the KNN algorithm and several of its variations. First of all, we'll take a look at how to implement the KNN algorithm for the regression, followed by implementations of the KNN classification and the outlier detection. In the end, we'll conclude with some of the pros and cons of the algorithm.

When Should You Use KNN?

Suppose you wanted to rent an apartment and recently found out your friend's neighbor might put her apartment for rent in 2 weeks. Since the apartment isn't on a rental website yet, how could you try to estimate its rental value?

Let's say your friend pays $1,200 in rent. Your rent value might be around that number, but the apartments aren't exactly the same (orientation, area, furniture quality, etc.), so, it would be nice to have more data on other apartments.

By asking other neighbors and looking at the apartments from the same building that were listed on a rental website, the closest three neighboring apartment rents are $1,200, $1,210, $1,210, and $1,215. Those apartments are on the same block and floor as your friend's apartment.

Other apartments, that are further away, on the same floor, but in a different block have rents of $1,400, $1,430, $1,500, and $1,470. It seems they are more expensive due to having more light from the sun in the evening.

Considering the apartment's proximity, it seems your estimated rent would be around $1,210. That is the general idea of what the K-Nearest Neighbors (KNN) algorithm does! It classifies or regresses new data based on its proximity to already existing data.

Translate the Example into Theory

When the estimated value is a continuous number, such as the rent value, KNN is used for regression. But we could also divide apartments into categories based on the minimum and maximum rent, for instance. When the value is discrete, making it a category, KNN is used for classification.

There is also the possibility of estimating which neighbors are so different from others that they will probably stop paying rent. This is the same as detecting which data points are so far away that they don't fit into any value or category, when that happens, KNN is used for outlier detection.

In our example, we also already knew the rents of each apartment, which means our data was labeled. KNN uses previously labeled data, which makes it a supervised learning algorithm.

KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification, regression, or outlier detection tasks.

Each time there is a new point added to the data, KNN uses just one part of the data for deciding the value (regression) or class (classification) of that added point. Since it doesn't have to look at all the points again, this makes it a lazy learning algorithm.

KNN also doesn't assume anything about the underlying data characteristics, it doesn't expect the data to fit into some type of distribution, such as uniform, or to be linearly separable. This means it is a non-parametric learning algorithm. This is an extremely useful feature since most of the real-world data doesn't really follow any theoretical assumption.

Visualizing Different Uses of the KNN

As it has been shown, the intuition behind the KNN algorithm is one of the most direct of all the supervised machine learning algorithms. The algorithm first calculates the distance of a new data point to all other training data points.

Note: The distance can be measured in different ways. You can use a Minkowski, Euclidean, Manhattan, Mahalanobis or Hamming formula, to name a few metrics. With high dimensional data, Euclidean distance oftentimes starts failing (high dimensionality is... weird), and Manhattan distance is used instead.

After calculating the distance, KNN selects a number of nearest data points - 2, 3, 10, or really, any integer. This number of points (2, 3, 10, etc.) is the K in K-Nearest Neighbors!

In the final step, if it is a regression task, KNN will calculate the average weighted sum of the K-nearest points for the prediction. If it is a classification task, the new data point will be assigned to the class to which the majority of the selected K-nearest points belong.

Let's visualize the algorithm in action with the help of a simple example. Consider a dataset with two variables and a K of 3.

When performing regression, the task is to find the value of a new data point, based on the average weighted sum of the 3 nearest points.

KNN with K = 3, when used for regression:


The KNN algorithm will start by calculating the distance of the new point from all the points. It then finds the 3 points with the least distance to the new point. This is shown in the second figure above, in which the three nearest points, 47, 58, and 79 have been encircled. After that, it calculates the weighted sum of 47, 58 and 79 - in this case the weights are equal to 1 - we are considering all points as equals, but we could also assign different weights based on distance. After calculating the weighted sum, the new point value is 61,33.

And when performing a classification, the KNN task to classify a new data point, into the "Purple" or "Red" class.

KNN with K = 3, when used for classification:


The KNN algorithm will start in the same way as before, by calculating the distance of the new point from all the points, finding the 3 nearest points with the least distance to the new point, and then, instead of calculating a number, it assigns the new point to the class to which majority of the three nearest points belong, the red class. Therefore the new data point will be classified as "Red".

The outlier detection process is different from both above, we will talk more about it when implementing it after the regression and classification implementations.

Note: The code provided in this tutorial has been executed and tested with the following Jupyter notebook.

The Scikit-Learn California Housing Dataset

We are going to use the California housing dataset to illustrate how the KNN algorithm works. The dataset was derived from the 1990 U.S. census. One row of the dataset represents the census of one block group.

In this section, we'll go over the details of the California Housing Dataset, so you can gain an intuitive understanding of the data we'll be working with. It's very important to get to know your data before you start working on it.

A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data. Besides block group, another term used is household, a household is a group of people residing within a home.

The dataset consists of nine attributes:

  • MedInc - median income in block group
  • HouseAge - median house age in a block group
  • AveRooms - the average number of rooms (provided per household)
  • AveBedrms - the average number of bedrooms (provided per household)
  • Population - block group population
  • AveOccup - the average number of household members
  • Latitude - block group latitude
  • Longitude - block group longitude
  • MedHouseVal - median house value for California districts (hundreds of thousands of dollars)

The dataset is already part of the Scikit-Learn library, we only need to import it and load it as a dataframe:

from sklearn.datasets import fetch_california_housing
# as_frame=True loads the data in a dataframe format, with other metadata besides it
california_housing = fetch_california_housing(as_frame=True)
# Select only the dataframe part and assign it to the df variable
df = california_housing.frame

Importing the data directly from Scikit-Learn, imports more than only the columns and numbers and includes the data description as a Bunch object - so we've just extracted the frame. Further details of the dataset are available here.

Let's import Pandas and take a peek at the first few rows of data:

import pandas as pd
df.head()

Executing the code will display the first five rows of our dataset:

	MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup   Latitude  Longitude  MedHouseVal
0 	8.3252 	41.0 	  6.984127 	1.023810   322.0 	   2.555556   37.88 	-122.23    4.526
1 	8.3014 	21.0 	  6.238137 	0.971880   2401.0 	   2.109842   37.86 	-122.22    3.585
2 	7.2574 	52.0 	  8.288136 	1.073446   496.0 	   2.802260   37.85 	-122.24    3.521
3 	5.6431 	52.0 	  5.817352 	1.073059   558.0 	   2.547945   37.85 	-122.25    3.413
4 	3.8462 	52.0 	  6.281853 	1.081081   565.0 	   2.181467   37.85 	-122.25    3.422

In this guide, we will use MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude to predict MedHouseVal. Something similar to our motivation narrative.

Let's now jump right into the implementation of the KNN algorithm for the regression.

Regression with K-Nearest Neighbors with Scikit-Learn

So far, we got to know our dataset and now can proceed to other steps in the KNN algorithm.

Preprocessing Data for KNN Regression

The preprocessing is where the first differences between the regression and classification tasks appear. Since this section is all about regression, we'll prepare our dataset accordingly.

For the regression, we need to predict another median house value. To do so, we will assign MedHouseVal to y and all other columns to X just by dropping MedHouseVal:

y = df['MedHouseVal']
X = df.drop(['MedHouseVal'], axis = 1)

By looking at our variables descriptions, we can see that we have differences in measurements. To avoid guessing, let's use the describe() method to check:

# .T transposes the results, transforming rows into columns
X.describe().T

This results in:

			count 	  mean 		   std 			min 		25% 		50% 		75% 		max
MedInc 		20640.0   3.870671 	   1.899822 	0.499900 	2.563400 	3.534800 	4.743250 	15.000100
HouseAge 	20640.0   28.639486    12.585558 	1.000000 	18.000000 	29.000000 	37.000000 	52.000000
AveRooms 	20640.0   5.429000 	   2.474173 	0.846154 	4.440716 	5.229129 	6.052381 	141.909091
AveBedrms 	20640.0   1.096675 	   0.473911 	0.333333 	1.006079 	1.048780 	1.099526 	34.066667
Population 	20640.0   1425.476744  1132.462122 	3.000000 	787.000000 	1166.000000 1725.000000 35682.000000
AveOccup 	20640.0   3.070655 	   10.386050 	0.692308 	2.429741 	2.818116 	3.282261 	1243.333333
Latitude 	20640.0   35.631861    2.135952 	32.540000 	33.930000 	34.260000 	37.710000 	41.950000
Longitude 	20640.0   -119.569704  2.003532    -124.350000 -121.800000 	-118.490000 -118.010000 -114.310000

Here, we can see that the mean value of MedInc is approximately 3.87 and the mean value of HouseAge is about 28.64, making it 7.4 times larger than MedInc. Other features also have differences in mean and standard deviation - to see that, look at the mean and std values and observe how they are distant from each other. For MedIncstd is approximately 1.9, for HouseAge, std is 12.59 and the same applies to the other features.

We're using an algorithm based on distance and distance-based algorithms suffer greatly from data that isn't on the same scale, such as this data. The scale of the points may (and in practice, almost always does) distort the real distance between values.

To perform Feature Scaling, we will use Scikit-Learn's StandardScaler class later. If we apply the scaling right now (before a train-test split), the calculation would include test data, effectively leaking test data information into the rest of the pipeline. This sort of data leakage is unfortunately commonly skipped, resulting in irreproducible or illusory findings.

Advice: If you'd like to learn more about feature scaling - read our "Feature Scaling Data with Scikit-Learn for Machine Learning in Python"

Splitting Data into Train and Test Sets

To be able to scale our data without leakage, but also to evaluate our results and to avoid over-fitting, we'll divide our dataset into train and test splits.

A straightforward way to create train and test splits is the train_test_split method from Scikit-Learn. The split doesn't linearly split at some point, but samples X% and Y% randomly. To make this process reproducible (to make the method always sample the same datapoints), we'll set the random_state argument to a certain SEED:

from sklearn.model_selection import train_test_split

SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

This piece of code samples 75% of the data for training and 25% of the data for testing. By changing the test_size to 0.3, for instance, you could train with 70% of the data and test with 30%.

By using 75% of the data for training and 25% for testing, out of 20640 records, the training set contains 15480 and the test set contains 5160. We can inspect those numbers quickly by printing the lengths of the full dataset and of split data:

len(X)       # 20640len(X_train) # 15480len(X_test)  # 5160

Great! We can now fit the data scaler on the X_train set, and scale both X_train and X_test without leaking any data from X_test into X_train.

Advice: If you'd like to learn more about the train_test_split() method the importance of a train-test-validation split, as well as how to separate out validation sets as well, read our "Scikit-Learn's train_test_split() - Training, Testing and Validation Sets".

Feature Scaling for KNN Regression

By importing StandardScaler, instantiating it, fitting it according to our train data (preventing leakage), and transforming both train and test datasets, we can perform feature scaling:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit only on X_train
scaler.fit(X_train)

# Scale both X_train and X_test
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Note: Since you'll oftentimes call scaler.fit(X_train) followed by scaler.transform(X_train) - you can call a single scaler.fit_transform(X_train) followed by scaler.transform(X_test) to make the call shorter!

Now our data is scaled! The scaler maintains only the data points, and not the column names, when applied on a DataFrame. Let's organize the data into a DataFrame again with column names and use describe() to observe the changes in mean and std:

col_names=['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
scaled_df = pd.DataFrame(X_train, columns=col_names)
scaled_df.describe().T

This will give us:

			count 		mean 			std 		min 		25% 		50% 		75% 		max
MedInc 		15480.0 	2.074711e-16 	1.000032 	-1.774632 	-0.688854 	-0.175663 	0.464450 	5.842113
HouseAge 	15480.0 	-1.232434e-16 	1.000032 	-2.188261 	-0.840224 	0.032036 	0.666407 	1.855852
AveRooms 	15480.0 	-1.620294e-16 	1.000032 	-1.877586 	-0.407008 	-0.083940 	0.257082 	56.357392
AveBedrms 	15480.0 	7.435912e-17 	1.000032 	-1.740123 	-0.205765 	-0.108332 	0.007435 	55.925392
Population 	15480.0 	-8.996536e-17 	1.000032 	-1.246395 	-0.558886 	-0.227928 	0.262056 	29.971725
AveOccup 	15480.0 	1.055716e-17 	1.000032 	-0.201946 	-0.056581 	-0.024172 	0.014501 	103.737365
Latitude 	15480.0 	7.890329e-16 	1.000032 	-1.451215 	-0.799820 	-0.645172 	0.971601 	2.953905
Longitude 	15480.0 	2.206676e-15 	1.000032 	-2.380303 	-1.106817 	0.536231 	0.785934 	2.633738

Observe how all standard deviations are now 1 and the means have become smaller. This is what makes our data more uniform! Let's train and evaluate a KNN-based regressor.

Training and Predicting KNN Regression

Scikit-Learn's intuitive and stable API makes training regressors and classifiers very straightforward. Let's import the KNeighborsRegressor class from the sklearn.neighbors module, instantiate it, and fit it to our train data:

from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(n_neighbors=5)
regressor.fit(X_train, y_train)

In the above code, the n_neighbors is the value for K, or the number of neighbors the algorithm will take into consideration for choosing a new median house value. 5 is the default value for KNeighborsRegressor(). There is no ideal value for K and it is selected after testing and evaluation, however, to start out, 5 is a commonly used value for KNN and was thus set as the default value.

The final step is to make predictions on our test data. To do so, execute the following script:

y_pred = regressor.predict(X_test)

We can now evaluate how well our model generalizes to new data that we have labels (ground truth) for - the test set!

Evaluating the Algorithm for KNN Regression

The most commonly used regression metrics for evaluating the algorithm are mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination (R2):

  1. Mean Absolute Error (MAE): When we subtract the predicted values from the actual values, obtain the errors, sum the absolute values of those errors and get their mean. This metric gives a notion of the overall error for each prediction of the model, the smaller (closer to 0) the better:

$$
mae = (\frac{1}{n})\sum_{i=1}^{n}\left | Actual - Predicted \right |
$$

Note: You may also encounter the y and ŷ (read as y-hat) notation in the equations. The y refers to the actual values and the ŷ to the predicted values.

  1. Mean Squared Error (MSE): It is similar to the MAE metric, but it squares the absolute values of the errors. Also, as with MAE, the smaller, or closer to 0, the better. The MSE value is squared so as to make large errors even larger. One thing to pay close attention to, it that it is usually a hard metric to interpret due to the size of its values and of the fact that they aren't on the same scale as the data.

$$
mse = \sum_{i=1}^{D}(Actual - Predicted)^2
$$

  1. Root Mean Squared Error (RMSE): Tries to solve the interpretation problem raised with the MSE by getting the square root of its final value, so as to scale it back to the same units of the data. It is easier to interpret and good when we need to display or show the actual value of the data with the error. It shows how much the data may vary, so, if we have an RMSE of 4.35, our model can make an error either because it added 4.35 to the actual value, or needed 4.35 to get to the actual value. The closer to 0, the better as well.

$$
rmse = \sqrt{ \sum_{i=1}^{D}(Actual - Predicted)^2}
$$

The mean_absolute_error() and mean_squared_error() methods of sklearn.metrics can be used to calculate these metrics as can be seen in the following snippet:

from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(f'mae: {mae}')
print(f'mse: {mse}')
print(f'rmse: {rmse}')

The output of the above script looks like this:

mae: 0.4460739527131783 
mse: 0.4316907430948294 
rmse: 0.6570317671884894

The R2 can be calculated directly with the score() method:

regressor.score(X_test, y_test)

Which outputs:

0.6737569252627673

The results show that our KNN algorithm overall error and mean error are around 0.44, and 0.43. Also, the RMSE shows that we can go above or below the actual value of data by adding 0.65 or subtracting 0.65. How good is that?

Let's check what the prices look like:

y.describe()
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The mean is 2.06 and the standard deviation from the mean is 1.15 so our score of ~0.44 isn't really stellar, but isn't too bad.

With the R2, the closest to 1 we get (or 100), the better. The R2 tells how much of the changes in data, or data variance are being understood or explained by KNN.

$$
R^2 = 1 - \frac{\sum(Actual - Predicted)^2}{\sum(Actual - Actual \ Mean)^2}
$$

With a value of 0.67, we can see that our model explains 67% of the data variance. It is already more than 50%, which is ok, but not very good. Is there any way we could do better?

We have used a predetermined K with a value of 5, so, we are using 5 neighbors to predict our targets which is not necessarily the best number. To understand which would be an ideal number of Ks, we can analyze our algorithm errors and choose the K that minimizes the loss.

Finding the Best K for KNN Regression

Ideally, you would see which metric fits more into your context - but it is usually interesting to test all metrics. Whenever you can test all of them, do it. Here, we will show how to choose the best K using only the mean absolute error, but you can change it to any other metric and compare the results.

To do this, we will create a for loop and run models that have from 1 to X neighbors. At each interaction, we will calculate the MAE and plot the number of Ks along with the MAE result:

error = []

# Calculating MAE error for K values between 1 and 39for i inrange(1, 40):
    knn = KNeighborsRegressor(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    mae = mean_absolute_error(y_test, pred_i)
    error.append(mae)

Now, let's plot the errors:

import matplotlib.pyplot as plt 

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', 
         linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
         
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')

Looking at the plot, it seems the lowest MAE value is when K is 12. Let's get a closer look at the plot to be sure by plotting less data:

plt.figure(figsize=(12, 6))
plt.plot(range(1, 15), error[:14], color='red', 
         linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('K Value MAE')
plt.xlabel('K Value')
plt.ylabel('Mean Absolute Error')

You can also obtain the lowest error and the index of that point using the built-in min() function (works on lists) or convert the list into a NumPy array and get the argmin() (index of the element with the lowest value):

import numpy as np 

print(min(error))               # 0.43631325936692505print(np.array(error).argmin()) # 11

We started counting neighbors on 1, while arrays are 0-based, so the 11th index is 12 neighbors!

This means that we need 12 neighbors to be able to predict a point with the lowest MAE error. We can execute the model and metrics again with 12 neighbors to compare results:

knn_reg12 = KNeighborsRegressor(n_neighbors=12)
knn_reg12.fit(X_train, y_train)
y_pred12 = knn_reg12.predict(X_test)
r2 = knn_reg12.score(X_test, y_test) 

mae12 = mean_absolute_error(y_test, y_pred12)
mse12 = mean_squared_error(y_test, y_pred12)
rmse12 = mean_squared_error(y_test, y_pred12, squared=False)
print(f'r2: {r2}, \nmae: {mae12} \nmse: {mse12} \nrmse: {rmse12}')

The following code outputs:

r2: 0.6887495617137436, 
mae: 0.43631325936692505 
mse: 0.4118522151025172 
rmse: 0.6417571309323467

With 12 neighbors our KNN model now explains 69% of the variance in the data, and has lost a little less, going from 0.44 to 0.43, 0.43 to 0.41, and 0.65 to 0.64 with the respective metrics. It is not a very large improvement, but it is an improvement nonetheless.

Note: Going further in this analysis, doing an Exploratory Data Analysis (EDA) along with residual analysis may help to select features and achieve better results.

We have already seen how to use KNN for regression - but what if we wanted to classify a point instead of predicting its value? Now, we can look at how to use KNN for classification.

Classification using K-Nearest Neighbors with Scikit-Learn

In this task, instead of predicting a continuous value, we want to predict the class to which these block groups belong. To do that, we can divide the median house value for districts into groups with different house value ranges or bins.

When you want to use a continuous value for classification, you can usually bin the data. In this way, you can predict groups, instead of values.

Preprocessing Data for Classification

Let's create the data bins to transform our continuous values into categories:

# Creating 4 categories and assigning them to a MedHouseValCat column
df["MedHouseValCat"] = pd.qcut(df["MedHouseVal"], 4, retbins=False, labels=[1, 2, 3, 4])

Then, we can split our dataset into its attributes and labels:

y = df['MedHouseValCat']
X = df.drop(['MedHouseVal', 'MedHouseValCat'], axis = 1)

Since we have used the MedHouseVal column to create bins, we need to drop the MedHouseVal column and MedHouseValCat columns from X. This way, the DataFrame will contain the first 8 columns of the dataset (i.e. attributes, features) while our y will contain only the MedHouseValCat assigned label.

Note: You can also select columns using .iloc() instead of dropping them. When dropping, just be aware you need to assign y values before assigning X values, because you can't assign a dropped column of a DataFrame to another object in memory.

Splitting Data into Train and Test Sets

As it has been done with regression, we will also divide the dataset into training and test splits. Since we have different data, we need to repeat this process:

from sklearn.model_selection import train_test_split

SEED = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=SEED)

We will use the standard Scikit-Learn value of 75% train data and 25% test data again. This means we will have the same train and test number of records as in the regression before.

Feature Scaling for Classification

Since we are dealing with the same unprocessed dataset and its varying measure units, we will perform feature scaling again, in the same way as we did for our regression data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Training and Predicting for Classification

After binning, splitting, and scaling the data, we can finally fit a classifier on it. For the prediction, we will use 5 neighbors again as a baseline. You can also instantiate the KNeighbors_ class without any arguments and it will automatically use 5 neighbors. Here, instead of importing the KNeighborsRegressor, we will import the KNeighborsClassifier, class:

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)

After fitting the KNeighborsClassifier, we can predict the classes of the test data:

y_pred = classifier.predict(X_test)

Time to evaluate the predictions! Would predicting classes be a better approach than predicting values in this case? Let's evaluate the algorithm to see what happens.

Evaluating KNN for Classification

For evaluating the KNN classifier, we can also use the score method, but it executes a different metric since we are scoring a classifier and not a regressor. The basic metric for classification is accuracy - it describes how many predictions our classifier got right. The lowest accuracy value is 0 and the highest is 1. We usually multiply that value by 100 to obtain a percentage.

$$
accuracy = \frac{\text{number of correct predictions}}{\text{total number of predictions}}
$$

Note: It is extremely hard to obtain 100% accuracy on any real data, if that happens, be aware that some leakage or something wrong might be happening - there is no consensus on an ideal accuracy value and it is also context-dependent. Depending on the cost of error (how bad it is if we trust the classifier and it turns out to be wrong), an acceptable error rate might be 5%, 10% or even 30%.

Let's score our classifier:

acc =  classifier.score(X_test, y_test)
print(acc) # 0.6191860465116279

By looking at the resulting score, we can deduce that our classifier got ~62% of our classes right. This already helps in the analysis, although by only knowing what the classifier got right, it is difficult to improve it.

There are 4 classes in our dataset - what if our classifier got 90% of classes 1, 2, and 3 right, but only 30% of class 4 right?

A systemic failure of some class, as opposed to a balanced failure shared between classes can both yield a 62% accuracy score. Accuracy isn't a really good metric for actual evaluation - but does serve as a good proxy. More often than not, with balanced datasets, a 62% accuracy is relatively evenly spread. Also, more often than not, datasets aren't balanced, so we're back at square one with accuracy being an insufficient metric.

We can look deeper into the results using other metrics to be able to determine that. This step is also different from the regression, here we will use:

  1. Confusion Matrix: To know how much we got right or wrong for each class. The values that were correct and correctly predicted are called true positives the ones that were predicted as positives but weren't positives are called false positives. The same nomenclature of true negatives and false negatives is used for negative values;
  2. Precision: To understand what correct prediction values were considered correct by our classifier. Precision will divide those true positives values by anything that was predicted as a positive;

$$
precision = \frac{\text{true positive}}{\text{true positive} + \text{false positive}}
$$

  1. Recall: to understand how many of the true positives were identified by our classifier. The recall is calculated by dividing the true positives by anything that should have been predicted as positive.

$$
recall = \frac{\text{true positive}}{\text{true positive} + \text{false negative}}
$$

  1. F1 score: Is the balanced or harmonic mean of precision and recall. The lowest value is 0 and the highest is 1. When f1-score is equal to 1, it means all classes were correctly predicted - this is a very hard score to obtain with real data (exceptions almost always exist).

$$
\text{f1-score} = 2* \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}}
$$

Note: A weighted F1 score also exists, and it's just an F1 that doesn't apply the same weight to all classes. The weight is typically dictated by the classes support - how many instances "support" the F1 score (the proportion of labels belonging to a certain class). The lower the support (the fewer instances of a class), the lower the weighted F1 for that class, because it's more unreliable.

The confusion_matrix() and classification_report() methods of the sklearn.metrics module can be used to calculate and display all these metrics. The confusion_matrix is better visualized using a heatmap. The classification report already gives us accuracy, precision, recall, and f1-score, but you could also import each of these metrics from sklearn.metrics.

To obtain metrics, execute the following snippet:

from sklearn.metrics import classification_report, confusion_matrix
#importing Seaborn's to use the heatmap import seaborn as sns

# Adding classes names for better interpretation
classes_names = ['class 1','class 2','class 3', 'class 4']
cm = pd.DataFrame(confusion_matrix(yc_test, yc_pred), 
                  columns=classes_names, index = classes_names)
                  
# Seaborn's heatmap to better visualize the confusion matrix
sns.heatmap(cm, annot=True, fmt='d');

print(classification_report(y_test, y_pred))

The output of the above script looks like this:

              precision    recall  f1-score   support

           1       0.75      0.78      0.76      1292
           2       0.49      0.56      0.53      1283
           3       0.51      0.51      0.51      1292
           4       0.76      0.62      0.69      1293

    accuracy                           0.62      5160
   macro avg       0.63      0.62      0.62      5160
weighted avg       0.63      0.62      0.62      5160

The results show that KNN was able to classify all the 5160 records in the test set with 62% accuracy, which is above average. The supports are fairly equal (even distribution of classes in the dataset), so the weighted F1 and unweighted F1 are going to be roughly the same.

We can also see the result of the metrics for each of the 4 classes. From that, we are able to notice that class 2 had the lowest precision, lowest recall, and lowest f1-score. Class 3 is right behind class 2 for having the lowest scores, and then, we have class 1 with the best scores followed by class 4.

By looking at the confusion matrix, we can see that:

  • class 1 was mostly mistaken for class 2 in 238 cases
  • class 2 for class 1 in 256 entries, and for class 3 in 260 cases
  • class 3 was mostly mistaken by class 2, 374 entries, and class 4, in 193 cases
  • class 4 was wrongly classified as class 3 for 339 entries, and as class 2 in 130 cases.

Also, notice that the diagonal displays the true positive values, when looking at it, it is plain to see that class 2 and class 3 have the least correctly predicted values.

With those results, we could go deeper into the analysis by further inspecting them to figure out why that happened, and also understanding if 4 classes are the best way to bin the data. Perhaps values from class 2 and class 3 were too close to each other, so it became hard to tell them apart.

Always try to test the data with a different number of bins to see what happens.

Besides the arbitrary number of data bins, there is also another arbitrary number that we have chosen, the number of K neighbors. The same technique we applied to the regression task can be applied to the classification when determining the number of Ks that maximize or minimize a metric value.

Finding the Best K for KNN Classification

Let's repeat what has been done for regression and plot the graph of K values and the corresponding metric for the test set. You can also choose which metric better fits your context, here, we will choose f1-score.

In this way, we will plot the f1-score for the predicted values of the test set for all the K values between 1 and 40.

First, we import the f1_score from sklearn.metrics and then calculate its value for all the predictions of a K-Nearest Neighbors classifier, where K ranges from 1 to 40:

from sklearn.metrics import f1_score

f1s = []

# Calculating f1 score for K values between 1 and 40for i inrange(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    # using average='weighted' to calculate a weighted average for the 4 classes 
    f1s.append(f1_score(y_test, pred_i, average='weighted'))

The next step is to plot the f1_score values against K values. The difference from the regression is that instead of choosing the K value that minimizes the error, this time we will choose the value that maximizes the f1-score.

Execute the following script to create the plot:

plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), f1s, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('F1 Score K Value')
plt.xlabel('K Value')
plt.ylabel('F1 Score')

The output graph looks like this:

From the output, we can see that the f1-score is the highest when the value of the K is 15. Let's retrain our classifier with 15 neighbors and see what it does to our classification report results:

classifier15 = KNeighborsClassifier(n_neighbors=15)
classifier15.fit(X_train, y_train)
y_pred15 = classifier15.predict(X_test)
print(classification_report(y_test, y_pred15))

This outputs:

              precision    recall  f1-score   support

           1       0.77      0.79      0.78      1292
           2       0.52      0.58      0.55      1283
           3       0.51      0.53      0.52      1292
           4       0.77      0.64      0.70      1293

    accuracy                           0.63      5160
   macro avg       0.64      0.63      0.64      5160
weighted avg       0.64      0.63      0.64      5160

Notice that our metrics have improved with 15 neighbors, we have 63% accuracy and higher precision, recall, and f1-scores, but we still need to further look at the bins to try to understand why the f1-score for classes 2 and 3 is still low.

Besides using KNN for regression and determining block values and for classification, to determine block classes - we can also use KNN for detecting which mean blocks values are different from most - the ones that don't follow what most of the data is doing. In other words, we can use KNN for detecting outliers.

Implementing KNN for Outlier Detection with Scikit-Learn

Outlier detection uses another method that differs from what we had done previously for regression and classification.

Here, we will see how far each of the neighbors is from a data point. Let's use the default 5 neighbors. For a data point, we will calculate the distance to each of the K-nearest neighbors. To do that, we will import another KNN algorithm from Scikit-learn which is not specific for either regression or classification called simply NearestNeighbors.

After importing, we will instantiate a NearestNeighbors class with 5 neighbors - you can also instantiate it with 12 neighbors to identify outliers in our regression example or with 15, to do the same for the classification example. We will then fit our train data and use the kneighbors() method to find our calculated distances for each data point and neighbors indexes:

from sklearn.neighbors import NearestNeighbors

nbrs = NearestNeighbors(n_neighbors = 5)
nbrs.fit(X_train)
# Distances and indexes of the 5 neighbors 
distances, indexes = nbrs.kneighbors(X_train)

Now we have 5 distances for each data point - the distance between itself and its 5 neighbors, and an index that identifies them. Let's take a peek at the first three results and the shape of the array to visualize this better.

To look at the first three distances shape, execute:

distances[:3], distances.shape
(array([[0.        , 0.12998939, 0.15157687, 0.16543705, 0.17750354],
        [0.        , 0.25535314, 0.37100754, 0.39090243, 0.40619693],
        [0.        , 0.27149697, 0.28024623, 0.28112326, 0.30420656]]),
 (3, 5))

Observe that there are 3 rows with 5 distances each. We can also look and the neighbors' indexes:

indexes[:3], indexes[:3].shape

This results in:

(array([[    0,  8608, 12831,  8298,  2482],
        [    1,  4966,  5786,  8568,  6759],
        [    2, 13326, 13936,  3618,  9756]]),
 (3, 5))

In the output above, we can see the indexes of each of the 5 neighbors. Now, we can continue to calculate the mean of the 5 distances and plot a graph that counts each row on the X-axis and displays each mean distance on the Y-axis:

dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point')
plt.xlabel('Count')
plt.ylabel('Mean Distances')

Notice that there is a part of the graph in which the mean distances have uniform values. That Y-axis point in which the means aren't too high or too low is exactly the point we need to identify to cut off the outlier values.

In this case, it is where the mean distance is 3. Let's plot the graph again with a horizontal dotted line to be able to spot it:

dist_means = distances.mean(axis=1)
plt.plot(dist_means)
plt.title('Mean of the 5 neighbors distances for each data point with cut-off line')
plt.xlabel('Count')
plt.ylabel('Mean Distances')
plt.axhline(y = 3, color = 'r', linestyle = '--')

This line marks the mean distance for which above it all values vary. This means that all points with a mean distance above 3 are our outliers. We can find out the indexes of those points using np.where(). This method will output either True or False for each index in regards to the meanabove 3 condition:

import numpy as np

# Visually determine cutoff values > 3
outlier_index = np.where(dist_means > 3)
outlier_index

The above code outputs:

(array([  564,  2167,  2415,  2902,  6607,  8047,  8243,  9029, 11892,
        12127, 12226, 12353, 13534, 13795, 14292, 14707]),)

Now we have our outlier point indexes. Let's locate them in the dataframe:

# Filter outlier values
outlier_values = df.iloc[outlier_index]
outlier_values

This results in:

		MedInc 	HouseAge AveRooms 	AveBedrms 	Population 	AveOccup 	Latitude 	Longitude 	MedHouseVal
564 	4.8711 	27.0 	 5.082811 	0.944793 	1499.0 	    1.880803 	37.75 		-122.24 	2.86600
2167 	2.8359 	30.0 	 4.948357 	1.001565 	1660.0 	    2.597809 	36.78 		-119.83 	0.80300
2415 	2.8250 	32.0 	 4.784232 	0.979253 	761.0 	    3.157676 	36.59 		-119.44 	0.67600
2902 	1.1875 	48.0 	 5.492063 	1.460317 	129.0 	    2.047619 	35.38 		-119.02 	0.63800
6607 	3.5164 	47.0 	 5.970639 	1.074266 	1700.0 	    2.936097 	34.18 		-118.14 	2.26500
8047 	2.7260 	29.0 	 3.707547 	1.078616 	2515.0 	    1.977201 	33.84 		-118.17 	2.08700
8243 	2.0769 	17.0 	 3.941667 	1.211111 	1300.0 	    3.611111 	33.78 		-118.18 	1.00000
9029 	6.8300 	28.0 	 6.748744 	1.080402 	487.0 		2.447236 	34.05 		-118.78 	5.00001
11892 	2.6071 	45.0 	 4.225806 	0.903226 	89.0 		2.870968 	33.99 		-117.35 	1.12500
12127 	4.1482 	7.0 	 5.674957 	1.106998 	5595.0 		3.235975 	33.92 		-117.25 	1.24600
12226 	2.8125 	18.0 	 4.962500 	1.112500 	239.0 		2.987500 	33.63 		-116.92 	1.43800
12353 	3.1493 	24.0 	 7.307323 	1.460984 	1721.0 		2.066026 	33.81 		-116.54 	1.99400
13534 	3.7949 	13.0 	 5.832258 	1.072581 	2189.0 		3.530645 	34.17 		-117.33 	1.06300
13795 	1.7567 	8.0 	 4.485173 	1.120264 	3220.0 		2.652389 	34.59 		-117.42 	0.69500
14292 	2.6250 	50.0 	 4.742236 	1.049689 	728.0 		2.260870 	32.74 		-117.13 	2.03200
14707 	3.7167 	17.0 	 5.034130 	1.051195 	549.0 		1.873720 	32.80 		-117.05 	1.80400

Our outlier detection is finished. This is how we spot each data point that deviates from the general data trend. We can see that there are 16 points in our train data that should be further looked at, investigated, maybe treated, or even removed from our data (if they were erroneously input) to improve results. Those points might have resulted from typing errors, mean block values inconsistencies, or even both.

Pros and Cons of KNN

In this section, we'll present some of the pros and cons of using the KNN algorithm.

Pros

  • It is easy to implement
  • It is a lazy learning algorithm and therefore doesn't require training on all data points (only using the K-Nearest neighbors to predict). This makes the KNN algorithm much faster than other algorithms that require training with the whole dataset such as Support Vector Machines, linear regression, etc.
  • Since KNN requires no training before making predictions, new data can be added seamlessly
  • There are only two parameters required to work with KNN, i.e. the value of K and the distance function

Cons

  • The KNN algorithm doesn't work well with high dimensional data because with a large number of dimensions, the distance between points gets "weird", and the distance metrics we use don't hold up
  • Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features

Going Further - Hand-Held End-to-End Project

Your inquisitive nature makes you want to go further? We recommend checking out our Guided Project"Hands-On House Price Prediction - Machine Learning in Python".

In this guided project - you'll learn how to build powerful traditional machine learning models as well as deep learning models, utilize Ensemble Learning and train meta-learners to predict house prices from a bag of Scikit-Learn and Keras models.

Using Keras, the deep learning API built on top of Tensorflow, we'll experiment with architectures, build an ensemble of stacked models and train a meta-learner neural network (level-1 model) to figure out the pricing of a house.

Deep learning is amazing - but before resorting to it, it's advised to also attempt solving the problem with simpler techniques, such as with shallow learning algorithms. Our baseline performance will be based on a Random Forest Regression algorithm. Additionally - we'll explore creating ensembles of models through Scikit-Learn via techniques such as bagging and voting.

This is an end-to-end project, and like all Machine Learning projects, we'll start out with Exploratory Data Analysis, followed by Data Preprocessing and finally Building Shallow and Deep Learning Models to fit the data we've explored and cleaned previously.

Conclusion

KNN is a simple yet powerful algorithm. It can be used for many tasks such as regression, classification, or outlier detection.

KNN has been widely used to find document similarity and pattern recognition. It has also been employed for developing recommender systems and for dimensionality reduction and pre-processing steps for computer vision - particularly face recognition tasks.

In this guide - we've gone through regression, classification and outlier detection using Scikit-Learn's implementation of the K-Nearest Neighbor algorithm.

Talk Python to Me: #378: Flet: Flutter apps in Python

$
0
0
Have you heard of Flutter? It's a modern and polished UI framework to write mobile apps, desktop apps, and even web apps. While interesting, you may have kept your distance because Flutter is a Dart language-based framework. But with the project we're covering today, Flet, many Flutter UIs can now be written in pure Python. Flet is a very exciting development in the GUI space for Python devs. And we have the creator, Feodor Fitsner, here to take us through it.<br/> <br/> <strong>Links from the show</strong><br/> <br/> <div><b>Feodor on GitHub</b>: <a href="https://github.com/FeodorFitsner" target="_blank" rel="noopener">github.com</a><br/> <b>Flet</b>: <a href="https://flet.dev" target="_blank" rel="noopener">flet.dev</a><br/> <b>Flutter</b>: <a href="https://flutter.dev" target="_blank" rel="noopener">flutter.dev</a><br/> <b>Dart</b>: <a href="https://dart.dev" target="_blank" rel="noopener">dart.dev</a><br/> <b>Flet Tutorials</b>: <a href="https://flet.dev/docs/tutorials" target="_blank" rel="noopener">flet.dev</a><br/> <b>It's All Widgets Showcase</b>: <a href="https://itsallwidgets.com" target="_blank" rel="noopener">itsallwidgets.com</a><br/> <b>Roadmap</b>: <a href="https://flet.dev/docs/roadmap/" target="_blank" rel="noopener">flet.dev</a><br/> <b>pglet</b>: <a href="https://pglet.io/docs/tutorials/python/" target="_blank" rel="noopener">pglet.io</a><br/> <b>Flutter Flow Designer</b>: <a href="https://flutterflow.io" target="_blank" rel="noopener">flutterflow.io</a><br/> <b>Fluent UI for Flutter Showcase App</b>: <a href="https://bdlukaa.github.io/fluent_ui/" target="_blank" rel="noopener">bdlukaa.github.io</a><br/> <b>macOS UI</b>: <a href="https://pub.dev/packages/macos_ui" target="_blank" rel="noopener">pub.dev</a><br/> <b>Flet Mobile Strategy</b>: <a href="https://flet.dev/blog/flet-mobile-strategy/" target="_blank" rel="noopener">flet.dev</a><br/> <b>Michael's flutter doctor output</b>: <a href="https://talk-python-shared.nyc3.digitaloceanspaces.com/flutter-doctor.png" target="_blank" rel="noopener">flutter-doctor.png</a><br/> <b>Pyscript</b>: <a href="https://pyscript.net" target="_blank" rel="noopener">pyscript.net</a><br/> <b>Watch this episode on YouTube</b>: <a href="https://www.youtube.com/watch?v=kxsLRRY2xZA" target="_blank" rel="noopener">youtube.com</a><br/> <b>Episode transcripts</b>: <a href="https://talkpython.fm/episodes/transcript/378/flet-flutter-apps-in-python" target="_blank" rel="noopener">talkpython.fm</a><br/> <br/> <b>--- Stay in touch with us ---</b><br/> <b>Subscribe to us on YouTube</b>: <a href="https://talkpython.fm/youtube" target="_blank" rel="noopener">youtube.com</a><br/> <b>Follow Talk Python on Twitter</b>: <a href="https://twitter.com/talkpython" target="_blank" rel="noopener">@talkpython</a><br/> <b>Follow Michael on Twitter</b>: <a href="https://twitter.com/mkennedy" target="_blank" rel="noopener">@mkennedy</a><br/></div><br/> <strong>Sponsors</strong><br/> <a href='https://talkpython.fm/sentry-dex-conf'>Sentry's DEX Conference</a><br> <a href='https://talkpython.fm/irl'>IRL Podcast</a><br> <a href='https://talkpython.fm/assemblyai'>AssemblyAI</a><br> <a href='https://talkpython.fm/training'>Talk Python Training</a>

Python for Beginners: Insert Element in A Sorted List in Python

$
0
0

Normally, we add elements to the end of a list in python. However, if we are given a sorted list and we are asked to maintain the order of the elements while inserting a new element, it might become a tedious task. In this article, we will discuss different approaches to insert an element in a sorted list in python. 

How to Insert an Element in a Sorted List?

If we are given a sorted list and we are asked to maintain the order of the elements while inserting a new element, we first need to find the position where the new element can be inserted. After that, we can insert the element into the list using slicing or the insert() method. 

Using slicing

To insert a new element in a sorted list using slicing, we will first find the position at which the element will be inserted. For this, we will find the index at which the element in the list is greater than the element to be inserted. After that, we will slice the list into two parts in such a way that one slice contains all the elements smaller than the element to be inserted and another slice contain all the elements greater than or equal to the element to be inserted. 

After creating the slices, we will create a list with the element to be inserted as its only element. Thereafter, we will concatenate the slices. In this way, we can create a sorted list that also contains the new element. You can observe this in the following example.

myList = [1, 2, 3, 5, 6, 7, 8, 9, 10]
print("Original list is:", myList)
element = 4
print("The element to be inserted is:", element)
l = len(myList)
index = 0
for i in range(l):
    if myList[i] > element:
        index = i
        break
myList = myList[:index] + [element] + myList[index:]
print("The updated list is:", myList)

Output:

Original list is: [1, 2, 3, 5, 6, 7, 8, 9, 10]
The element to be inserted is: 4
The updated list is: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Using The insert() Method

After finding the index of the element that is greater than the element to be inserted, we can use the insert() method to insert the element in the sorted list. The insert() method, when invoked on a list, takes the index as its first input argument and the element to be inserted as the second input argument. After execution, the element is inserted into the list. 

After finding the element that is greater than the element to be inserted, we will insert the element just before the element using the insert() method as shown below.

myList = [1, 2, 3, 5, 6, 7, 8, 9, 10]
print("Original list is:", myList)
element = 4
print("The element to be inserted is:", element)
l = len(myList)
index = 0
for i in range(l):
    if myList[i] > element:
        index = i
        break
myList.insert(index, element)
print("The updated list is:", myList)

Output:

Original list is: [1, 2, 3, 5, 6, 7, 8, 9, 10]
The element to be inserted is: 4
The updated list is: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Suggested Reading: Regression in Machine Learning With Examples

Insert an Element in a Sorted List Using The bisect Module

The bisect module provides us with the insort() function with which we can insert an element to a sorted list. The insort() method takes the sorted list as its first input argument and the element to be inserted as its second input argument. After execution, the element is inserted into the list. You can observe this in the following example.

import bisect

myList = [1, 2, 3, 5, 6, 7, 8, 9, 10]
print("Original list is:", myList)
element = 4
print("The element to be inserted is:", element)
bisect.insort(myList, element)
print("The updated list is:", myList)

Output:

Original list is: [1, 2, 3, 5, 6, 7, 8, 9, 10]
The element to be inserted is: 4
The updated list is: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Conclusion

In this article, we have discussed different approaches to insert elements in a sorted list in python. To know more about lists, you can read this article on list comprehension in python. You might also like this article on dictionary comprehension in python.

The post Insert Element in A Sorted List in Python appeared first on PythonForBeginners.com.

Python Circle: The Rising Popularity of Python

$
0
0
The rising popularity of Python, why is python so popular, what is the future of Python, Is python going to be the top choice by 2025,

PyCharm: PyCharm 2022.2.1 Is Out

$
0
0

The first minor release for PyCharm 2022.2 is available with the following fixes:

  • We’ve enabled the new UI for setting up an interpreter via the Show all popup menu in the Python Interpreter popup window. [PY-53057]
  • Docker: Run options that were set up via a run/debug configuration are no longer ignored if other run options were set up during Docker interpreter configuration. [PY-53638], [PY-53116]
  • Docker: The console and debugger now connect when using a Docker interpreter on Linux. [PY-55338]
  • Docker Compose: Project interpreters configured with previous PyCharm versions now start as expected. [PY-55423]
  • Docker Compose: Port configuration now works for the Docker Compose interpreter. [PY-54302]
  • Docker Compose: Running Django with the Docker Compose interpreter no longer leads to an HTTP error. [PY-55394]
  • PyCharm now recognizes non-default Flask app names and sets up the Flask run configuration accordingly. [PY-55347]
  • Testing: Behave run configurations can run individual scenarios instead of running all available files from the target directory. [PY-55434]
  • The Emulate terminal in output console option no longer leads to unexpected indentations in the console or debugger. [PY-55322]
  • Debugger: The Python console now displays text containing ANSI color sequences correctly. [PY-54599]
  • Debugger: PyCharm no longer produces an error when debugging code containing non-ASCII encoding. [PY-55369]
  • Debugger: Debugging a multiprocessing script no longer leads to an IDE exception error. [PY-55104]

You can download the new version from our website, update directly from the IDE, update via the free Toolbox App, or use snaps for Ubuntu.

The major changes in 2022.2

With PyCharm 2022.2, we introduced an important change to how PyCharm works with remote targets such as the WSL, SSH, and Docker. This change enabled users to create virtual environments on the WSL and SSH directly in PyCharm. It also equipped PyCharm with a new unified wizard for managing remote interpreters, enhancements that were not possible with the old implementation.

Initially, PyCharm focused on local development. As PyCharm matured, it started to cover new approaches and technologies. At first, support was added for the basic scenarios of development with the code deployed on SSH servers. Then, Docker integration was implemented and WSL followed. While different technologies were supported, the codebase responsible for code execution remained the same in its essence, leaving several corner cases and individual features unsupported.

To implement this new functionality, we had to refactor a significant portion of our huge code base. We did internal dogfooding of the updated code for several months before we added the functionality to the Early Access Program builds for the 2022.1 release and then for the 2022.2 release. Unfortunately, there were still some regressions that we didn’t catch, and they ended up appearing in PyCharm 2022.2. This minor release brings fixes that allow you to work more efficiently with remote interpreters again.

What’s next

Our future plans include providing the ability to configure Conda interpreters in WSL and on SSH servers and adding support for managing Python interpreters configured in Docker, which runs on an SSH server.

For now, as we release PyCharm 2022.2.1 we are already working on the next minor iteration and the following issues are currently being prioritized:

If you notice any issues after upgrading to 2022.2.1, please contact our support team or create a ticket in our issue tracker.

Real Python: How to Check if a Python String Contains a Substring

$
0
0

If you’re new to programming or come from a programming language other than Python, you may be looking for the best way to check whether a string contains another string in Python.

Identifying such substrings comes in handy when you’re working with text content from a file or after you’ve received user input. You may want to perform different actions in your program depending on whether a substring is present or not.

In this tutorial, you’ll focus on the most Pythonic way to tackle this task, using the membership operator in. Additionally, you’ll learn how to identify the right string methods for related, but different, use cases.

Finally, you’ll also learn how to find substrings in pandas columns. This is helpful if you need to search through data from a CSV file. You could use the approach that you’ll learn in the next section, but if you’re working with tabular data, it’s best to load the data into a pandas DataFrame and search for substrings in pandas.

Free Download:Click here to download the sample code that you’ll use to check if a string contains a substring.

How to Confirm That a Python String Contains Another String

If you need to check whether a string contains a substring, use Python’s membership operator in. In Python, this is the recommended way to confirm the existence of a substring in a string:

>>>
>>> raw_file_content="""Hi there and welcome.... This is a special hidden file with a SECRET secret.... I don't want to tell you The Secret,... but I do want to secretly tell you that I have one.""">>> "secret"inraw_file_contentTrue

The in membership operator gives you a quick and readable way to check whether a substring is present in a string. You may notice that the line of code almost reads like English.

Note: If you want to check whether the substring is not in the string, then you can use not in:

>>>
>>> "secret"notinraw_file_contentFalse

Because the substring "secret" is present in raw_file_content, the not in operator returns False.

When you use in, the expression returns a Boolean value:

  • True if Python found the substring
  • False if Python didn’t find the substring

You can use this intuitive syntax in conditional statements to make decisions in your code:

>>>
>>> if"secret"inraw_file_content:... print("Found!")...Found!

In this code snippet, you use the membership operator to check whether "secret" is a substring of raw_file_content. If it is, then you’ll print a message to the terminal. Any indented code will only execute if the Python string that you’re checking contains the substring that you provide.

The membership operator in is your best friend if you just need to check whether a Python string contains a substring.

However, what if you want to know more about the substring? If you read through the text stored in raw_file_content, then you’ll notice that the substring occurs more than once, and even in different variations!

Which of these occurrences did Python find? Does capitalization make a difference? How often does the substring show up in the text? And what’s the location of these substrings? If you need the answer to any of these questions, then keep on reading.

Generalize Your Check by Removing Case Sensitivity

Python strings are case sensitive. If the substring that you provide uses different capitalization than the same word in your text, then Python won’t find it. For example, if you check for the lowercase word "secret" on a title-case version of the original text, the membership operator check returns False:

>>>
>>> title_cased_file_content="""Hi There And Welcome.... This Is A Special Hidden File With A Secret Secret.... I Don't Want To Tell You The Secret,... But I Do Want To Secretly Tell You That I Have One.""">>> "secret"intitle_cased_file_contentFalse

Despite the fact that the word secret appears multiple times in the title-case text title_cased_file_content, it never shows up in all lowercase. That’s why the check that you perform with the membership operator returns False. Python can’t find the all-lowercase string "secret" in the provided text.

Humans have a different approach to language than computers do. This is why you’ll often want to disregard capitalization when you check whether a string contains a substring in Python.

You can generalize your substring check by converting the whole input text to lowercase:

>>>
>>> file_content=title_cased_file_content.lower()>>> print(file_content)hi there and welcome.this is a special hidden file with a secret secret.i don't want to tell you the secret,but i do want to secretly tell you that i have one.>>> "secret"infile_contentTrue

Read the full article at https://realpython.com/python-string-contains-substring/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Anarcat: Alternatives MPD clients to GMPC

$
0
0

GMPC (GNOME Music Player Client) is a audio player based on MPD (Music Player Daemon) that I've been using as my main audio player for years now.

Unfortunately, it's marked as "unmaintained" in the official list of MPD clients, along with basically every client available in Debian. In fact, if you look closely, all but one of the 5 unmaintained clients are in Debian (ario, cantata, gmpc, and sonata), which is kind of sad. And none of the active ones are packaged.

GMPC status and features

GMPC, in particular, is basically dead. The upstream website domain has been lost and there has been no release in ages. It's built with GTK2 so it's bound to be destroyed in a fire at some point anyways.

Still: it's really an awesome client. It has:

  • cover support
  • lyrics and tabs lookups (although those typically fail now)
  • last.fm lookups
  • high performance: loading thousands of artists or tracks is almost instant
  • repeat/single/consume/shuffle settings (single is particularly nice)
  • (global) keyboard shortcuts
  • file, artist, genre, tag browser
  • playlist editor
  • plugins
  • multi-profile support
  • avahi support
  • shoutcast support

Regarding performance, the only thing that I could find to slow down gmpc is to make it load all of my 40k+ artists in a playlist. That's slow, but it's probably understandable.

It's basically impossible to find a client that satisfies all of those.

But here are the clients that I found, alphabetically. I restrict myself to Linux-based clients.

CoverGrid

CoverGrid looks real nice, but is sharply focused on browsing covers. It's explicitly "not to be a replacement for your favorite MPD client but an addition to get a better album-experience", so probably not good enough for a daily driver. I asked for a FlatHub package so it could be tested.

mpdevil

mpdevil is a nice little client. It supports:

  • repeat, shuffle, single, consume mode
  • playlist support (although it fails to load any of my playlist with a UnicodeDecodeError)
  • nice genre / artist / album cover based browser
  • fails to load "all artists" (or takes too long to (pre-?)load covers?)
  • keyboard shortcuts
  • no file browser

Overall pretty good, but performance issues with large collections, and needs a cleanly tagged collection (which is not my case).

QUIMUP

QUIMUP looks like a simple client, C++, Qt, and mouse-based. No Flatpak, not tested.

SkyMPC

SkyMPC is similar. Ruby, Qt, documentation in Japanese. No Flatpak, not tested.

Xfmpc

Xfmpc is the XFCE client. Minimalist, doesn't seem to have all the features I need. No Flatpak, not tested.

Ymuse

Ymuse is another promising client. It has trouble loading all my artists or albums (and that's without album covers), but it eventually does. It does have a Files browser which saves it... It's noticeably slower than gmpc but does the job.

Cover support is spotty: it sometimes shows up in notifications but not the player, which is odd. I'm missing a "this track information" thing. It seems to support playlists okay.

I'm missing an album cover browser as well. Overall seems like the most promising.

Written in Golang. It crashed on a library update.

Conclusion

For now, I guess that ymuse is the most promising client, even though it's still lacking some features and performance is suffering compared to gmpc. I'll keep updating this page as I find more information about the projects. I do not intend to package anything yet, and will wait a while to see if a clear winner emerges.

Mike Driscoll: How to Convert Images to PDFs with Python and Pillow (Video)

PyBites: Making plots in your terminal with plotext

$
0
0

In this blog post a quick script to plot the frequency of our blog articles in the terminal. It’s good to see that we’re getting back on track 🙂

The code gist is here.

First we import the libraries we are going to use. As always we separate Standard Library modules from 3rd party ones as per PEP8:


from collections import Counter
from datetime import date

from dateutil.parser import parse
import plotext as plt
import requests

Then I define some constants:

API_URL = "https://codechalleng.es/api/articles/"
START_YEAR = 2017
THIS_YEAR = date.today().year
THIS_MONTH = date.today().month
MONTH_RANGE = range(1, 13)

I defined a year month generator. Why? Because some months we have not posted, yet I still want to show them on the graph.

def _create_yymm_range():
    for year in range(START_YEAR, THIS_YEAR + 1):
        for month in MONTH_RANGE:
            yield f"{year}-{str(month).zfill(2)}"
            if year == THIS_YEAR and month == THIS_MONTH:
                break

Then the wokhorse function that calculates the amount of posts per month. We conveniently use a Counter, the right abstraction here:

def get_articles_per_month(url=API_URL):
    ym_range = _create_yymm_range()
    cnt = Counter({ym: 0 for ym in ym_range})
    data = requests.get(API_URL)
    for row in data.json():
        dt = parse(row["publish_date"])
        if dt.year < START_YEAR:
            continue
        ym = dt.strftime("%Y-%m")
        cnt[ym] += 1
    return cnt

We use the requests library to get all the articles which we expose using our platform’s API.

Again, using _create_yymm_range() we make sure that all months are covered, even if there were 0 posts.

(Looking back cnt could be better named as number_posts_per_months. Well, you hardly get it right the first time!)

Making a plot with plotext is as simple as picking the type of graph and feeding it labels and values:

def show_plot(data):
    labels, values = zip(*data.items())
    plt.bar(labels, values)
    plt.title("Pybites articles published per month")
    plt.show()

zip(*data.items()) is a nice trick to extract labels and values from a list of tuples (which a dict’s .items() method gives us):

>>> d = {1: 'a', 2: 'b', 3: 'c'}
>>> list(zip(*d.items()))
[(1, 2, 3), ('a', 'b', 'c')]

We give the plot a title and show it, which will render it in the terminal.

Lastly we make sure we call the two functions only if you call the script directly, not when importing it.

You do that by using an if __name__ == "__main__" block:

if __name__ == "__main__":
    data = get_articles_per_month()
    show_plot(data)

And the result:

Screenshot 2022 08 19 at 14.33.29Pybites # of blog posts per month since we started.

 —

Thanks Russell for introducing me to this library the other day. Do you want to see a really cool use case? Check out his blog article 😍

Viewing all 23953 articles
Browse latest View live


Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>