Quantcast
Channel: Planet Python
Viewing all 22638 articles
Browse latest View live

Made With Mu: Mu and PyGameZero Gamepad Demo

$
0
0

Anthony Briggs (who blogs at NeoLudic) got in touch via Twitter with a proof-of-concept for controlling PyGameZero games with a gamepad (such as the one you’ll use with your XBox or Playstation).


The source code is nice and simple, although it relies an two patches which have not yet made it into a release of PyGameZero (so mark this as coming soon). One patch adds joystick support (i.e. what you do with the gamepad) and the other makes it easy to flip sprites.

Coming soon to a PyGameZero/Mu release near you!


Codementor: Dynamic Task Routing in Celery

$
0
0
This post was originally published on Celery. The Missing Tutorials (https://www.python-celery.com/) on June 5th, 2018. All source code examples used in this blog post can be found on GitHub: ...

Codementor: Writing a Simple Web Scraper using Scrapy

$
0
0
Web scrapers are a great way to collect data for projects. In this example I will use the @Scrapy Framework (https://scrapy.org/) to create a web scraper that gets the links of products when...

Bhishan Bhandari: Web Scraping using Golang

$
0
0

Web Scraping can be beneficial to individuals and companies. The intentions of this post is to host a set of examples on Web Scraping using Golang and goquery. I will be using github’s trending page https://github.com/trending throughout this post for the examples, especially because it best suits for applying various goquery methods. There are two […]

The post Web Scraping using Golang appeared first on The Tara Nights.

Davy Wybiral: Learn to Solder Kits

$
0
0
These Learn to Solder kits from Rocket Dept. are a great way to teach your youngsters about soldering and basic electronics. One of them controls three LEDs with push buttons, one is a large RGB LED connected to three potentiometers so you can customize the color, one is a bug that vibrates to walk around, and the other is a firefly in a jar.


py.CheckIO: Using Regular Expressions in Python

$
0
0
regular expressions

One of the programmers’ favorite jokes about regular expressions is: "There is a problem when working with strings? Excellent, in this situation I can use regular expressions." And now the developer has 2 problems... In this article you’ll be able to get acquainted with the basics of working with regular expressions and not become the same programmer from a joke that, using regular expressions, only complicated the task at hand.

Stack Abuse: Creating a Simple Recommender System in Python using Pandas

$
0
0

Introduction

Have you ever wondered how Netflix suggests movies to you based on the movies you have already watched? Or how does an e-commerce websites display options such as "Frequently Bought Together"? They may look relatively simple options but behind the scenes, a complex statistical algorithm executes in order to predict these recommendations. Such systems are called Recommender Systems, Recommendation Systems, or Recommendation Engines. A Recommender System is one of the most famous applications of data science and machine learning.

A Recommender System employs a statistical algorithm that seeks to predict users' ratings for a particular entity, based on the similarity between the entities or similarity between the users that previously rated those entities. The intuition is that similar types of users are likely to have similar ratings for a set of entities.

Currently, almost all of the big companies have employed Recommender Systems in one way or the other. For instance, YouTube uses Recommender Systems for video suggestions, Amazon uses it for a product recommendation, Facebook uses Recommender Systems for recommending people to follow and pages to like.

In this article, we will see how we can build a simple recommender system in Python.

Types of Recommender Systems

There are two major approaches to build recommender systems: Content-Based Filtering and Collaborative Filtering:

Content-Based Filtering

In content-based filtering, the similarity between different products is calculated on the basis of the attributes of the products. For instance, in a content-based movie recommender system, the similarity between the movies is calculated on the basis of genres, the actors in the movie, the director of the movie, etc.

Collaborative Filtering

Collaborative filtering leverages the power of the crowd. The intuition behind collaborative filtering is that if a user A likes products X and Y, and if another user B likes product X, there is a fair bit of chance that he will like the product Y as well.

Take the example of a movie recommender system. Suppose a huge number of users have assigned the same ratings to movies X and Y. A new user comes who has assigned the same rating to movie X but hasn't watched movie Y yet. Collaborative filtering system will recommend him the movie Y.

Movie Recommender System Implementation in Python

In this section, we'll develop a very simple movie recommender system in Python that uses the correlation between the ratings assigned to different movies, in order to find the similarity between the movies.

The dataset that we are going to use for this problem is the MovieLens Dataset. To download the dataset, go the home page of the dataset and download the "ml-latest-small.zip" file, which contains a subset of the actual movie dataset and contains 100000 ratings for 9000 movies by 700 users.

Once you unzip the downloaded file, you will see "links.csv", "movies.csv", "ratings.csv" and "tags.csv" files, along with the "README" document. In this article, we are going to use the "movies.csv" and "ratings.csv" files.

For the scripts in this article, the unzipped "ml-latest-small" folder has been placed inside the "Datasets" folder in the "E" drive.

Data Visualization and Preprocessing

The first step in every data science problem is to visualize and preprocess the data. We will do the same, so let's first import the "ratings.csv" file and see what it contains. Execute the following script:

import numpy as np  
import pandas as pd

ratings_data = pd.read_csv("E:\Datasets\ml-latest-small\\ratings.csv")  
ratings_data.head()  

In the script above we use the read_csv() method of the Pandas library to read the "ratings.csv" file. Next, we call the head() method from the dataframe object returned by the read_csv() function, which will display the first five rows of the dataset.

The output looks likes this:

userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
2110613.01260759182
3111292.01260759185
4111724.01260759205

You can see from the output that the "ratings.csv" file contains the userId, movieId, ratings, and timestamp attributes. Each row in the dataset corresponds to one rating. The userId column contains the ID of the user who left the rating. The movieId column contains the Id of the movie, the rating column contains the rating left by the user. Ratings can have values between 1 and 5. And finally, the timestamp refers to the time at which the user left the rating.

There is one problem with this dataset. It contains the IDs of the movies but not their titles. We'll need movie names for the movies we're recommending. The movie names are stored in the "movies.csv" file. Let's import the file and see the data it contains. Execute the following script:

movie_names = pd.read_csv("E:\Datasets\ml-latest-small\\movies.csv")  
movie_names.head()  

The output looks likes this:

movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy

As you can see, this dataset contains movieId, the title of the movie, and its genre. We need a dataset that contains the userId, movie title, and its ratings. We have this information in two different dataframe objects: "ratings_data" and "movie_names". To get our desired information in a single dataframe, we can merge the two dataframes objects on the movieId column since it is common between the two dataframes.

We can do this using merge() function from the Pandas library, as shown below:

movie_data = pd.merge(ratings_data, movie_names, on='movieId')  

Now let's view our new dataframe:

movie_data.head()  

The output looks likes this:

userIdmovieIdratingtimestamptitlegenres
01312.51260759144Dangerous Minds (1995)Drama
17313.0851868750Dangerous Minds (1995)Drama
231314.012703541953Dangerous Minds (1995)Drama
332314.0834828440Dangerous Minds (1995)Drama
436313.0847057202Dangerous Minds (1995)Drama

You can see our newly created dataframe contains userId, title, and rating of the movie as required.

Now let's take a look at the average rating of each movie. To do so, we can group the dataset by the title of the movie and then calculate the mean of the rating for each movie. We will then display the first five movies along with their average rating using the head() method. Look at the the following script:

movie_data.groupby('title')['rating'].mean().head()  

The output looks likes this:

title  
"Great Performances" Cats (1998)           1.750000
$9.99 (2008)                               3.833333
'Hellboy': The Seeds of Creation (2004)    2.000000  
'Neath the Arizona Skies (1934)            0.500000  
'Round Midnight (1986)                     2.250000  
Name: rating, dtype: float64  

You can see that the average ratings are not sorted. Let's sort the ratings in the descending order of their average ratings:

movie_data.groupby('title')['rating'].mean().sort_values(ascending=False).head()  

If you execute the above script, the output will look like this:

title  
Burn Up! (1991)                                     5.0  
Absolute Giganten (1999)                            5.0  
Gentlemen of Fortune (Dzhentlmeny udachi) (1972)    5.0  
Erik the Viking (1989)                              5.0  
Reality (2014)                                      5.0  
Name: rating, dtype: float64  

The movies have now been sorted according to the ascending order of their ratings. However, there is a problem. A movie can make it to the top of the above list even if only a single user has given it five stars. Therefore, the above stats can be misleading. Normally, a movie which is really a good one gets a higher rating by a large number of users.

Let's now plot the total number of ratings for a movie:

movie_data.groupby('title')['rating'].count().sort_values(ascending=False).head()  

Executing the above script returns the following output:

title  
Forrest Gump (1994)                          341  
Pulp Fiction (1994)                          324  
Shawshank Redemption, The (1994)             311  
Silence of the Lambs, The (1991)             304  
Star Wars: Episode IV - A New Hope (1977)    291  
Name: rating, dtype: int64  

Now you can see some really good movies at the top. The above list supports our point that good movies normally receive higher ratings. Now we know that both the average rating per movie and the number of ratings per movie are important attributes. Let's create a new dataframe that contains both of these attributes.

Execute the following script to create ratings_mean_count dataframe and first add the average rating of each movie to this dataframe:

ratings_mean_count = pd.DataFrame(movie_data.groupby('title')['rating'].mean())  

Next, we need to add the number of ratings for a movie to the ratings_mean_count dataframe. Execute the following script to do so:

ratings_mean_count['rating_counts'] = pd.DataFrame(movie_data.groupby('title')['rating'].count())  

Now let's take a look at our newly created dataframe.

ratings_mean_count.head()  

The output looks like this:

titleratingrating_counts
"Great Performances" Cats (1998)1.7500002
$9.99 (2008)3.8333333
'Hellboy': The Seeds of Creation (2004)2.0000001
'Neath the Arizona Skies (1934)0.5000001
'Round Midnight (1986)2.2500002

You can see movie title, along with the average rating and number of ratings for the movie.

Let's plot a histogram for the number of ratings represented by the "rating_counts" column in the above dataframe. Execute the following script:

import matplotlib.pyplot as plt  
import seaborn as sns  
sns.set_style('dark')  
%matplotlib inline

plt.figure(figsize=(8,6))  
plt.rcParams['patch.force_edgecolor'] = True  
ratings_mean_count['rating_counts'].hist(bins=50)  

Here is the output of the script above:

Ratings histogram

From the output, you can see that most of the movies have received less than 50 ratings. While the number of movies having more than 100 ratings is very low.

Now we'll plot a histogram for average ratings. Here is the code to do so:

plt.figure(figsize=(8,6))  
plt.rcParams['patch.force_edgecolor'] = True  
ratings_mean_count['rating'].hist(bins=50)  

The output looks likes this:

Average ratings histogram

You can see that the integer values have taller bars than the floating values since most of the users assign rating as integer value i.e. 1, 2, 3, 4 or 5. Furthermore, it is evident that the data has a weak normal distribution with the mean of around 3.5. There are a few outliers in the data.

Earlier, we said that movies with a higher number of ratings usually have a high average rating as well since a good movie is normally well-known and a well-known movie is watched by a large number of people, and thus usually has a higher rating. Let's see if this is also the case with the movies in our dataset. We will plot average ratings against the number of ratings:

plt.figure(figsize=(8,6))  
plt.rcParams['patch.force_edgecolor'] = True  
sns.jointplot(x='rating', y='rating_counts', data=ratings_mean_count, alpha=0.4)  

The output looks likes this:

Average ratings vs number of ratings

The graph shows that, in general, movies with higher average ratings actually have more number of ratings, compared with movies that have lower average ratings.

Finding Similarities Between Movies

We spent quite a bit of time on visualizing and preprocessing our data. Now is the time to find the similarity between movies.

We will use the correlation between the ratings of a movie as the similarity metric. To find the correlation between the ratings of the movie, we need to create a matrix where each column is a movie name and each row contains the rating assigned by a specific user to that movie. Bear in mind that this matrix will have a lot of null values since every movie is not rated by every user.

To create the matrix of movie titles and corresponding user ratings, execute the following script:

user_movie_rating = movie_data.pivot_table(index='userId', columns='title', values='rating')  
user_movie_rating.head()  
title"Great Performances" Cats (1998)$9.99 (1998)'Hellboy': The Seeds of Creation (2008)'Neath the Arizona Skies (1934)'Round Midnight (1986)'Salem's Lot (2004)'Til There Was You (1997)'burbs, The (1989)'night Mother (1986)(500) Days of Summer (2009)...Zulu (1964)Zulu (2013)
userId
1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaN
3NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaN
5NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaN

We know that each column contains all the user ratings for a particular movie. Let's find all the user ratings for the movie "Forrest Gump (1994)" and find the movies similar to it. We chose this movie since it has the highest number of ratings and we want to find the correlation between movies that have a higher number of ratings.

To find the user ratings for "Forrest Gump (1994)", execute the following script:

forrest_gump_ratings = user_movie_rating['Forrest Gump (1994)']  

The above script will return a Pandas series. Let's see how it looks.

forrest_gump_ratings.head()  
userId  
1    NaN  
2    3.0  
3    5.0  
4    5.0  
5    4.0  
Name: Forrest Gump (1994), dtype: float64  

Now let's retrieve all the movies that are similar to "Forrest Gump (1994)". We can find the correlation between the user ratings for the "Forest Gump (1994)" and all the other movies using corrwith() function as shown below:

movies_like_forest_gump = user_movie_rating.corrwith(forrest_gump_ratings)

corr_forrest_gump = pd.DataFrame(movies_like_forest_gump, columns=['Correlation'])  
corr_forrest_gump.dropna(inplace=True)  
corr_forrest_gump.head()  

In the above script, we first retrieved the list of all the movies related to "Forrest Gump (1994)" along with their correlation value, using corrwith() function. Next, we created a dataframe that contains movie title and correlation columns. We then removed all the NA values from the dataframe and displayed its first 5 rows using the head function.

The output looks likes this:

titleCorrelation
$9.99 (2008)1.000000
'burbs, The (1989)0.044946
(500) Days of Summer (2009)0.624458
*batteries not included (1987)0.603023
...And Justice for All (1979)0.173422

Let's sort the movies in descending order of correlation to see highly correlated movies at the top. Execute the following script:

corr_forrest_gump.sort_values('Correlation', ascending=False).head(10)  

Here is the output of the script above:

titleCorrelation
$9.99 (2008)1.0
Say It Isn't So (2001)1.0
Metropolis (2001)1.0
See No Evil, Hear No Evil (1989)1.0
Middle Men (2009)1.0
Water for Elephants (2011)1.0
Watch, The (2012)1.0
Cheech & Chong's Next Movie (1980)1.0
Forrest Gump (1994)1.0
Warrior (2011)1.0

From the output you can see that the movies that have high correlation with "Forrest Gump (1994)" are not very well known. This shows that correlation alone is not a good metric for similarity because there can be a user who watched '"Forest Gump (1994)" and only one other movie and rated both of them as 5.

A solution to this problem is to retrieve only those correlated movies that have at least more than 50 ratings. To do so, will add the rating_counts column from the rating_mean_count dataframe to our corr_forrest_gump dataframe. Execute the following script to do so:

corr_forrest_gump = corr_forrest_gump.join(ratings_mean_count['rating_counts'])  
corr_forrest_gump.head()  

The output looks likes this:

titleCorrelationrating_counts
$9.99 (2008)1.0000003
'burbs, The (1989)0.04494619
(500) Days of Summer (2009)0.62445845
*batteries not included (1987)0.6030237
...And Justice for All (1979)0.17342213

You can see that the movie "$9.99", which has the highest correlation has only three ratings. This means that only three users gave same ratings to "Forest Gump (1994)", "$9.99". However, we can deduce that a movie cannot be declared similar to the another movie based on just 3 ratings. This is why we added "rating_counts" column. Let's now filter movies correlated to "Forest Gump (1994)", that have more than 50 ratings. The following code will do that:

corr_forrest_gump[corr_forrest_gump ['rating_counts']>50].sort_values('Correlation', ascending=False).head()  

The output of the script, looks likes this:

titleCorrelationrating_counts
Forrest Gump (1994)1.000000341
My Big Fat Greek Wedding (2002)0.62624051
Beautiful Mind, A (2001)0.575922114
Few Good Men, A (1992)0.55520676
Million Dollar Baby (2004)0.54563865

Now you can see from the output the movies that are highly correlated with "Forrest Gump (1994)". The movies in the list are some of the most famous movies Hollywood movies, and since "Forest Gump (1994)" is also a very famous movie, there is a high chance that these movies are correlated.

Conclusion

In this article, we studied what a recommender system is and how we can create it in Python using only the Pandas library. It is important to mention that the recommender system we created is very simple. Real-life recommender systems use very complex algorithms and will be discussed in a later article.

If you want to learn more about recommender systems, I suggest checking out this very good course Building Recommender Systems with Machine Learning and AI. It goes much more in-depth and covers more complex and accurate methods than we did in this article.

Python Does What?!: kids these days think data structures grow on trees

$
0
0
Args and kwargs are great features of Python.  There is a measurable (though highly variable) cost of them however:

>>> timeit.timeit(lambda: (lambda a, b: None)(1, b=2))
0.16460260000000204

>>> timeit.timeit(lambda: (lambda *a, **kw: None)(1, b=2))
0.21245309999999762


>>> timeit.timeit(lambda: (lambda *a, **kw: None)(1, b=2)) - timeit.timeit(lambda: (lambda a, b: None)(1, b=2))
0.14699769999992895


Constructing that dict and tuple doesn't happen for free:

>>> timeit.timeit(lambda: ((1,), {'b': 2})) - timeit.timeit(lambda: None)
0.16881599999999253


Specifically, it takes about 1/5,000,000th of a second.

Python Piedmont Triad User Group: PYPTUG Monthly Meeting (September): Introduction to Packet Manipulation with Scapy

$
0
0

Details

Come join PYPTUG at out next monthly meeting (September 25th 2018) to learn more about the Python programming language, modules and tools. Python is the language to learn if you've never programmed before, and at the other end, it is also a tool that no expert would do without.


Main talk: Introduction to Packet Manipulation with Scapy


by Samuel Mitchell

Abstract:

Scapy is a very powerful framework written in Python that allows the forging and manipulation of packets. It's a swiss army knife of sorts for dealing with capturing, interacting, and manipulating packets down to the packet frame itself. It sees a lot of usage in the security community as a result but can also be used by anyone who needs to reverse engineer odd/unique protocols, QA test products at the lowest levels of the OSI layers, or come up with or test your own communication protocol. This will be an intro to the usage and capabilities of scapy targetted at anyone familiar with the basics of networking at the OSI layer and basic Python programming.

Bio:

Samuel Mitchell lived a former life as a sysadmin and in the DevOps field. In his current life, he works as an Offensive Security Researcher/Tester for a large financial institution. He's also a husband/father of three kids, and dreams of some day finishing one of his multitude of side projects.


Lightning talks!


We will have some time for extemporaneous "lightning talks" of 5-10 minute duration. If you'd like to do one, some suggestions of talks were provided here, if you are looking for inspiration. Or talk about a project you are working on.

When:

Tuesday, September 25th 2018
Meeting starts at 6:00PM

Where:

Wake Forest University, close to Polo Rd and University Parkway:
Manchester Hall
room: Manchester 241
Wake Forest University, Winston-Salem, NC 27109

And speaking of parking:  Parking after 5pm is on a first-come, first-serve basis.  The official parking policy is:
"Visitors can park in any general parking lot on campus. Visitors should avoid reserved spaces, faculty/staff lots, fire lanes or other restricted area on campus. Frequent visitors should contact Parking and Transportation to register for a parking permit."

Mailing List:

Don't forget to sign up to our user group mailing list:
It is the only step required to become a PYPTUG member.
 
RSVP on meetup:  https://www.meetup.com/PYthon-Piedmont-Triad-User-Group-PYPTUG/events/ddlvxgyxmbhc/

Continuum Analytics Blog: Key Trends and Takeaways from Strata New York 2018

$
0
0

By Elizabeth Winkler Another Strata conference has come and gone. We had an incredible time meeting with a huge number of Anaconda users who came by our booth to chat! We also noticed some really interesting trends when it comes to the future of data science, machine learning, and AI. The future of ML/AI is containerized. …
Read more →

The post Key Trends and Takeaways from Strata New York 2018 appeared first on Anaconda.

Kay Hayen: Nuitka Release 0.5.33

$
0
0

This is to inform you about the new stable release of Nuitka. It is the extremely compatible Python compiler. Please see the page "What is Nuitka?" for an overview.

This release contains a bunch of fixes, most of which were previously released as part of hotfixes, and important new optimization.

Bug Fixes

  • Fix, nested functions with local classes using outside function closure variables were not registering their usage, which could lead to errors at C compile time. Fixed in 0.5.32.1 already.
  • Fix, usage of built-in calls in a class level could crash the compiler if a class variable was updated with its result. Fixed in 0.5.32.1 already.
  • Python 3.7: The handling of non-type bases classes was not fully compatible and wrong usages were giving AttributeError instead of TypeError. Fixed in 0.5.32.2 already.
  • Python 3.5: Fix, await expressions didn't annotate their exception exit. Fixed in 0.5.32.2 already.
  • Python3: The enum module usages with __new__ in derived classes were not working, due to our automatic staticmethod decoration. Turns out, that was only needed for Python2 and can be removed, making enum work all the way. Fixed in 0.5.32.3 already.
  • Fix, recursion into __main__ was done and could lead to compiler crashes if the main module was named like that. This is not prevented. Fixed in 0.5.32.3 already.
  • Python3: The name for list contraction's frames was wrong all along and not just changed for 3.7, so drop that version check on it. Fixed in 0.5.32.3 already.
  • Fix, the hashing of code objects has creating a key that could produce more overlaps for the hash than necessary. Using a C1 on line 29 and C on line 129, was considered the same. And that is what actually happened. Fixed in 0.5.32.3 already.
  • MacOS: Various fixes for newer Xcode versions to work as well.
  • Python3: Fix, the default __annotations__ was the empty dict and could be modified, leading to severe corruption potentially. Fixed in 0.5.32.4 already.
  • Python3: When an exception is thrown into a generator that currently does a yield from is not to be normalized.
  • Python3: Some exception handling cases of yield from were leaking references to objects. Fixed in 0.5.32.5 already.
  • Python3: Nested namespace packages were not working unless the directory continued to exist on disk. Fixed in 0.5.32.5 already.
  • Standalone: Do not include icuuc.dll which is a system DLL. Fixed in 0.5.32.5 already.
  • Standalone: Added hidden dependency of newer version of sip. Fixed in 0.5.32.5 already.
  • Standalone: Do not copy file permissions of DLLs and extension modules as that makes deleting and modifying them only harder.
  • Python 3.5: Fixed exception handling with coroutines and asyncgen throw to not corrupt exception objects.
  • Python 3.7: Added more checks to class creations that were missing for full compatibility.

Organizational

  • The issue tracker on Github is now the one that should be used with Nuitka, winning due to easier issue templating and integration with pull requests.
  • Document the threading model and exception model to use for MinGW64.
  • Removed the enum plug-in which is no longer useful after the improvements to the staticmethod handling for Python3.
  • Added Python 3.7 testing for Travis.
  • Make it clear in the documentation that pyenv is not supported.

New Features

  • Added support for MiniConda Python.

Optimization

  • Using goto based generators that return from execution and resume based on heap storage. This makes tests using generators twice as fast and they no longer use a full C stack of 2MB, but only 1K instead.
  • Put all statement related code and declarations for it in a dedicated C block, making things slightly more easy for the C compiler to re-use the stack space.
  • Avoid linking against libpython in module mode on everything but Windows where it is really needed. No longer check for static Python, not needed anymore.
  • More compact function, generator, and asyncgen creation code for the normal cases, avoid qualname if identical to name for all of them.
  • Python2 class dictionaries are now indeed directly optimized, giving more compact code.

Cleanups

  • Frame object and their cache declarations are now handled by the way of allocated variable descriptions, avoid special handling for them.
  • The interface to "forget" a temporary variable has been replaced with a new method that skips a number for it. This is done to keep expression use the same indexes for all their child expressions, but this is more explicit.
  • Instead of passing around C variables names for temporary values, we now have full descriptions, with C type, code name, storage location, and the init value to use. This makes the information more immediately available where it is needed.
  • Variable declarations are now created when needed and stored in dedicated variable storage objects, which then in can generate the code as necessary.
  • Module code generation has been enhanced to be closer to the pattern used by functions, generators, etc.
  • There is now only one spot that creates variable declaration, instead of previous code duplications.
  • Code objects are now attached to functions, generators, coroutines, and asyncgen bodies, and not anymore to the creation of these objects. This allows for simpler code generation.
  • Removed fiber implementations, no more needed.

Tests

  • Finally the asyncgen tests can be enabled in the CPython 3.6 test suite as the corrupting crasher has been identified.
  • Cover ever more cases of spurious permission problems on Windows.

Summary

This release is huge in many ways.

First, finishing "goto generators" clears an old scalability problem of Nuitka that needed to be addressed. No more do generators/coroutines/asyncgen consume too much memory, but instead they become as lightweight as they ought to be.

Second, the use of variable declarations carying type information all through the code generation, is an import pre-condition for "C types" work to resume and become possible.

Third, the improved generator performance will be removing a lot of cases, where Nuitka wasn't as fast, as its current state not using "C types" yet, would allow.

Fourth, the fibers were a burden for the debugging and linking of Nuitka on various platforms, as they provided deprecated interfaces or not. As they are now gone, Nuitka ought to definitely work on any platform where Python works.

From here on, C types work will resume and hopefully yield good results soon.

Stefan Behnel: Cython, pybind11, cffi – which tool should you choose?

$
0
0

In and after the conference talks that I give about Cython, I often get the question how it compares to other tools like pybind11 and cffi. There are others, but these are definitely the three that are widely used and "modern" in the sense that they provide an efficient user experience for today's real-world problems. And as with all tools from the same problem space, there are overlaps and differences. First of all, pybind11 and cffi are pure wrapping tools, whereas Cython is a Python compiler and a complete programming language that is used to implement actual functionality and not just bind to it. So let's focus on the area where the three tools compete: extending Python with native code and libraries.

Using native code from the Python runtime (specifically CPython) has been at the heart of the Python ecosystem since the very early days. Python is a great language for all sorts of programming needs, but it would not have the success and would not be where it is today without its great ecosystem that is heavily based on fast, low-level, native code. And the world of computing is full of such code, often decades old, heavily tuned, well tested and production proven. Looking at indicators like the TIOBE Index suggests that low-level languages like C and C++ are becoming more and more important again even in the last years, decades after their creation.

Today, no-one would attempt to design a (serious/practical) programming language anymore that does not come out of the box with a complete and easy to use way to access all that native code. This ability is often referred to as an FFI, a foreign function interface. Rust is an excellent example for a modern language that was designed with that ability in mind. The FFI in LuaJIT is a great design of a fast and easy to use FFI for the Lua language. Even Java and its JVM, which are certainly not known for their ease of code reuse, have provided the JNI (Java Native Interface) from the early days. CPython, being written in C, has made it very easy to interface with C code right from the start, and above all others the whole scientific computing and big data community has made great use of that over the past 25 years.

Over time, many tools have aimed to simplify the wrapping of external code. The venerable SWIG with its long list of supported target languages is clearly worth mentioning here. Partially a successor to SWIG (and sip), shiboken is a C++ bindings generator used by the PySide project to auto-create wrapper code for the large Qt C++ API.

A general shortcoming of all wrapper generators is that many users eventually reach the limits of their capabilities, be it in terms of performance, feature support, language integration to one side or the other, or whatever. From that point on, users start fighting the tool in order to make it support their use case at hand, and it is not unheard of that projects start over from scratch with a different tool. Therefore, most projects are better off starting directly with a manually written wrapper, at least when the part of the native API that they need to wrap is not prohibitively vast.

The lxml XML toolkit is an excellent example for that. It wraps libxml2 and libxslt with their huge low-level C-APIs. But if the project had used a wrapper generator to wrap it for Python, mapping this C-API to Python would have made the language integration of the Python-level API close to unusable. In fact, the whole project started because generated Python bindings for both already existed that were like the thrilling embrace of an exotic stranger (Mark Pilgrim). And beautifying the API at the Python level by adding another Python wrapper layer would have countered the advantages of a generated wrapper and also severely limited its performance. Despite the sheer vastness of the C-API that it wraps, the decision for manual wrapping and against a wrapper generator was the foundation of a very fast and highly pythonic tool.

Nowadays, three modern tools are widely used in the Python community that support manual wrapping: Cython, cffi and pybind11. These three tools serve three different sides of the need to extend (C)Python with native code.

  • Cython is Python with native C/C++ data types.

    Cython is a static Python compiler. For people coming from a Python background, it is much easier to express their coding needs in Python and then optimising and tuning them, than to rewrite them in a foreign language. Cython allows them to do that by automatically translating their Python code to C, which often avoids the need for an implementation in a low-level language.

    Cython uses C type declarations to mix C/C++ operations into Python code freely, be it the usage of C/C++ data types and containers, or of C/C++ functions and objects defined in external libraries. There is a very concise Cython syntax that uses special additional keywords (cdef) outside of Python syntax, as well as ways to declare C types in pure Python syntax. The latter allows writing type annotated Python code that gets optimised into fast C level when compiled by Cython, but that remains entirely pure Python code that can be run, analysed ad debugged with the usual Python tools.

    When it comes to wrapping native libraries, Cython has strong support for designing a Python API for them. Being Python, it really keeps the developer focussed on the usage from the Python side and on solving the problem at hand, and takes care of most of the boilerplate code through automatic type conversions and low-level code generation. Its usage is essentially writing C code without having to write C code, but remaining in the wonderful world of the Python language.

  • pybind11 is modern C++ with Python integration.

    pybind11 is the exact opposite of Cython. Coming from C++, and targeting C++ developers, it provides a C++ API that wraps native functions and classes into Python representations. For that, it makes good use of the compile time introspection features that were added to C++11 (hence the name). Thus, it keeps the user focussed on the C++ side of things and takes care of the boilerplate code for mapping it to a Python API.

    For everyone who is comfortable with programming in C++ and wants to make direct use of all C++ features, pybind11 is the easiest way to make the C++ code available to Python.

  • CFFI is Python with a dynamic runtime interface to native code.

    cffi then is the dynamic way to load and bind to external shared libraries from regular Python code. It is similar to the ctypes module in the Python standard library, but generally faster and easier to use. Also, it has very good support for the PyPy Python runtime, still better than what Cython and pybind11 can offer. However, the runtime overhead prevents it from coming any close in performance to the statically compiled code that Cython and pybind11 generate for CPython. And the dependency on a well-defined ABI (binary interface) means that C++ support is mostly lacking.

    As long as there is a clear API-to-ABI mapping of a shared library, cffi can directly load and use the library file at runtime, given a header file description of the API. In the more complex cases (e.g. when macros are involved), cffi uses a C compiler to generate a native stub wrapper from the description and uses that to communicate with the library. That raises the runtime dependency bar quite a bit compared to ctypes (and both Cython and pybind11 only need a C compiler at build time, not at runtime), but on the other hand also enables wrapping library APIs that are difficult to use with ctypes.

This list shows the clear tradeoffs of the three tools. If performance is not important, if dynamic runtime access to libraries is an advantage, and if users prefer writing their wrapping code in Python, then cffi (or even ctypes) will do the job, nicely and easily. Otherwise, users with a strong C++ background will probably prefer pybind11 since it allows them to write functionality and wrapper code in C++ without switching between languages. For users with a Python background (or at least not with a preference for C/C++), Cython will be very easy to learn and use since the code remains Python, but gains the ability to do efficient native C/C++ operations at any point.

Python Celery - Weekly Celery Tutorials and How-tos: Concurrency and Parallelism

$
0
0

Concurrency is often misunderstood and mistaken for parallelism. However, concurrency and parallelism are not the same thing. But why should you care? All you want is to make your Python application fast and responsive. Which you can achieve by distributing it across many CPUs. What difference does it make whether you call it concurrency or parallelism?

Why should you care?

Turns out, a lot actually. What happens is that you want to make your application fast, so you run it on more processors and… it gets slower. And you think: this is broken, this doesn’t make any sense. But what is broken is the understanding of concurrency and parallelism. That’s why in this blog post I’ll explain the difference between concurrency and parallelism and what’s in it for you.

Parallelism is the simultaneous execution of multiple things. These things are possibly related, possibly not. Parallelism is about doing a lot of things at once. Is is the act of running multiple computations simultaneously. Parallel computing can be of different types: bit-level, instruction-level, data and task parallelism. But it is entirely about execution.

Concurrency is the ability to deal with a lot of things at once. It is about the composition of independently executing things. Concurrency is a way to structure a program by breaking it into pieces that can be executed independently. This involves structuring but also coordinating thiese pieces, which is communication. Concurrency is the act of managing and running multiple computations at the same time.

For example, an operating system has multiple I/O devices: a keyboard driver, a mouse driver, a network driver and a screen driver. All these devices are managed by the operationg system as independent things. They are concurrent things. But they are not necessarily parallel. In fact, it does not matter. If you only have one processor, only one process can only ever run at a time. So, this is a concurrent model, but it is not necessarily parallel. It does not need to be parallel, but it is very benefical for it to be concurrent.

Concurrency is a way to structure things so that you can (maybe) run these things in parallel to do a better or faster job. But parallelism is not the goal of concurrency, the goal of concurrency is a good structure. A structure that allows you to scale.

Concurrency and parallelism with Celery and Dask

An example where concurrency matters is one of my consulting projects. A hedge fund needed to fetch (a lot of) market data from an external data vendor (Bloomberg). This process would run once a day and write the data to an internal database. We broke the process down into small pieces: request the data from the Bloomberg webservice, poll the webservice at regular intervals to establish whether the data is ready for collection, collect the data, transform the data, write the data into the database. We defined the tasks in such a way that each of them can be executed independently. Which means they can run concurrently. And in order to run the entire process faster, we made it parallel by scaling up the number of workers (we used Celery). But this could not have happened without concurrency.

An example where parallel computation matters - and concurrency not - is when dealing with large Pandas datasets in a Jupyter notebook. If your datasets are so large that they do not fit into your computer’s memory, you can solve that problem by parallelising the dataset across a cluster of machines. In this case, you want data parallelism. Concurrency does not matter. And this is a textbook example for Dask.

Talk Python to Me: #177 Flask goes 1.0

$
0
0
Flask is now 8 years old and until recently had gone along pretty steady state. It had been hanging around at version 0.11 and 0.12 for some time. After a year-long effort, the web framework has now been updated to Flask 1.0.

Python Bytes: #95 Unleash the py-spy!


Weekly Python StackOverflow Report: (cxliii) stackoverflow python report

$
0
0

Codementor: Option Pricing With Monte Carlo & Celery

$
0
0
This post was originally published on Distributed Python (https://www.distributedpython.com/) on June 26th, 2018. All source code examples used in this blog post can be found on GitHub: ...

PyBites: PyBites Twitter Digest - Issue 29, 2018

$
0
0

A handy template to generate basic Python project structures.

Submitted by @Erik.

Ooo Language focused Docker images

Submitted by @dgjustice.

Hacktoberfest is almost here!!

A great instructional on Decorators!

Submitted by @Erik.

Given Hacktoberfest is on the way: 50 Popular Python open-source projects

Snowy: A Small Python 3 module for manipulating and generating images

Submitted by @Erik (thanks again mate!).

Simple Celery

Submitted by @dgjustice.

The UNIX Philosophy

Submitted by @Erik.

The man behind the core 100Daysofcode Challenge! A great listen

Python hits the Top 3 in popularity!

Learn to code Python with Project Python

Submitted by @Erik.

Awesome Talk Python episode on the Python Community

Sentdex is back with his self-driving cars in GTA V!

More Hettinger wisdom

Cool, plot graphs on the console!


>>>frompybitesimportBob,JulianKeepCalmandCodeinPython!

Dusty Phillips: An Intermediate Guide To RSA

$
0
0
The venerable RSA public key encryption algorithm is very elegant. It requires a basic understanding of modular arithmetic, which may sound scary if you haven’t studied it. It reduces to taking the remainder after integer long division. The RSA Wikipedia article describes five simple steps to generate the keys. Encryption and decryption are a matter of basic exponentiation. There’s no advanced math, and it’s easy to understand their example of working with small numbers.

Bhishan Bhandari: Python, Boto and AWS EC2

$
0
0

Most if not all software companies have adopted to cloud infrastructure and services. AWS in particular is very popular amongst all. The intentions of this post is to host a few examples on using boto to make use of one of the services available on AWS i.e EC2. It is more likely than not to […]

The post Python, Boto and AWS EC2 appeared first on The Tara Nights.

Viewing all 22638 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>