Quantcast
Channel: Planet Python
Viewing all 22861 articles
Browse latest View live

Mike Driscoll: PyDev of the Week: Marlene Mhangami

$
0
0

This week we welcome Marlene Mhangami (@marlene_zw) as our PyDev of the Week! Marlene is the PyCon Africa (@pyconafrica) chair, the co-founder of @zimbopy and a director for the Python Software Foundation. Let’s spend some time getting to know her!

Marlene Mhangami

Can you tell us a little about yourself (hobbies, education, etc):

Sure, in college I studied molecular biology. I was actually in the schools pre-medicine track because I initially thought I wanted to become a doctor. Looking back on it now I laugh because I hate blood, just the sight of it in movies makes me shut my eyes tightly, so I’m genuinely happy that didn’t work out! I went to a liberal arts college and appreciate that I had the space to take courses in other fields like philosophy and politics which I really enjoy.

I get asked about what hobbies I have quite often, and I’m not sure if I have anything I do consistently enough to call a hobby. I read, and sometimes run, and love to journal. I also occasionally paint, but the last time I told someone I painted they asked me where my studios were and started listing off artists that I had never heard of before, so I like to disclaimer that I don’t paint in a way that is cultured or sophisticated but just as a way to express myself and have fun.

Why did you start using Python?

For a good chunk of my college, I was studying in the United States. I remember coming home one summer and being really aware of how different Zimbabwe was from the U.S. From really small cultural differences like how people address conflict (which is something I’m still trying to figure out with my US friends) to much more impactful things like access to knowledge and education.

I decided that I wanted to stay in my country and start being more involved with empowering my local community. For a number of clear and obscure reasons, I also decided that I wanted to leverage technology to help me do that. As I’m writing this out I’m actually remembering a great conversation I had with one of my friends. She had initially been a math major and then suddenly switched to computer science. I remember her telling me how much she enjoyed building stuff with code and how useful it was. She had created a program that could predict the warmest pathway to walk through on campus (she was also an international student who struggled with the cold, so both of us agreed that this was an extremely useful invention.) I also had a really vivid dream that made me re-evaluate what I was doing with my life.

All of these things led me to start googling around and I actually ended up organizing a meetup. It was there that I got introduced to one of my co-founders, Ronald Maravanykia, who was at the time running a Django girls workshop in Harare. He introduced me to Python as a great educational tool for teaching programming to people who don’t have a computer science background. I use Python primarily to help with teaching for our non-profit ZimboPy, so while I don’t use it in my day job, I really enjoy sharing it with the girls we teach.

What other programming languages do you know and which is your favorite?

Because I’m mainly self-taught I haven’t tried that many other languages. I know HTML, CSS and the basics of Javascript and PHP, but outside of that not much else. I’ve recently been feeling the need to increase my knowledge of some of the other languages, so I’ve been trying to work on some projects that could help me do this. I’m really interested in game development and VR and I know C++ plays a big part in that industry, so that’s something I’m wanting to learn for sure! I might take some formal classes on this if I can get to it. I would definitely say Python is still my favorite language, this might change as I go along though, who knows 😉

What projects are you working on now?

Like I mentioned earlier I’m really interested in game development so I’ve been using Pygame to work on a game as a fun project. It’s really in the early stages and is honestly not great, so I’m not willing to show anyone what I have yet, haha, but its been fun to do. I’ve also been playing around with Circuit Python and the Adafruit boards. I’m hoping to host a Zimbopy workshop using these soon. Its been a super busy few months, but I’m really looking forward to getting that going. Finally, I’m trying to get a personal blog/website up. I’ve bought a domain name and everything! I’m just in the process of building it and thinking of relevant content that will make it as riveting as possible! Those are the main personal projects I’m working on right now.

Which Python libraries are your favorite (core or 3rd party)?

Well at this exact moment I’d have to say the Circuit Python libraries. I’m really enjoying playing around with hardware programming. The Adafruit boards I got at Pycon this year sent me into a spiral of lights and sound haha! I generally think using hardware makes learning and experimenting with Python really fun. I’m usually thinking about how I can make learning the basics of Python easier to teach to the people in my community so that’s why I’d say it’s my current favorite.

How did you get involved in organizing ZimboPy?

Zimbopy is a non-profit that my friends Ronald Maravanyika and Mike Place co-founded with me. The goal of it is to give girls in Zim access to knowledge and experiences they wouldn’t necessarily get anywhere else in the country. I’m really passionate about empowering African women, and I think this was one of the best ways I could think of tangibly doing so.

What are you hoping to accomplish while you are working with the Python Software Foundation?

The global Python community is growing at an exceptionally fast rate. I’d like to make sure that the community has resources and structures that can help that growth continue to be healthy and positive. The Python community is one of the kindest and most diverse communities I’ve ever encountered. I’m really excited about making sure that same kindness and diversity is translated well into new and upcoming communities around the world.

Is there anything else you’d like to say?

Well, I’ve just finished working with an amazing group of organizers to host the first-ever Pycon Africa. I think it went really well and I think we had such a great lineup of talks. We’ll have the talks uploaded on Youtube in a few days so I’d encourage anyone reading to follow our page @pyconafrica to see them. We’re also starting to plan for next year, so definitely think Pythonista’s interested should think about attending.

Thanks for doing the interview, Marlene!

The post PyDev of the Week: Marlene Mhangami appeared first on The Mouse Vs. The Python.


Codementor: ML with Python: Part-3

$
0
0
In preious post, we saw various steps involved in creating a machine learning (ML) model. You might have noticed in Building ML Model we consider multiple Algorithums in a pipeline and then tune...

Stack Abuse: Analyzing API Data with MongoDB, Seaborn, and Matplotlib

$
0
0

Introduction

A commonly requested skill for software development positions is experience with NoSQL databases, including MongoDB. This tutorial will explore collecting data using an API, storing it in a MongoDB database, and doing some analysis of the data.

However, before jumping into the code let's take a moment to go over MongoDB and APIs, to make sure we understand how we'll be dealing with the data we're collecting.

MongoDB and NoSQL

MongoDB is a form of NoSQL database, enabling the storage of data in non-relational forms. NoSQL databases are best understood by comparing them to their progenitor/rivals - SQL databases.

SQL stands for Structure Query Language and it is a type of relational database management tool. A relational database is a database that stores data as a series of keys and values, with each row in a data table having its own unique key. Values in the database can be retrieved by looking up the corresponding key. This is how SQL databases store data, but NoSQL databases can store data in non-relational ways.

NoSQL stands for "Not Only SQL", which refers to the fact that although SQL-esque queries can be done with NoSQL systems, they can also do things SQL databases struggle with. NoSQL databases have a wider range of storage options for the data they handle, and because the data is less rigidly related it can be retrieved in more ways, making some operations quicker. NoSQL databases can make the addition of nodes or fields simpler in comparison to SQL databases.

There are many popular NoSQL frameworks, including MongoDB, OrientDB, InfinityDB, Aerospike, and CosmosDB. MongoDB is one specific NoSQL framework which stores data in the form of documents, acting as a document-oriented database.

MongoDB is popular because of its versatility and easy cloud integration, and able to be used for a wide variety of tasks. MongoDB stores data using the JSON format. Queries of MongoDB databases are also made in the JSON format, and because both the storage and retrieval commands are based on the JSON format, it is simple to remember and compose commands for MongoDB.

What are APIs?

APIs are Application Programming Interfaces, and their function is to make communications between clients and servers easier. APIs are often created to facilitate the collection of information by those who are less experienced with the language used by the application's developers.

APIs can also be helpful methods of controlling the flow of information from a server, encouraging those interested in accessing its information to use official channels to do so, rather than construct a web scraper. The most common APIs for websites are REST (Representational State Transfer) APIs, which make use of standard HTTP requests and responses to send, receive, delete, and modify data. We'll be accessing a REST API and making our requests in HTTP format for this tutorial.

What API will we be Using?

The API we'll be using is GameSpot's API. GameSpot is one of the biggest video game review sites on the web, and its API can be reached here.

Getting Set Up

Before we begin, you should be sure to get yourself an API key for GameSpot. You should also be sure to have MongoDB and its Python library installed. The installation instructions for Mongo can be found here.

The PyMongo library can be installed simply by running:

$ pip install pymongo

You may also wish to install the MongoDB Compass program, which lets you easily visualize and edit aspect of MongoDB databases with a GUI.

alt

Creating the MongoDB Database

We can now start our project by creating the MongoDB database. First, we'll handle our imports. We'll import the MongoClient from PyMongo, as well as requests and pandas:

from pymongo import MongoClient
import requests
import pandas as pd

When creating a database with MongoDB, we first need to connect to the client and then use the client to create the database we desire:

client = MongoClient('127.0.0.1', 27017)
db_name = 'gamespot_reviews'

# connect to the database
db = client[db_name]

MongoDB can store multiple data collections within a single database, so we also need to define the name of the collection we want to use:

# open the specific collection
reviews = db.reviews

That's it. Our database and collection has been created and we are ready to start inserting data into it. That was pretty simple, wasn't it?

Using the API

We're now ready to make use of the GameSpot API to collect data. By taking a look at the documentation for the API here, we can determine the format that our requests need to be in.

We need to make our requests to a base URL that contains our API key. GameSpot's API has multiple resources of its own that we can pull data from. For instance, they have a resource that lists data about games like release date and consoles.

However, we're interested in their resource for game reviews, and we'll be pulling a few specific fields from the API resource. Also, GameSpot asks that you specify a unique user agent identifier when making requests, which we'll do by creating a header that we'll pass in to the requests function:

headers = {
    "user_agent": "[YOUR IDENTIFIER] API Access"
}

games_base = "http://www.gamespot.com/api/reviews/?api_key=[YOUR API KEY HERE]&format=json"

We'll want the following data fields: id, title, score, deck, body, good, bad:

review_fields = "id,title,score,deck,body,good,bad"

GameSpot only allows the return of 100 results at a time. For this reason, in order to get a decent number of reviews to analyze, we'll need to create a range of numbers and loop through them, retrieving 100 results at a time.

You can select any number you want. I chose to get all their reviews, which cap out at 14,900:

pages = list(range(0, 14900))
pages_list = pages[0:14900:100]

We're going to create a function that joins together the base URL, the list of fields we want to return, a sorting scheme (ascending or descending), and the offset for the query.

We'll take the number of pages we want to loop through, and then for every 100 entries we'll create a new URL and request the data:

def get_games(url_base, num_pages, fields, collection):

    field_list = "&field_list=" + fields + "&sort=score:desc" + "&offset="

    for page in num_pages:
        url = url_base + field_list + str(page)
        print(url)
        response = requests.get(url, headers=headers).json()
        print(response)
        video_games = response['results']
        for i in video_games:
            collection.insert_one(i)
            print("Data Inserted")

Recall that MongoDB stores data as JSON. For that reason we need to convert our response data to JSON format using the json() method.

After the data has been converted into JSON, we'll get the "results" property from the response, as this is the portion which actually contains our data of interest. We'll then go through the 100 different results and insert each of them into our collection using the insert_one() command from PyMongo. You could also put them all in a list and use insert_many() instead.

Let's now call the function and have it collect the data:

get_games(review_base, pages_list, review_fields, reviews)

Why don't we check to see that our data has been inserted into our database as we expect it? We can view the database and its contents directly with the Compass program:

alt

We can see the data has been properly inserted.

We can also make some database retrievals and print them. To do that, we'll just create an empty list to store our entries and use the .find() command on the "reviews" collection.

When using the find function from PyMongo, the retrieval needs to be formatted as JSON as well. The parameters given to the find function will have a field and value.

By default, MongoDB always returns the _id field (its own unique ID field, not the ID we pulled from GameSpot), but we can tell it to suppress this by specifying a 0 value. The fields we do want to return, like the score field in this case, should be given a 1 value:

scores = []

for score in list(reviews.find({}, {"_id":0, "score": 1})):
    scores.append(score)

print(scores[:900])

Here's what was successfully pulled and printed:

[{'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'}, {'score': '10.0'} ...

We can also convert the query results to a data-frame easily by using Pandas:

scores_data = pd.DataFrame(scores, index=None)
print(scores_data.head(20))

Here's what was returned:

   score
0   10.0
1   10.0
2   10.0
3   10.0
4   10.0
5   10.0
6   10.0
7   10.0
8   10.0
9   10.0
10  10.0
11  10.0
12  10.0
13  10.0
14  10.0
15  10.0
16  10.0
17   9.9
18   9.9
19   9.9

Before we start analyzing some of the data, let's take a moment to see how we could potentially join two collections together. As mentioned, GameSpot has multiple resources to pull data from, and we may want to get values from a second database like the Games database.

MongoDB is a NoSQL database, so unlike SQL it isn't intended to handle relations between databases and join data fields together. However, there is a function which can approximate a database join - lookup().

The lookup() function mimics a database join and it can be done by specifying a pipeline, which contains the database you want to join elements from, as well as the fields you want from both the input documents (localField) and the "from" documents (foreignField).

Finally, you choose a moniker to convert the foreign documents to and they will be displayed under this new name in our query response table. If you had a second database called games and wanted to join them together in a query, it could be done like this:

pipeline = [{
    '$lookup': {
        'from': 'reviews',
        'localField': 'id',
        'foreignField': 'score',
        'as': 'score'
    }
},]

for doc in (games.aggregate(pipeline)):
    print(doc)

Analyzing the Data

Now we can get around to analyzing and visualizing some of the data found within our newly created database. Let's make sure we have all the functions we'll need for analysis.

from pymongo import MongoClient
import pymongo
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import string
import en_core_web_sm
import seaborn as sns

Let's say we want to do some analysis of the words found in GameSpot's game reviews. We have that information in our database, we just have to get it.

We can start by collecting the top 40 (or whatever number you want) reviews from our database using the find() function like before, but this time we'll specify that we want to sort by the score variable and that we to sort in descending order:

d_name = 'gamespot_reviews'
collection_name = 'gamespot'

client = MongoClient('127.0.0.1', 27017)
db = client[d_name]

reviews = db.reviews
review_bodies = []

for body in list(reviews.find({}, {"_id":0, "body": 1}).sort("score", pymongo.DESCENDING).limit(40)):
    review_bodies.append(body)

We'll turn that response into a Pandas data-frame and convert it into a string. Then we'll extract all the values within the <p> HTML tag containing the review text, which we'll do with BeautifulSoup:

reviews_data = pd.DataFrame(review_bodies, index=None)

def extract_comments(input):
    soup = BeautifulSoup(str(input), "html.parser")
    comments = soup.find_all('p')
    return comments

review_entries = extract_comments(str(review_bodies))
print(review_entries[:500])

See the print statement to see the review text has been collected:

[<p>For anyone who hasn't actually seen the game on a TV right in front of them, the screenshots look too good to be true. In fact, when you see NFL 2K for the first time right in front of you...]

Now that we have the review text data, we want to analyze it in several different ways. Let's try getting some intuition for what kinds of words are commonly used in the top 40 reviews. We can do this several different ways:

  • We can create a word cloud
  • We can count all of the words and sort by their number of occurrences
  • We can do named entity recognition

Before we can do any analysis of the data though, we have to preprocess it.

To preprocess the data, we want to create a function to filter the entries. The text data is still full of all kinds of tags and non-standard characters, and we want to remove those by getting the raw text of the review comments. We'll be using regular expressions to substitute the non-standard characters with blank spaces.

We'll also use some stop words from NTLK (highly common words that add little meaning to our text) and remove them from our text by creating a list to hold all the words and then appending words to that list only if they are not in our list of stop words.

Word Cloud

Let's grab a subset of the review words to visualize as a corpus. If it's too large when generating it can cause some problems with the word cloud.

For example, I've filtered out the first 5000 words:

stop_words = set(stopwords.words('english'))

def filter_entries(entries, stopwords):

    text_entries = BeautifulSoup(str(entries), "lxml").text
    subbed_entries = re.sub('[^A-Za-z0-9]+', ' ', text_entries)
    split_entries = subbed_entries.split()

    stop_words = stopwords

    entries_words = []

    for word in split_entries:
        if word not in stop_words:
            entries_words.append(word)

    return entries_words

review_words = filter_entries(review_entries, stop_words)
review_words = review_words[5000:]

We can now make a word cloud very easily by using a pre-made WordCloud library found here.

This word cloud does give us some information on what kinds of words are commonly used in the top reviews:

alt

It is unfortunately still full of common words, which is why it would be a good idea to do filtering of the review words with a tf-idf filtering scheme, but for the purposes of this simple demonstration, this is good enough.

We do, in fact, have some information about what kinds of concepts are talked about in game reviews: gameplay, story, characters, world, action, locations, etc.

We can confirm for ourselves that these words are commonly found in game reviews by looking at one of the top 40 reviews we selected: Mike Mahardy's review of Uncharted 4:

alt

Sure enough, the review discusses action, gameplay, characters, and story.

The size of the words gives us intuition about how commonly words appear in these reviews, but we can also just count how often certain words show up.

Counter

We can get a list of the most common words by splitting the words up and adding them to a dictionary of words along with their individual count, which will be incremented every time the same word is seen.

We then just need to use Counter and the most_common() function:

def get_word_counts(words_list):
    word_count = {}

    for word in words_list:
        word = word.translate(translator).lower()
        if word not in stop_words:
            if word not in word_count:
                word_count[word] = 1
            else:
                word_count[word] += 1

    return word_count

review_word_count = get_word_counts(review_words)
review_word_count = Counter(review_word_count)
review_list = review_word_count.most_common()
print(review_list)

Here's the counts of some of the most common words:

[('game', 1231), ('one', 405), ('also', 308), ('time', 293), ('games', 289), ('like', 285), ('get', 278), ('even', 271), ('well', 224), ('much', 212), ('new', 200), ('play', 199), ('level', 195), ('different', 195), ('players', 193) ...]

Named Entity Recognition

We can also do named entity recognition using en_core_web_sm, a language model included with spaCy. The various concepts and linguistic features that can detect are listed here.

We need to grab the list of detected named entities and concepts from the document (list of words):

doc = nlp(str(review_words))
labels = [x.label_ for x in doc.ents]
items = [x.text for x in doc.ents]

We can print out the found entities as well as a count of the entities.

# Example of named entities and their categories
print([(X.text, X.label_) for X in doc.ents])

# All categories and their counts
print(Counter(labels))

# Most common named entities
print(Counter(items).most_common(20))

Here's what is printed:

[('Nintendo', 'ORG'), ('NES', 'ORG'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('15', 'CARDINAL'), ('Super', 'WORK_OF_ART'), ('Mario', 'PERSON'), ('Super', 'WORK_OF_ART') ...]

Counter({'PERSON': 1227, 'CARDINAL': 496, 'ORG': 478, 'WORK_OF_ART': 204, 'ORDINAL': 200, 'NORP': 110, 'PRODUCT': 88, 'GPE': 63, 'TIME': 12, 'DATE': 12, 'LOC': 12, 'QUANTITY': 4 ...]

[('first', 147), ('two', 110), ('Metal', 85), ('Solid', 82), ('GTAIII', 78), ('Warcraft', 72), ('2', 59), ('Mario', 56), ('four', 54), ('three', 42), ('NBA', 41) ...]

Let's say we wanted to plot the most common recognized terms for different categories, like persons and organizations. We just need to make a function to get the counts of the different classes of entities and then use it to get the entities we desire.

We'll get a list of named entities/people, organizations, and GPEs (locations):

def word_counter(doc, ent_name, col_name):
    ent_list = []
    for ent in doc.ents:
        if ent.label_ == ent_name:
            ent_list.append(ent.text)
    df = pd.DataFrame(data=ent_list, columns=[col_name])
    return df

review_persons = word_counter(doc, 'PERSON', 'Named Entities')
review_org = word_counter(doc, 'ORG', 'Organizations')
review_gpe = word_counter(doc, 'GPE', 'GPEs')

Now all we have to do is plot the counts with a function:

def plot_categories(column, df, num):
    sns.countplot(x=column, data=df,
                  order=df[column].value_counts().iloc[0:num].index)
    plt.xticks(rotation=-45)
    plt.show()

plot_categories("Named Entities", review_persons, 30)
plot_categories("Organizations", review_org, 30)
plot_categories("GPEs", review_gpe, 30)

Let's take a look at the plots that were generated.

alt

As would be expected of named entities, most of the results returned are names of video game characters. This isn't perfect, as it does misclassify some terms like "Xbox" as being a named entity rather than an organization, but this still gives us some idea of what characters are discussed in the top reviews.

alt

The organization plot shows some proper game developers and publishers like Playstation and Nintendo, but it also tags things like "480p" as being an organization.

alt

Above is the plot for GPEs, or geographical locations. It looks like "Hollywood" and "Miami" pop up often in reviews of games. (Settings for games? Or maybe the reviewer is describing something in-game as Hollywood-style?)

As you can see, carrying out named entity recognition and concept recognition isn't perfect, but it can give you some intuition about what kinds of topics are discussed in a body of text.

Plotting Numerical Values

Finally, we can try plotting numerical values from the database. Let's get the score values from the reviews collection, count them up, and then plot them:

scores = []

for score in list(reviews.find({}, {"_id":0, "score": 1})):
    scores.append(score)
scores = pd.DataFrame(scores, index=None).reset_index()

counts = scores['score'].value_counts()

sns.countplot(x="score", data=scores)
plt.xticks(rotation=-90)
plt.show()

alt

Above is the graph for the total number of review scores given, running from 0 to 9.9. It looks like the most commonly given scores were 7 and 8, which makes sense intuitively. Seven is often considered average on a ten point review scale.

Conclusion

Collecting, storing, retrieving, and analyzing data are skills that are highly in-demand in today's world, and MongoDB is one of the most commonly used NoSQL database platforms.

Knowing how to use NoSQL databases and how to interpret the data in them will equip you to carry out many common data analysis tasks.

Python Software Foundation: Grants Awarded for Python in Education

$
0
0
The Python Software Foundation has been asked about Python in education quite a bit recently. People have asked, “Is there an official curriculum we can use?”, “Are there online resources?”, “Are there efforts happening to improve Python on mobile?”, and so on.

9 years ago we instituted the Education Summit at PyCon US where educators as well as students work together on initiatives and obstacles. Earlier this year we decided we needed to do more. In November of 2018, the PSF created the Python in Education Board Committee and it was tasked with finding initiatives to fund to help improve the presence of Python in education.

In January of this year, the Python in Education Board Committee launched a “request for ideas” phase taking suggestions from the community on what we should focus our funding on. After the RFI period, we came up with 3 areas of education we wanted to focus on and asked to receive grant proposals on the following: resources (curriculums, evaluations, studies, multidisciplinary projects), localization (primarily translations), and mobile (development on mobile devices).

We are happy to publish more details on the grants the PSF approved from this initiative!

Beeware

The BeeWare Project wants to make it possible for all Python developers to write native apps for desktop and mobile platforms. Most desktop operating systems and iOS are supported already, but Android needs attention. Since Android users outnumber other mobile OS users worldwide by over 3 to 1, we determined it is important to fund this project. Beeware was awarded a $50,000 grant to help improve Python on Android. Phase one will be starting soon with this set of goals:

  1. A port of the CPython runtime to Android, delivered as a binary library ready to install into an Android project.
  2. A JNI-based library for bridging between the Android runtime and the CPython runtime.
  3. A template for a Gradle project that can be used to deploy Python code on Android devices. 

Beeware announced that they are looking for contractors to help with the work. Check out their blog post for more information.

Python in Education Website

Educational resources are in demand.  The PSF awarded a grant of $12,000 USD to Meg Ray, to work on creating a Python in Education website where we can curate educational information from all over the world. Meg will begin by collecting resources and after auditing the shared information, she will work on organizing it on an official PSF webpage. This work will begin in October of 2019 so please keep an eye out for updates via tweets and blogs!

Friendly-tracebacks

Lastly is a project called friendly-tracebacks. This project is not in need of financial support but is asking the PSF to help publicize it.  Friendly-traceback aims to provide simplified tracebacks translated into as many languages as possible. The project maintainer is looking for volunteers to help with tasks such as documenting possible SyntaxError use cases and documenting exceptions that haven't already been covered. Read more on their blog for the full call to action from the maintainer.


We hope to continue this initiative yearly! Companies that are passionate about supporting Python in Education should get in touch; we can't continue our work without your support!  As a non-profit organization, the PSF depends on sponsorships and donations to support the Python community.

Donate to the PSF: https://www.python.org/psf/donations/
Sponsor the PSF: https://www.python.org/psf/sponsorship/

Real Python: Preventing SQL Injection Attacks With Python

$
0
0

Every few years, the Open Web Application Security Project (OWASP) ranks the most critical web application security risks. Since the first report, injection risks have always been on top. Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

In this tutorial, you’re going to learn:

  • What Python SQL injection is and how to prevent it
  • How to compose queries with both literals and identifiers as parameters
  • How to safely execute queries in a database

This tutorial is suited for users of all database engines. The examples here use PostgreSQL, but the results can be reproduced in other database management systems (such as SQLite, MySQL, Microsoft SQL Server, Oracle, and so on).

Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Understanding Python SQL Injection

SQL Injection attacks are such a common security vulnerability that the legendary xkcd webcomic devoted a comic to it:

A humorous webcomic by xkcd about the potential effect of SQL injection"Exploits of a Mom" (Image: xkcd)

Generating and executing SQL queries is a common task. However, companies around the world often make horrible mistakes when it comes to composing SQL statements. While the ORM layer usually composes SQL queries, sometimes you have to write your own.

When you use Python to execute these queries directly into a database, there’s a chance you could make mistakes that might compromise your system. In this tutorial, you’ll learn how to successfully implement functions that compose dynamic SQL queries without putting your system at risk for Python SQL injection.

Setting Up a Database

To get started, you’re going to set up a fresh PostgreSQL database and populate it with data. Throughout the tutorial, you’ll use this database to witness firsthand how Python SQL injection works.

Creating a Database

First, open your shell and create a new PostgreSQL database owned by the user postgres:

$ createdb -O postgres psycopgtest

Here you used the command line option -O to set the owner of the database to the user postgres. You also specified the name of the database, which is psycopgtest.

Note:postgres is a special user, which you would normally reserve for administrative tasks, but for this tutorial, it’s fine to use postgres. In a real system, however, you should create a separate user to be the owner of the database.

Your new database is ready to go! You can connect to it using psql:

$ psql -U postgres -d psycopgtest
psql (11.2, server 10.5)Type "help" for help.

You’re now connected to the database psycopgtest as the user postgres. This user is also the database owner, so you’ll have read permissions on every table in the database.

Creating a Table With Data

Next, you need to create a table with some user information and add data to it:

psycopgtest=#CREATETABLEusers(usernamevarchar(30),adminboolean);CREATE TABLEpsycopgtest=#INSERTINTOusers(username,admin)VALUES('ran',true),('haki',false);INSERT 0 2psycopgtest=#SELECT*FROMusers; username | admin----------+------- ran      | t haki     | f(2 rows)

The table has two columns: username and admin. The admin column indicates whether or not a user has administrative privileges. Your goal is to target the admin field and try to abuse it.

Setting Up a Python Virtual Environment

Now that you have a database, it’s time to set up your Python environment. For step-by-step instructions on how to do this, check out Python Virtual Environments: A Primer.

Create your virtual environment in a new directory:

~/src $ mkdir psycopgtest~/src $ cd psycopgtest~/src/psycopgtest $ python3 -m venv venv

After you run this command, a new directory called venv will be created. This directory will store all the packages you install inside the virtual environment.

Connecting to the Database

To connect to a database in Python, you need a database adapter. Most database adapters follow version 2.0 of the Python Database API Specification PEP 249. Every major database engine has a leading adapter:

DatabaseAdapter
PostgreSQLPsycopg
SQLitesqlite3
Oraclecx_oracle
MySqlMySQLdb

To connect to a PostgreSQL database, you’ll need to install Psycopg, which is the most popular adapter for PostgreSQL in Python. Django ORM uses it by default, and it’s also supported by SQLAlchemy.

In your terminal, activate the virtual environment and use pip to install psycopg:

~/src/psycopgtest $ source venv/bin/activate~/src/psycopgtest $ python -m pip install psycopg2>=2.8.0Collecting psycopg2  Using cached https://....  psycopg2-2.8.2.tar.gzInstalling collected packages: psycopg2  Running setup.py install for psycopg2 ... doneSuccessfully installed psycopg2-2.8.2

Now you’re ready to create a connection to your database. Here’s the start of your Python script:

importpsycopg2connection=psycopg2.connect(host="localhost",database="psycopgtest",user="postgres",password=None,)connection.set_session(autocommit=True)

You used psycopg2.connect() to create the connection. This function accepts the following arguments:

  • host is the IP address or the DNS of the server where your database is located. In this case, the host is your local machine, or localhost.

  • database is the name of the database to connect to. You want to connect to the database you created earlier, psycopgtest.

  • user is a user with permissions for the database. In this case, you want to connect to the database as the owner, so you pass the user postgres.

  • password is the password for whoever you specified in user. In most development environments, users can connect to the local database without a password.

After setting up the connection, you configured the session with autocommit=True. Activating autocommit means you won’t have to manually manage transactions by issuing a commit or rollback. This is the defaultbehavior in most ORMs. You use this behavior here as well so that you can focus on composing SQL queries instead of managing transactions.

Note: Django users can get the instance of the connection used by the ORM from django.db.connection:

fromdjango.dbimportconnection

Executing a Query

Now that you have a connection to the database, you’re ready to execute a query:

>>>
>>> withconnection.cursor()ascursor:... cursor.execute('SELECT COUNT(*) FROM users')... result=cursor.fetchone()... print(result)(2,)

You used the connection object to create a cursor. Just like a file in Python, cursor is implemented as a context manager. When you create the context, a cursor is opened for you to use to send commands to the database. When the context exits, the cursor closes and you can no longer use it.

Note: To learn more about context managers, check out Python Context Managers and the “with” Statement.

While inside the context, you used cursor to execute a query and fetch the results. In this case, you issued a query to count the rows in the users table. To fetch the result from the query, you executed cursor.fetchone() and received a tuple. Since the query can only return one result, you used fetchone(). If the query were to return more than one result, then you’d need to either iterate over cursor or use one of the other fetch* methods.

Using Query Parameters in SQL

In the previous section, you created a database, established a connection to it, and executed a query. The query you used was static. In other words, it had no parameters. Now you’ll start to use parameters in your queries.

First, you’re going to implement a function that checks whether or not a user is an admin. is_admin() accepts a username and returns that user’s admin status:

# BAD EXAMPLE. DON'T DO THIS!defis_admin(username:str)->bool:withconnection.cursor()ascursor:cursor.execute("""            SELECT                admin            FROM                users            WHERE                username = '%s'"""%username)result=cursor.fetchone()admin,=resultreturnadmin

This function executes a query to fetch the value of the admin column for a given username. You used fetchone() to return a tuple with a single result. Then, you unpacked this tuple into the variable admin. To test your function, check some usernames:

>>>
>>> is_admin('haki')False>>> is_admin('ran')True

So far so good. The function returned the expected result for both users. But what about non-existing user? Take a look at this Python traceback:

>>>
>>> is_admin('foo')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 12, in is_adminTypeError: cannot unpack non-iterable NoneType object

When the user does not exist, a TypeError is raised. This is because .fetchone() returns None when no results are found, and unpacking None raises a TypeError. The only place you can unpack a tuple is where you populate admin from result.

To handle non-existing users, create a special case for when result is None:

# BAD EXAMPLE. DON'T DO THIS!defis_admin(username:str)->bool:withconnection.cursor()ascursor:cursor.execute("""            SELECT                admin            FROM                users            WHERE                username = '%s'"""%username)result=cursor.fetchone()ifresultisNone:# User does not existreturnFalseadmin,=resultreturnadmin

Here, you’ve added a special case for handling None. If username does not exist, then the function should return False. Once again, test the function on some users:

>>>
>>> is_admin('haki')False>>> is_admin('ran')True>>> is_admin('foo')False

Great! The function can now handle non-existing usernames as well.

Exploiting Query Parameters With Python SQL Injection

In the previous example, you used string interpolation to generate a query. Then, you executed the query and sent the resulting string directly to the database. However, there’s something you may have overlooked during this process.

Think back to the username argument you passed to is_admin(). What exactly does this variable represent? You might assume that username is just a string that represents an actual user’s name. As you’re about to see, though, an intruder can easily exploit this kind of oversight and cause major harm by performing Python SQL injection.

Try to check if the following user is an admin or not:

>>>
>>> is_admin("'; select true; --")True

Wait… What just happened?

Let’s take another look at the implementation. Print out the actual query being executed in the database:

>>>
>>> print("select admin from users where username = '%s'"%"'; select true; --")select admin from users where username = ''; select true; --'

The resulting text contains three statements. To understand exactly how Python SQL injection works, you need to inspect each part individually. The first statement is as follows:

selectadminfromuserswhereusername='';

This is your intended query. The semicolon (;) terminates the query, so the result of this query does not matter. Next up is the second statement:

selecttrue;

This statement was constructed by the intruder. It’s designed to always return True.

Lastly, you see this short bit of code:

--'

This snippet defuses anything that comes after it. The intruder added the comment symbol (--) to turn everything you might have put after the last placeholder into a comment.

When you execute the function with this argument, it will always return True. If, for example, you use this function in your login page, an intruder could log in with the username '; select true; --, and they’ll be granted access.

If you think this is bad, it could get worse! Intruders with knowledge of your table structure can use Python SQL injection to cause permanent damage. For example, the intruder can inject an update statement to alter the information in the database:

>>>
>>> is_admin('haki')False>>> is_admin("'; update users set admin = 'true' where username = 'haki'; select true; --")True>>> is_admin('haki')True

Let’s break it down again:

';

This snippet terminates the query, just like in the previous injection. The next statement is as follows:

updateuserssetadmin='true'whereusername='haki';

This section updates admin to true for user haki.

Finally, there’s this code snippet:

selecttrue;--

As in the previous example, this piece returns true and comments out everything that follows it.

Why is this worse? Well, if the intruder manages to execute the function with this input, then user haki will become an admin:

psycopgtest=#select*fromusers; username | admin----------+------- ran      | t haki     | t(2 rows)

The intruder no longer has to use the hack. They can just log in with the username haki. (If the intruder really wanted to cause harm, then they could even issue a DROP DATABASE command.)

Before you forget, restore haki back to its original state:

psycopgtest=#updateuserssetadmin=falsewhereusername='haki';UPDATE 1

So, why is this happening? Well, what do you know about the username argument? You know it should be a string representing the username, but you don’t actually check or enforce this assertion. This can be dangerous! It’s exactly what attackers are looking for when they try to hack your system.

Crafting Safe Query Parameters

In the previous section, you saw how an intruder can exploit your system and gain admin permissions by using a carefully crafted string. The issue was that you allowed the value passed from the client to be executed directly to the database, without performing any sort of check or validation. SQL injections rely on this type of vulnerability.

Any time user input is used in a database query, there’s a possible vulnerability for SQL injection. The key to preventing Python SQL injection is to make sure the value is being used as the developer intended. In the previous example, you intended for username to be used as a string. In reality, it was used as a raw SQL statement.

To make sure values are used as they’re intended, you need to escape the value. For example, to prevent intruders from injecting raw SQL in the place of a string argument, you can escape quotation marks:

>>>
>>> # BAD EXAMPLE. DON'T DO THIS!>>> username=username.replace("'","''")

This is just one example. There are a lot of special characters and scenarios to think about when trying to prevent Python SQL injection. Lucky for you, modern database adapters, come with built-in tools for preventing Python SQL injection by using query parameters. These are used instead of plain string interpolation to compose a query with parameters.

Note: Different adapters, databases, and programming languages refer to query parameters by different names. Common names include bind variables, replacement variables, and substitution variables.

Now that you have a better understanding of the vulnerability, you’re ready to rewrite the function using query parameters instead of string interpolation:

 1 defis_admin(username:str)->bool: 2 withconnection.cursor()ascursor: 3 cursor.execute(""" 4             SELECT 5                 admin 6             FROM 7                 users 8             WHERE 9                 username = %(username)s10 """,{11 'username':username12 })13 result=cursor.fetchone()14 15 ifresultisNone:16 # User does not exist17 returnFalse18 19 admin,=result20 returnadmin

Here’s what’s different in this example:

  • In line 9, you used a named parameter username to indicate where the username should go. Notice how the parameter username is no longer surrounded by single quotation marks.

  • In line 11, you passed the value of username as the second argument to cursor.execute(). The connection will use the type and value of username when executing the query in the database.

To test this function, try some valid and invalid values, including the dangerous string from before:

>>>
>>> is_admin('haki')False>>> is_admin('ran')True>>> is_admin('foo')False>>> is_admin("'; select true; --")False

Amazing! The function returned the expected result for all values. What’s more, the dangerous string no longer works. To understand why, you can inspect the query generated by execute():

>>>
>>> withconnection.cursor()ascursor:... cursor.execute("""...        SELECT...            admin...        FROM...            users...        WHERE...            username = %(username)s... """,{... 'username':"'; select true; --"... })... print(cursor.query.decode('utf-8'))SELECT    adminFROM    usersWHERE    username = '''; select true; --'

The connection treated the value of username as a string and escaped any characters that might terminate the string and introduce Python SQL injection.

Passing Safe Query Parameters

Database adapters usually offer several ways to pass query parameters. Named placeholders are usually the best for readability, but some implementations might benefit from using other options.

Let’s take a quick look at some of the right and wrong ways to use query parameters. The following code block shows the types of queries you’ll want to avoid:

# BAD EXAMPLES. DON'T DO THIS!cursor.execute("SELECT admin FROM users WHERE username = '"+username+'");cursor.execute("SELECT admin FROM users WHERE username = '%s' % username);cursor.execute("SELECT admin FROM users WHERE username = '{}'".format(username));cursor.execute(f"SELECT admin FROM users WHERE username = '{username}'");

Each of these statements passes username from the client directly to the database, without performing any sort of check or validation. This sort of code is ripe for inviting Python SQL injection.

In contrast, these types of queries should be safe for you to execute:

# SAFE EXAMPLES. DO THIS!cursor.execute("SELECT admin FROM users WHERE username = %s'",(username,));cursor.execute("SELECT admin FROM users WHERE username = %(username)s",{'username':username});

In these statements, username is passed as a named parameter. Now, the database will use the specified type and value of username when executing the query, offering protection from Python SQL injection.

Using SQL Composition

So far you’ve used parameters for literals. Literals are values such as numbers, strings, and dates. But what if you have a use case that requires composing a different query—one where the parameter is something else, like a table or column name?

Inspired by the previous example, let’s implement a function that accepts the name of a table and returns the number of rows in that table:

# BAD EXAMPLE. DON'T DO THIS!defcount_rows(table_name:str)->int:withconnection.cursor()ascursor:cursor.execute("""            SELECT                count(*)            FROM%(table_name)s""",{'table_name':table_name,})result=cursor.fetchone()rowcount,=resultreturnrowcount

Try to execute the function on your users table:

>>>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in count_rowspsycopg2.errors.SyntaxError: syntax error at or near "'users'"LINE 5:                 'users'                        ^

The command failed to generate the SQL. As you’ve seen already, the database adapter treats the variable as a string or a literal. A table name, however, is not a plain string. This is where SQL composition comes in.

You already know it’s not safe to use string interpolation to compose SQL. Luckily, Psycopg provides a module called psycopg.sql to help you safely compose SQL queries. Let’s rewrite the function using psycopg.sql.SQL():

frompsycopg2importsqldefcount_rows(table_name:str)->int:withconnection.cursor()ascursor:stmt=sql.SQL("""            SELECT                count(*)            FROM{table_name}""").format(table_name=sql.Identifier(table_name),)cursor.execute(stmt)result=cursor.fetchone()rowcount,=resultreturnrowcount

There are two differences in this implementation. First, you used sql.SQL() to compose the query. Then, you used sql.Identifier() to annotate the argument value table_name. (An identifier is a column or table name.)

Note: Users of the popular package django-debug-toolbar might get an error in the SQL panel for queries composed with psycopg.sql.SQL(). A fix is expected for release in version 2.0.

Now, try executing the function on the users table:

>>>
>>> count_rows('users')2

Great! Next, let’s see what happens when the table does not exist:

>>>
>>> count_rows('foo')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 11, in count_rowspsycopg2.errors.UndefinedTable: relation "foo" does not existLINE 5:                 "foo"                        ^

The function throws the UndefinedTable exception. In the following steps, you’ll use this exception as an indication that your function is safe from a Python SQL injection attack.

Note: The exception UndefinedTable was added in psycopg2 version 2.8. If you’re working with an earlier version of Psycopg, then you’ll get a different exception.

To put it all together, add an option to count rows in the table up to a certain limit. This feature might be useful for very large tables. To implement this, add a LIMIT clause to the query, along with query parameters for the limit’s value:

frompsycopg2importsqldefcount_rows(table_name:str,limit:int)->int:withconnection.cursor()ascursor:stmt=sql.SQL("""            SELECT                COUNT(*)            FROM (                SELECT                    1                FROM{table_name}                LIMIT{limit}            ) AS limit_query""").format(table_name=sql.Identifier(table_name),limit=sql.Literal(limit),)cursor.execute(stmt)result=cursor.fetchone()rowcount,=resultreturnrowcount

In this code block, you annotated limit using sql.Literal(). As in the previous example, psycopg will bind all query parameters as literals when using the simple approach. However, when using sql.SQL(), you need to explicitly annotate each parameter using either sql.Identifier() or sql.Literal().

Note: Unfortunately, the Python API specification does not address the binding of identifiers, only literals. Psycopg is the only popular adapter that added the ability to safely compose SQL with both literals and identifiers. This fact makes it even more important to pay close attention when binding identifiers.

Execute the function to make sure that it works:

>>>
>>> count_rows('users',1)1>>> count_rows('users',10)2

Now that you see the function is working, make sure it’s also safe:

>>>
>>> count_rows("(select 1) as foo; update users set admin = true where name = 'haki'; --",1)Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in count_rowspsycopg2.errors.UndefinedTable: relation "(select 1) as foo; update users set admin = true where name = '" does not existLINE 8:                     "(select 1) as foo; update users set adm...                            ^

This traceback shows that psycopg escaped the value, and the database treated it as a table name. Since a table with this name doesn’t exist, an UndefinedTable exception was raised and you were not hacked!

Conclusion

You’ve successfully implemented a function that composes dynamic SQL without putting your system at risk for Python SQL injection! You’ve used both literals and identifiers in your query without compromising security.

You’ve learned:

  • What Python SQL injection is and how it can be exploited
  • How to prevent Python SQL injection using query parameters
  • How to safely compose SQL statements that use literals and identifiers as parameters

You’re now able to create programs that can withstand attacks from the outside. Go forth and thwart the hackers!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

PyCharm: Webinar: “React+TypeScript+TDD in PyCharm” with Paul Everitt

$
0
0

ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it’s hard to find accurate learning materials.

React+TypeScript, with PyCharm? That three-part combination is the topic of this webinar. We’ll show a little about a lot. Meaning, the key steps to getting productive, in PyCharm, for React projects using TypeScript. Along the way we’ll show test-driven development and emphasize tips-and-tricks in the IDE.

  • Wednesday, October 16
  • 6:00 PM – 7:00 PM CEST (12:00 PM – 1:00 PM EDT)
  • Aimed at intermediate web developers familiar with React
  • Register here

About This Webinar

This webinar is based on a 12 part tutorial with write-ups, videos, and working code for each step. The tutorial covers: getting started with Jest testing, debugging, TSX, functional components, sharing props with types, class based components, interfaces, testing event handlers, and “dumb” components.

You could, of course, skip this webinar and go through the material. Or you could go through the material and use the webinar to ask questions. Either way, we’ll give a quick treatment of each topic.

As a note, I’ll be doing this same webinar, but in webinars with IntelliJ and Rider, later in the year.

One final point: the tutorial and this webinar teach React+TS while sitting in tests, rather than the browser. It’s a productive way to work and makes for a good learning experience.

Speaking To You

Paul is the PyCharm Developer Advocate at JetBrains. Before that, Paul was a co-founder of Zope Corporation, taking the first open source application server through $14M of funding. Paul has bootstrapped both the Python Software Foundation and the Plone Foundation. Paul was an officer in the US Navy, starting www.navy.mil in 1993.

Kumar Vipin Yadav: Python Pune Meetup September 2k19

$
0
0
“If there’s a book that you want to read, but it hasn’t been written yet, then you must write it.” … More

Evennia: Blackifying and fixing bugs

$
0
0
Since version 0.9 of Evennia, the MU*-creation framework, was released, work has mainly been focused on bug fixing. But there few new features also already sneaked into master branch, despite technically being changes slated for Evennia 1.0.




On Frontends

Contributor friarzen has chipped away at improving Evennia's HTML5 web client. It already had the ability to structure and spawn any number of nested text panes. In the future we want to extend the user's ability to save an restore its layouts and allow developers to offer pre-prepared layouts for their games. Already now though, it has gotten plugins for handling both graphics, sounds and video:

Inline image by me (griatch-art.deviantart.com)


A related fun development is Castlelore Studios' development of an Unreal Engine Evennia plugin (this is unaffiliated with core Evennia development and I've not tried it, but it looks pretty nifty!): 

Image ©Castlelore Studios
 
On Black

Evennia's source code is extensively documented and was sort of adhering to the Python formatting standard PEP8. But many places were sort of hit-and-miss and others were formatted with slight variations due to who wrote the code.
 
After pre-work and recommendation by Greg Taylor, Evennia has adopted the black autoformatter for its source code. I'm not really convinced that black produces the best output of all possible outputs every time, but as Greg puts it, it's at least consistent in style. We use a line width of 100.

I have set it up so that whenever a new commit is added to the repo, the black formatter will run on it. It may still produce line widths >100 at times (especially for long strings), but otherwise this reduces the number of different PEP8 infractions in the code a lot.

On Python3

Overall the move to Python3 appears to have been pretty uneventful for most users. I've not heard almost any complaints or requests for help with converting an existing game.
The purely Python2-to-Python3 related bugs have been very limited after launch; almost all have been with unicode/bytes when sending data over the wire.

People have wholeheartedly adopted the new f-strings though, and some spontaneous PRs have already been made towards converting some of Evennia existing code into using them.

Post-launch we moved to Django 2.2.2, but the Django 2+ upgrades have been pretty uneventful so far.Some people had issues installing Twisted on Windows since there was no py3.7 binary wheel (causing them to have to compile it from scratch). The rise of the Linux Subsystem on Windows have alleviated most of this though and I've not seen any Windows install issues in a while.

On Future

For now we'll stay in bug-fixing mode, with the ocational new feature popping up here and there. In the future we'll move to the develop branch again. I have a slew of things in mind for 1.0. 

Apart from bug fixing and cleaning up the API in several places, I plan to make use of the feedback received over the years to make Evennia a little more accessible for a new user. This means I'll also try reworking and consolidating the tutorials so one can follow them with a more coherent "red thread", as well as improving the documentation in various other ways to help newcomers with the common questions we hear a lot. 

The current project plan (subject to change) is found here. Lots of things to do!



Top image credit: https://www.goodfreephotos.com/ (public domain)

Podcast.__init__: Building A Modern Discussion Forum In Python To Support Healthy Communities

$
0
0
Building and sustaining a healthy community requires a substantial amount of effort, especially online. The design and user experience of the digital space can impact the overall interactions of the participants and guide them toward respectful conversation. In this episode Rafał Pitoń shares his experience building the Misago platform for creating community forums. He explains his motivation for creating the project, the lessons he has learned in the process, and how it is being used by himself and others. This was a great conversation about how technology is just a means, and not the end in itself.

Summary

Building and sustaining a healthy community requires a substantial amount of effort, especially online. The design and user experience of the digital space can impact the overall interactions of the participants and guide them toward respectful conversation. In this episode Rafał Pitoń shares his experience building the Misago platform for creating community forums. He explains his motivation for creating the project, the lessons he has learned in the process, and how it is being used by himself and others. This was a great conversation about how technology is just a means, and not the end in itself.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Rafał Pitoń about Misago, a fully featured modern forum application that is fast, scalable, and responsive

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by explaining what Misago is and your motivation for creating it?
    • How does it compare to other modern forum options such as Discourse and Flarum?
  • How did you generate and prioritize the set of features that you have implemented and what are the main capabilities that are still on your roadmap?
  • Is Misago intended to be run in isolation, or does it allow for integrating into a larger Django project?
    • Is there any support for multi-tenancy?
  • How is Misago itself implemented and how has the architecture evolved since you first began working on it?
    • If you were to start it today, what are some of the choices that you would make differently?
  • What are the extension points that developers can hook into for adding custom functionality?
  • In addition to the technical challenges, managing a forum involves a fair amount of social challenges. How does Misago help with management of a healthy community?
    • How do different design elements factor into promoting healthy conversation and sustainable engagement?
    • What are some of the aspects of community management and the accompanying platform features that enable them which aren’t initially obvious?
  • For someone who wants to use Misago, what is involved in deploying and configuring it?
    • What are some of the routine maintenance tasks that they should be aware of?
  • What are some of the most interesting or unexpected ways that you have seen Misago used?
  • What have you found to be the most interesting, unexpected, and challenging aspects of building and maintaining a forum platform?
  • What do you have planned for the future of Misago?

Keep In Touch

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Codementor: Writing a simple Pytest hook

$
0
0
Write some basic pytest hooks to capture test failures in a text file.

Tryton News: Newsletter October 2019

$
0
0

@ced wrote:

For the last months before the release, we have many improvements for the user experience.
Even if the development is not yet frozen, you can already help to translate the next release in your language.

Contents:

Changes For The User

Now users are able to set a default warehouse in the preferences.
This is useful for companies with multiple warehouses. It saves time for the users as they could have the warehouse filled-in for which they work.

The web client now supports drag and drop to order list and tree rows like in the desktop client. There is one small difference to insert a row inside a non-expanded row: The user must drop it below the row while pressing the CTRL key. Otherwise the row is dropped next to the row.

Now, you can use consumable products in an inventory if needed. There are still no requirements and the inventory is not automatically filled with products of this type.

We can show a visual context on the rows or cells. Those visual context can be muted, success, warning or danger. Many modules have been updated to use them like the payable and receivable today on the party or when an invoice is due etc.


The constraint that prevented to use twice the same invoice sequence per fiscal year, has been relaxed. Now Tryton checks only that the sequence was not used to number an invoice with a later date.

The time-sheet and opportunity reports are now displaying the month name instead of the number.

We have reviewed all the list and tree views to expand needed columns. This was necessary to improve the user experience with the web client (because the web client does not permit to resize columns).

New Modules

Modules to manage amendments

The blueprint Amendment for Sale and Purchase has been implemented.
The amendment modules allows you to change sales and purchases that are being processed and keeping track of the changes. When an amendment is validated the document is updated and given a new revision.

Changes For The Developer

Now we prevent to set a value for an unknown field in proteus scripts and in Tryton modules model definitions. For that we add __slots__ automatically on each model. A positive side effect is that it reduces also the memory consumption of each instance.

The PYSONEval now supports the dotted notation. This feature is a common expectation from beginners. So we decided it is good to support it.

We have already a multiselection widget to use with a Many2Many field. But now we have also a MultiSelection field which stores a list of value as a JSON list in the database. This is useful when the selection has a few options. For now, the widget is also available on list views (but not editable). And the field is usable in the search bar of the client.

You can now define a different start date when using PYSONDate or DateTime with delta.

Posts: 1

Participants: 1

Read full topic

Django Weblog: Django bugfix releases: 2.2.6, 2.1.13 and 1.11.25

$
0
0

Today we've issued the 2.2.6, 2.1.13, and 1.11.25 bugfix releases.

The release package and checksums are available from our downloads page, as well as from the Python Package Index. The PGP key ID used for this release is Carlton Gibson: E17DF5C82B4F9D00.

EuroPython: EuroPython 2019 - Videos for Friday available

$
0
0

We are pleased to announce the third and final batch of cut videos from EuroPython 2019 in Basel, Switzerland, with another 49 videos.

image

EuroPython 2019 on our YouTube Channel

In this batch, we have included all videos for Friday, July 12 2019, the third conference day.

In total, we now have 133 videos available for you to watch.

All EuroPython videos, including the ones from previous conferences, are available on our EuroPython YouTube Channel.

Enjoy,

EuroPython 2019 Team
https://ep2019.europython.eu/
https://www.europython-society.org/

Stack Abuse: Python for NLP: Deep Learning Text Generation with Keras

$
0
0

This is the 21st article in my series of articles on Python for NLP. In the previous article, I explained how to use Facebook's FastText library for finding semantic similarity and to perform text classification. In this article, you will see how to generate text via deep learning technique in Python using the Keras library.

Text generation is one of the state-of-the-art applications of NLP. Deep learning techniques are being used for a variety of text generation tasks such as writing poetry, generating scripts for movies, and even for composing music. However, in this article we will see a very simple example of text generation where given an input string of words, we will predict the next word. We will use the raw text from Shakespeare's famous novel "Macbeth" and will use that to predict the next word given a sequence of input words.

After completing this article, you will be able to perform text generation using the dataset of your choice. So, let's begin without further ado.

Importing Libraries and Dataset

The first step is to import the libraries required to execute the scripts in this article, along with the dataset. The following code imports the required libraries:

import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.utils import to_categorical
from random import randint
import re

The next step is to download the dataset. We will use Python's NLTK library to download the dataset. We will be using the Gutenberg Dataset, which contains 3036 English books written by 142 authors, including the "Macbeth" by Shakespeare.

The following script downloads the Gutenberg dataset and prints the names of all the files in the dataset.

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gut

print(gut.fileids())

You should see the following output:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

The file shakespeare-macbeth.txt contains raw text for the novel "Macbeth". To read the text from this file, the raw method from the gutenberg class can be used:

macbeth_text = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')

Let's print the first 500 characters from out dataset:

print(macbeth_text[:500])

Here is the output:

[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through

You can see that the text contains many special characters and numbers. The next step is to clean the dataset.

Data Preprocessing

To remove the punctuations and special characters, we will define a function named preprocess_text():

def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence.lower()

The preprocess_text function accepts a text string as a parameter and returns a cleaned text string in lower case.

Let's now clean our text and again print the first 500 characters:

macbeth_text = preprocess_text(macbeth_text)
macbeth_text[:500]

Here is the output:

the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcom

Convert Words to Numbers

Deep learning models are based on statistical algorithms. Hence, in order to work with deep learning models, we need to convert words to numbers.

In this article, we will be using a very simple approach where words will be converted into single integers. Before we could convert words to integers, we need to tokenize our text into individual words. To do so, the word_tokenize() method from the nltk.tokenize module can be used.

The following script tokenizes the text in our dataset and then prints the total number of words in the dataset, as well as the total number of unique words in the dataset:

from nltk.tokenize import word_tokenize

macbeth_text_words = (word_tokenize(macbeth_text))
n_words = len(macbeth_text_words)
unique_words = len(set(macbeth_text_words))

print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)

The output looks like this:

Total Words: 17250
Unique Words: 3436

Our text has 17250 words in total, out of which 3436 words are unique. To convert tokenized words to numbers, the Tokenizer class from the keras.preprocessing.text module can be used. You need to call the fit_on_texts method and pass it the list of words. A dictionary will be created where the keys will represent words, whereas integers will represent the corresponding values of the dictionary.

Look at the following script:

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3437)
tokenizer.fit_on_texts(macbeth_text_words)

To access the dictionary that contains words and their corresponding indexes, the word_index attribute of the tokenizer object can be used:

vocab_size = len(tokenizer.word_index) + 1
word_2_index = tokenizer.word_index

If you check the length of the dictionary, it will contain 3436 words, which is the total number of unique words in our dataset.

Let's now print the 500th unique word along with its integer value from the word_2_index dictionary.

print(macbeth_text_words[500])
print(word_2_index[macbeth_text_words[500]])

Here is the output:

comparisons
1456

Here the word "comparisons" is assigned the integer value of 1456.

Modifying the Shape of the Data

Text generation falls in the category of many-to-one sequence problems since the input is a sequence of words and output is a single word. We will be using the Long Short-Term Memory Network (LSTM), which is a type of recurrent neural network to create our text generation model. LSTM accepts data in a 3-dimensional format (number of samples, number of time-steps, features per time-step). Since the output will be a single word, the shape of the output will be 2-dimensional (number of samples, number of unique words in the corpus).

The following script modifies the shape of the input sequences and the corresponding outputs.

input_sequence = []
output_words = []
input_seq_length = 100

for i in range(0, n_words - input_seq_length , 1):
    in_seq = macbeth_text_words[i:i + input_seq_length]
    out_seq = macbeth_text_words[i + input_seq_length]
    input_sequence.append([word_2_index[word] for word in in_seq])
    output_words.append(word_2_index[out_seq])

In the script above, we declare two empty lists input_sequence and output_words. The input_seq_length is set to 100, which means that our input sequence will consist of 100 words. Next, we execute a loop where in the first iteration, integer values for the first 100 words from the text are appended to the input_sequence list. The 101st word is appended to the output_words list. During the second iteration, a sequence of words that starts from the 2nd word in the text and ends at the 101st word is stored in the input_sequence list, and the 102nd word is stored in the output_words array, and so on. A total of 17150 input sequences will be generated since there are 17250 total words in the dataset (100 less than the total words).

Let's now print the value of the first sequence in the input_sequence list:

print(input_sequence[0])

Output:

[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9]

Let's normalize our input sequences by dividing the integers in the sequences by the largest integer value. The following script also converts the output into 2-dimensional format.

X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))
X = X / float(vocab_size)

y = to_categorical(output_words)

The following script prints the shape of the inputs and the corresponding outputs.

print("X shape:", X.shape)
print("y shape:", y.shape)

Output:

X shape: (17150, 100, 1)
y shape: (17150, 3437)

Training the Model

The next step is to train our model. There is no hard and fast rule as to what number of layers and neurons should be used to train the model. We will randomly select the layer and neuron sizes. You can play around with the hyper parameters to see if you can get better results.

We will create three LSTM layers with 800 neurons each. A final dense layer with 1 neuron will be added to predict the index of the next word, as shown below:

model = Sequential()
model.add(LSTM(800, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(800, return_sequences=True))
model.add(LSTM(800))
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam')

Since the output word can be one of 3436 unique words, our problem is a multi-class classification problem, hence the categorical_crossentropy loss function is used. In case of binary classification, the binary_crossentropy function is used. Once you execute the above script, you should see the model summary:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm_1 (LSTM)                (None, 100, 800)          2566400
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 800)          5123200
_________________________________________________________________
lstm_3 (LSTM)                (None, 800)               5123200
_________________________________________________________________
dense_1 (Dense)              (None, 3437)              2753037
=================================================================
Total params: 15,565,837
Trainable params: 15,565,837
Non-trainable params: 0

To train the model, we can simply use the fit() method.

model.fit(X, y, batch_size=64, epochs=10, verbose=1)

Here again, you can play around with different values for batch_size and the epochs. The model can take some time to train.

Making Predictions

To make predictions, we will randomly select a sequence from the input_sequence list, convert it into a 3-dimentional shape and then pass it to the predict() method of the trained model. The model will return a one-hot encoded array where the index that contains 1 will be the index value of the next word. The index value is then passed to the index_2_word dictionary, where the word index is used as a key. The index_2_word dictionary will return the word that belong to the index that is passed as a key to the dictionary.

The following script randomly selects a sequence of integers and then prints the corresponding sequence of words:

random_seq_index = np.random.randint(0, len(input_sequence)-1)
random_seq = input_sequence[random_seq_index]

index_2_word = dict(map(reversed, word_2_index.items()))

word_sequence = [index_2_word[value] for value in random_seq]

print(' '.join(word_sequence))

For the script in this article, the following sequence was randomly selected. The sequence generated for you will most likely be different than this one:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane

In the above script, the index_2_word dictionary is created by simply reversing the word_2_index dictionary. In this case, reversing a dictionary refers to the process of swapping keys with values.

Next, we will print the next 100 words that follow the above sequence of words:

for i in range(100):
    int_sample = np.reshape(random_seq, (1, len(random_seq), 1))
    int_sample = int_sample / float(vocab_size)

    predicted_word_index = model.predict(int_sample, verbose=0)

    predicted_word_id = np.argmax(predicted_word_index)
    seq_in = [index_2_word[index] for index in random_seq]

    word_sequence.append(index_2_word[ predicted_word_id])

    random_seq.append(predicted_word_id)
    random_seq = random_seq[1:len(random_seq)]

The word_sequence variable now contains our input sequence of words, along with the next 100 predicted words. The word_sequence variable contains sequence of words in the form of list. We can simply join the words in the list to get the final output sequence, as shown below:

final_output = ""
for word in word_sequence:
    final_output = final_output + " " + word

print(final_output)

Here is the final output:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and

The output doesn't look very good yet and it seems that our model is only learning from the last word i.e. and. However, you get the idea about how to create a text generation model with Keras. To improve the results, I have the following recommendations for you:

  • Change the hyper parameters, including the size and number of LSTM layers and number of epochs to see if you get better results.
  • Try to remove the stop words like is, am, are from training set to generate words other than stop words in the test set (although this will depend on the type of application).
  • Create a character-level text generation model that predicts the next N characters.

To practice further, I would recommend that you try to develop a text generation model with the other datasets from the Gutenberg corpus.

Conclusion

In this article, we saw how to create a text generation model using deep learning with Python's Keras library. Though the model developed in this article is not perfect, the article conveys the idea of how to generate text with deep learning.

Codementor: Test Driven Development with PyTest - Part 1

$
0
0
A 3 part series on how to get started with pytest and test-driven development practices.

PythonClub - A Brazilian collaborative blog about Python: Criando dicts a partir de outros dicts

$
0
0

Neste tutorial, será abordado o processo de criação de um dict ou dicionário, a partir de um ou mais dicts em Python.

Como já é de costume da linguagem, isso pode ser feito de várias maneiras diferentes.

Abordagem inicial

Pra começar, vamos supor que temos os seguintes dicionários:

dict_1={'a':1,'b':2,}dict_2={'b':3,'c':4,}

Como exemplo, vamos criar um novo dicionário chamado new_dict com os valores de dict_1 e dict_2 logo acima. Uma abordagem bem conhecida é utilizar o método update.

new_dict={}new_dcit.update(dict_1)new_dcit.update(dict_2)

Assim, temos que new_dict será:

>>print(new_dict){'a':1,'b':3,'c':4,}

Este método funciona bem, porém temos de chamar o método update para cada dict que desejamos mesclar em new_dict. Não seria interessante se fosse possível passar todos os dicts necessários já na inicialização de new_dict?

Novidades do Python 3

O Python 3 introduziu uma maneira bem interessante de se fazer isso, utilizando os operadores **.

new_dict={**dict_1,**dict_2,}

Assim, de maneira semelhante ao exemplo anterior, temos que new_dict será :

>>print(new_dict['a'])1>>print(new_dict['b'])3>>print(new_dict['c'])4

Cópia real de dicts

Ao utilizamos o procedimento de inicialização acima, devemos tomar conseiderar alguns fatores. Apenas os valores do primeiro nível serão realmente duplicados no novo dicionário. Como exemplo, vamos alterar uma chave presente em ambos os dicts e verificar se as mesmas possuem o mesmo valor:

>>dict_1['a']=10>>new_dict['a']=11>>print(dict_1['a'])10>>print(new_dict['a'])11

Porém isso muda quando um dos valores de dict_1 for uma list, outro dict ou algum objeto complexo. Por exemplo:

dict_3={'a':1,'b':2,'c':{'d':5,}}

e agora, vamos criar um novo dict a partir desse:

new_dict={**dict_3,}

Como no exemplo anterior, podemos imaginar que foi realizado uma cópia de todos os elementos de dict_3, porém isso não é totalmente verdade. O que realmente aconteceu é que foi feita uma cópia superficial dos valores de dict_3, ou seja, apenas os valores de primeiro nível foram duplicados. Observe o que acontece quando alteramos o valor do dict presente na chave c.

>>new_dict['c']['d']=11>>print(new_dict['c']['d'])11>>print(dict_3['c']['d'])11# valor anterior era 5

No caso da chave c, ela contem uma referência para outra estrutura de dados (um dict, no caso). Quando alteramos algum valor de dict_3['c'], isso reflete em todos os dict que foram inicializados com dict_3. Em outras palavras, deve-se ter cuidado ao inicializar um dict a partir de outros dicts quando os mesmos possuírem valores complexos, como list, dict ou outros objetos (os atributos deste objeto não serão duplicados).

De modo a contornar este inconveniente, podemos utilizar o método deepcopy da lib nativa copy. Agora, ao inicializarmos new_dict:

importcopydict_3={'a':1,'b':2,'c':{'d':5,}}new_dict=copy.deepcopy(dict_3)

O método deepcopy realiza uma cópia recursiva de cada elemento de dict_3, resolvendo nosso problema. Veja mais um exemplo:

>>new_dict['c']['d']=11>>print(new_dict['c']['d'])11>>print(dict_3['c']['d'])5# valor não foi alterado

Conclusão

Este artigo tenta demonstrar de maneira simples a criação de dicts, utilizando os diversos recursos que a linguagem oferece bem como os prós e contras de cada abordagem.

Referências

Para mais detalhes e outros exemplos, deem uma olhada neste post do forum da Python Brasil aqui.

É isso pessoal. Obrigado por ler!

Rene Dudfield: post modern C tooling - draft 2

$
0
0
DRAFT 1 - 9/16/19, 7:19 PM, I'm still working on this, but it's already useful and I'd like some feedback - so I decided to share it early.
DRAFT 2 - 10/1/19



This is a post about contemporary C tooling. Tooling for making higher quality C, faster.

In 2001 or so people started using the phrase "Modern C++". So now that it's 2019, I guess we're in the post modern era? Anyway, this isn't a post about C++ code, but some of this information applies there too.
The C language has no logo, but it's everywhere.

Welcome to the post modern era.

Some of the C++ people have pulled off one of the cleverest and sneakiest tricks ever. They required 'modern' C99 and C11 features in 'recent' C++ standards. Microsoft has famously still clung onto some 80s version of C with their compiler for the longest time. So it's been a decade of hacks for people writing portable code in C. For a while I thought we'd be stuck in the 80s with C89 forever. However, now that some C99 and C11 features are more widely available in the Microsoft compiler, we can use these features in highly portable code (but forget about C17/C18 ISO/IEC 9899:2018/C2X stuff!!).

So, we have some pretty modern language features in C with C11.  But what about tooling?

Tools and protection for our feet.

C, whilst a work horse being used in everything from toasters, trains, phones, web browsers, ... (everything basically) - is also an excellent tool for shooting yourself in the foot.

Noun

footgun (pluralfootguns)
  1. (informal,humorous,derogatory) Any feature whose addition to a product results in the user shooting themselves in the foot. C.

Tools like linters, test coverage checkers, static analyzers, memory checkers, documentation generators, thread checkers, continuous integration, nice error messages, ... and such help protect our feet.

How do we do continuous delivery with a language that lets us do the most low level footgunie things ever? On a dozen CPU architectures, 32 bit, 64bit, little endian, big endian, 64 bit with 32bit pointers (wat?!?), with multiple compilers, on a dozen different OS, with dozens of different versions of your dependencies?

Surely there won't be enough time to do releases, and have time left to eat my vegan shaved ice desert after lunch?



Debuggers

Give me 15 minutes, and I'll change your mind about GDB. --
https://www.youtube.com/watch?v=PorfLSr3DDI
Firstly, did you know gdb had a curses based 'GUI' which works in a terminal? It's a quite a bit easier to use than the command line text interface. It's called TUI. It's built in, and uses emacs key bindings.

But what if you are used to VIM key bindings? https://cgdb.github.io/

Also, there's a fairly easy to use web based front end for GDB called gdbgui (https://www.gdbgui.com/). For those who don't use an IDE with debugging support built in (such as Visual studio by Microsoft or XCode by Apple).





Reverse debugger

Normally a program runs forwards. But what about when you are debugging and you want to run the program backwards?

Set breakpoints and data watchpoints and quickly reverse-execute to where they were hit.

How do you tame non determinism to allow a program to run the same way it did when it crashed? In C and with threads some times it's really hard to reproduce problems.

rr helps with this. It's actual magic.

https://rr-project.org/






LLDB - the LLVM debugger.

Apart from the ever improving gdb, there is a new debugger from the LLVM people - lldb ( https://lldb.llvm.org/ ).


IDE debugging

Visual Studio by Microsoft, and XCode by Apple are the two heavy weights here.

The free Visual Studio Code also supports debugging with GDB. https://code.visualstudio.com/docs/languages/cpp

Sublime is another popular editor, and there is good GDB integration for it too in the SublimeGDB package (https://packagecontrol.io/packages/SublimeGDB).



Portable building, and package management

C doesn't have a package manager... or does it?

Ever since Debian dpkg, Redhat rpm, and Perl started doing package management in the early 90s people world wide have been able to share pieces of software more easily. Following those systems, many other systems like Ruby gems, JavaScript npm, and Pythons cheese shop came into being. Allowing many to share code easily.

But what about C? How can we define dependencies on different 'packages' or libraries and have them compile on different platforms?

How do we build with Microsofts compiler, with gcc, with clang, or Intels C compiler? How do we build on Mac, on Windows, on Ubuntu, on Arch linux?

Part of the answer to that is CMake. "Modern CMake" lets you define your dependencies,


Conan package manager

There are several packaging tools for C these days, but one of the top contenders is Conan.

https://conan.io/




Testing coverage.

Tests let us know that some certain function is running ok. Which code do we still need to test?

gcov, a tool you can use in conjunction with GCC to test code coverage in your programs.
lcov, LCOV is a graphical front-end for GCC's coverage testing tool gcov.


Instructions from codecov.io on how to use it with C, and clang or gcc. (codecov.io is free for public open source repos).
https://github.com/codecov/example-c


Here's documentation for how CPython gets coverage results for C.
 https://devguide.python.org/coverage/#measuring-coverage-of-c-code-with-gcov-and-lcov

Here is the CPython Travis CI configuration they use.
https://github.com/python/cpython/blob/master/.travis.yml#L69
    - os: linux
language: c
compiler: gcc
env: OPTIONAL=true
addons:
apt:
packages:
- lcov
- xvfb
before_script:
- ./configure
- make coverage -s -j4
# Need a venv that can parse covered code.
- ./python -m venv venv
- ./venv/bin/python -m pip install -U coverage
- ./venv/bin/python -m test.pythoninfo
script:
# Skip tests that re-run the entire test suite.
- xvfb-run ./venv/bin/python -m coverage run --pylib -m test --fail-env-changed -uall,-cpu -x test_multiprocessing_fork -x test_multiprocessing_forkserver -x test_multiprocessing_spawn -x test_concurrent_futures
after_script: # Probably should be after_success once test suite updated to run under coverage.py.
# Make the `coverage` command available to Codecov w/ a version of Python that can parse all source files.
- source ./venv/bin/activate
- make coverage-lcov
- bash > (curl -s https://codecov.io/bash)




Static analysis

"Static analysis has not been helpful in finding bugs in SQLite." -- https://www.sqlite.org/testing.html

According to David Wheeler in "How to Prevent the next Heartbleed" (https://dwheeler.com/essays/heartbleed.html#static-not-found the security problem with a logo, a website, and a marketing team) only one static analysis tool found the Heartbleed vulnerability before it was known. This tool is called CQual++. One reason for projects not using these tools is that they have been (and some still are) hard to use. The LLVM project only started using the clang static analysis tool on it's own projects recently for example. However, since Heartbleed tools have improved in both usability and their ability to detect issues.

I think it's generally accepted that static analysis tools are incomplete, in that each tool does not guarantee detecting every problem or even always detecting the same issues all the time. Using multiple tools can therefore be said to find multiple different types of problems.

Compilers are kind of smart

The most basic of static analysis tools are compilers themselves. Over the years they have been getting more and more tools which used to only be available in dedicated Static Analyzers and Lint tools.
Variable shadowing and format-string mismatches can be detected reliably and quickly is because both gcc and clang do this detection as part of their regular compile. --  Bruce Dawson
Here we see two issues (which used to be) very common in C being detected by the two most popular C compilers themselves.

Compiling code with gcc "-Wall -Wextra -pedantic" options catches quite a number of potential or actual problems (https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html). Other compilers check different things as well. So using multiple compilers with their warnings can find plenty of different types of issues for you.

Compiler warnings should be turned in errors on CI.

By getting your errors down to zero on Continuous Integration there is less chance of new warnings being introduced that are missed in code review. There are problems with distributing your code with warnings turned into errors, so that should not be done.

Some points for people implementing this:
  • -Werror can be used to turn warnings into errors
  • -Wno-error=unknown-pragmas
  • should run only in CI, and not in the build by default. See werror-is-not-your-friend (https://embeddedartistry.com/blog/2017/5/3/-werror-is-not-your-friend).
  • Use most recent gcc, and most recent clang (change two travis linux builders to do this).
  • first have to fix all the warnings (and hopefully not break something in the process).
  • consider adding extra warnings to gcc: "-Wall -Wextra -Wpedantic" See C tooling
  • Also the Microsoft compiler MSVC on Appveyor can be configured to treat warnings as errors. The /WX argument option treats all warnings as errors. See MSVC warning levels
  • For MSVC on Appveyor, /wdnnnn Suppresses the compiler warning that is specified by nnnn. For example, /wd4326 suppresses compiler warning C4326.
If you run your code on different CPU architectures, these compilers can find even more issues. For example 32bit/64bit Big Endian, and Little Endian.

Static analysis tool overview.

Note, that static analysis can be much slower than the analysis usually provided by compilation. It trades off more CPU time for (perhaps) better results.

The talk "Clang Static Analysis" (https://www.youtube.com/watch?v=UcxF6CVueDM) talks about an LLVM tool called codechecker (https://github.com/Ericsson/codechecker). Clang's Static Analyzer, a free static analyzer based on Clang.  Not that XCode IDE on Mac includes the clang static analyser.

Visual studio by Microsoft can also do static code analysis too. ( https://docs.microsoft.com/en-us/visualstudio/code-quality/code-analysis-for-c-cpp-overview?view=vs-2017)

cppcheck focuses of low false positives and can find many actual problems.
Coverity, a commercial static analyzer, free for open source developers
CppDepend, a commercial static analyzer based on Clang
codechecker, https://github.com/Ericsson/codechecker
cpplint, Cpplint is a command-line tool to check C/C++ files for style issues following Google's C++ style guide.
Awesome static analysis, a page full of static analysis tools for C/C++. https://github.com/mre/awesome-static-analysis#cc
PVS-Studio, a comercial static analyzier, free for open source developers.




cppcheck 

Cppcheck is an analysis tool for C/C++ code. It provides unique code analysis to detect bugs and focuses on detecting undefined behaviour and dangerous coding constructs. The goal is to detect only real errors in the code (i.e. have very few false positives).

The quote below was particularly interesting to me because it echos the sentiments of other developers, that testing will find more bugs. But here is one of the static analysis tools saying so as well.
"You will find more bugs in your software by testing your software carefully, than by using Cppcheck."

To Install cppcheck

http://cppcheck.sourceforge.net/ and https://github.com/danmar/cppcheck
The manual can be found here: http://cppcheck.net/manual.pdf

brew install cppcheck bear
sudo apt-get install cppcheck bear

To run cppcheck on C code.

You can use bear (the build ear) tool to record a compilation database (compile_commands.json). cppcheck can then know what c files and header files you are using.

# call your build tool, like `bear make` to record. 
# See cppcheck manual for other C environments including Visual Studio.
bear python setup.py build
cppcheck --quiet --language=c --enable=all -D__x86_64__ -D__LP64__ --project=compile_commands.json

 It does seem to find some errors, and style improvements that other tools do not suggest. Note that you can control the level of issues found to errors, to portability and style issues plus more. See cppcheck --help and the manual for more details about --enable options.

For example these ones from the pygame code base:
[src_c/math.c:1134]: (style) The function 'vector_getw' is never used.
[src_c/base.c:1309]: (error) Pointer addition with NULL pointer.
[src_c/scrap_qnx.c:109]: (portability) Assigning a pointer to an integer is not portable.
[src_c/surface.c:832] -> [src_c/surface.c:819]: (warning) Either the condition '!surf' is redundant or there is possible null pointer dereference: surf.

cppcheck reports 942 things in the pygame codebase. (633 without cython related things).




Custom static analysis for API usage

Probably one of the most useful parts of static analysis is being able to write your own checks. This allows you to do checks specific to your code base in which general checks will not work. One example of this is the gcc cpychecker (https://gcc-python-plugin.readthedocs.io/en/latest/cpychecker.html). With this, gcc can find API usage issues within CPython extensions written in C. Including reference counting bugs, and NULL pointer de-references, and other types of issues. You can write custom checkers with LLVM as well in the "Checker Developer Manual" (https://clang-analyzer.llvm.org/checker_dev_manual.html)

There is a list of GCC plugins (https://gcc.gnu.org/wiki/plugins) among them are some Linux security plugins by grsecurity.




"Using SAL annotations to reduce code defects." (https://docs.microsoft.com/en-us/visualstudio/code-quality/using-sal-annotations-to-reduce-c-cpp-code-defects?view=vs-2019)

"In GNU C and C++, you can use function attributes to specify certain function properties that may help the compiler optimize calls or check code more carefully for correctness."
https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html




Performance profiling and measurement

“The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned.”  Michael Abrash. “Michael Abrash’s Graphics Programming Black Book.”
Reducing energy usage, and run time requirements of apps can often be a requirement or very necessary. For a mobile or embedded application it can mean the difference of being able to run the program at all. Performance can directly be related to user happiness but also to the financial performance of a piece of software.

But how to we measure the performance of a program, and how to we know what parts of a program need improvement? Tooling can help.

Valgrind

Valgrind has its own section here because it does lots of different things for us. It's a great tool, or set of tools for improving your programs. It used to be available only on linux, but is now also available on MacOS.

Apparently Valgrind would have caught the heartbleed issue if it was used with a fuzzer.

http://valgrind.org/docs/manual/quick-start.html

Apple Performance Tools

Apple provides many performance related development tools. Along with the gcc and llvm based tools, the main tool is called Instruments. Instruments (part of Xcode) allows you to record and analyse programs for lots of different aspects of performance - including graphics, memory activity, file system, energy and other program events. By being able to record and analyse different types of events together can make it convienient to find performance issues.

Many of the low level parts of the tools in XCode are made open source through the LLVM project. See "LLVM Machine Code Analyzer" ( https://llvm.org/docs/CommandGuide/llvm-mca.html) as one example.

Free and Open Source performance tools.



Microsoft performance tools.


Intel performance tools.

https://software.intel.com/en-us/vtune




Caching builds

https://ccache.samba.org/

ccache is very useful for reducing the compile time of large C projects. Especially when you are doing a 'rebuild from scratch'. This is because ccache can cache the compilation of parts in this situation when the files do not change.
http://itscompiling.eu/2017/02/19/speed-cpp-compilation-compiler-cache/

This is also useful for speeding up CI builds, and especially when large parts of the code base rarely change.


Distributed building.


distcc https://github.com/distcc/distcc
icecream https://github.com/icecc/icecream


Complexity of code.


How complex is your code?
http://www.gnu.org/software/complexity/

complexity src_c/*.c


Testing your code on different OS/architectures.

Sometimes you need to be able to fix an issue on an OS or architecture that you don't have access to. Luckily these days there are many tools available to quickly use a different system through emulation, or container technology.


Vagrant
Virtualbox
Docker
Launchpad, compile and run tests on many architectures.
Mini cloud (ppc machines for debugging)

If you pay Travis CI, they allow you to connect to the testing host with ssh when a test fails.


Code Formatting

clang-format

clang-format - rather than manually fix various formatting errors found with a linter, many projects are just using clang-format to format the code into some coding standard.



Services

LGTM is an 'automated code review tool' with github (and other code repos) support. https://lgtm.com/help/lgtm/about-automated-code-review

Coveralls provides a store for test coverage results with github (and other code repos) support. https://coveralls.io/




Coding standards for C

There are lots of coding standards for C, and there are tools to check them.

An older set of standards is the MISRA_C (https://en.wikipedia.org/wiki/MISRA_C) aims to facilitate code safety, security, and portability for embedded systems.

The Linux Kernel Coding standard (https://www.kernel.org/doc/html/v4.10/process/coding-style.html) is well known mainly because of the popularity of the Linux Kernel. But this is mainly concerned with readability.

A newer one is the CERT C coding standard (https://wiki.sei.cmu.edu/confluence/display/seccode/SEI+CERT+Coding+Standards), and it is a secure coding standard (not a safety one).

The website for the CERT C coding standard is quite amazing. It links to tools that can detect each of the problems automatically (when they can be). It is very well researched, and links each problem to other relevant standards, and gives issues priorities. A good video to watch on CERT C is "How Can I Enforce the SEI CERT C Coding Standard Using Static Analysis?" (https://www.youtube.com/watch?v=awY0iJOkrg4). They do releases of the website, which is edited as a wiki. At the time of writing the last release into book form was in 2016.







How are other projects tested?

We can learn a lot by how other C projects are going about their business today.
Also, thanks to CI testing tools defining things in code we can see how automated tests are run on services like Travis CI and Appveyor.

SQLite

"How SQLite Is Tested"

Curl

"Testing Curl"
https://github.com/curl/curl/blob/master/.travis.yml

Python

"How is CPython tested?"
https://github.com/python/cpython/blob/master/.travis.yml

OpenSSL

"How is OpenSSL tested?"

https://github.com/openssl/openssl/blob/master/.travis.yml
They use Coverity too: https://github.com/openssl/openssl/pull/9805
https://github.com/openssl/openssl/blob/master/fuzz/README.md

libsdl

"How is SDL tested?" [No response]


Linux

As of early 2019, Linux used no unit testing within the kernel tree (some unit tests exist outside of the kernel tree).

There's no in-tree unit tests, but linux is probably one of the most highly tested pieces of code there is.

Linux relies a lot on community testing. With thousands of developers working on Linux every day, that is a lot of people testing things out. Additionally, because all of the source code is available for Linux many more people are able to try things out, and test things on different systems.


https://stackoverflow.com/questions/3177338/how-is-the-linux-kernel-tested

https://www.linuxjournal.com/content/linux-kernel-testing-and-debugging


Haproxy

https://github.com/haproxy/haproxy/blob/master/.travis.yml







Real Python: Strings and Character Data in Python

$
0
0

In this course, you’ll learn about working with strings, which are objects that contain sequences of character data. Processing character data is integral to programming. It is a rare application that doesn’t need to manipulate strings to at least some extent.

Python provides a rich set of operators, functions, and methods for working with strings. When you’re finished this course, you’ll know how to:

  • Use operators with strings
  • Access and extract portions of strings
  • Use built-in Python functions with characters and strings
  • Use methods to manipulate and modify string data

You’ll also be introduced to two other Python objects used to represent raw byte data: the bytes and bytearray types.

Take the Quiz: Test your knowledge with our interactive “Python Strings and Character Data” quiz. Upon completion you will receive a score so you can track your learning progress over time:

Take the Quiz »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Dataquest: Tutorial: Transforming Data with Python Scripts and the Command Line

Python Insider: Python 3.8.0rc1 is now available

$
0
0
Python 3.8.0 is almost ready. After a rather tumultuous few days, we are very happy to announce the availability of the release candidate:
https://www.python.org/downloads/release/python-380rc1/ 

This release, 3.8.0rc1, is the final planned release preview. Assuming no critical problems are found prior to 2019-10-14, the scheduled release date for 3.8.0, no code changes are planned between this release candidate and the final release.

Please keep in mind that this is not the gold release yet and as such its use is not recommended for production environments.

Major new features of the 3.8 series, compared to 3.7

Some of the new major new features and changes in Python 3.8 are:
  • PEP 572, Assignment expressions
  • PEP 570, Positional-only arguments
  • PEP 587, Python Initialization Configuration (improved embedding)
  • PEP 590, Vectorcall: a fast calling protocol for CPython
  • PEP 578, Runtime audit hooks
  • PEP 574, Pickle protocol 5 with out-of-band data
  • Typing-related: PEP 591 (Final qualifier), PEP 586 (Literal types), and PEP 589 (TypedDict)
  • Parallel filesystem cache for compiled bytecode
  • Debug builds share ABI as release builds
  • f-strings support a handy = specifier for debugging
  • continue is now legal in finally: blocks
  • on Windows, the default asyncio event loop is now ProactorEventLoop
  • on macOS, the spawn start method is now used by default in multiprocessing
  • multiprocessing can now use shared memory segments to avoid pickling costs between processes
  • typed_ast is merged back to CPython
  • LOAD_GLOBAL is now 40% faster
  • pickle now uses Protocol 4 by default, improving performance
  • (Hey, fellow core developer, if a feature you find important is missing from this list, let Łukasz know.)
 
Viewing all 22861 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>