Mike Driscoll: PyDev of the Week: Martin Uribe

November 17, 2019, 10:05 pm

≫ Next: Codementor: Cyber Discovery - What it is all about

≪ Previous: tryexceptpass: Unconventional Secure and Asynchronous RESTful APIs using SSH

This week we welcome Martin Uribe (@clamytoe) as our PyDev of the Week! Martin helps out at PyBites. You can find him on PyBite’s Slack channel answering lots of Python related questions. You can also find out what Martin is up to via his Github or LinkedIn profiles. Let’s take a few moments to get to know Martin better!

Can you tell us a little about yourself (hobbies, education, etc):

I’m 46 and happily married with 8 kids. Born and raised in South Central L.A. I joined the California National Guard while I was still in high school. I went to Basic Training between my 11th and 12th grades; came back and graduated with honors and was gone within the month for Advanced Initial Training where they taught me how to fix helicopter radios. After a couple of years I decided to enlisted full-time in the regular Army and did a stint for another 8 years in Automated Logistics and got an honorable discharge as a Sergeant in 2001.

Before getting out, I got in a semester of full-time college as part of a re-enlistment bonus. I loved it and I hit the books pretty hard. I was so pumped to learn that I pushed myself to continue to grow when I went back to work. As a result, I was able to get my MCSE, MCP+I and A+ certifications which allowed me to get into the role that I still hold as a Senior Field Engineer for Fidelity. I’m contracted out to one of our many customers, PNC Bank, at their Dallas lockbox location. The title has changed over the years but it entails a lot of hardware and software support. In case you don’t know, a lockbox is where everyone’s checks go for processing when they make a payment over snail mail. Everything gets imaged front and back and entered into the bank’s system and the banks customers can access their documents through a secure proxy connection immediately. The money transfers are made the next day once the checks have cleared. At the end of the month, the banks customers images are placed on encrypted CD’s or DVD’s and mailed out to them.

To blow some steam I like to play Minecraft with my kids, edit movies, play Beat Saber, take online courses, and do some Python coding.

Why did you start using Python?

While in the Army I got into the role of maintaining the 4th Infantry Division’s logistics database. Once I figured out that I could automate most of my work, I was hooked! I had this report that I had to generate daily. That thing was a beast and took several hours to put together. After doing it a couple of times, I decided to record a macro and the next time, it only took several minutes! I went from macros, to editing the VBScript code itself, to writing batch scripts on the NT servers. By the time that I left, the only I had to do was make sure the tape was in the tape drive for the nightly backups!

When I got into the role that I have now, it was a whole new ball game. Up to that point I was only familiar with Windows NT and Windows 95. I was plopped in front of a terminal on a FreeBSD network and told to take care of it! Trial by fire as they say! I soon got the hang of it and since our whole platform runs on Perl, I started to dabble a bit with that. Pretty soon I was writing Perl and shell scripts to make my job easier.

At this point of my life, I was into a bit of everything. From pentesting, web development, database management, to 3D modeling/rigging/animation. I even got certified as a Macromedia Flash Designer! Boy was I wrong for betting on that platform… My interests where so scattered that I was good at a lot of things, but not an expert at any of them. I finally got fed up and decided that it was time to stick to one thing and become really good at it.

While pentesting I had come across several Python scripts and I was impressed with how easy they were to read compared to Perl and how powerful they were. I decided Python would by my train and I hoped on without a second thought.

What other programming languages do you know and which is your favorite?

While taking some college courses I learned Java, but I didn’t like it much. I know enough of the following to get things done: HTML, CSS, JavaScript, Perl, SQL, and BASH. Python is my favorite; I use it pretty much every day even though my job doesn’t require me to code.

What projects are you working on now?

Currently I’m working on some code to gather statistics for printer usage across PNC’s national network. They currently have a project where they are consolidating printing to one location and needed to get a good feel for how many documents are printed at each of their sites. With Threads, I’m basically just scraping the printers internal web servers with requests and BeautifulSoup for the page counts once per hour and dumping into a log file and then using another script to process the data with Pandas and generating charts with matplotlib. The bossman was so impress with it that he wants me to expand on it so that it can be used at all of our customers locations.

On the side I’m continuing to learn more Python. I’m in the process of working through some asyncio material and also learning more about deploying API services. I usually put all of my stuff on GitHub, so that’s a good place to see what I’ve been up to.

Which Python libraries are your favorite (core or 3rd party)?

I’ve already mentioned a few of my favorites, requests, BeautifulSoup, Pandas, and matplotlib, but if I had to add more I would say the collections module, black, and cookiecutter. I have my toepack cookiecutter template on GitHub that I’m really proud of. I use it to automatically generate a lot of the boilerplate code and files that go into most of my new projects.

How did you get started at PyBites?

I had just finished some Python MOOC courses on Coursera and I didn’t want to lose what I had learned, so I was looking for some coding challenges. I came across pybit.es and I loved the variety of their blog challenges. I had already found some of the other sites but those all seemed to be more algorithmic in nature and I wanted some real world situational challenges; which PyBites serves up by the plateful! Back then, codechalleng.es didn’t even exist, it was all manual git commands, which in of itself was a great challenge, but Bob Belderbos was very patient with me and held my hand through those dark times.

I had been messing around with Python for over six years and I really didn’t start to make much progress with it until I met Bob and Julian Sequeira. It’s due to them that I’ve gotten as far as I have and I’m really grateful.

What do you do at PyBites?

I help Bob and Julian by verifying the functionality of new features, and testing, proof reading, and suggesting new bites. I also suggest new features, UI enhancements, create new challenges, and am somewhat of a soundboard for Bob to bounce new ideas off of.

Is there anything else you’d like to say?

Since everybody asks, my online nickname, clamytoe, is one that I came up for my first born. Over 19 years ago, my wife and I were in the process of trying to sell our home outside of Fort Hood. If you’ve ever sold or bought a house, you know that there is a ton of paperwork involved. My wife were looking over the contract and we had paper scattered all over the living room floor when my son came running through. Ever sheet that he stepped on stuck to his little feet. I remember picking him up and saying something like, “Dang, you have to clamy toes!”. He giggled and was off again, but the name stuck around.

Thanks for doing the interview, Martin!

The post PyDev of the Week: Martin Uribe appeared first on The Mouse Vs. The Python.

↧

Codementor: Cyber Discovery - What it is all about

November 17, 2019, 11:33 pm

≫ Next: Django Weblog: Django 3.0 release candidate 1 released

≪ Previous: Mike Driscoll: PyDev of the Week: Martin Uribe

↧

Django Weblog: Django 3.0 release candidate 1 released

November 18, 2019, 12:44 am

≫ Next: PyBites: You can now hone your testing / pytest skills on our platform

≪ Previous: Codementor: Cyber Discovery - What it is all about

Django 3.0 release candidate 1 is the final opportunity for you to try out the raft of new features before Django 3.0 is released.

The release candidate stage marks the string freeze and the call for translators to submit translations. Provided no major bugs are discovered that can't be solved in the next two weeks, Django 3.0 will be released on or around December 2. Any delays will be communicated on the django-developers mailing list thread.

Please use this opportunity to help find and fix bugs (which should be reported to the issue tracker). You can grab a copy of the package from our downloads page or on PyPI.

The PGP key ID used for this release is Mariusz Felisiak: 2EF56372BA48CD1B.

↧

PyBites: You can now hone your testing / pytest skills on our platform

November 18, 2019, 3:21 am

≫ Next: Django Weblog: Introducing DjangoCon Africa

≪ Previous: Django Weblog: Django 3.0 release candidate 1 released

Writing test code is an essential skill. As PyBites we believe writing code is the only solution to becoming a master (Ninja) at programming. The same applies to test code. For that reason we extended our regular exercises with Test Bites.

In this article you will read about the feature showcasing it on our first ever Test Bite. We also share some details around implementation and a challenge we hit getting it to work. Enjoy and start honing your testing skills today!

Why did we add this?

It was one of the most requested features. Period.

It is also a logical progression. So far, coding our 200+ Bites of Py you have looked at test code and analyzed its output when Bites did not pass. This is of course very useful, but it does not teach you how to write tests for your code. Many users expressed how cool it would be if they could learn how to code in pytest on our platform. Well, now you can! We just added 3 Test Bites - stay tuned for more soon ...

How to test test code?

This was a challenge, but somebody on Slack suggested mutation testing and a quick google search yielded MutPy:

MutPy is a mutation testing tool for Python 3.3+ source code. MutPy supports standard unittest module, generates YAML/HTML reports and has colorful output. It applies mutation on AST level. You could boost your mutation testing process with high order mutations (HOM) and code coverage analysis.

Here we run it manually on Bite 05:

    $ mut.py -h
    usage: mut.py [-h] [--version] [--target TARGET [TARGET ...]]
                [--unit-test UNIT_TEST [UNIT_TEST ...]] [--runner RUNNER]
                [--report REPORT_FILE] [--report-html DIR_NAME]
                [--timeout-factor TIMEOUT_FACTOR] [--show-mutants] [--quiet]
                [--debug] [--colored-output] [--disable-stdout]
                [--experimental-operators] [--operator OPERATOR [OPERATOR ...]]
                [--disable-operator OPERATOR [OPERATOR ...]] [--list-operators]
                [--path DIR] [--percentage PERCENTAGE] [--coverage]
                [--order ORDER] [--hom-strategy HOM_STRATEGY]
                [--list-hom-strategies] [--mutation-number MUTATION_NUMBER]

    $ mut.py --target names.py --unit-test test_names.py --runner pytest --coverage
    [*] Start mutation process:
    - targets: names.py
    - tests: test_names.py
    [*] 6 tests passed:
    - test_names [0.37057 s]
    [*] Start mutants generation and execution:
    - [#   1] CRP names: [0.13190 s] killed by test_names.py::test_sort_by_surname_desc
    - [#   2] CRP names: [0.09615 s] killed by test_names.py::test_sort_by_surname_desc
    - [#   3] CRP names: [0.09483 s] killed by test_names.py::test_sort_by_surname_desc
    - [#   4] CRP names: [0.09546 s] killed by test_names.py::test_sort_by_surname_desc
    - [#   5] CRP names: [0.09294 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#   6] CRP names: [0.09068 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#   7] CRP names: [0.08975 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#   8] CRP names: [0.09013 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#   9] CRP names: [0.12359 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  10] CRP names: [0.09420 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  11] CRP names: [0.09482 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  12] CRP names: [0.09506 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  13] CRP names: [0.09875 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  14] CRP names: [0.09385 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  15] CRP names: [0.09451 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  16] CRP names: [0.09528 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  17] CRP names: [0.09253 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  18] CRP names: [0.09329 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  19] CRP names: [0.09638 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  20] CRP names: [0.09449 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  21] CRP names: [0.09480 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  22] CRP names: [0.09092 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  23] CRP names: [0.09388 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  24] CRP names: [0.09356 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  25] CRP names: [0.09271 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  26] CRP names: [0.09341 s] killed by test_names.py::test_dedup_and_title_case_names
    - [#  27] CRP names: [0.09397 s] killed by test_names.py::test_sort_by_surname_desc
    - [#  28] CRP names: [0.10249 s] killed by test_names.py::test_shortest_first_name
    [*] Mutation score [8.05135 s]: 100.0%
    - all: 28
    - killed: 28 (100.0%)
    - survived: 0 (0.0%)
    - incompetent: 0 (0.0%)
    - timeout: 0 (0.0%)
    [*] Coverage: 113 of 113 AST nodes (100.0%)

To see the mutations it tried, run it with the -m switch:

    $ mut.py --target names.py --unit-test test_names.py --runner pytest --coverage -m
    [*] Start mutation process:
    - targets: names.py
    - tests: test_names.py
    [*] 6 tests passed:
    - test_names [0.32475 s]
    [*] Start mutants generation and execution:
    - [#   1] CRP names:
    --------------------------------------------------------------------------------
    -  1: NAMES = ['arnold schwarzenegger', 'alec baldwin', 'bob belderbos', \
    +  1: NAMES = ['mutpy', 'alec baldwin', 'bob belderbos', \
    2:     'julian sequeira', 'sandra bullock', 'keanu reeves', \
    3:     'julbob pybites', 'bob belderbos', 'julian sequeira', \
    4:     'al pacino', 'brad pitt', 'matt damon', 'brad pitt']
    5:
    --------------------------------------------------------------------------------
    [0.10319 s] killed by test_names.py::test_sort_by_surname_desc
    - [#   2] CRP names:
    --------------------------------------------------------------------------------
    -  1: NAMES = ['arnold schwarzenegger', 'alec baldwin', 'bob belderbos', \
    +  1: NAMES = ['', 'alec baldwin', 'bob belderbos', \
    2:     'julian sequeira', 'sandra bullock', 'keanu reeves', \
    3:     'julbob pybites', 'bob belderbos', 'julian sequeira', \
    4:     'al pacino', 'brad pitt', 'matt damon', 'brad pitt']
    5:
    --------------------------------------------------------------------------------
    [0.09149 s] killed by test_names.py::test_sort_by_surname_desc
    ...
    ... many more ...
    ...

Pretty cool eh? There is also a good explanation and simple example on MutPy's project page.

How it works on our platform

As previously mentioned, a Test Bite looks like a regular Bite but instead of a given set of tests, you are provided with a code module that gets imported. You are asked to write one or more test functions:

test bite interface

Let's write some code for this Bite and see the different checks (warning: contains spoilers, if you want to try it out now, go write code here!)

Bad syntax

It first checks if your test code runs. Here I forgot a colon after my function declaration:

syntax error

If you click Show pytest output you can see where you went wrong:

syntax error

Tests pass

Secondly it checks if the tests pass. Here I added a statement that does not pass:

pytest fail

And it doesn't like that:

pytest fail

Coverage

Next step is code coverage. Our initial thought was to use pytest-cov but that would mean two commands to run on the code. If possible it's preferable to use one tool. Luckily MutPy had us covered (pun intended). As long as we run it with the --coverage flag (see above).

The code so far has a coverage of 50%:

coverage

Mutation testing

At this point we also see MutPy's mutation test output that we included (using the -m switch). Again use Show pytest output to reveal the full output.

As you can see MutPy started the mutants generation and execution. Sounds like a game no? It tries to kill the mutants using your test code.

mutpy output

More in a bit. Scrolling down you see the coverage for which it uses Python's ast or Abstract Syntax Trees.

coverage calculation

Deleting one of the lines we have written so far we see that not all mutants get killed. Scrolling down you see 2 survivors:

mutants fail

Here is a mutant our tests addressed:

mutant killed

And here are the two survivors we still need to address:

survived

survived2

Adding assert fib(1) == 1 earlier kills those survivors.

Let's work on the 50% coverage percentage next. Here is another test function to test a higher number:

another test

This brings the coverage up to 93%! There is still one scenario we did not write tests for, negative numbers:

uncovered

I'll leave that as an exercise to the reader :)

Finally when you pass syntax + pytest run + 100% coverage + 100% mutation score you pass the Bite:

pass feedback

As we're still testing the waters of this feature / the MutPy module, minimum coverage and mutation required scores might go down a bit if the Test Bites become too challenging ...

Remember MutPy == multiprocessing

As elaborated in PyBites Code Challenges behind the scenes we run our platform on Heroku and use AWS Lambda for code execution.

One tough nut to crack was AWS's lack of parallel processing, more specifically multiprocessing.Queue that MutPy uses (if you try it, you'll get OSError: [Errno 38] Function not implemented).

Luckily thanks to this article replacing multiprocessing.Queue with multiprocessing.Pipe in the MutPy module got it to work 🎉

Just leaving this note here if somebody runs into this.

Start today!

We hope this will not only improve your pytest skills, but also how you think about writing test code.

We are putting all Test Bites into a new / dedicated learning path soon so stay tuned ...

As always have fun and let us know if you have any feedback on this new feature, preferably on Slack:

For platform questions use the #codechallenges channel.
For testing use the #testing and #pytest channels.
To share your Pythonic wins (be it with this feature or any other thing you use Python for), hop into our #checkins-and-wins channel.

Special thanks to AJ, David, Martin and Harrison for testing out this feature!

Keep Calm and Code in Python!

-- Bob

↧

Django Weblog: Introducing DjangoCon Africa

November 18, 2019, 4:25 am

≫ Next: Real Python: Pandas GroupBy: Your Guide to Grouping Data in Python

≪ Previous: PyBites: You can now hone your testing / pytest skills on our platform

Following the huge success of PyCon Africa, the Django community in Africa is ready to bring a new major software event to the continent - the very first DjangoCon Africa! The Django Software Foundation is excited to endorse and support this initiative.

Plans are already in motion for a DjangoCon Africa to be held in Addis Ababa, Ethiopia in November 2020. Actual dates to be announced as soon as key details are in place.

DjangoCon Africa will include 3 days of single-track talks, 1 day of workshops and sprints, and another day for touring for international visitors.

The event will also include a Django Girls workshop to be held the weekend before DjangoCon Africa. To make the conference as inclusive as possible, the event will offer financial aid to members of under-represented communities in software to ensure they can also attend.

The CFP, which is open to all, will also be announced as soon as key details are in place.

About Ethiopia

Ethiopia is a country in North East of Africa, commonly known as the Horn of Africa. It is a country with a rich history and many historical places to visit. The country is highly accessible to all, with African Union members having the option of applying for visa on arrival at Bole International Airport or applying for an e-visa like the rest of the world before traveling to Ethiopia.

The country also boasts of the largest airline in the whole of Africa, with the country’s airline, Ethiopian Airlines having 53 routes in Africa, 17 in Europe, 7 in the Americas, 14 in Asia and 10 in the Middle East. This makes this country very accessible to all of Africa and the rest of the world and hence an ideal location for the first DjangoCon Africa.

See you in Addis Ababa in November 2020 for the first ever DjangoCon Africa!

↧

Real Python: Pandas GroupBy: Your Guide to Grouping Data in Python

November 18, 2019, 6:00 am

≫ Next: Codementor: Solving CAPTCHA with Web automation

≪ Previous: Django Weblog: Introducing DjangoCon Africa

Whether you’ve just started working with Pandas and want to master one of its core facilities, or you’re looking to fill in some gaps in your understanding about .groupby(), this tutorial will help you to break down and visualize a Pandas GroupBy operation from start to finish.

This tutorial is meant to complement the official documentation, where you’ll see self-contained, bite-sized examples. Here, however, you’ll focus on three more involved walk-throughs that use real-world datasets.

In this tutorial, you’ll cover:

How to use Pandas GroupBy operations on real-world data
How the split-apply-combine chain of operations works
How to decompose the split-apply-combine chain into steps
How methods of a Pandas GroupBy object can be placed into different categories based on their intent and result

This tutorial assumes you have some experience with Pandas itself, including how to read CSV files into memory as Pandas objects with read_csv(). If you need a refresher, then check out Reading CSVs With Pandas.

You can download the source code for all the examples in this tutorial by clicking on the link below:

Download Datasets:Click here to download the datasets you'll use to learn about Pandas' GroupBy in this tutorial.

Housekeeping

All code in this tutorial was generated in a CPython 3.7.2 shell using Pandas 0.25.0. Before you proceed, make sure that you have the latest version of Pandas available within a new virtual environment:

$ python -m venv pandas-gb-tut
$source ./pandas-gb-tut/bin/activate
$ python -m pip install pandas

The examples here also use a few tweaked Pandas options for friendlier output:

importpandasaspd# Use 3 decimal places in output displaypd.set_option("display.precision",3)# Don't wrap repr(DataFrame) across additional linespd.set_option("display.expand_frame_repr",False)# Set max rows displayed in output to 25pd.set_option("display.max_rows",25)

You can add these to a startup file to set them automatically each time you start up your interpreter.

In this tutorial, you’ll focus on three datasets:

The U.S. Congress dataset contains public information on historical members of Congress and illustrates several fundamental capabilities of .groupby().
The air quality dataset contains periodic gas sensor readings. This will allow you to work with floats and time series data.
The news aggregator dataset which holds metadata on several hundred thousand news articles. You’ll be working with strings and doing groupby-based text munging.

You can download the source code for all the examples in this tutorial by clicking on the link below:

Download Datasets:Click here to download the datasets you'll use to learn about Pandas' GroupBy in this tutorial.

Once you’ve downloaded the .zip, you can unzip it to your current directory:

$ unzip -q -d groupby-data groupby-data.zip

The -d option lets you extract the contents to a new folder:

./
│
└── groupby-data/
    │
    ├── legislators-historical.csv
    ├── airqual.csv
    └── news.csv

With that set up, you’re ready to jump in!

Example 1: U.S. Congress Dataset

You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a Pandas DataFrame with read_csv():

importpandasaspddtypes={"first_name":"category","gender":"category","type":"category","state":"category","party":"category",}df=pd.read_csv("groupby-data/legislators-historical.csv",dtype=dtypes,usecols=list(dtypes)+["birthday","last_name"],parse_dates=["birthday"])

The dataset contains members’ first and last names, birth date, gender, type ("rep" for House of Representatives or "sen" for Senate), U.S. state, and political party. You can use df.tail() to vie the last few rows of the dataset:

>>>

>>> df.tail()      last_name first_name   birthday gender type state       party11970   Garrett     Thomas 1972-03-27      M  rep    VA  Republican11971    Handel      Karen 1962-04-18      F  rep    GA  Republican11972     Jones     Brenda 1959-10-24      F  rep    MI    Democrat11973    Marino        Tom 1952-08-15      M  rep    PA  Republican11974     Jones     Walter 1943-02-10      M  rep    NC  Republican

The DataFrame uses categorical dtypes for space efficiency:

>>>

>>> df.dtypeslast_name             objectfirst_name          categorybirthday      datetime64[ns]gender              categorytype                categorystate               categoryparty               categorydtype: object

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

The “Hello, World!” of Pandas GroupBy

Now that you’re familiar with the dataset, you’ll start with a “Hello, World!” for the Pandas GroupBy operation. What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset? In SQL, you could find this answer with a SELECT statement:

SELECTstate,count(name)FROMdfGROUPBYstateORDERBYstate;

Here’s the near-equivalent in Pandas:

>>>

>>> n_by_state=df.groupby("state")["last_name"].count()>>> n_by_state.head(10)stateAK     16AL    206AR    117AS      2AZ     48CA    361CO     90CT    240DC      2DE     97Name: last_name, dtype: int64

You call .groupby() and pass the name of the column you want to group on, which is "state". Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.

You can pass a lot more than just a single column name to .groupby() as the first argument. You can also specify any of the following:

A list of multiple column names
A dict or Pandas Series
A NumPy array or Pandas Index, or an array-like iterable of these

Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:

>>>

>>> df.groupby(["state","gender"])["last_name"].count()state  genderAK     M          16AL     F           3       M         203AR     F           5       M         112                ...WI     M         196WV     F           1       M         119WY     F           2       M          38Name: last_name, Length: 104, dtype: int64

The analogous SQL query would look like this:

SELECTstate,gender,count(name)FROMdfGROUPBYstate,genderORDERBYstate,gender;

As you’ll see next, .groupby() and the comparable SQL statements are close cousins, but they’re often not functionally identical.

Note: There’s one more tiny difference in the Pandas GroupBy vs SQL comparison here: in the Pandas version, some states only display one gender. As we developed this tutorial, we encountered a small but tricky bug in the Pandas source that doesn’t handle the observed parameter well with certain types of data. Never fear! There are a few workarounds in this particular case.

Pandas GroupBy vs SQL

This is a good time to introduce one prominent difference between the Pandas GroupBy operation and the SQL query above. The result set of the SQL query contains three columns:

state
gender
count

In the Pandas version, the grouped-on columns are pushed into the MultiIndex of the resulting Series by default:

>>>

>>> n_by_state_gender=df.groupby(["state","gender"])["last_name"].count()>>> type(n_by_state_gender)<class 'pandas.core.series.Series'>>>> n_by_state_gender.index[:5]MultiIndex([('AK', 'M'),            ('AL', 'F'),            ('AL', 'M'),            ('AR', 'F'),            ('AR', 'M')],           names=['state', 'gender'])

To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you an use as_index=False:

>>>

>>> df.groupby(["state","gender"],as_index=False)["last_name"].count()    state gender  last_name0      AK      F        NaN1      AK      M       16.02      AL      F        3.03      AL      M      203.04      AR      F        5.0..    ...    ...        ...111    WI      M      196.0112    WV      F        1.0113    WV      M      119.0114    WY      F        2.0115    WY      M       38.0[116 rows x 3 columns]

This produces a DataFrame with three columns and a RangeIndex, rather than a Series with a MultiIndex. In short, using as_index=False will make your result more closely mimic the default SQL output for a similar operation.

Note: In df.groupby(["state", "gender"])["last_name"].count(), you could also use .size() instead of .count(), since you know that there are no NaN last names. Using .count() excludes NaN values, while .size() includes everything, NaN or not.

Also note that the SQL queries above explicitly use ORDER BY, whereas .groupby() does not. That’s because .groupby() does this by default through its parameter sort, which is True unless you tell it otherwise:

>>>

>>> # Don't sort results by the sort keys>>> df.groupby("state",sort=False)["last_name"].count()stateDE      97VA     432SC     251MD     305PA    1053      ...AK      16PI      13VI       4GU       4AS       2Name: last_name, Length: 58, dtype: int64

Next, you’ll dive into the object that .groupby() actually produces.

How Pandas GroupBy Works

Before you get any further into the details, take a step back to look at .groupby() itself:

>>>

>>> by_state=df.groupby("state")>>> print(by_state)<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>

What is that DataFrameGroupBy thing? Its .__str__() doesn’t give you much information into what it actually is or how it works. The reason that a DataFrameGroupBy object can be difficult to wrap your head around is that it’s lazy in nature. It doesn’t really do any operations to produce a useful result until you say so.

One term that’s frequently used alongside .groupby() is split-apply-combine. This refers to a chain of three steps:

Split a table into groups
Apply some operations to each of those smaller tables
Combine the results

It can be difficult to inspect df.groupby("state") because it does virtually none of these things until you do something with the resulting object. Again, a Pandas GroupBy object is lazy. It delays virtually every part of the split-apply-combine process until you invoke a method on it.

So, how can you mentally separate the split, apply, and combine stages if you can’t see any of them happening in isolation? One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__() and produces an iterator of (group, DataFrame) pairs for DataFrames:

>>>

>>> forstate,frameinby_state:... print(f"First 2 entries for {state!r}")... print("------------------------")... print(frame.head(2),end="\n\n")...First 2 entries for 'AK'------------------------     last_name first_name   birthday gender type state        party6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat6647      Cale     Thomas 1848-09-17      M  rep    AK  IndependentFirst 2 entries for 'AL'------------------------    last_name first_name   birthday gender type state       party912   Crowell       John 1780-09-18      M  rep    AL  Republican991    Walker       John 1783-08-12      M  sen    AL  Republican

If you’re working on a challenging aggregation problem, then iterating over the Pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.

There are a few other methods and properties that let you look into the individual groups and their splits. The .groups attribute will give you a dictionary of {group name: group label} pairs. For example, by_state is a dict with states as keys. Here’s the value for the "PA" key:

>>>

>>> by_state.groups["PA"]Int64Index([    4,    19,    21,    27,    38,    57,    69,    76,    84,               88,            ...            11842, 11866, 11875, 11877, 11887, 11891, 11932, 11945, 11959,            11973],           dtype='int64', length=1053)

Each value is a sequence of the index locations for the rows belonging to that particular group. In the output above, 4, 19, and 21 are the first indices in df at which the state equals “PA.”

You can also use .get_group() as a way to drill down to the sub-table from a single group:

>>>

>>> by_state.get_group("PA")      last_name first_name   birthday gender type state                party4        Clymer     George 1739-03-16      M  rep    PA                  NaN19       Maclay    William 1737-07-20      M  sen    PA  Anti-Administration21       Morris     Robert 1734-01-20      M  sen    PA   Pro-Administration27      Wynkoop      Henry 1737-03-02      M  rep    PA                  NaN38       Jacobs     Israel 1726-06-09      M  rep    PA                  NaN... .....................11891     Brady     Robert 1945-04-07      M  rep    PA             Democrat11932   Shuster       Bill 1961-01-10      M  rep    PA           Republican11945   Rothfus      Keith 1962-04-25      M  rep    PA           Republican11959  Costello       Ryan 1976-09-07      M  rep    PA           Republican11973    Marino        Tom 1952-08-15      M  rep    PA           Republican

This is virtually equivalent to using .loc[]. You could get the same output with something like df.loc[df["state"] == "PA"].

Note: I use the generic term Pandas GroupBy object to refer to both a DataFrameGroupBy object or a SeriesGroupBy object, which have a lot of commonalities between them.

It’s also worth mentioning that .groupby() does do some, but not all, of the splitting work by building a Grouping class instance for each key that you pass. However, many of the methods of the BaseGrouper class that holds these groupings are called lazily rather than at __init__(), and many also use a cached property design.

Next, what about the apply part? You can think of this step of the process as applying the same operation (or callable) to every “sub-table” that is produced by the splitting stage. (I don’t know if “sub-table” is the technical term, but I haven’t found a better one 🤷‍♂️)

From the Pandas GroupBy object by_state, you can grab the initial U.S. state and DataFrame with next(). When you iterate over a Pandas GroupBy object, you’ll get pairs that you can unpack into two variables:

>>>

>>> state,frame=next(iter(by_state))# First tuple from iterator>>> state'AK'>>> frame.head(3)     last_name first_name   birthday gender type state        party6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent7442   Grigsby     George 1874-12-02      M  rep    AK          NaN

Now, think back to your original, full operation:

>>>

>>> df.groupby("state")["last_name"].count()stateAK      16AL     206AR     117AS       2AZ      48...

The apply stage, when applied to your single, subsetted DataFrame, would look like this:

>>>

>>> frame["last_name"].count()# Count for state == 'AK'16

You can see that the result, 16, matches the value for AK in the combined result.

The last step, combine, is the most self-explanatory. It simply takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.

Example 2: Air Quality Dataset

The air quality dataset contains hourly readings from a gas sensor device in Italy. Missing values are denoted with -200 in the CSV file. You can use read_csv() to combine two columns into a timestamp while using a subset of the other columns:

importpandasaspddf=pd.read_csv("groupby-data/airqual.csv",parse_dates=[["Date","Time"]],na_values=[-200],usecols=["Date","Time","CO(GT)","T","RH","AH"]).rename(columns={"CO(GT)":"co","Date_Time":"tstamp","T":"temp_c","RH":"rel_hum","AH":"abs_hum",}).set_index("tstamp")

This produces a DataFrame with a DatetimeIndex and four float columns:

>>>

>>> df.head()                      co  temp_c  rel_hum  abs_humtstamp2004-03-10 18:00:00  2.6    13.6     48.9    0.7582004-03-10 19:00:00  2.0    13.3     47.7    0.7262004-03-10 20:00:00  2.2    11.9     54.0    0.7502004-03-10 21:00:00  2.2    11.0     60.0    0.7872004-03-10 22:00:00  1.6    11.2     59.6    0.789

Here, co is that hour’s average carbon monoxide reading, while temp_c, rel_hum, and abs_hum are the average temperature in Celsius, relative humidity, and absolute humidity over that hour, respectively. The observations run from March 2004 through April 2005:

>>>

>>> df.index.min()Timestamp('2004-03-10 18:00:00')>>> df.index.max()Timestamp('2005-04-04 14:00:00')

So far, you’ve grouped on columns by specifying their names as str, such as df.groupby("state"). But .groupby() is a whole lot more flexible than this! You’ll see how next.

Grouping on Derived Arrays

Earlier you saw that the first parameter to .groupby() can accept several different arguments:

A column or list of columns
A dict or Pandas Series
A NumPy array or Pandas Index, or an array-like iterable of these

You can take advantage of the last option in order to group by the day of the week. You can use the index’s .day_name() to produce a Pandas Index of strings. Here are the first ten observations:

>>>

>>> day_names=df.index.day_name()>>> type(day_names)<class 'pandas.core.indexes.base.Index'>>>> day_names[:10]Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',       'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday'],      dtype='object', name='tstamp')

You can then take this object and use it as the .groupby() key. In Pandas-speak, day_names is array-like. It’s a one-dimensional sequence of labels.

Note: For a Pandas Series, rather than an Index, you’ll need the .dt accessor to get access to methods like .day_name(). If ser is your Series, then you’d need ser.dt.day_name().

Now, pass that object to .groupby() to find the average carbon monoxide ()co) reading by day of the week:

>>>

>>> df.groupby(day_names)["co"].mean()tstampFriday       2.543Monday       2.017Saturday     1.861Sunday       1.438Thursday     2.456Tuesday      2.382Wednesday    2.401Name: co, dtype: float64

The split-apply-combine process behaves largely the same as before, except that the splitting this time is done on an artificially-created column. This column doesn’t exist in the DataFrame itself, but rather is derived from it.

What if you wanted to group not just by day of the week, but by hour of the day? That result should have 7 * 24 = 168 observations. To accomplish that, you can pass a list of array-like objects. In this case, you’ll pass Pandas Int64Index objects:

>>>

>>> hr=df.index.hour>>> df.groupby([day_names,hr])["co"].mean().rename_axis(["dow","hr"])dow        hrFriday     0     1.936           1     1.609           2     1.172           3     0.887           4     0.823                 ...Wednesday  19    4.147           20    3.845           21    2.898           22    2.102           23    1.938Name: co, Length: 168, dtype: float64

Here’s one more similar case that uses .cut() to bin the temperature values into discrete intervals:

>>>

>>> bins=pd.cut(df["temp_c"],bins=3,labels=("cool","warm","hot"))>>> df[["rel_hum","abs_hum"]].groupby(bins).agg(["mean","median"])       rel_hum        abs_hum          mean median    mean mediantemp_ccool    57.651   59.2   0.666  0.658warm    49.383   49.3   1.183  1.145hot     24.994   24.1   1.293  1.274

In this case, bins is actually a Series:

>>>

>>> type(bins)<class 'pandas.core.series.Series'>>>> bins.head()tstamp2004-03-10 18:00:00    cool2004-03-10 19:00:00    cool2004-03-10 20:00:00    cool2004-03-10 21:00:00    cool2004-03-10 22:00:00    coolName: temp_c, dtype: categoryCategories (3, object): [cool < warm < hot]

Whether it’s a Series, NumPy array, or list doesn’t matter. What’s important is that bins still serves as a sequence of labels, one of cool, warm, or hot. If you really wanted to, then you could also use a Categorical array or even a plain-old list:

Native Python list:df.groupby(bins.tolist())
Pandas Categorical array:df.groupby(bins.values)

As you can see, .groupby() is smart and can handle a lot of different input types. Any of these would produce the same result because all of them function as a sequence of labels on which to perform the grouping and splitting.

Resampling

You’ve grouped df by the day of the week with df.groupby(day_names)["co"].mean(). Now consider something different. What if you wanted to group by an observation’s year and quarter? Here’s one way to accomplish that:

>>>

>>> # See an easier alternative below>>> df.groupby([df.index.year,df.index.quarter])["co"].agg(... ["max","min"]... ).rename_axis(["year","quarter"])               max  minyear quarter2004 1         8.1  0.3     2         7.3  0.1     3         7.5  0.1     4        11.9  0.12005 1         8.7  0.1     2         5.0  0.3

This whole operation can, alternatively, be expressed through resampling. One of the uses of resampling is as a time-based groupby. All that you need to do is pass a frequency string, such as "Q" for "quarterly", and Pandas will do the rest:

>>>

>>> df.resample("Q")["co"].agg(["max","min"])             max  mintstamp2004-03-31   8.1  0.32004-06-30   7.3  0.12004-09-30   7.5  0.12004-12-31  11.9  0.12005-03-31   8.7  0.12005-06-30   5.0  0.3

Often, when you use .resample() you can express time-based grouping operations in a much more succinct manner. The result may be a tiny bit different than the more verbose .groupby() equivalent, but you’ll often find that .resample() gives you exactly what you’re looking for.

Example 3: News Aggregator Dataset

Now you’ll work with the third and final dataset, which holds metadata on several hundred thousand news articles and groups them into topic clusters:

importdatetimeasdtimportpandasaspddefparse_millisecond_timestamp(ts:int)->dt.datetime:"""Convert ms since Unix epoch to UTC datetime instance."""returndt.datetime.fromtimestamp(ts/1000,tz=dt.timezone.utc)df=pd.read_csv("groupby-data/news.csv",sep="\t",header=None,index_col=0,names=["title","url","outlet","category","cluster","host","tstamp"],parse_dates=["tstamp"],date_parser=parse_millisecond_timestamp,dtype={"outlet":"category","category":"category","cluster":"category","host":"category",},)

To read it into memory with the proper dyptes, you need a helper function to parse the timestamp column. This is because it’s expressed as the number of milliseconds since the Unix epoch, rather than fractional seconds, which is the convention. Similar to what you did before, you can use the Categorical dtype to efficiently encode columns that have a relatively small number of unique values relative to the column length.

Each row of the dataset contains the title, URL, publishing outlet’s name, and domain, as well as the publish timestamp. cluster is a random ID for the topic cluster to which an article belongs. category is the news category and contains the following options:

b for business
t for science and technology
e for entertainment
m for health

Here’s the first row:

>>>

>>> df.iloc[0]title       Fed official says wea...url         http://www.latimes.co...outlet             Los Angeles Timescategory                           bcluster     ddUyU0VZz0BRneMioxUPQ...host                 www.latimes.comtstamp      2014-03-10 16:52:50.6...Name: 1, dtype: object

Now that you’ve had a glimpse of the data, you can begin to ask more complex questions about it.

Using Lambda Functions in `.groupby()`

This dataset invites a lot more potentially involved questions. I’ll throw a random but meaningful one out there: which outlets talk most about the Federal Reserve? Let’s assume for simplicity that this entails searching for case-sensitive mentions of "Fed". Bear in mind that this may generate some false positives with terms like “Federal Government.”

To count mentions by outlet, you can call .groupby() on the outlet, and then quite literally .apply() a function on each group:

>>>

>>> df.groupby("outlet",sort=False)["title"].apply(... lambdaser:ser.str.contains("Fed").sum()... ).nlargest(10)outletReuters                         161NASDAQ                          103Businessweek                     93Investing.com                    66Wall Street Journal \(blog\)     61MarketWatch                      56Moneynews                        55Bloomberg                        53GlobalPost                       51Economic Times                   44Name: title, dtype: int64

Let’s break this down since there are several method calls made in succession. Like before, you can pull out the first group and its corresponding Pandas object by taking the first tuple from the Pandas GroupBy iterator:

>>>

>>> title,ser=next(iter(df.groupby("outlet",sort=False)["title"]))>>> title'Los Angeles Times'>>> ser.head()1       Fed official says weak data caused by weather,...486            Stocks fall on discouraging news from Asia1124    Clues to Genghis Khan's rise, written in the r...1146    Elephants distinguish human voices by sex, age...1237    Honda splits Acura into its own division to re...Name: title, dtype: object

In this case, ser is a Pandas Series rather than a DataFrame. That’s because you followed up the .groupby() call with ["title"]. This effectively selects that single column from each sub-table.

Next comes .str.contains("Fed"). This returns a Boolean Series that is True when an article title registers a match on the search. Sure enough, the first row starts with "Fed official says weak data caused by weather,..." and lights up as True:

>>>

>>> ser.str.contains("Fed")1          True486       False1124      False1146      False1237      False          ...421547    False421584    False421972    False422226    False422905    FalseName: title, Length: 1976, dtype: bool

The next step is to .sum() this Series. Since bool is technically just a specialized type of int, you can sum a Series of True and False just as you would sum a sequence of 1 and 0:

>>>

>>> ser.str.contains("Fed").sum()17

The result is the number of mentions of "Fed" by the Los Angeles Times in the dataset. The same routine gets applied for Reuters, NASDAQ, Businessweek, and the rest of the lot.

Improving the Performance of `.groupby()`

Let’s backtrack again to .groupby(...).apply() to see why this pattern can be suboptimal. To get some background information, check out How to Speed Up Your Pandas Projects. What may happen with .apply() is that it will effectively perform a Python loop over each group. While the .groupby(...).apply() pattern can provide some flexibility, it can also inhibit Pandas from otherwise using its Cython-based optimizations.

All that is to say that whenever you find yourself thinking about using .apply(), ask yourself if there’s a way to express the operation in a vectorized way. In that case, you can take advantage of the fact that .groupby() accepts not just one or more column names, but also many array-like structures:

A 1-dimensional NumPy array
A list
A Pandas Series or Index

Also note that .groupby() is a valid instance method for a Series, not just a DataFrame, so you can essentially inverse the splitting logic. With that in mind, you can first construct a Series of Booleans that indicate whether or not the title contains "Fed":

>>>

>>> mentions_fed=df["title"].str.contains("Fed")>>> type(mentions_fed)<class 'pandas.core.series.Series'>

Now, .groupby() is also a method of Series, so you can group one Series on another:

>>>

>>> importnumpyasnp>>> mentions_fed.groupby(... df["outlet"],sort=False... ).sum().nlargest(10).astype(np.uintc)outletReuters                         161NASDAQ                          103Businessweek                     93Investing.com                    66Wall Street Journal \(blog\)     61MarketWatch                      56Moneynews                        55Bloomberg                        53GlobalPost                       51Economic Times                   44Name: title, dtype: uint32

The two Series don’t need to be columns of the same DataFrame object. They just need to be of the same shape:

>>>

>>> mentions_fed.shape(422419,)>>> df["outlet"].shape(422419,)

Finally, you can cast the result back to an unsigned integer with np.uintc if you’re determined to get the most compact result possible. Here’s a head-to-head comparison of the two versions that will produce the same result:

# Version 1: using `.apply()`df.groupby("outlet",sort=False)["title"].apply(lambdaser:ser.str.contains("Fed").sum()).nlargest(10)# Version 2: using vectorizationmentions_fed.groupby(df["outlet"],sort=False).sum().nlargest(10).astype(np.uintc)

On my laptop, Version 1 takes 4.01 seconds, while Version 2 takes just 292 milliseconds. This is an impressive 14x difference in CPU time for a few hundred thousand rows. Consider how dramatic the difference becomes when your dataset grows to a few million rows!

Note: This example glazes over a few details in the data for the sake of simplicity. Namely, the search term "Fed" might also find mentions of things like “Federal government.”

Series.str.contains() also takes a compiled regular expression as an argument if you want to get fancy and use an expression involving a negative lookahead.

You may also want to count not just the raw number of mentions, but the proportion of mentions relative to all articles that a news outlet produced.

Pandas GroupBy: Putting It All Together

If you call dir() on a Pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a Pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.

Broadly, methods of a Pandas GroupBy object fall into a handful of categories:

Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number.
Filter methods come back to you with a subset of the original DataFrame. This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group.
Transformation methods return a DataFrame with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resulting DataFrame will commonly be smaller in size than the input DataFrame. This is not true of a transformation, which transforms individual values themselves but retains the shape of the original DataFrame.
Meta methods are less concerned with the original object on which you called .groupby(), and more focused on giving you high-level information such as the number of groups and indices of those groups.
Plotting methods mimic the API of plotting for a Pandas Series or DataFrame, but typically break the output into multiple subplots.

The official documentation has its own explanation of these categories. They are, to some degree, open to interpretation, and this tutorial might diverge in slight ways in classifying which method falls where.

Note: There’s also yet another separate table in the Pandas docs with its own classification scheme. Pick whichever works for you and seems most intuitive!

You can take a look at a more detailed breakdown of each category and the various methods of .groupby() that fall under them:

Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number. Here are some aggregation methods:

Filter methods come back to you with a subset of the original DataFrame. This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group. Here are some filter methods:

Transformation methods return a DataFrame with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resulting DataFrame will commonly be smaller in size than the input DataFrame. This is not true of a transformation, which transforms individual values themselves but retains the shape of the original DataFrame. Here are some transformer methods:

Meta methods are less concerned with the original object on which you called .groupby(), and more focused on giving you high-level information such as the number of groups and indices of those groups. Here are some meta methods:

Plotting methods mimic the API of plotting for a Pandas Series or DataFrame, but typically break the output into multiple subplots. Here are some plotting methods:

There are a few methods of Pandas GroupBy objects that don’t fall nicely into the categories above. These methods usually produce an intermediate object that is not a DataFrame or Series. For instance, df.groupby(...).rolling(...) produces a RollingGroupby object, which you can then call aggregation, filter, or transformation methods on:

.expanding()
.pipe()
.resample()
.rolling()

Conclusion

In this tutorial, you’ve covered a ton of ground on .groupby(), including its design, its API, and how to chain methods together to get data in an output that suits your purpose.

You’ve learned:

How to use Pandas GroupBy operations on real-world data
How the split-apply-combine chain of operations works and how you can decompose it into steps
How methods of a Pandas GroupBy can be placed into different categories based on their intent and result

There is much more to .groupby() than you can cover in one tutorial. Check out the resources below and use the example datasets here as a starting point for further exploration!

You can download the source code for all the examples in this tutorial by clicking on the link below:

Download Datasets:Click here to download the datasets you'll use to learn about Pandas' GroupBy in this tutorial.

More Resources on Pandas GroupBy

Pandas documentation guides are user-friendly walk-throughs to different aspects of Pandas. Here are some portions of the documentation that you can check out to learn more about Pandas GroupBy:

The Pandas GroupBy user guide
The Grouping cookbook

The API documentation is a fuller technical reference to methods and objects:

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Codementor: Solving CAPTCHA with Web automation

November 18, 2019, 8:24 am

≫ Next: Python Software Foundation: Why Sponsor PyCon 2020?

≪ Previous: Real Python: Pandas GroupBy: Your Guide to Grouping Data in Python

All you need to know about CAPTCHA bypassing.

↧

Python Software Foundation: Why Sponsor PyCon 2020?

November 18, 2019, 4:58 am

≫ Next: PyCon: Attention! Attention! Tutorial Proposal Deadline Approaching

≪ Previous: Codementor: Solving CAPTCHA with Web automation

Sponsors help keep PyCon affordable and accessible to the widest possible audience. Sponsors are what make this conference possible. From low ticket prices to financial aid, to video recording, the organizations who step forward to support PyCon, in turn, support the entire Python community. They make it possible for so many to attend, for so many to be presenters, and for the people at home to watch along.

As a PyCon sponsor, your outreach to attendees begins before the conference even starts and lasts throughout the year. Your reach isn’t limited to the number of people who attend the conference itself. Following PyCon, you’ll continue to connect with the Python community through many touch points:

Playback and Recorded PyCon Coverage: Over 17,000 people have subscribed to PyCon’s YouTube channel with over 723,000 views of either the keynotes or the recorded videos of PyCon 2018 sessions. Those videos continue to attract viewers today.
Conference Page and Announcement E-Lists: The PyCon home page has thousands of unique visitors every year and this year we’ll have an opt-in at signup for a newsletter to receive PyCon announcements.
Social Media: The PSF and PyCon Twitter accounts have more than 332,000 followers combined and thousands of followers of the PyCon speakers.

A Silver Sponsorship is a great low-cost option that includes (2) full passes and Job listing(s) on the Jobs Fair page! We also offer organizations with fewer than 25 employees a 30% discount for gold and silver sponsorships.

Check out the new marketing and promotional opportunities for 2020!

We are also happy to customize a sponsorship package to give you the freedom to choose what you think works best in order to meet your event participation goals. Our sponsorship prospectus can be found here.

For more information please contact: pycon-sponsors@python.org

↧

PyCon: Attention! Attention! Tutorial Proposal Deadline Approaching

November 18, 2019, 5:50 am

≫ Next: Python Software Foundation: Python Software Foundation Fellow Members for Q3 2019

≪ Previous: Python Software Foundation: Why Sponsor PyCon 2020?

Tutorial Proposal Deadline is this Friday, November 22, 2019

If you have been considering submitting a proposal, don’t hesitate, don’t wait, now is the time to submit your proposal!

How to Get Started?

First, sign up for an account
Once you are logged in proceed to your account dashboard and create a speaker profile.
At this point, you can submit tutorials, fill in the fields as follows:

Title
Give your tutorial a name that accurately describes the tutorial’s focus to potential students.

Description
A high-level description of the tutorial, limited to ~400 characters. The description is used to describe your tutorial online should it be selected. So we ask that you make it brief.

Audience
At what level of Python and other topic-specific experience or expertise is the tutorial aiming for?

‘Advanced’, ‘Intermediate’, and ‘Beginner’ mean something different to everyone. Feel free to include additional detail regarding the sort of background expected, as well as who may benefit. Reviewers need to know what level of Python experience is targeted and also what level(s) of domain-specific expertise is targeted, for example networking, SQL, database, etc. See our sample tutorial proposals for details.

Format
Please describe what portion of the tutorial you plan to spend on student exercises, lecture, or other activities. We don’t want precision: we just want to know what teaching tools you’ll use, and how interactive your tutorial will be. If you want to describe this via other means, feel free.
NOTE: In past years, we instead requested submitters categorize their tutorials as ‘labs’, ‘workshops’, or ‘lectures’, but found everyone’s definition of those terms varied.

Outline
Your outline should list the topics and activities you will guide your students through during your 3 hour tutorial. You may wish to consult the markdown guide for styling. Please err on the side of ‘too much detail’ rather than ‘not enough’.

You should also include timing notes, estimating what portion of your tutorial you’ll devote to each major topic (usually there are 2-5 of those).

The outline will not be shared with conference attendees.

What should my timing notes look like? How precise do I need to be?

We request you provide a rough estimate of how much time (or percentage of the talk) you’ll dedicate to each major topic (not subtopics). We recommend these timings be no more precise than 30-minute increments, but we’ll allow some leeway. Please don’t give your timings down to the minute!

Alternatively, you are welcome to provide the portion of time you expect to spend on each major topic. Please indicate whether you are using percentages or minutes.

Read through the full details on the tutorial information page and look through the sample submissions for more information and guidance on submitting a proposal.

↧

Python Software Foundation: Python Software Foundation Fellow Members for Q3 2019

November 18, 2019, 6:35 am

≫ Next: Stack Abuse: Quicksort in Python

≪ Previous: PyCon: Attention! Attention! Tutorial Proposal Deadline Approaching

We are happy to announce our newest PSF Fellow Members for Q3!

Q3 2019

Abigail Mesrenyame Dogbe

Twitter

Anton Caceres

Twitter, Website

Bruno Oliveira

GitHub

Gautier Hayoun

Twitter, Personal company website

Mahmoud Hashemi

GitHub, Blog, Twitter

Manabu Terada

Mannie Young

Michael Young

Noah Alorwu

Paul Kehrer

Tom Viner

Valentin Dombrovsky

Personal company website, LinkedIn

Congratulations! Thank you for your continued contributions. We have added you to our Fellow roster online.

The above members have contributed to the Python ecosystem by teaching Python, maintaining popular libraries/tools such as cryptography and pytest, helping document on packaging.python.org, organizing Python events, starting Python communities in their home countries, and overall being great mentors in our community. Each of them continues to help make Python more accessible around the world. To learn more about the new Fellow members, check out their links above.

Let's continue to recognize Pythonistas all over the world for their impact on our community. Here's the criteria our Work Group uses to review nominations:

For those who have served the Python community by creating and/or maintaining various engineering/design contributions, the following statement should be true:

Nominated Person has served the Python community by making available code, tests, documentation, or design, either in a Python implementation or in a Python ecosystem project, that 1) shows technical excellence, 2) is an example of software engineering principles and best practices, and 3) has achieved widespread usage or acclaim.

For those who have served the Python community by coordinating, organizing, teaching, writing, and evangelizing, the following statement should be true:

Nominated Person has served the Python community through extraordinary efforts in organizing Python events, publicly promoting Python, and teaching and coordinating others. Nominated Person's efforts have shown leadership and resulted in long-lasting and substantial gains in the number and quality of Python users, and have been widely recognized as being above and beyond normal volunteering.

If someone is not accepted to be a fellow in the quarter they were nominated for, they will remain an active nominee for 1 year for future consideration.
It is suggested/recommended that the nominee have wide Python community involvement. Examples would be (not a complete list - just examples):

Someone who has received a Community Service Award or Distinguished Service Award
A developer that writes (more than one) documentation/books/tutorials for wider audience
Someone that helps translate (more than one) documentation/books/tutorials for better inclusivity
An instructor that teaches Python related tutorials in various regions
Someone that helps organize local meet ups and also helps organize a regional conference

Nominees should be aware of the Python community’s Code of Conduct and should have a record of fostering the community.
Sitting members of the PSF Board of Directors can be nominated if they meet the above criteria.

If you would like to nominate someone to be a PSF Fellow, please send a description of their Python accomplishments and their email address to psf-fellow at python.org. We are accepting nominations for quarter 4 through November 20, 2019. More information is available at: https://www.python.org/psf/fellows/.

↧

Stack Abuse: Quicksort in Python

November 18, 2019, 9:13 am

≫ Next: Test and Code: 94: The real 11 reasons I don't hire you - Charity Majors

≪ Previous: Python Software Foundation: Python Software Foundation Fellow Members for Q3 2019

Introduction

Quicksort is a popular sorting algorithm and is often used, right alongside Merge Sort. It's a good example of an efficient sorting algorithm, with an average complexity of O(n log_n). Part of its popularity also derives from the ease of implementation.

We will use simple integers in the first part of this article, but we'll give an example of how to change this algorithm to sort objects of a custom class.

Quicksort is a representative of three types of sorting algorithms: divide and conquer, in-place, and unstable.

Divide and conquer: Quicksort splits the array into smaller arrays until it ends up with an empty array, or one that has only one element, before recursively sorting the larger arrays.
In place: Quicksort doesn't create any copies of the array or any of its subarrays. It does however require stack memory for all the recursive calls it makes.
Unstable: A stable sorting algorithm is one in which elements with the same value appear in the same relative order in the sorted array as they do before the array is sorted. An unstable sorting algorithm doesn't guarantee this, it can of course happen, but it isn't guaranteed.

This is something that becomes important when you sort objects instead of primitive types. For example, imagine you have several Person objects that have the same age, i.e. Dave aged 21 and Mike aged 21. If you were to use Quicksort on a collection that contains both Dave and Mike, sorted by age, there is no guarantee that Dave will come before Mike every time you run the algorithm, and vice versa.

Quicksort

The basic version of the algorithm does the following:

Divide the collection in two (roughly) equal parts by taking a pseudo-random element and using it as a pivot.
Elements smaller than the pivot get moved to the left of the pivot, and elements larger than the pivot to the right of it.
This process is repeated for the collection to the left of the pivot, as well as for the array of elements to the right of the pivot until the whole array is sorted.

When we describe elements as "larger" or "smaller" than another element - it doesn't necessarily mean larger or smaller integers, we can sort by any property we choose.

If we have a custom class Person, and each person has a name and age, we can sort by name (lexicographically) or by age (ascending or descending).

How Quicksort Works

Quicksort will, more often than not, fail to divide the array into equal parts. This is because the whole process depends on how we choose the pivot. We need to choose a pivot so that it's roughly larger than half of the elements, and therefore roughly smaller than the other half of the elements. As intuitive as this process may seem, it's very hard to do.

Think about it for a moment - how would you choose an adequate pivot for your array? A lot of ideas about how to choose a pivot have been presented in Quicksort's history - randomly choosing an element, which doesn't work because of how "expensive" choosing a random element is while not guaranteeing a good pivot choice; picking an element from the middle; picking a median of the first, middle and last element; and even more complicated recursive formulas.

The most straight-forward approach is to simply choose the first (or last) element. This leads to Quicksort, ironically, performing very badly on already sorted (or almost sorted) arrays.

This is how most people choose to implement Quicksort and, since it's simple and this way of choosing the pivot is a very efficient operation (and we'll need to do it repeatedly), this is exactly what we will do.

Now that we have chosen a pivot - what do we do with it? Again, there are several ways of going about the partitioning itself. We will have a "pointer" to our pivot, and a pointer to the "smaller" elements and a pointer to the "larger" elements.

The goal is to move the elements around so that all elements smaller than the pivot are to its left, and all larger elements are to its right. The smaller and larger elements don't necessarily end up sorted, we just want them on the proper side of the pivot. We then recursively go through the left and right side of the pivot.

A step by step look at what we're planning to do will help illustrate the process. Using the array shown below, we've chosen the first element as the pivot (29), and the pointer to the smaller elements (called "low") starts right after, and the pointer to the larger elements (called "high") starts at the end.

29 is the first pivot, low points to 99 and high points to 44

29 | 99 (low),27,41,66,28,44,78,87,19,31,76,58,88,83,97,12,21,44 (high)

We move high to the left until we find a value that's lower than our pivot.

29 | 99 (low),27,41,66,28,44,78,87,19,31,76,58,88,83,97,12,21 (high),44

Now that our high variable is pointing to 21, an element smaller than the pivot, we want to find a value near the beginning of the array that we can swap it with. It doesn't make any sense to swap with a value that's also smaller than the pivot, so if low is pointing to a smaller element we try and find one that's larger.
We move our low variable to the right until we find an element larger than the pivot. Luckily, low was already positioned on 99.
We swap places of low and high:

29 | 21 (low),27,41,66,28,44,78,87,19,31,76,58,88,83,97,12,99 (high),44

Right after we do this, we move high to the left and low to the right (since 21 and 99 are now in their correct places)
Again, we move high to the left until we reach a value lower than the pivot, which we find right away - 12
Now we search for a value larger than the pivot by moving low to the right, and we find the first such value at 41

This process is continued until the low and high pointers finally meet in a single element:

29 | 21,27,12,19,28 (low/high),44,78,87,66,31,76,58,88,83,97,12,99,44

We've got no more use of this pivot so the only thing left to do is to swap pivot and high and we're done with this recursive step:

28,21,27,12,19,29,44,78,87,66,31,76,58,88,83,97,12,99,44

As you can see, we have achieved that all values smaller than 29 are now to the left of 29, and all values larger than 29 are to the right.

The algorithm then does the same thing for the 28,21,27,12,19 (left side) collection and the 44,78,87,66,31,76,58,88,83,97,12,99,44 (right side) collection.

Implementation

Sorting Arrays

Quicksort is a naturally recursive algorithm - divide the input array into smaller arrays, move the elements to the proper side of the pivot, and repeat.

Let's go through how a few recursive calls would look:

When we first call the algorithm, we consider all of the elements - from indexes 0 to n-1 where n is the number of elements in our array.
If our pivot ended up in position k, we'd then repeat the process for elements from 0 to k-1 and from k+1 to n-1.
While sorting the elements from k+1 to n-1, the current pivot would end up in some position p. We'd then sort the elements from k+1 to p-1 and p+1 to n-1, and so on.

That being said, we'll utilize two functions - partition() and quick_sort(). The quick_sort() function will first partition() the collection and then recursively call itself on the divided parts.

Let's start off with the partition() function:

def partition(array, start, end):
    pivot = array[start]
    low = start + 1
    high = end

    while True:
        # If the current value we're looking at is larger than the pivot
        # it's in the right place (right side of pivot) and we can move left,
        # to the next element.
        # We also need to make sure we haven't surpassed the low pointer, since that
        # indicates we have already moved all the elements to their correct side of the pivot
        while low <= high and array[high] >= pivot:
            high = high - 1

        # Opposite process of the one above
        while low <= high and array[low] <= pivot:
            low = low + 1

        # We either found a value for both high and low that is out of order
        # or low is higher than high, in which case we exit the loop
        if low <= high:
            array[low], array[high] = array[high], array[low]
            # The loop continues
        else:
            # We exit out of the loop
            break

    array[start], array[high] = array[high], array[start]

    return high

And finally, let's implement the quick_sort() function:

def quick_sort(array, start, end):
    if start >= end:
        return

    p = partition(array, start, end)
    quick_sort(array, start, p-1)
    quick_sort(array, p+1, end)

With both of them implemented, we can run quick_sort() on a simple array:

array = [29,99,27,41,66,28,44,78,87,19,31,76,58,88,83,97,12,21,44]

quick_sort(array, 0, len(array) - 1)
print(array)

Output:

[12, 19, 21, 27, 28, 29, 31, 41, 44, 44, 58, 66, 76, 78, 83, 87, 88, 97, 99]

Since the algorithm is unstable, there's no guarantee that these two 44's were in this order to each other. Maybe there were originally switched - though this doesn't mean much in an integer array.

Sorting Custom Objects

There are a few ways you can rewrite this algorithm to sort custom objects in Python. A very Pythonic way would be to implement the comparison operators for a given class, which means that we wouldn't actually need to change the algorithm implementation since >, ==, <=, etc. would also work on our class object.

Another option would be to allow the caller to supply a method to our algorithm which would then be used to perform the actual comparison of the objects. Rewriting the algorithm in this way for use with custom objects is fairly straight-forward. Keep in mind however that the algorithm isn't stable.

Let's start off with a Person class:

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __str__(self):
        return self.name

This is a pretty basic class with only two properties, name and age. We want to use age as our sorting key, which we'll do by providing a custom lambda function to the sorting algorithm.

But first, let's see how this provided function is used within the algorithm. Instead of doing a direct comparison with the <= or >= operators, we instead call the function to tell is which Person is higher in age:

def partition(array, start, end, compare_func):
    pivot = array[start]
    low = start + 1
    high = end

    while True:
        while low <= high and compare_fun(array[high], pivot):
            high = high - 1

        while low <= high and not compare_fun(array[low], pivot):
            low = low + 1

        if low <= high:
            array[low], array[high] = array[high], array[low]
        else:
            break

    array[start], array[high] = array[high], array[start]

    return high

def quick_sort(array, start, end, compare_func):
    if start >= end:
        return

    p = partition(array, start, end, compare_func)
    quick_sort(array, start, p-1, compare_func)
    quick_sort(array, p+1, end, compare_func)

And now, let's sort a collection of these objects. You can see that the object comparison is provided to the quick_sort call via a lambda, which does the actual comparison of the age property:

p1 = Person("Dave", 21)
p2 = Person("Jane", 58)
p3 = Person("Matthew", 43)
p4 = Person("Mike", 21)
p5 = Person("Tim", 10)

array = [p1,p2,p3,p4,p5]

quick_sort(array, 0, len(array) - 1, lambda x, y: x.age < y.age)
for person in array:
    print(person)

The output is:

Tim
Dave
Mike
Matthew
Jane

By implementing the algorithm in this way, it can be used with any custom object we choose, just as long as we provide an appropriate comparison function.

Optimizations of Quicksort

Given that Quicksort sorts "halves" of a given array independently, it's very convenient for parallelization. We can have a separate thread that sorts each "half" of the array, and we could ideally halve the time needed to sort it.

However, Quicksort can have a very deep recursive call stack if we are particularly unlucky in our choice of a pivot, and parallelization isn't as efficient as it is with Merge Sort.

It's recommended to use a simple, non-recursive algorithm for sorting small arrays. Even something simple like insertion sort is more efficient on small arrays than Quicksort. So ideally we could check whether our subarray has only a small number of elements (most recommendations say about 10 or less), and if so, we'd sort it with Insertion Sort instead.

A popular variation of Quicksort is the Multi-pivot Quicksort, which breaks up the original array into n smaller arrays, using n-1 pivots. However, most of the time only two pivots are used, not more.

Fun fact: Dual-pivot Quicksort, along with Insertion Sort for smaller arrays was used in Java 7's sorting implementation.

Conclusion

As we have previously mentioned, the efficiency of Quicksort depends highly on the choice of pivot - it can "make or break" the algorithm's time (and stack space) complexity. The instability of the algorithm is also something that can be a deal breaker when using custom objects.

However, despite all this, Quicksort's average time complexity of O(n*log_n) and its relatively low space-usage and simple implementation, make it a very efficient and popular algorithm.

If you want to learn more, check out our other article, Sorting Algorithms in Python, which covers more sorting algorithms in Python, but not as in-depth.

↧

Test and Code: 94: The real 11 reasons I don't hire you - Charity Majors

November 18, 2019, 9:15 am

≫ Next: Python Engineering at Microsoft: Python in Visual Studio Code – November 2019 Release

≪ Previous: Stack Abuse: Quicksort in Python

You've applied for a job, maybe lots of jobs.
Depending on the company, you've gotta get through:

a resume review
a coding challange
a phone screen
maybe another code example
an in person interview

If you get the job, and you enjoy the work, awesome, congratulations.

If you don't get the job, it'd be really great to know why.

Sometimes it isn't because you aren't a skilled engineer.

What other reasons are there?

Well, that's what we're talking about today.

Charity Majors is the cofounder and CTO of Honeycomb.io, and we're going to talk about reasons for not hiring someone.

This is a very informative episode both for people who job hunt in the future and for hiring managers and people on the interview team.

Special Guest: Charity Majors.

Python Engineering at Microsoft: Python in Visual Studio Code – November 2019 Release

November 18, 2019, 10:52 am

≫ Next: Podcast.__init__: From Simple Script To Beautiful Web Application With Streamlit

≪ Previous: Test and Code: 94: The real 11 reasons I don't hire you - Charity Majors

We are pleased to announce that the November 2019 release of the Python Extension for Visual Studio Code is now available. You can  download the Python extension from the Marketplace, or install it directly from the extension gallery in Visual Studio Code. If you already have the Python extension installed, you can also get the latest update by restarting Visual Studio Code. You can learn more about  Python support in Visual Studio Code in the documentation.

In this release we focused mostly on product quality. We closed a total of 60 issues, 39 of them being bug fixes. However, we’re also pleased to deliver delightful features such as:

Add imports “quick fix” when using the Python Language Server
Altair plot support
Line numbers in the Notebook Editor.

If you’re interested, you can check the full list of improvements in our changelog.

Add Imports “Quick Fix” when using the Python Language Server

We’re excited to announce that we have brought the magic of automatic imports to Python developers in VS Code by way of an add imports quick fix. Automatic imports functionality was one of the most requested features on our GitHub repo (GH21), and when you enable the Microsoft Language Server, you will get this new functionality. To enable the Language Server, add the setting “python.jediEnabled”: false to your settings.json file.

The add imports quick fix within VS Code is triggered via a code action lightbulb. To use the quick fix, begin typing a package name within the editor for which you do not have an import statement at the header of the file. You will notice that if a code action is available for this package (i.e. you have a module installed within your environment with the name you’ve supplied), a yellow squiggle will appear. If you hover over that text, a code action lightbulb will appear indicating that an ’import’ code action is available for the package. You’ll see a list of potential imports (again, based on what’s installed within your environment), allowing you to choose the package that you wish to import.

Example of auto import suggestion for path submodule

The add imports code action will also recognize some of the most popular abbreviations for the following Python packages: numpy as np, tensorflow as tf, pandas as pd, matplotlib.pyplot as plt, matplotlib as mpl, math as m, scipy.io as spio, and scipy as sp.

Example of auto completions suggestions behaviour

The import suggestion list is ordered such that all import statements that appear at the top of the list are package (or module) imports; those that appear lower in the list are import statements for additional modules and/or members (e.g. classes, objects, etc.) from specified packages.

Import suggestion for sys module

Make sure you have linting enabled since this functionality is tied to the Language Server linting capability. You can enable linting by opening the Command Palette (View> Command Palette…), running the “Python: Enable Linting” command and selecting “On” in the drop-down menu.

Altair plots support

The Notebook Editor and the Python Interactive window now both support rendering plots built with Altair, a declarative statistical visualization library for Python.

Jupyter Notebook example displaying Altair support

Line Numbers in the Notebook Editor

Line numbers are now supported in the notebook editor. On selected code cells, you can toggle the line numbers by pressing the “L” key.

Other Changes and Enhancements

We have also added small enhancements and fixed issues requested by users that should improve your experience working with Python in Visual Studio Code. Some notable changes include:

Fix running a unittest file to not execute only the first test. (thanks Nikolay Kondratyev). (#4567)
Added commands translation for Farsi and Turkish (thanks Nikronic). (#8092)
Added command translations for Turkish (thanks alioguzhan). (#8320)
Place all plots on a white background regardless of theme. (#8000)

We are continuing to A/B test new features, so if you see something different that was not announced by the team, you may be part of the experiment! To see if you are part of an experiment, you can check the first lines in the Python extension output channel. If you wish to opt-out of A/B testing, you can open the user settings.json file (View > Command Palette… and run Preferences: Open Settings (JSON)) and set the “python.experiments.enabled” setting to false.

Be sure to download the Python extension for Visual Studio Code now to try out the above improvements. If you run into any problems, please file an issue on the Python VS Code GitHub page.

The post Python in Visual Studio Code – November 2019 Release appeared first on Python.

↧

Podcast.init: From Simple Script To Beautiful Web Application With Streamlit

November 18, 2019, 1:57 pm

≫ Next: The No Title® Tech Blog: Book review – Supercharged Python, by Brian Overland and John Bennet

≪ Previous: Python Engineering at Microsoft: Python in Visual Studio Code – November 2019 Release

Building well designed and easy to use web applications requires a significant amount of knowledge and experience across a range of domains. This can act as an impediment to engineers who primarily work in so-called back-end technologies such as machine learning and systems administration. In this episode Adrien Treuille describes how the Streamlit framework empowers anyone who is comfortable writing Python scripts to create beautiful applications to share their work and make it accessible to their colleagues and customers. If you have ever struggled with hacking together a simple web application to make a useful script self-service then give this episode a listen and then go experiment with how Streamlit can level up your work.

Summary

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at pythonpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host as usual is Tobias Macey and today I’m interviewing Adrien Treuille about Streamlit, an open source app framework built for machine learning and data science teams

Interview

Introductions
How did you get introduced to Python?
Can you start by explaining what Streamlit is and its origin story?
What are some of the types of applications that are commonly built by data teams and who are the typical consumers of those projects?
What are some of the challenges or complications that are unique to this problem space?
What are some of the complications or challenges that you have faced to integrate Streamlit with so many different machine learning frameworks?
Can you describe the technical implementation of Streamlit and how it has evolved since you began working on it?
- How did you approach the design of the API and development workflow to tailor it for the needs and capabilities of machine learning engineers?
- If you were to start the project from scratch today what would you do differently?
What is a typical workflow for someone working on a machine learning application and how does Streamlit fit in?
- What are some of the types of tools or processes that it replaces?
What are some of the most interesting or unexpected ways that you have seen Streamlit used?
What have you found to be some of the most challenging or unexpected aspects of building and evolving Streamlit?
How do you see Python evolving in light of Streamlit and other work in the machine learning space?
What do you have in store for the future of Streamlit or any adjacent products and services?
How are you approaching the governance and sustainability of the Streamlit open source project?

Keep In Touch

Website
LinkedIn
@myelbows on Twitter
treuille on GitHub

Picks

Tobias
- The Book Of Why by Judea Pearl
Adrien
- No Self, No Problem by Anam Thubten

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

↧

The No Title® Tech Blog: Book review – Supercharged Python, by Brian Overland and John Bennet

November 18, 2019, 5:27 pm

≫ Next: Codementor: teach your kids to build their own game with Python - 1

≪ Previous: Podcast.__init__: From Simple Script To Beautiful Web Application With Streamlit

If you have been following beginner or even intermediate guides on Python and are starting to feel the need for more advanced learning, this book may be the one you have been looking for.

↧

Codementor: teach your kids to build their own game with Python - 1

November 18, 2019, 9:09 pm

≫ Next: Codementor: Lesson 101: Everything You Need To Learn About Programming Guidance

≪ Previous: The No Title® Tech Blog: Book review – Supercharged Python, by Brian Overland and John Bennet

how to teach your kids to build their own game with Python

↧

Codementor: Lesson 101: Everything You Need To Learn About Programming Guidance

November 19, 2019, 1:45 am

≫ Next: Real Python: Threading in Python

≪ Previous: Codementor: teach your kids to build their own game with Python - 1

Myhomeworkhelponline Experts provides the most trusted and reliable online Programming assignment help . Programming is one of the most widely taught subjects across the universities.

↧

Real Python: Threading in Python

November 19, 2019, 6:00 am

≫ Next: Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

≪ Previous: Codementor: Lesson 101: Everything You Need To Learn About Programming Guidance

Python threading allows you to have different parts of your program run concurrently and can simplify your design. If you’ve got some experience in Python and want to speed up your program using threads, then this course is for you!

In this article, you’ll learn:

What threads are
How to create threads and wait for them to finish
How to use a ThreadPoolExecutor
How to avoid race conditions
How to use the common tools that Python threading provides

This course assumes you’ve got the Python basics down pat and that you’re using at least version 3.6 to run the examples. If you need a refresher, you can start with the Python Learning Paths and get up to speed.

If you’re not sure if you want to use Python threading, asyncio, or multiprocessing, then you can check out Speed Up Your Python Program With Concurrency.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

November 19, 2019, 7:15 am

≫ Next: Catalin George Festila: Python 3.7.5 : Display a file in the hexadecimal and binary output.

≪ Previous: Real Python: Threading in Python

A few professional announcements.

Seeking developers for paid contract on pip; apply by Nov. 22

One is that I helped the Packaging Working Group of the Python Software Foundation get funding for a long-needed improvement to pip. I led the writing of a few proposals -- grantwriting, to oversimplify -- and, starting possibly as soon as next month, contractors will start work. As Dustin Ingram explains:

Big news: the Python Packaging Working Group has secured >$400K in grants from multiple funders (TBA) to improve one of the most fundamental parts of pip: its dependency resolver. https://pyfound.blogspot.com/2019/11/seeking-developers-for-paid-contract.html
The dependency resolver is the algorithm which takes multiple constrained requirements (e.g. "some_package>=1.0,=2.0") and finds a version of all dependencies (and sub-dependencies) which satisfy all the constraints.
https://pip.pypa.io/en/stable/user_guide/#requirements-files
Right now, pip's resolver mostly works for most use cases... However the algorithm it uses is naïve, and isn't always guaranteed to produce an optimal (or correct) result.
.....
These funds will pay multiple developers to work on completing the design, implementation and rollout of this new dependency resolver for pip, finally closing issue #988.
Not only will this give pip a better resolver, but it will "enable us to untangle pip’s internals from the resolver, enabling pip to share code for dependency resolution with other packaging tooling". https://pradyunsg.me/blog/2019/06/23/oss-update-1/
This is great news for pip and Python packaging in general. Huge shout out to @pradyunsg for his existing work on the resolver issue and guidance here, and to @brainwane for all her tireless work acquiring and directing funding for Python projects.
If you or your organization is interested in participating in this project, we've just posted the RFP, which includes instructions for submitting proposals, evaluation criteria and scope of work.
https://github.com/python/request-for/blob/master/2020-pip/RFP.md

If you're interested, please apply by 22 November.

NYU, Secure Systems Lab, and my new title

In further news: I am now a visiting scholar in Professor Justin Cappos's Secure Systems Lab at New York University's Tandon School of Engineering. And I get to use an office with a door, shelves, whiteboards, and so on (per the picture at right). If you contribute to Python packaging/distribution tools and live in/near or sometimes visit New York City, let me know and perhaps we could cowork a bit?

The Secure Systems Lab stewards The Update Framework (TUF) and related projects, and works to improve the security of the software supply chain. The Python Package Index is likely going to implement TUF to add cryptographic signatures to packages on PyPI, and so I've gotten to give TUF's developers some advice to help that work move along. (I won't be the manager on that project but I'll be watching with great interest.) PyPA may also choose to use more of SSL's work in implementing further security improvements to the package distribution toolchain, and I'm learning more to work out whether and how that could happen. Also, Cappos's research on backtracking dependency resolvers has been helpful to the pip resolver work.

Edited 19 Nov 2019 to clarify role.

PSF projects

I'm grateful to get to help connect the Python Software Foundation with more resources and volunteers. Changeset's current and recent projects have mostly been for the PSF. Last month we finished accessibility, security, and internationalization work on PyPI that was funded by the Open Technology Fund, and Changeset's work on communicating about the sunsetting of Python 2.x continues and will go through April 2020.

Availability for one-day engagements in San Francisco in February

But I am interested in taking on new clients for short engagements starting in February 2020. In particular, I will be in the San Francisco Bay Area in mid- to late February. If you're in SF or nearby, I could offer you a one-day engagement doing one of the following:

developing a contributor outreach/intake strategy
researching potential funders and writing a rough draft of a grant proposal
auditing and improving your developer onboarding documents

I'd spend a little time talking with you, then sit in your office and finish the document before leaving that afternoon. (Photo at right provides a sample of how I look while sitting.) Drop me a line for a free initial 30-minute chat and we can talk pricing.

↧

Catalin George Festila: Python 3.7.5 : Display a file in the hexadecimal and binary output.

November 19, 2019, 2:43 am

≫ Next: PyCharm: PyCharm 2019.2.5

≪ Previous: Sumana Harihareswara - Cogito, Ergo Sumana: My New Title, Improving pip, Availability For Work, And SSL (No, The Other One)

This is an example with a few python3 modules that display a file in the hexadecimal and binary output: import sys import os.path import argparse parser = argparse.ArgumentParser() parser.add_argument("FILE", help="the file that you wish to dump to hexadecimal", type=str) parser.add_argument("-b", "--binary", help="display bytes in binary format instead of hexadecimal") args = parser.parse_args(

↧

Why did we add this?

How to test test code?

How it works on our platform

Bad syntax

Tests pass

Coverage

Mutation testing

Remember MutPy == multiprocessing

Start today!

About Ethiopia

Housekeeping

Example 1: U.S. Congress Dataset

The “Hello, World!” of Pandas GroupBy

Pandas GroupBy vs SQL

How Pandas GroupBy Works

Example 2: Air Quality Dataset

Grouping on Derived Arrays

Resampling

Example 3: News Aggregator Dataset

Using Lambda Functions in .groupby()

Improving the Performance of .groupby()

Pandas GroupBy: Putting It All Together

Conclusion

More Resources on Pandas GroupBy

Tutorial Proposal Deadline is this Friday, November 22, 2019

Q3 2019

Introduction

Quicksort

How Quicksort Works

Implementation

Sorting Arrays

Sorting Custom Objects

Optimizations of Quicksort

Conclusion

Add Imports “Quick Fix” when using the Python Language Server

Altair plots support

Line Numbers in the Notebook Editor

Other Changes and Enhancements

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

Using Lambda Functions in `.groupby()`

Improving the Performance of `.groupby()`