Last year, I poured my heart and soul into a 4-hour video series teaching machine learning in Python with scikit-learn. It has been popular among beginner and intermediate scikit-learn users, and has accumulated over one million minutes of "watch time" on YouTube.
This year, I decided to create a series covering another key library in the PyData ecosystem: pandas.
Why learn pandas?
pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. If you're working with data in Python and you're not using pandas, you're probably working too hard!
There are many things to like about pandas: It's well-documented, has a huge amount of community support, is under active development, and plays well with other Python libraries (such as matplotlib, scikit-learn, and seaborn).
There are also things you might not like: pandas has an overwhelming amount of functionality (so it's hard to know where to start), and it provides too many ways to accomplish the same task (so it's hard to figure out the best practices).
That's why I created this series. I've been using and teaching pandas for a long time, and so I know how to explain pandas in a way that is understandable to novices.
About the video series
You don't need to have any pandas experience to benefit from this series, but you do need to know the basics of Python.
In each video, I answer a question from one of my students using a real dataset. Since I've posted the data online, and pandas can read files directly from a URL, you can follow along with every video at home!
Every video in the series is embedded below. New videos come out every Tuesday and Thursday, and there will be at least 30 videos. (Subscribe on YouTube for notifications.)
There's also a well-commented IPython/Jupyter notebook containing the code from every video, and a GitHub repository containing all of the datasets.
Do you have a question about pandas, or a task you would like to accomplish? Let me know in the comments section!
List of videos (as of May 10)
1. What is pandas? (Introduction to the Q&A series) (6:24)
pandas is a full-featured Python library for data analysis, manipulation, and visualization. This video series is for anyone who wants to work with data in Python, regardless of whether you are brand new to pandas or have some experience. Each video will answer a student question about pandas using a real dataset, which is available online so you can follow along!
2. How do I read a tabular data file into pandas? (8:54)
"Tabular data" is just data that has been formatted as a table, with rows and columns (like a spreadsheet). You can easily read a tabular data file into pandas, even directly from a URL! In this video, I'll walk you through how to do that, including how to modify some of the default arguments of the read_table function to solve common problems.
3. How do I select a pandas Series from a DataFrame? (11:10)
DataFrames and Series are the two main object types in pandas for data storage: a DataFrame is like a table, and each column of the table is called a Series. You will often select a Series in order to analyze or manipulate it. In this video, I'll show you how to select a Series using "bracket notation" and "dot notation", and will discuss the limitations of dot notation. I'll also demonstrate how to create a new Series in a DataFrame.
4. Why do some pandas commands end with parentheses (and others don't)? (8:45)
To access most of the functionality in pandas, you have to call the methods and attributes of DataFrame and Series objects. In this video, I'll discuss some common methods and attributes, and show you how to tell the difference between them. (Hint: It's all about the parentheses!)
5. How do I rename columns in a pandas DataFrame? (9:36)
You will often want to rename the columns of a DataFrame so that their names are descriptive, easy to type, and don't contain any spaces. In this video, I'll demonstrate three different strategies for renaming columns so that you can choose the best strategy to fit your particular situation.
6. How do I remove columns from a pandas DataFrame? (6:35)
If you have DataFrame columns that you're never going to use, you may want to remove them entirely in order to focus on the columns that you do use. In this video, I'll show you how to remove columns (and rows), and will briefly explain the meaning of the "axis" and "inplace" parameters.
7. How do I sort a pandas DataFrame or a Series? (8:56)
pandas allows you to sort a DataFrame by one of its columns (known as a "Series"), and also allows you to sort a Series alone. The sorting API changed in pandas version 0.17, so in this video, I'll demonstrate both the "old way" and the "new way" to sort. I'll also show you how to sort a DataFrame by multiple columns at once!
8. How do I filter rows of a pandas DataFrame by column value? (13:44)
Let's say that you only want to display the rows of a DataFrame which have a certain column value. How would you do it? pandas makes it easy, but the notation can be confusing and thus difficult to remember. In this video, I'll work up to the solution step-by-step using regular Python code so that you can truly understand the logic behind pandas filtering notation.
9. How do I apply multiple filter criteria to a pandas DataFrame? (9:51)
Let's say that you want to filter the rows of a DataFrame by multiple conditions. In this video, I'll demonstrate how to do this using two different logical operators. I'll also explain the special rules in pandas for combining filter criteria, and end with a trick for simplifying chained conditions!
10. Your pandas questions answered! (9:06)
In this video, I'm answering a few of the pandas questions I've received in the YouTube comments:
- When reading from a file, how do I read in only a subset of the columns or rows?
- How do I iterate through a Series or a DataFrame?
- How do I drop all non-numeric columns from a DataFrame?
- How do I know whether I should pass an argument as a string or a list?
11. How do I use the "axis" parameter in pandas? (8:33)
When performing operations on a pandas DataFrame, such as dropping columns or calculating row means, it is often necessary to specify the "axis". But what exactly is an axis? In this video, I'll help you to build a mental model for understanding the axis parameter so that you will know when and how to use it.
12. How do I use string methods in pandas? (coming May 12)
pandas includes powerful string manipulation capabilities that you can easily apply to any Series of strings. In this video, I'll show you how to access string methods in pandas (along with a few examples), and then end with two bonus tips to help you maximize your efficiency.
13. How do I change the data type of a pandas Series? (coming May 17)
Have you ever tried to do math with a pandas Series that you thought was numeric, but it turned out that your numbers were stored as strings? In this video, I'll demonstrate two different ways to change the data type of a Series so that you can fix incorrect data types. I'll also show you the easiest way to convert a boolean Series to integers, which is useful for creating dummy/indicator variables for machine learning.
14. When should I use a "groupby" in pandas? (coming May 19)
The pandas "groupby" method allows you to split a DataFrame into groups, apply a function to each group independently, and then combine the results back together. This is called the "split-apply-combine" pattern, and is a powerful tool for analyzing data across different categories. In this video, I'll explain when you should use a groupby and then demonstrate its flexibility using four different examples.
P.S. Want to be the first to know when I launch an online pandas course?Subscribe to the Data School newsletter.