One of the most common questions you might have when entering the world of pandas is how to iterate over rows in a pandas DataFrame. If you’ve gotten comfortable using loops in core Python, then this is a perfectly natural question to ask.
While iterating over rows is relatively straightforward with .itertuples()
or .iterrows()
, that doesn’t necessarily mean iteration is the best way to work with DataFrames. In fact, while iteration may be a quick way to make progress, relying on iteration can become a significant roadblock when it comes to being effective with pandas.
In this tutorial, you’ll learn how to iterate over the rows in a pandas DataFrame, but you’ll also learn why you probably don’t want to. Generally, you’ll want to avoid iteration because it comes with a performance penalty and goes against the way of the panda.
To follow along with this tutorial, you can download the datasets and code samples from the following link:
Free Sample Code:Click here to download the free sample code and datasets that you’ll use to explore iterating over rows in a pandas DataFrame vs using vectorized methods.
The last bit of prep work is to spin up a virtual environment and install a few packages:
The pandas
installation won’t come as a surprise, but you may wonder about the others. You’ll use the httpx
package to carry out some HTTP requests as part of one example, and the codetiming
package to make some quick performance comparisons.
With that, you’re ready to get stuck in and learn how to iterate over rows, why you probably don’t want to, and what other options to rule out before resorting to iteration.
How to Iterate Over DataFrame Rows in pandas
While uncommon, there are some situations in which you can get away with iterating over a DataFrame. These situations are typically ones where you:
- Need to feed the information from a pandas DataFrame sequentially into another API
- Need the operation on each row to produce a side effect, such as an HTTP request
- Have complex operations to carry out involving various columns in the DataFrame
- Don’t mind the performance penalty of iteration, maybe because working with the data isn’t the bottleneck, the dataset is very small, or it’s just a personal project
For instance, imagine you have a list of URLs in a DataFrame, and you want to check which URLs are online. In the downloadable materials, you’ll find a CSV file with some data on the most popular websites, which you can load into a DataFrame:
>>> importpandasaspd>>> websites=pd.read_csv("resources/popular_websites.csv",index_col=0)>>> websites name url total_views0 Google https://www.google.com 5.207268e+111 YouTube https://www.youtube.com 2.358132e+112 Facebook https://www.facebook.com 2.230157e+113 Yahoo https://www.yahoo.com 1.256544e+114 Wikipedia https://www.wikipedia.org 4.467364e+105 Baidu https://www.baidu.com 4.409759e+106 Twitter https://twitter.com 3.098676e+107 Yandex https://yandex.com 2.857980e+108 Instagram https://www.instagram.com 2.621520e+109 AOL https://www.aol.com 2.321232e+1010 Netscape https://www.netscape.com 5.750000e+0611 Nope https://alwaysfails.example.com 0.000000e+00
This data contains the website’s name, its URL, and the total number of views over an unspecified time period. In the example, pandas shows the number of views in scientific notation. You’ve also got a dummy website in there for testing purposes.
You want to write a connectivity checker to test the URLs and provide a human-readable message indicating whether the website is online or whether it’s being redirected to another URL:
>>> importhttpx>>> defcheck_connection(name,url):... try:... response=httpx.get(url)... location=response.headers.get("location")... iflocationisNoneorlocation.startswith(url):... print(f"{name} is online!")... else:... print(f"{name} is online! But redirects to {location}")... returnTrue... excepthttpx.ConnectError:... print(f"Failed to establish a connection with {url}")... returnFalse...
Here, you’ve defined a check_connection()
function to make the request and print out messages for a given name and URL.
With this function, you’ll use both the url
and the name
columns. You don’t care much about the performance of reading the values from the DataFrame for two reasons—partly because the data is so small, but mainly because the real time sink is making HTTP requests, not reading from a DataFrame.
Additionally, you’re interested in inspecting whether any of the websites are down. That is, you’re interested in the side effect and not in adding information to the DataFrame.
For these reasons, you can get away with using .itertuples()
:
>>> forwebsiteinwebsites.itertuples():... check_connection(website.name,web.url)...Google is online!YouTube is online!Facebook is online!Yahoo is online!Wikipedia is online!Baidu is online!Twitter is online!Yandex is online!Instagram is online!AOL is online!Netscape is online! But redirects to https://www.aol.com/Failed to establish a connection with https://alwaysfails.example.com
Here you use a for
loop on the iterator that you get from .itertuples()
. The iterator yields a namedtuple for each row. Using dot notation, you select the two columns to feed into the check_connection()
function.
Read the full article at https://realpython.com/pandas-iterate-over-rows/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]