Real Python: How to Iterate Over Rows in pandas, and Why You Shouldn't

One of the most common questions you might have when entering the world of pandas is how to iterate over rows in a pandas DataFrame. If you’ve gotten comfortable using loops in core Python, then this is a perfectly natural question to ask.

While iterating over rows is relatively straightforward with .itertuples() or .iterrows(), that doesn’t necessarily mean iteration is the best way to work with DataFrames. In fact, while iteration may be a quick way to make progress, relying on iteration can become a significant roadblock when it comes to being effective with pandas.

In this tutorial, you’ll learn how to iterate over the rows in a pandas DataFrame, but you’ll also learn why you probably don’t want to. Generally, you’ll want to avoid iteration because it comes with a performance penalty and goes against the way of the panda.

To follow along with this tutorial, you can download the datasets and code samples from the following link:

Free Sample Code:Click here to download the free sample code and datasets that you’ll use to explore iterating over rows in a pandas DataFrame vs using vectorized methods.

The last bit of prep work is to spin up a virtual environment and install a few packages:

The pandas installation won’t come as a surprise, but you may wonder about the others. You’ll use the httpx package to carry out some HTTP requests as part of one example, and the codetiming package to make some quick performance comparisons.

With that, you’re ready to get stuck in and learn how to iterate over rows, why you probably don’t want to, and what other options to rule out before resorting to iteration.

How to Iterate Over DataFrame Rows in pandas

While uncommon, there are some situations in which you can get away with iterating over a DataFrame. These situations are typically ones where you:

Need to feed the information from a pandas DataFrame sequentially into another API
Need the operation on each row to produce a side effect, such as an HTTP request
Have complex operations to carry out involving various columns in the DataFrame
Don’t mind the performance penalty of iteration, maybe because working with the data isn’t the bottleneck, the dataset is very small, or it’s just a personal project

For instance, imagine you have a list of URLs in a DataFrame, and you want to check which URLs are online. In the downloadable materials, you’ll find a CSV file with some data on the most popular websites, which you can load into a DataFrame:

>>>

>>> importpandasaspd>>> websites=pd.read_csv("resources/popular_websites.csv",index_col=0)>>> websites         name                              url   total_views0      Google           https://www.google.com  5.207268e+111     YouTube          https://www.youtube.com  2.358132e+112    Facebook         https://www.facebook.com  2.230157e+113       Yahoo            https://www.yahoo.com  1.256544e+114   Wikipedia        https://www.wikipedia.org  4.467364e+105       Baidu            https://www.baidu.com  4.409759e+106     Twitter              https://twitter.com  3.098676e+107      Yandex               https://yandex.com  2.857980e+108   Instagram        https://www.instagram.com  2.621520e+109         AOL              https://www.aol.com  2.321232e+1010   Netscape         https://www.netscape.com  5.750000e+0611       Nope  https://alwaysfails.example.com  0.000000e+00

This data contains the website’s name, its URL, and the total number of views over an unspecified time period. In the example, pandas shows the number of views in scientific notation. You’ve also got a dummy website in there for testing purposes.

You want to write a connectivity checker to test the URLs and provide a human-readable message indicating whether the website is online or whether it’s being redirected to another URL:

>>>

>>> importhttpx>>> defcheck_connection(name,url):... try:... response=httpx.get(url)... location=response.headers.get("location")... iflocationisNoneorlocation.startswith(url):... print(f"{name} is online!")... else:... print(f"{name} is online! But redirects to {location}")... returnTrue... excepthttpx.ConnectError:... print(f"Failed to establish a connection with {url}")... returnFalse...

Here, you’ve defined a check_connection() function to make the request and print out messages for a given name and URL.

With this function, you’ll use both the url and the name columns. You don’t care much about the performance of reading the values from the DataFrame for two reasons—partly because the data is so small, but mainly because the real time sink is making HTTP requests, not reading from a DataFrame.

Additionally, you’re interested in inspecting whether any of the websites are down. That is, you’re interested in the side effect and not in adding information to the DataFrame.

For these reasons, you can get away with using .itertuples():

>>>

>>> forwebsiteinwebsites.itertuples():... check_connection(website.name,web.url)...Google is online!YouTube is online!Facebook is online!Yahoo is online!Wikipedia is online!Baidu is online!Twitter is online!Yandex is online!Instagram is online!AOL is online!Netscape is online! But redirects to https://www.aol.com/Failed to establish a connection with https://alwaysfails.example.com

Here you use a for loop on the iterator that you get from .itertuples(). The iterator yields a namedtuple for each row. Using dot notation, you select the two columns to feed into the check_connection() function.

Read the full article at https://realpython.com/pandas-iterate-over-rows/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

How to Iterate Over DataFrame Rows in pandas

Latest Images

Trending Articles

Latest Images