Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22875

David Amos: Method Chaining in Pandas: Bad Form Or a Recipe For Success?

$
0
0
Method Chaining in Pandas: Bad Form Or a Recipe For Success?

This article contains affiliate links. See my affiliate disclosure for more information.

Python trainer Matt Harrison has been creating a bit of a stir.

Some of his pandas examples, like the one below, have elicited emotional responses from different folks in the Twitterverse:

Method Chaining in Pandas: Bad Form Or a Recipe For Success?View on Twitter

One person commented: "Nobody in their right mind writes code like this. Right?" Someone Quote Tweeted it saying: "How not to write Python code."

The thing is, Matt&aposs an experienced programmer with "years in the trenches," as he puts it. He&aposs written best-selling books on Python and pandas, and regularly trains data science teams at top companies.

I knew there had to be more to the story.

Was there some disconnect between the folks criticizing Matt&aposs code and the reality that has shaped Matt&aposs lambda-laden, method-chaining style of writing pandas? I decided to reach out and see if Matt was interested in chatting with me about his code and the reactions to it on Twitter.

I&aposm so glad he said yes.

✉️
This article was originally published in my Curious About Code newsletter. Never miss an issue. Subscribe here →

David:I&aposve just been absolutely fascinated by the discourse I&aposve seen on your posts on Twitter. And, you know, it&aposs been interesting to watch you handle that as well. But I get the sense that there&aposs some context around the code examples that you post that some people might be missing. Is that the case?

Matt: Yeah, maybe I should talk quickly about my background so that people know where I&aposm coming from. I have a CS degree and, you know, my original thought out of school was I&aposd be a software engineer. My first job out of school was doing natural language processing, and I&aposve been in data most of my career using Python. I&aposm not a statistician. I&aposm not an admin. My background is in writing code.

What I do now is train professionals in the Python data space, but I also do some consulting and advising for companies. I&aposm an educator, but I would say that I have time in the trenches. So I&aposm not reading slides or just making stuff up just because I think it&aposs fun to harass people on the internet.

My goal as an educator is to, as I say, covertly teach best practices and, more generally, software engineering best practices to people who are using code as a tool. People who claim they don&apost want to be coders. They&aposre just using code as a tool, right? But they&aposre coding, and my goal is to help them write professional code that they won&apost hate themselves later for writing.

That said, I get that code is like everyone&aposs baby, and people who are coders put a lot of time, sweat, and tears into their code. When you call someone&aposs baby ugly or say that their baby isn&apost doing things right, that&aposs like a personal offense to a lot of people and they take that the wrong way. So, I get that.

But you aren&apost your code, just like you aren&apost the company you work for. Your identity should be separate from your code. And so, if you&aposre using code your goal would be that you would want to be able to use code better, or that&aposs what I think your goal might be. If it&aposs not, that&aposs maybe weird to me.

Having said that, I don&apost think you&aposd ever get a hundred people to agree on the one true way or one right way to code. 70% of them will say you should do object-oriented programming. 20% of them might say you need to do Rust. And maybe 1% of them would say you need to do functional programming or something like that. So you&aposre never going to get a consensus on that.

David:What&aposs the elevator pitch for writing pandas code the way that you do?

Matt: One common thing that you&aposll see in the data science world is this notion that there&aposs like Untitled1.ipynb and Untitled2.ipynb. This notion that when data scientists go to work in the morning, they take whatever Jupyter notebook they worked on yesterday, they copy it and paste it and start off again. It&aposs kind of like, "Well when I use Excel, I just save it as a new file and then I start from that," right?

My goal is to help with that so you don&apost have Untitled28.ipynb, you have Analysis_for_ClientA.ipynb and that&aposs the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you&aposre going to be productive.

Your code will be easier to read. Others can use your code, you can test your code, you can come back to your code in a week and pick it up and others can do the same.

David:You&aposve been called a gatekeeper by some people on Twitter. In one Tweet you said you write code for professionals. Is that where this idea of gatekeeping comes from? That if you don&apost write code a certain way it&aposs not "professional?"

Matt: Yeah, I think so. You know, someone was upset because I said that I teach professionals. I asked them who their audience was and they said they work with teachers or grad students or something. And they thought that my claim that I teach professionals was a jab at them. That somehow they aren&apost professional.

I see what they&aposre saying. But, I do teach professionals. Like, that&aposs what I do. I go into big companies that you&aposve heard of and watch shows on their platforms, and I teach them how to write code that will serve them well. A lot of these people aren&apost necessarily software engineers, but they are writing code.

David: I see your content as filling a gap, in a way. If you search for pandas you&aposre going to find the pandas docs and a whole bunch of beginner-oriented content. That&aposs true of almost anything in programming. So there&aposs this gap between what&aposs easy to find on search engines and what, I think, professionals actually need. Does that play into this at all?

Matt: The style of code that you find out there through search — I&aposve written code like that. And it was painful. Most data science posts on Medium are like "Top Pandas Functions To Remember" sort of things. These posts teach operations in isolation, which is okay. But in practice, I&aposve never had a data set where I&aposve done one thing to it in isolation.

In my 20-plus years of working with data, I have multiple steps and I don&apost care about the intermediate steps. I care about the raw data, what&aposs coming in, and I care about the clean data that I&aposm going to visualize or I&aposm going to send into a machine learning system. I care about the end results. Writing a chain is the recipe that gets me those end results.

David: What is it that separates beginner pandas code from professional pandas code?

Matt: I would say that if you want to write good pandas code — let&aposs draw out the term professional and just say good pandas code — you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking, which is something that a lot of people probably don&apost use, is super useful in pandas world.

Some people say, "I want to write my code so that someone who&aposs never used pandas or Python can look at it and use it." Well, if that&aposs your audience, good luck with that. I don&apost want to cater to the lowest common denominator.

I assume that they understand what lambdas, comprehensions, and dictionary unpackings are. I also assume the audience has some minimum level of Python experience. I don&apost think lambdas and dictionary unpacking and list comprehensions are necessarily beginner-level Python code.

So I&aposm not saying that people who write code in this beginner style aren&apost professionals. I&aposm saying that their code is written in a naive way. I&aposm sorry if that offends them. I don&apost think it&aposs their fault. I think it&aposs due to, like you said, a lot of the content floating around the internet being written in this beginner style that shows you how to do something in isolation.

David:You&aposve mentioned a couple of times that you work with people who don&apost necessarily consider themselves programmers, but are professional pandas users. Is there maybe some other style of writing pandas code that is appropriate in a more traditional software engineering setting?

Matt: No. I would say that if you&aposre writing pandas code you should embrace the chain. Basically, it&aposs a constraint. If you limit yourself to the constraint, it forces you to think about each step of what you&aposre doing along the way.

David:Is the size of the project a factor at all? How big are the notebooks you see people work with, typically?

Matt: I have seen some notebooks where they do longer things. But, I mean, people complain about five lines of a chain like that&aposs the worst thing ever.

David:Yeah, I&aposve seen some comments where people object to just a few chains in a row. And it&aposs like, come on, really?

Matt: Well, there&aposs Demeter&aposs Law, which is a general programming principle. If you have an object and you call an instance member of an object and you call an instance member of that object and you chain those operations, then that&aposs a violation of Demeter&aposs Law, which says that you shouldn&apost ask the internal part about the internal part. Whatever you&aposre starting from should expose the functionality to do that.

I don&apost really buy that I&aposm violating Demeter&aposs Law. We&aposre not really dealing with internal parts of internal parts here. This is completely different from that. We have a DataFrame and we&aposre returning another DataFrame. We&aposre not digging into the DataFrame and pulling out this part and then pulling out some subpart of it.

David:A lot of software engineering best practices make sense at a large scale. But they don&apost necessarily make sense on a small scale. You end up with a lot of boilerplate. Like, you could have just done this in a couple of functions in one file, so why are you scattering things around and doing all this stuff?

But, you know, at a certain scale it makes sense because, if you don&apost do that, it&aposs just too hard to deal with things. Are there scalability issues that might make chaining less desirable?

Matt: From my point of view, the scale here is you&aposve got super complicated data. And maybe you end up with a very long chain. In that case, one of the things you can do is leverage .pipe(). Maybe these twenty lines are cleaning up the temperature data, or whatever. So you could say .pipe(clean_temperature_data) and just remove those twenty lines.

But I&aposve seen people write a whole bunch of pipes that all dispatch to a single line of code. And my point is: you should be able to just read what&aposs in the pipe. By separating things you&aposre putting more cognitive overhead on yourself because now you have to scroll up and down to read what all these things are when previously the line in the chain told you what it was.

There are cases, though, where chaining legitimately makes things hard. Maybe you really do need an intermediate variable because you&aposre calculating something that&aposs derived from two different things. There are ways to do that without breaking a chain. You can use .pipe() to make an intermediate variable, and then you can refer to that later on if you need to. So, I haven&apost seen anything that tells chaining doesn&apost scale as far as code size goes.

As far as the other scale that we might think about, which is data size, one thing to be aware of is that pandas is an in-memory tool. Your data needs to fit in memory. That&aposs why the first thing I do when I load my data is set the correct types to shrink it down. I gave an example the other day where I went 95% smaller with a few lines of code.

And the other thing is that pandas is not particularly smart about doing copy-on-write semantics. In fact, it doesn&apost really have that. For my clients, I usually recommend a 3X–10X memory overhead so that you have space to do these operations. So, yes, data size can be a problem.

One of the objections people have to chaining is that it&aposs problematic with data size. But it&aposs actually less problematic than what we might call the naive style. Maybe the word naive has a bad connotation, but it is kind of like that. You make all these intermediate variables, right? You&aposre actually storing references to all of these intermediate objects that are all copies of your data.

When you chain there is often a copy being made, but there&aposs no pointer or variable holding it. It gets garbage collected. You make an intermediate variable and then the next thing does something with it and, at that point, no one else is using that intermediate variable so it&aposs gone. You don&apost have to worry about that or manage that.

David:Switching gears a little bit, I&aposm curious to know what readability means to you. What is it that makes code readable?

Matt: I think that&aposs a question that there&aposs no standard answer for. If you ask ten people, you&aposre going to get ten different answers. I will say that I don&apost just teach pandas, I also teach Python. A lot of people in my fundamentals of Python class will hear me say, "Your goal is not to write code that&aposs easy to write. Your goal is to write code that&aposs easy to read.

But beauty is in the eye of the beholder and so, again, you need to take into account who your audience is. I don&apost think a professional who&aposs writing Python code or a professional who&aposs writing pandas code should write code that&aposs easily read by someone who&aposs fresh and hasn&apost had any training or background.

I would say there&aposs some baseline. Write Python in whatever idiomatic Python style looks like, rather than if it were C or Java. I see a lot of people coming out of university saying they learned Python, but really they learned C++ or Java. The instructor took their Java content and translated it into Python, which isn&apost particularly hard, and now the student knows Python, right? But they don&apost really know Python.

Having said that, just because I write pandas in this chaining style doesn&apost mean that when I write Python code I&aposm writing chains all over the place. Someone asked me where object-oriented programming is used with pandas. When I&aposm writing pandas code, I don&apost write classes. Does that mean that I never write classes? No, I write classes all the time. It&aposs just that I don&apost really need to use classes when I&aposm writing pandas.

So I guess I&aposll answer this from a pandas point of view. For me, readable pandas code looks like a recipe. It&aposs got steps in it. You can say this is what the first step is, this is what the second step is, and this is what the third step is. Chaining is the constraint that, if you follow that constraint, basically forces you to write your code as if it was a recipe.

David:To me, there&aposs a kind of unsatisfyingly obvious answer to "what is readable code?" and that is any code that I can read and understand without having to work too hard. If I have to write things down and take notes in order to understand a few lines of code, then we&aposre starting to leave the territory of readability.

One of the things that attracted me to the chaining style when I first saw it in your Twitter posts was that I only have to think about what each step is doing and not how it&aposs done. I find myself writing more of my code like this. Not chaining, necessarily, but in a more declarative style. That&aposs something I see a lot in pandas code that I come across, and, I think, especially in method chaining.

But there&aposs another objection people have to chaining that we haven&apost talked about yet. I see people argue that the lack of access to intermediate steps in a chain makes the code hard to test and debug. Can you talk about that?

Matt: I post this code online and I think a lot of people are like, "Oh, you&aposre just like walking up to a computer, typing in this whole piece of code, and then you&aposre done? Where are the intermediate steps? How do I debug this?"

I recommend people go and search for my idiomatic pandas talks on YouTube where I show that this is the end result. It&aposs not like I sat down on a computer and wrote this whole thing in one go. I sat down at a computer and made this line-by-line, testing it as I go.

I don&apost care about the intermediate results at the end. I&aposm inspecting those as I&aposm going through them and validating that what I&aposm doing actually works. I think a lot of it is just like they don&apost understand how I got to that endpoint because they just see the end.

And, you know, I generally challenge people and ask "How would you rewrite this?" Most people don&apost take that, but one person did take that and they&aposre like, "I&aposm going to write this big long notebook where I declare some variables up here." And they&aposre like, "You need to put some Markdown in there."

A lot of people say I need to use Markdown because you need comments. And so you need to have multiple cells because you need to have Markdown between them, right? And then each cell needs to have like some markdown above it explaining what it&aposs doing and that sort of thing.

I mean, if that&aposs readable to you, that&aposs great. But the problem is, when I come back to this tomorrow, I&aposve got to find those cells and run them in order. And if you happen to make them out of order or something, then I&aposm kind of in a bad place.

In my notebooks, I put my chain into a function when I&aposm done, and then I just put that function at the very top of my code and I can come back tomorrow, load my raw data, and run that and I&aposm good to go.

Having said that, if I do want the intermediate variables, I can use .pipe() and just make a global variable. I show examples of that in my Effective Pandas book. How do you debug that? You can comment out the pipe and walk through. That&aposs really easy. People ask if you can use debug tools. Yes. You put a breakpoint at the line and you can step into the method at that point. The claim that you can&apost debug it is kind of bogus to me.

David:So what are your final thoughts on all of this? What should people take away from our conversation?

Matt: If people don&apost like chaining, that&aposs probably bad news for them because my belief is that we&aposre just going to see more of that with next-gen tools like Polars, which is a DataFrame implementation in Rust.

You kind of have to chain in Polars. You can do a .filter() at the very end of the chain and it will go back up to read the CSV file and limit which columns and rows it reads based on the filter. You can do query optimization from the chain, which you wouldn&apost get if you didn&apost chain in Polars.

And really, my advice would be just to try out chaining in pandas. I think a lot of people have an adverse reaction to it, but they never try it out. A lot of people who read my book say, "I was skeptical, but I tried it out and now it&aposs changed how I write pandas code."


Want to learn how to write effective pandas code?

Check out Matt&aposs latest book Effective Pandas. It shows you how to clean your data, create powerful visualizations, and write your own data recipes. And there&aposs an entire chapter dedicated to debugging.

Get instant access to the eBook on Matt&aposs website or order a print copy on Amazon.

My favorite part is how Matt uses diagrams to explain operations and help you build a mental model for working with pandas DataFrames:

Method Chaining in Pandas: Bad Form Or a Recipe For Success?Example diagram from Effective Pandas.

Follow Matt on Twitter (@__mharrison__) and see all of his books and training courses over at metasnake.com.



Viewing all articles
Browse latest Browse all 22875

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>