Abhijeet Pal: Adding Custom Model Managers In Django

January 4, 2020, 1:18 am

≫ Next: Paolo Amoroso: A List of Free Python Books

≪ Previous: Mike C. Fletcher: PyOpenGL 3.1.5 is Out

A manager is an interface through which database query operations are provided to Django models. At least one Manager exists for every model in a Django application, objects is the default manager of every model that retrieves all objects in the database. However, one model can have multiple model managers, we can also build our own custom model managers by extending the base manager class. In this article, we will learn how to create custom model managers in Django. Creating Custom Model Managers In this article, we will build a model manager for a blog application. class Post(models.Model): title = models.CharField(max_length=200) content = models.TextField() created = models.DateTimeField(auto_now_add=True) updated = models.DateTimeField(auto_now=True) active = models.BooleanField(default=False) def __str__(self): return self.title As we know objects is the default model manager for every model, therefore Post.objects.all() will return all post objects. The objects method is capable of doing all basic QuerySets then why would we need a custom model manager? There are two reasons you might want to customize a Manager – To add extra Manager methods. To modify the initial QuerySet the Manager returns. Let’s …

The post Adding Custom Model Managers In Django appeared first on Django Central.

↧

Paolo Amoroso: A List of Free Python Books

January 4, 2020, 4:07 am

≫ Next: Catalin George Festila: Python 3.7.5 : Testing the PyMongo package - part 001.

≪ Previous: Abhijeet Pal: Adding Custom Model Managers In Django

If you’re like me, you love learning by reading books.

So, when I set out to learn the Python programming language in the last days of 2018, I started looking for good books. I googled, browsed Reddit, checked major Python sites, and came out with a list Python books, including several free ebooks. I shared the list of free books to Reddit as I thought it may help others. Not only was the list a huge hit, some users suggested more great books.

The GitHub repository of the list of free Python Books I maintain.

Given all the interest, I put together my initial list, integrated it with the suggestions, and published the list of free Python books.

Go check the list, there are good titles covering many topics, from introductory guides to advanced language features and techniques, from software engineering to game development, and more. Including a few gems, such as the unusual book Boxes: Your Second Python Book that explores digital typesetting and text layout algorithms.

↧

Catalin George Festila: Python 3.7.5 : Testing the PyMongo package - part 001.

January 4, 2020, 12:03 am

≫ Next: Weekly Python StackOverflow Report: (ccix) stackoverflow python report

≪ Previous: Paolo Amoroso: A List of Free Python Books

MongoDB and PyMongo are not my priorities for the new year 2020 but because they are quite evolved I thought to start presenting it within my free time. The PyMongo python package is a Python distribution containing tools for working with MongoDB. The full documentation can be found on this webpage. You can see my tutorial about how to install the MongoDB into Fedora 31 on this webpage. I used

↧

Weekly Python StackOverflow Report: (ccix) stackoverflow python report

January 4, 2020, 1:56 pm

≫ Next: Catalin George Festila: Python 3.7.5 : Testing cryptography python package - part 001.

≪ Previous: Catalin George Festila: Python 3.7.5 : Testing the PyMongo package - part 001.

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2020-01-04 21:55:45 GMT

↧

Catalin George Festila: Python 3.7.5 : Testing cryptography python package - part 001.

January 4, 2020, 6:58 pm

≫ Next: Codementor: Getting started with Python - Reading Time: 6 Mins

≪ Previous: Weekly Python StackOverflow Report: (ccix) stackoverflow python report

There are many python packets that present themselves as useful encryption and decryption solutions. I recommend before you test them, use them and spend time with them to focus on the correct study of cryptology because many disadvantages and problems can arise in the correct and safe writing of the source code. Today I will show you a simple example with cryptography python package. Let's

↧

Codementor: Getting started with Python - Reading Time: 6 Mins

January 5, 2020, 1:35 am

≫ Next: Podcast.__init__: Checking Up On Python's Role in DevOps

≪ Previous: Catalin George Festila: Python 3.7.5 : Testing cryptography python package - part 001.

Getting started with Python for the absolute beginners for 2020

↧

Podcast.init: Checking Up On Python's Role in DevOps

January 5, 2020, 7:22 pm

≫ Next: James Bennett: A Python Packaging Carol

≪ Previous: Codementor: Getting started with Python - Reading Time: 6 Mins

Python has been part of the standard toolkit for systems administrators since it was created. In recent years there has been a shift in how servers are deployed and managed, and how code gets released due to the rise of cloud computing and the accompanying DevOps movement. The increased need for automation and speed of iteration has been a perfect use case for Python, cementing its position as a powerful tool for operations. In this episode Moshe Zadka reflects on his experiences using Python in a DevOps context and the book that he wrote on the subject. He also discusses the difference in what aspects of the language are useful as an introduction for system operators and where they can continue their learning.

Summary

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host as usual is Tobias Macey and today I’m interviewing Moshe Zadke about his recent book DevOps In Python

Interview

Introductions
How did you get introduced to Python?
How did you gain experience in managing systems with Python?
What is DevOps?
What makes Python a good fit for managing systems?
What is unique to the devops/sysadmin domain in terms of what software is used and what aspects of the language are useful?
What are the main ways that Python is used for managing servers and infrastructure?
What are some of the most notable changes in the ways that Python is used for server administration over the past several years?
How has Python3 impacted the lives of operators?
What was your motivation for writing a book about Python focused specifically on DevOps and server automation?
What are some of the tools that have been replaced in your own workflow over the years?

Keep In Touch

Picks

Tobias
- SaltStack
  - Podcast Episode
Moshe
- Automat
  - Podcast Episode

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

↧

James Bennett: A Python Packaging Carol

January 5, 2020, 5:22 pm

≫ Next: Mike Driscoll: PyDev of the Week: Bryan Weber

≪ Previous: Podcast.__init__: Checking Up On Python's Role in DevOps

I have endeavoured in this Ghostly little book, to raise the Ghost of an Idea, which shall not put my readers out of humour with themselves, with each other, with the season, or with me. May it haunt their houses pleasantly, and no one wish to lay it.

Every year around Christmas, I make a point of re-reading Charles Dickens’ A Christmas Carol. If you’ve only ever been exposed to the story through adaptations into …

Read full entry

↧

Mike Driscoll: PyDev of the Week: Bryan Weber

January 5, 2020, 10:05 pm

≫ Next: Catalin George Festila: Python 3.7.5 : Set json with settings in Django project.

≪ Previous: James Bennett: A Python Packaging Carol

This week we welcome Bryan Weber (@darthbith) as our PyDev of the Week! Bryan is a contributor for Real Python and a core developer for Cantera. If you’d like to learn more about Bryan, you can check out his website or his Github profile. Let’s take a few moment to get to know him better!

Can you tell us a little about yourself (hobbies, education, etc):

I am a teaching professor at the University of Connecticut, as well as the Director of Undergraduate Studies for Mechanical Engineering. This means that I focus mostly on improving the education of our undergraduate students. I teach a lot of thermodynamics and fluid mechanics courses, and I’ve developed a few Python packages to help with that.

I got my doctorate in Mechanical Engineering in 2014, also from the University of Connecticut. One of my favorite things about mechanical engineering is that it is a super broad field, covering everything from robotics to chemistry, cars and trucks to planes and rockets, and everything in between.

My hobbies are open source software, Ultimate Frisbee, and cooking. I have a daughter and I love spending time as a family. Aside from that, there isn’t much time for anything else!

Why did you start using Python?

While I was in grad school, my dissertation was focused on developing experimental data for biofuels. Originally, I wrote all of my data processing in MATLAB because that was the language I knew from undergrad. At some point, I realized that if I wanted to practice open science, that included sharing the data processing scripts as well as the raw data. Of course, MATLAB is proprietary software and is quite expensive. This means that my work would not be really open and free (as in speech).

So I rewrote everything in Python, so that I could share it all! I chose Python because another package that I wanted to use had a Python interface, and it made it easy to integrate everything together. The package I wrote for data processing is still on GitHub (it is called UConnRCMPy) although I’m not sure if anyone uses it at all.

What other programming languages do you know and which is your favorite?

I used to know FORTRAN and MATLAB pretty well, but those skills have mostly atrophied. I can read most C++ code, but can’t write it all that well. Python is by far my favorite language that I’ve learned so far. I’m also very interested to learn Julia and see how it compares!

What projects are you working on now?

I have developed a few software packages to use in my classes. I’ve written an extension for the Jupyter Notebook that allows one to download several Notebooks as PDFs. This uses nbconvert and pdfrw to automatically stitch together the converted Notebooks into one PDF. I use this in my classes for students to do their homework right in the Jupyter Notebook and then submit a PDF of their code + explanations. For my classes, autograding doesn’t work very well, so this is a nice substitute.

Aside from teaching, I spend most of the rest of my time working on Cantera, the open source package for thermodynamics, chemical kinetics, and transport phenomena. I’ve been one of the lead developers since about 2013, but Cantera was the package that inspired me to use Python in the first place, so I must have known about it in 2010 or 2011. The core of Cantera is written in C++ because it needs to be fast and we use Cython to generate a Python interface. Cython is an excellent package that makes it really easy to write code that looks a lot like Python but can be compiled to a C library. Since Cython generates a C library, it is pretty simple to link with other C or C++ code like the core of Cantera.

I’m a pretty heavy user of Conda, the cross-platform package manager as well. I’ve written Conda recipes to build packages for Cantera and the other libraries that I’ve developed, and even contributed a small change to conda-build itself! I’ve learned a lot about compiling and linking C/C++ code on all the major platforms and I have a ton of appreciation and respect for the people who really get this stuff.

Which Python libraries are your favorite (core or 3rd party)?

NumPy is so foundational to everything I do, that has to be one of my favorites. I also love the Pint library, which handles physical unit conversions (how many feet in a meter?). Pint was just featured on the Python Bytes podcast as well! I also use the Nikola and Pelican static HTML generators pretty frequently for Cantera and my personal website, respectively.

How did you end up writing for Real Python?

I saw a tweet from Real Python that they were looking for authors, and decided to apply! My first article was about how to write a main() function in Python, and I’ve since written an article about switching from MATLAB to Python, and I have a few more in the pipeline I’m excited to share!

How do you choose what articles to write?

I typically look for articles that cover something I already know pretty well, like switching from MATLAB to Python, or something that I want to learn about. My upcoming articles are about doing optimization with SciPy and getting started with the datetime library. I think time in programming is such a fascinating idea, especially because it’s so complicated!

Is there anything else you’d like to say?

If you don’t donate to the Python Software Foundation or other non-profit organizations in the open-source space, please consider doing so. Cantera is a fiscally sponsored project under the NumFOCUS umbrella, another non-profit that accepts donations on behalf of the projects. If you use open-source software in your work, please consider a program like Tidelift so your company can help support open source as well. And if you don’t have the financial resources to give back, and you do have the time to give back, contributing code or editing documentation is another great way to be involved!

Thanks for doing the interview Bryan!

The post PyDev of the Week: Bryan Weber appeared first on The Mouse Vs. The Python.

↧

Catalin George Festila: Python 3.7.5 : Set json with settings in Django project.

January 5, 2020, 5:30 pm

≫ Next: Julien Danjou: Atomic lock-free counters in Python

≪ Previous: Mike Driscoll: PyDev of the Week: Bryan Weber

[mythcat@desk django]$ source env/bin/activate (env) [mythcat@desk django]$ cd mysite/ (env) [mythcat@desk mysite]$ ls db.sqlite3 manage.py mysite test001 (env) [mythcat@desk mysite]$ pwd /home/mythcat/projects/django/mysiteCreate a file named config.json in the folder django: (env) [mythcat@desk mysite]$ vim /home/mythcat/projects/django/config.jsonOpen your settings.py file from your Django

↧

Julien Danjou: Atomic lock-free counters in Python

January 6, 2020, 2:47 am

≫ Next: Techiediaries - Django: MyCLI: A MySQL CLI Based on Python with Auto-completion and Syntax Highlighting

≪ Previous: Catalin George Festila: Python 3.7.5 : Set json with settings in Django project.

At Datadog, we're really into metrics. We love them, we store them, but we also generate them. To do that, you need to juggle with integers that are incremented, also known as counters.

While having an integer that changes its value sounds dull, it might not be without some surprises in certain circumstances. Let's dive in.

The Straightforward Implementation

class SingleThreadCounter(object):
	def __init__(self):
    	self.value = 0
        
    def increment(self):
        self.value += 1

Pretty easy, right?

Well, not so fast, buddy. As the class name implies, this works fine with a single-threaded application. Let's take a look at the instructions in the increment method:

>>> import dis
>>> dis.dis("self.value += 1")
  1           0 LOAD_NAME                0 (self)
              2 DUP_TOP
              4 LOAD_ATTR                1 (value)
              6 LOAD_CONST               0 (1)
              8 INPLACE_ADD
             10 ROT_TWO
             12 STORE_ATTR               1 (value)
             14 LOAD_CONST               1 (None)
             16 RETURN_VALUE

The self.value +=1 line of code generates 8 different operations for Python. Operations that could be interrupted at any time in their flow to switch to a different thread that could also increment the counter.

Indeed, the += operation is not atomic: one needs to do a LOAD_ATTR to read the current value of the counter, then an INPLACE_ADD to add 1, to finally STORE_ATTR to store the final result in the value attribute.

If another thread executes the same code at the same time, you could end up with adding 1 to an old value:

Thread-1 reads the value as 23
Thread-1 adds 1 to 23 and get 24
Thread-2 reads the value as 23
Thread-1 stores 24 in value
Thread-2 adds 1 to 23
Thread-2 stores 24 in value

Boom. Your Counter class is not thread-safe. 😭

The Thread-Safe Implementation

To make this thread-safe, a lock is necessary. We need a lock each time we want to increment the value, so we are sure the increments are done serially.

import threading

class FastReadCounter(object):
    def __init__(self):
        self.value = 0
        self._lock = threading.Lock()
        
    def increment(self):
        with self._lock:
            self.value += 1

This implementation is thread-safe. There is no way for multiple threads to increment the value at the same time, so there's no way that an increment is lost.

The only downside of this counter implementation is that you need to lock the counter each time you need to increment. There might be much contention around this lock if you have many threads updating the counter.

On the other hand, if it's barely updated and often read, this is an excellent implementation of a thread-safe counter.

A Fast Write Implementation

There's a way to implement a thread-safe counter in Python that does not need to be locked on write. It's a trick that should only work on CPython because of the Global Interpreter Lock.

While everybody is unhappy with it, this time, the GIL is going to help us. When a C function is executed and does not do any I/O, it cannot be interrupted by any other thread. It turns out there's a counter-like class implemented in Python: itertools.count.

We can use this count class as our advantage by avoiding the need to use a lock when incrementing the counter.

If you read the documentation for itertools.count, you'll notice that there's no way to read the current value of the counter. This is tricky, and this is where we'll need to use a lock to bypass this limitation. Here's the code:

import itertools
import threading

class FastWriteCounter(object):
    def __init__(self):
        self._number_of_read = 0
        self._counter = itertools.count()
        self._read_lock = threading.Lock()

    def increment(self):
        next(self._counter)

    def value(self):
        with self._read_lock:
            value = next(self._counter) - self._number_of_read
            self._number_of_read += 1
        return value

The increment code is quite simple in this case: the counter is just incremented without any lock. The GIL protects concurrent access to the internal data structure in C, so there's no need for us to lock anything.

On the other hand, Python does not provide any way to read the value of an itertools.count object. We need to use a small trick to get the current value. The value method increments the counter and then gets the value while subtracting the number of times the counter has been read (and therefore incremented for nothing).

This counter is, therefore, lock-free for writing, but not for reading. The opposite of our previous implementation

Measuring Performance

After writing all of this code, I wanted to make sure how the different implementations impacted speed. Using the timeit module and my fancy laptop, I've measured the performance of reading and writing to this counter.

Operation	SingleThreadCounter	FastReadCounter	FastWriteCounter
`increment`	176 ns	390 ns	169 ns
`value`	26 ns	26 ns	529 ns

I'm glad that the performance measurements in practice match the theory 😅. Both SingleThreadCounter and FastReadCounter have the same performance for reading. Since they use a simple variable read, it makes absolute sense.

The same goes for SingleThreadCounter and FastWriteCounter, which have the same performance for incrementing the counter. Again they're using the same kind of lock-free code to add 1 to an integer, making the code fast.

Conclusion

It's pretty obvious, but if you're using a single-threaded application and do not have to care about concurrent access, you should stick to using a simple incremented integer.

For fun, I've published a Python package named fastcounter that provides those classes. The sources are available on GitHub. Enjoy!

↧

Techiediaries - Django: MyCLI: A MySQL CLI Based on Python with Auto-completion and Syntax Highlighting

January 5, 2020, 4:00 pm

≫ Next: Catalin George Festila: Python 3.7.5 : About Django REST framework.

≪ Previous: Julien Danjou: Atomic lock-free counters in Python

If you prefer to work with MySQL via its command-line interface, you'll like mycli which is a CLI tool with auto-completion and syntax highlighting built on top of Python and prompt_toolkit for building interactive command line applications with Python.

It is cross-platform and it is tested on Linux, MacOS and Windows.

According to the official website:

mycli is a command line interface for MySQL, MariaDB, and Percona with auto-completion and syntax highlighting.

MyCLI

Prerequisites

Python 2.7 or Python 3.4+.

How to Install MyCLI?

Supposed you have Python and pip installed, open a new terminal and run the following command:

$ pip install mycli

Check out the official website for instructions on how to install MyCLI on the other platforms.

You can check the source code of this tool on GitHub.

↧

Catalin George Festila: Python 3.7.5 : About Django REST framework.

January 6, 2020, 12:34 am

≫ Next: Real Python: Using Pandas and Python to Explore Your Dataset

≪ Previous: Techiediaries - Django: MyCLI: A MySQL CLI Based on Python with Auto-completion and Syntax Highlighting

First, let's activate my Python virtual environment: [mythcat@desk django]$ source env/bin/activate I update my django version 3.0 up to 3.0.1. (env) [mythcat@desk django]$ pip3 install --upgrade django --user Collecting django ... Successfully uninstalled Django-3.0 Successfully installed django-3.0.1 The next step comes with installation of Python modules for Django and Django REST: (env)

↧

Real Python: Using Pandas and Python to Explore Your Dataset

January 6, 2020, 6:00 am

≫ Next: Vinta Software: Counting Queries: Basic Performance Testing in Django

≪ Previous: Catalin George Festila: Python 3.7.5 : About Django REST framework.

Do you have a large dataset that’s full of interesting insights, but you’re not sure where to start exploring it? Has your boss asked you to generate some statistics from it, but they’re not so easy to extract? These are precisely the use cases where Pandas and Python can help you! With these tools, you’ll be able to slice a large dataset down into manageable parts and glean insight from that information.

In this tutorial, you’ll learn how to:

Calculate metrics about your data
Perform basic queries and aggregations
Discover and handle incorrect data, inconsistencies, and missing values
Visualize your data with plots

You’ll also learn about the differences between the main data structures that Pandas and Python use. To follow along, you can get all of the example code in this tutorial at the link below:

Get Jupyter Notebook:Click here get the Jupyter Notebook you'll use to explore data with Pandas in this tutorial.

Setting Up Your Environment

There are a few things you’ll need to get started with this tutorial. First is a familiarity with Python’s built-in data structures, especially lists and dictionaries. For more information, check out Lists and Tuples in Python and Dictionaries in Python.

The second thing you’ll need is a working Python environment. You can follow along in any terminal that has Python 3 installed. If you want to see nicer output, especially for the large NBA dataset you’ll be working with, then you might want to run the examples in a Jupyter notebook.

Note: If you don’t have Python installed at all, then check out Python 3 Installation & Setup Guide. You can also follow along online in a try-out Jupyter notebook.

The last thing you’ll need is the Pandas Python library, which you can install with pip:

$ python -m pip install pandas

You can also use the Conda package manager:

$ conda install pandas

If you’re using the Anaconda distribution, then you’re good to go! Anaconda already comes with the Pandas Python library installed.

Note: Have you heard that there are multiple package managers in the Python world and are somewhat confused about which one to pick? pip and conda are both excellent choices, and they each have their advantages.

If you’re going to use Python mainly for data science work, then conda is perhaps the better choice. In the conda ecosystem, you have two main alternatives:

If you want to get a stable data science environment up and running quickly, and you don’t mind downloading 500 MB of data, then check out the Anaconda distribution.
If you prefer a more minimalist setup, then check out the section on installing Miniconda in Setting Up Python for Machine Learning on Windows.

The examples in this tutorial have been tested with Python 3.7 and Pandas 0.25.0, but they should also work in older versions. You can get all the code examples you’ll see in this tutorial in a Jupyter notebook by clicking the link below:

Get Jupyter Notebook:Click here get the Jupyter Notebook you'll use to explore data with Pandas in this tutorial.

Let’s get started!

Using the Pandas Python Library

Now that you’ve installed Pandas, it’s time to have a look at a dataset. In this tutorial, you’ll analyze NBA results provided by FiveThirtyEight in a 17MB CSV file. Create a script download_nba_all_elo.py to download the data:

importrequestsdownload_url="https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"target_csv_path="nba_all_elo.csv"response=requests.get(download_url)response.raise_for_status()# Check that the request was successfulwithopen(target_csv_path,"wb")asf:f.write(response.content)print("Download ready.")

When you execute the script, it will save the file nba_all_elo.csv in your current working directory.

Note: You could also use your web browser to download the CSV file.

However, having a download script has several advantages:

You can tell where you got your data.
You can repeat the download anytime! That’s especially handy if the data is often refreshed.
You don’t need to share the 17MB CSV file with your co-workers. Usually, it’s enough to share the download script.

Now you can use the Pandas Python library to take a look at your data:

>>>

>>> importpandasaspd>>> nba=pd.read_csv("nba_all_elo.csv")>>> type(nba)<class 'pandas.core.frame.DataFrame'>

Here, you follow the convention of importing Pandas in Python with the pd alias. Then, you use .read_csv() to read in your dataset and store it as a DataFrame object in the variable nba.

Note: Is your data not in CSV format? No worries! The Pandas Python library provides several similar functions like read_json(), read_html(), and read_sql_table(). To learn how to work with these file formats, check out Reading and Writing Files With Pandas or consult the docs.

You can see how much data nba contains:

>>>

>>> len(nba)126314>>> nba.shape(126314, 23)

You use the Python built-in function len() to determine the number of rows. You also use the .shape attribute of the DataFrame to see its dimensionality. The result is a tuple containing the number of rows and columns.

Now you know that there are 126,314 rows and 23 columns in your dataset. But how can you be sure the dataset really contains basketball stats? You can have a look at the first five rows with .head():

>>>

>>> nba.head()

If you’re following along with a Jupyter notebook, then you’ll see a result like this:

Unless your screen is quite large, your output probably won’t display all 23 columns. Somewhere in the middle, you’ll see a column of ellipses (...) indicating the missing data. If you’re working in a terminal, then that’s probably more readable than wrapping long rows. However, Jupyter notebooks will allow you to scroll. You can configure Pandas to display all 23 columns like this:

>>>

>>> pd.set_option("display.max.columns",None)

While it’s practical to see all the columns, you probably won’t need six decimal places! Change it to two:

>>>

>>> pd.set_option("display.precision",2)

To verify that you’ve changed the options successfully, you can execute .head() again, or you can display the last five rows with .tail() instead:

>>>

>>> nba.tail()

Now, you should see all the columns, and your data should show two decimal places:

You can discover some further possibilities of .head() and .tail() with a small exercise. Can you print the last three lines of your DataFrame? Expand the code block below to see the solution:

Here’s how to print the last three lines of nba:

>>>

>>> nba.tail(3)

Your output should look something like this:

You can see the last three lines of your dataset with the options you’ve set above.

Similar to the Python standard library, functions in Pandas also come with several optional parameters. Whenever you bump into an example that looks relevant but is slightly different from your use case, check out the official documentation. The chances are good that you’ll find a solution by tweaking some optional parameters!

Getting to Know Your Data

You’ve imported a CSV file with the Pandas Python library and had a first look at the contents of your dataset. So far, you’ve only seen the size of your dataset and its first and last few rows. Next, you’ll learn how to examine your data more systematically.

Displaying Data Types

The first step in getting to know your data is to discover the different data types it contains. While you can put anything into a list, the columns of a DataFrame contain values of a specific data type. When you compare Pandas and Python data structures, you’ll see that this behavior makes Pandas much faster!

You can display all columns and their data types with .info():

>>>

>>> nba.info()

This will produce the following output:

You’ll see a list of all the columns in your dataset and the type of data each column contains. Here, you can see the data types int64, float64, and object. Pandas uses the NumPy library to work with these types. Later, you’ll meet the more complex categorical data type, which the Pandas Python library implements itself.

The object data type is a special one. According to the Pandas Cookbook, the object data type is “a catch-all for columns that Pandas doesn’t recognize as any other specific type.” In practice, it often means that all of the values in the column are strings.

Although you can store arbitrary Python objects in the object data type, you should be aware of the drawbacks to doing so. Strange values in an object column can harm Pandas’ performance and its interoperability with other libraries. For more information, check out the official getting started guide.

Showing Basics Statistics

Now that you’ve seen what data types are in your dataset, it’s time to get an overview of the values each column contains. You can do this with .describe():

>>>

>>> nba.describe()

This function shows you some basic descriptive statistics for all numeric columns:

.describe() only analyzes numeric columns by default, but you can provide other data types if you use the include parameter:

>>>

>>> importnumpyasnp>>> nba.describe(include=np.object)

.describe() won’t try to calculate a mean or a standard deviation for the object columns, since they mostly include text strings. However, it will still display some descriptive statistics:

Take a look at the team_id and fran_id columns. Your dataset contains 104 different team IDs, but only 53 different franchise IDs. Furthermore, the most frequent team ID is BOS, but the most frequent franchise ID Lakers. How is that possible? You’ll need to explore your dataset a bit more to answer this question.

Exploring Your Dataset

Exploratory data analysis can help you answer questions about your dataset. For example, you can examine how often specific values occur in a column:

>>>

>>> nba["team_id"].value_counts()BOS    5997NYK    5769LAL    5078...SDS      11>>> nba["fran_id"].value_counts()Name: team_id, Length: 104, dtype: int64Lakers          6024Celtics         5997Knicks          5769...Huskies           60Name: fran_id, dtype: int64

It seems that a team named "Lakers" played 6024 games, but only 5078 of those were played by the Los Angeles Lakers. Find out who the other "Lakers" team is:

>>>

>>> nba.loc[nba["fran_id"]=="Lakers","team_id"].value_counts()LAL    5078MNL     946Name: team_id, dtype: int64

Indeed, the Minneapolis Lakers ("MNL") played 946 games. You can even find out when they played those games:

>>>

>>> nba.loc[nba["team_id"]=="MNL","date_game"].min()'1/1/1949'>>> nba.loc[nba["team_id"]=="MNL","date_game"].max()'4/9/1959'>>> nba.loc[nba["team_id"]=="MNL","date_game"].agg(("min","max"))min    1/1/1949max    4/9/1959Name: date_game, dtype: object

It looks like the Minneapolis Lakers played between the years of 1949 and 1959. That explains why you might not recognize this team!

You’ve also found out why the Boston Celtics team "BOS" played the most games in the dataset. Let’s analyze their history also a little bit. Find out how many points the Boston Celtics have scored during all matches contained in this dataset. Expand the code block below for the solution:

Similar to the .min() and .max() aggregate functions, you can also use .sum():

>>>

>>> nba.loc[nba["team_id"]=="BOS","pts"].sum()626484

The Boston Celtics scored a total of 626,484 points.

You’ve got a taste for the capabilities of a Pandas DataFrame. In the following sections, you’ll expand on the techniques you’ve just used, but first, you’ll zoom in and learn how this powerful data structure works.

Getting to Know Pandas’ Data Structures

While a DataFrame provides functions that can feel quite intuitive, the underlying concepts are a bit trickier to understand. For this reason, you’ll set aside the vast NBA DataFrame and build some smaller Pandas objects from scratch.

Understanding Series Objects

Python’s most basic data structure is the list, which is also a good starting point for getting to know pandas.Series objects. Create a new Series object based on a list:

>>>

>>> revenues=pd.Series([5555,7000,1980])>>> revenues0    55551    70002    1980dtype: int64

You’ve used the list [5555, 7000, 1980] to create a Series object called revenues. A Series object wraps two components:

A sequence of values
A sequence of identifiers, which is the index

You can access these components with .values and .index, respectively:

>>>

>>> revenues.valuesarray([5555, 7000, 1980])>>> revenues.indexRangeIndex(start=0, stop=3, step=1)

revenues.values returns the values in the Series, whereas revenues.index returns the positional index.

Note: If you’re familiar with NumPy, then it might be interesting for you to note that the values of a Series object are actually n-dimensional arrays:

>>>

>>> type(revenues.values)<class 'numpy.ndarray'>

If you’re not familiar with NumPy, then there’s no need to worry! You can explore the ins and outs of your dataset with the Pandas Python library alone. However, if you’re curious about what Pandas does behind the scenes, then check out Look Ma, No For-Loops: Array Programming With NumPy.

While Pandas builds on NumPy, a significant difference is in their indexing. Just like a NumPy array, a Pandas Series also has an integer index that’s implicitly defined. This implicit index indicates the element’s position in the Series.

However, a Series can also have an arbitrary type of index. You can think of this explicit index as labels for a specific row:

>>>

>>> city_revenues=pd.Series(... [4200,8000,6500],... index=["Amsterdam","Toronto","Tokyo"]... )>>> city_revenuesAmsterdam    4200Toronto      8000Tokyo        6500dtype: int64

Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types of Series:

revenues: This Series behaves like a Python list because it only has a positional index.
city_revenues: This Series acts like a Python dictionary because it features both a positional and a label index.

Here’s how to construct a Series with a label index from a Python dictionary:

>>>

>>> city_employee_count=pd.Series({"Amsterdam":5,"Tokyo":8})>>> city_employee_countAmsterdam    5Tokyo        8dtype: int64

The dictionary keys become the index, and the dictionary values are the Series values.

Just like dictionaries, Series also support .keys() and the in keyword:

>>>

>>> city_employee_count.keys()Index(['Amsterdam', 'Tokyo'], dtype='object')>>> "Tokyo"incity_employee_countTrue>>> "New York"incity_employee_countFalse

You can use these methods to answer questions about your dataset quickly.

Understanding DataFrame Objects

While a Series is a pretty powerful data structure, it has its limitations. For example, you can only store one attribute per key. As you’ve seen with the nba dataset, which features 23 columns, the Pandas Python library has more to offer with its DataFrame. This data structure is a sequence of Series objects that share the same index.

If you’ve followed along with the Series examples, then you should already have two Series objects with cities as keys:

city_revenues
city_employee_count

You can combine these objects into a DataFrame by providing a dictionary in the constructor. The dictionary keys will become the column names, and the values should contain the Series objects:

>>>

>>> city_data=pd.DataFrame({... "revenue":city_revenues,... "employee_count":city_employee_count... })>>> city_data           revenue  employee_countAmsterdam     4200             5.0Tokyo         6500             8.0Toronto       8000             NaN

Note how Pandas replaced the missing employee_count value for Toronto with NaN.

The new DataFrame index is the union of the two Series indices:

>>>

>>> city_data.indexIndex(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

Just like a Series, a DataFrame also stores its values in a NumPy array:

>>>

>>> city_data.valuesarray([[4.2e+03, 5.0e+00],       [6.5e+03, 8.0e+00],       [8.0e+03,     nan]])

You can also refer to the 2 dimensions of a DataFrame as axes:

>>>

>>> city_data.axes[Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'), Index(['revenue', 'employee_count'], dtype='object')]>>> city_data.axes[0] Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')>>> city_data.axes[1] Index(['revenue', 'employee_count'], dtype='object')

The axis marked with 0 is the row index, and the axis marked with 1 is the column index. This terminology is important to know because you’ll encounter several DataFrame methods that accept an axis parameter.

A DataFrame is also a dictionary-like data structure, so it also supports .keys() and the in keyword. However, for a DataFrame these don’t relate to the index, but to the columns:

>>>

>>> city_data.keys()Index(['revenue', 'employee_count'], dtype='object')>>> "Amsterdam"incity_dataFalse>>> "revenue"incity_dataTrue

You can see these concepts in action with the bigger NBA dataset. Does it contain a column called "points", or was it called "pts"? To answer this question, display the index and the axes of the nba dataset, then expand the code block below for the solution:

Because you didn’t specify an index column when you read in the CSV file, Pandas has assigned a RangeIndex to the DataFrame:

>>>

>>> nba.indexRangeIndex(start=0, stop=126314, step=1)

nba, like all DataFrame objects, has two axes:

>>>

>>> nba.axes[RangeIndex(start=0, stop=126314, step=1), Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',        'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',        'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',        'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes'],       dtype='object')]

You can check the existence of a column with .keys():

>>>

>>> "points"innba.keys()False>>> "pts"innba.keys()True

The column is called "pts", not "points".

As you use these methods to answer questions about your dataset, be sure to keep in mind whether you’re working with a Series or a DataFrame so that your interpretation is accurate.

Accessing Series Elements

In the section above, you’ve created a Pandas Series based on a Python list and compared the two data structures. You’ve seen how a Series object is similar to lists and dictionaries in several ways. A further similarity is that you can use the indexing operator ([]) for Series as well.

You’ll also learn how to use two Pandas-specific access methods:

.loc
.iloc

You’ll see that these data access methods can be much more readable than the indexing operator.

Using the Indexing Operator

Recall that a Series has two indices:

A positional or implicit index, which is always a RangeIndex
A label or explicit index, which can contain any hashable objects

Next, revisit the city_revenues object:

>>>

>>> city_revenuesAmsterdam    4200Toronto      8000Tokyo        6500dtype: int64

You can conveniently access the values in a Series with both the label and positional indices:

>>>

>>> city_revenues["Toronto"]8000>>> city_revenues[1]8000

You can also use negative indices and slices, just like you would for a list:

>>>

>>> city_revenues[-1]6500>>> city_revenues[1:]Toronto    8000Tokyo      6500dtype: int64>>> city_revenues["Toronto":]Toronto    8000Tokyo      6500dtype: int64

If you want to learn more about the possibilities of the indexing operator, then check out Lists and Tuples in Python.

Using `.loc` and `.iloc`

The indexing operator ([]) is convenient, but there’s a caveat. What if the labels are also numbers? Say you have to work with a Series object like this:

>>>

>>> colors=pd.Series(... ["red","purple","blue","green","yellow"],... index=[1,2,3,5,8]... )>>> colors1       red2    purple3      blue5     green8    yellowdtype: object

What will colors[1] return? For a positional index, colors[1] is "purple". However, if you go by the label index, then colors[1] is referring to "red".

The good news is, you don’t have to figure it out! Instead, to avoid confusion, the Pandas Python library provides two data access methods:

.loc refers to the label index.
.iloc refers to the positional index.

These data access methods are much more readable:

>>>

>>> colors.loc[1]'red'>>> colors.iloc[1]'purple'

colors.loc[1] returned "red", the element with the label 1. colors.iloc[1] returned "purple", the element with the index 1.

The following figure shows which elements .loc and .iloc refer to:

Again, .loc points to the label index on the right-hand side of the image. Meanwhile, .iloc points to the positional index on the left-hand side of the picture.

It’s easier to keep in mind the distinction between .loc and .iloc than it is to figure out what the indexing operator will return. Even if you’re familiar with all the quirks of the indexing operator, it can be dangerous to assume that everybody who reads your code has internalized those rules as well!

Note: In addition to being confusing for Series with numeric labels, the Python indexing operator has some performance drawbacks. It’s perfectly okay to use it in interactive sessions for ad-hoc analysis, but for production code, the .loc and .iloc data access methods are preferable. For further details, check out the Pandas User Guide section on indexing and selecting data.

.loc and .iloc also support the features you would expect from indexing operators, like slicing. However, these data access methods have an important difference. While .iloc excludes the closing element, .loc includes it. Take a look at this code block:

>>>

>>> # Return the elements with the implicit index: 1, 2>>> colors.iloc[1:3]2    purple3      bluedtype: object

If you compare this code with the image above, then you can see that colors.iloc[1:3] returns the elements with the positional indices of 1 and 2. The closing item "green" with a positional index of 3 is excluded.

On the other hand, .loc includes the closing element:

>>>

>>> # Return the elements with the explicit index between 3 and 8>>> colors.loc[3:8]3      blue5     green8    yellowdtype: object

This code block says to return all elements with a label index between 3 and 8. Here, the closing item "yellow" has a label index of 8 and is included in the output.

You can also pass a negative positional index to .iloc:

>>>

>>> colors.iloc[-2]'green'

You start from the end of the Series and return the second element.

Note: There used to be an .ix indexer, which tried to guess whether it should apply positional or label indexing depending on the data type of the index. Because it caused a lot of confusion, it has been deprecated since Pandas version 0.20.0.

It’s highly recommended that you do not use .ix for indexing. Instead, always use .loc for label indexing and .iloc for positional indexing. For further details, check out the Pandas User Guide.

You can use the code blocks above to distinguish between two Series behaviors:

You can use .iloc on a Series similar to using [] on a list.
You can use .loc on a Series similar to using [] on a dictionary.

Be sure to keep these distinctions in mind as you access elements of your Series objects.

Accessing DataFrame Elements

Since a DataFrame consists of Series objects, you can use the very same tools to access its elements. The crucial difference is the additional dimension of the DataFrame. You’ll use the indexing operator for the columns and the access methods .loc and .iloc on the rows.

Using the Indexing Operator

If you think of a DataFrame as a dictionary whose values are Series, then it makes sense that you can access its columns with the indexing operator:

>>>

>>> city_data["revenue"]Amsterdam    4200Tokyo        6500Toronto      8000Name: revenue, dtype: int64>>> type(city_data["revenue"])pandas.core.series.Series

Here, you use the indexing operator to select the column labeled "revenue".

If the column name is a string, then you can use attribute-style accessing with dot notation as well:

>>>

>>> city_data.revenueAmsterdam    4200Tokyo        6500Toronto      8000Name: revenue, dtype: int64

city_data["revenue"] and city_data.revenue return the same output.

There’s one situation where accessing DataFrame elements with dot notation may not work or may lead to surprises. This is when a column name coincides with a DataFrame attribute or method name:

>>>

>>> toys=pd.DataFrame([... {"name":"ball","shape":"sphere"},... {"name":"Rubik's cube","shape":"cube"}... ])>>> toys["shape"]0    sphere1      cubeName: shape, dtype: object>>> toys.shape(2, 2)

The indexing operation toys["shape"] returns the correct data, but the attribute-style operation toys.shape still returns the shape of the DataFrame. You should only use attribute-style accessing in interactive sessions or for read operations. You shouldn’t use it for production code or for manipulating data (such as defining new columns).

Using `.loc` and `.iloc`

Similar to Series, a DataFrame also provides .loc and .ilocdata access methods. Remember, .loc uses the label and .iloc the positional index:

>>>

>>> city_data.loc["Amsterdam"]revenue           4200.0employee_count       5.0Name: Amsterdam, dtype: float64>>> city_data.loc["Tokyo":"Toronto"]        revenue employee_countTokyo   6500    8.0Toronto 8000    NaN>>> city_data.iloc[1]revenue           6500.0employee_count       8.0Name: Tokyo, dtype: float64

Each line of code selects a different row from city_data:

city_data.loc["Amsterdam"] selects the row with the label index "Amsterdam".
city_data.loc["Tokyo": "Toronto"] selects the rows with label indices from "Tokyo" to "Toronto". Remember, .loc is inclusive.
city_data.iloc[1] selects the row with the positional index 1, which is "Tokyo".

Alright, you’ve used .loc and .iloc on small data structures. Now, it’s time to practice with something bigger! Use a data access method to display the second-to-last row of the nba dataset. Then, expand the code block below to see a solution:

The second-to-last row is the row with the positional index of -2. You can display it with .iloc:

>>>

>>> nba.iloc[-2]gameorder               63157game_id          201506170CLElg_id                     NBA_iscopy                     0year_id                  2015date_game           6/16/2015seasongame                102is_playoffs                 1team_id                   CLEfran_id             Cavalierspts                        97elo_i                 1700.74elo_n                 1692.09win_equiv             59.2902opp_id                    GSWopp_fran             Warriorsopp_pts                   105opp_elo_i             1813.63opp_elo_n             1822.29game_location               Hgame_result                 Lforecast              0.48145notes                     NaNName: 126312, dtype: object

You’ll see the output as a Series object.

For a DataFrame, the data access methods .loc and .iloc also accept a second parameter. While the first parameter selects rows based on the indices, the second parameter selects the columns. You can use these parameters together to select a subset of rows and columns from your DataFrame:

>>>

>>> city_data.loc["Amsterdam":"Tokyo","revenue"]Amsterdam    4200Tokyo        6500Name: revenue, dtype: int64

Note that you separate the parameters with a comma (,). The first parameter, "Amsterdam" : "Tokyo," says to select all rows between those two labels. The second parameter comes after the comma and says to select the "revenue" column.

It’s time to see the same construct in action with the bigger nba dataset. Select all games between the labels 5555 and 5559. You’re only interested in the names of the teams and the scores, so select those elements as well. Expand the code block below to see a solution:

First, define which rows you want to see, then list the relevant columns:

>>>

>>> nba.loc[5555:5559,["fran_id","opp_fran","pts","opp_pts"]]

You use .loc for the label index and a comma (,) to separate your two parameters.

You should see a small part of your quite huge dataset:

The output is much easier to read!

With data access methods like .loc and .iloc, you can select just the right subset of your DataFrame to help you answer questions about your dataset.

Querying Your Dataset

You’ve seen how to access subsets of a huge dataset based on its indices. Now, you’ll select rows based on the values in your dataset’s columns to query your data. For example, you can create a new DataFrame that contains only games played after 2010:

>>>

>>> current_decade=nba[nba["year_id"]>2010]>>> current_decade.shape(12658, 23)

You still have all 23 columns, but your new DataFrame only consists of rows where the value in the "year_id" column is greater than 2010.

You can also select the rows where a specific field is not null:

>>>

>>> games_with_notes=nba[nba["notes"].notnull()]>>> games_with_notes.shape(5424, 23)

This can be helpful if you want to avoid any missing values in a column. You can also use .notna() to achieve the same goal.

You can even access values of the object data type as str and perform string methods on them:

>>>

>>> ers=nba[nba["fran_id"].str.endswith("ers")]>>> ers.shape(27797, 23)

You use .str.endswith() to filter your dataset and find all games where the home team’s name ends with "ers".

You can combine multiple criteria and query your dataset as well. To do this, be sure to put each one in parentheses and use the logical operators| and & to separate them.

Note: The operators and, or, &&, and || won’t work here. If you’re curious as to why then check out the section on how the Pandas Python library uses Boolean operators in Python Pandas: Tricks & Features You May Not Know.

Do a search for Baltimore games where both teams scored over 100 points. In order to see each game only once, you’ll need to exclude duplicates:

>>>

>>> nba[... (nba["_iscopy"]==0)&... (nba["pts"]>100)&... (nba["opp_pts"]>100)&... (nba["team_id"]=="BLB")... ]

Here, you use nba["_iscopy"] == 0 to include only the entries that aren’t copies.

Your output should contain five eventful games:

Try to build another query with multiple criteria. In the spring of 1992, both teams from Los Angeles had to play a home game at another court. Query your dataset to find those two games. Both teams have an ID starting with "LA". Expand the code block below to see a solution:

You can use .str to find the team IDs that start with "LA", and you can assume that such an unusual game would have some notes:

>>>

>>> nba[... (nba["_iscopy"]==0)&... (nba["team_id"].str.startswith("LA"))&... (nba["year_id"]==1992)&... (nba["notes"].notnull())... ]

Your output should show two games on the day 5/3/1992:

Nice find!

When you know how to query your dataset with multiple criteria, you’ll be able to answer more specific questions about your dataset.

Grouping and Aggregating Your Data

You may also want to learn other features of your dataset, like the sum, mean, or average value of a group of elements. Luckily, the Pandas Python library offers grouping and aggregation functions to help you accomplish this task.

A Series has more than twenty different methods for calculating descriptive statistics. Here are some examples:

>>>

>>> city_revenues.sum()18700>>> city_revenues.max()8000

The first method returns the total of city_revenues, while the second returns the max value. There are other methods you can use, like .min() and .mean().

Remember, a column of a DataFrame is actually a Series object. For this reason, you can use these same functions on the columns of nba:

>>>

>>> points=nba["pts"]>>> type(points)<class 'pandas.core.series.Series'>>>> points.sum()12976235

A DataFrame can have multiple columns, which introduces new possibilities for aggregations, like grouping:

>>>

>>> nba.groupby("fran_id",sort=False)["pts"].sum()fran_idHuskies           3995Knicks          582497Stags            20398Falcons           3797Capitols         22387...

By default, Pandas sorts the group keys during the call to .groupby(). If you don’t want to sort, then pass sort=False. This parameter can lead to performance gains.

You can also group by multiple columns:

>>>

>>> nba[... (nba["fran_id"]=="Spurs")&... (nba["year_id"]>2010)... ].groupby(["year_id","game_result"])["game_id"].count()year_id  game_result2011     L              25         W              632012     L              20         W              602013     L              30         W              732014     L              27         W              782015     L              31         W              58Name: game_id, dtype: int64

You can practice these basics with an exercise. Take a look at the Golden State Warriors’ 2014-15 season (year_id: 2015). How many wins and losses did they score during the regular season and the playoffs? Expand the code block below for the solution:

First, you can group by the "is_playoffs" field, then by the result:

>>>

>>> nba[... (nba["fran_id"]=="Warriors")&... (nba["year_id"]==2015)... ].groupby(["is_playoffs","game_result"])["game_id"].count()is_playoffs  game_result0            L              15             W              671            L               5             W              16

is_playoffs=0 shows the results for the regular season, and is_playoffs=1 shows the results for the playoffs.

In the examples above, you’ve only scratched the surface of the aggregation functions that are available to you in the Pandas Python library. To see more examples of how to use them, check out Pandas GroupBy: Your Guide to Grouping Data in Python.

Manipulating Columns

You’ll need to know how to manipulate your dataset’s columns in different phases of the data analysis process. You can add and drop columns as part of the initial data cleaning phase, or later based on the insights of your analysis.

Create a copy of your original DataFrame to work with:

>>>

>>> df=nba.copy()>>> df.shape(126314, 23)

You can define new columns based on the existing ones:

>>>

>>> df["difference"]=df.pts-df.opp_pts>>> df.shape(126314, 24)

Here, you used the "pts" and "opp_pts" columns to create a new one called "difference". This new column has the same functions as the old ones:

>>>

>>> df["difference"].max()68

Here, you used an aggregation function .max() to find the largest value of your new column.

You can also rename the columns of your dataset. It seems that "game_result" and "game_location" are too verbose, so go ahead and rename them now:

>>>

>>> renamed_df=df.rename(... columns={"game_result":"result","game_location":"location"}... )>>> renamed_df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 126314 entries, 0 to 126313Data columns (total 24 columns):gameorder      126314 non-null int64...location       126314 non-null objectresult         126314 non-null objectforecast       126314 non-null float64notes          5424 non-null objectdifference     126314 non-null int64dtypes: float64(6), int64(8), object(10)memory usage: 23.1+ MB

Note that there’s a new object, renamed_df. Like several other data manipulation methods, .rename() returns a new DataFrame by default. If you want to manipulate the original DataFrame directly, then .rename() also provides an inplace parameter that you can set to True.

Your dataset might contain columns that you don’t need. For example, Elo ratings may be a fascinating concept to some, but you won’t analyze them in this tutorial. You can delete the four columns related to Elo:

>>>

>>> df.shape(126314, 24)>>> elo_columns=["elo_i","elo_n","opp_elo_i","opp_elo_n"]>>> df.drop(elo_columns,inplace=True,axis=1)>>> df.shape(126314, 20)

Remember, you added the new column "difference" in a previous example, bringing the total number of columns to 24. When you remove the four Elo columns, the total number of columns drops to 20.

Specifying Data Types

When you create a new DataFrame, either by calling a constructor or reading a CSV file, Pandas assigns a data type to each column based on its values. While it does a pretty good job, it’s not perfect. If you choose the right data type for your columns upfront, then you can significantly improve your code’s performance.

Take another look at the columns of the nba dataset:

>>>

>>> df.info()

You’ll see the same output as before:

Ten of your columns have the data type object. Most of these object columns contain arbitrary text, but there are also some candidates for data type conversion. For example, take a look at the date_game column:

>>>

>>> df["date_game"]=pd.to_datetime(df["date_game"])

Here, you use .to_datetime() to specify all game dates as datetime objects.

Other columns contain text that are a bit more structured. The game_location column can have only three different values:

>>>

>>> df["game_location"].nunique()3>>> df["game_location"].value_counts()A    63138H    63138N       38Name: game_location, dtype: int64

Which data type would you use in a relational database for such a column? You would probably not use a varchar type, but rather an enum. Pandas provides the categorical data type for the same purpose:

>>>

>>> df["game_location"]=pd.Categorical(df["game_location"])>>> df["game_location"].dtypeCategoricalDtype(categories=['A', 'H', 'N'], ordered=False)

categorical data has a few advantages over unstructured text. When you specify the categorical data type, you make validation easier and save a ton of memory, as Pandas will only use the unique values internally. The higher the ratio of total values to unique values, the more space savings you’ll get.

Run df.info() again. You should see that changing the game_location data type from object to categorical has decreased the memory usage.

Note: The categorical data type also gives you access to additional methods through the .cat accessor. To learn more, check out the official docs.

You’ll often encounter datasets with too many text columns. An essential skill for data scientists to have is the ability to spot which columns they can convert to a more performant data type.

Take a moment to practice this now. Find another column in the nba dataset that has a generic data type and convert it to a more specific one. You can expand the code block below to see one potential solution:

game_result can take only two different values:

>>>

>>> df["game_result"].nunique()2>>> df["game_result"].value_counts()L    63157W    63157

To improve performance, you can convert it into a categorical column:

>>>

>>> df["game_result"]=pd.Categorical(df["game_result"])

You can use df.info() to check the memory usage.

As you work with more massive datasets, memory savings becomes especially crucial. Be sure to keep performance in mind as you continue to explore your datasets.

Cleaning Data

You may be surprised to find this section so late in the tutorial! Usually, you’d take a critical look at your dataset to fix any issues before you move on to a more sophisticated analysis. However, in this tutorial, you’ll rely on the techniques that you’ve learned in the previous sections to clean your dataset.

Missing Values

Have you ever wondered why .info() shows how many non-null values a column contains? The reason why is that this is vital information. Null values often indicate a problem in the data-gathering process. They can make several analysis techniques, like different types of machine learning, difficult or even impossible.

When you inspect the nba dataset with nba.info(), you’ll see that it’s quite neat. Only the column notes contains null values for the majority of its rows:

This output shows that the notes column has only 5424 non-null values. That means that over 120,000 rows of your dataset have null values in this column.

Sometimes, the easiest way to deal with records containing missing values is to ignore them. You can remove all the rows with missing values using .dropna():

>>>

>>> rows_without_missing_data=nba.dropna()>>> rows_without_missing_data.shape(5424, 23)

Of course, this kind of data cleanup doesn’t make sense for your nba dataset, because it’s not a problem for a game to lack notes. But if your dataset contains a million valid records and a hundred where relevant data is missing, then dropping the incomplete records can be a reasonable solution.

You can also drop problematic columns if they’re not relevant for your analysis. To do this, use .dropna() again and provide the axis=1 parameter:

>>>

>>> data_without_missing_columns=nba.dropna(axis=1)>>> data_without_missing_columns.shape(126314, 22)

Now, the resulting DataFrame contains all 126,314 games, but not the sometimes empty notes column.

If there’s a meaningful default value for your use case, then you can also replace the missing values with that:

>>>

>>> data_with_default_notes=nba.copy()>>> data_with_default_notes["notes"].fillna(... value="no notes at all",... inplace=True... )>>> data_with_default_notes["notes"].describe()count              126314unique                232top       no notes at allfreq               120890Name: notes, dtype: object

Here, you fill the empty notes rows with the string "no notes at all".

Invalid Values

Invalid values can be even more dangerous than missing values. Often, you can perform your data analysis as expected, but the results you get are peculiar. This is especially important if your dataset is enormous or used manual entry. Invalid values are often more challenging to detect, but you can implement some sanity checks with queries and aggregations.

One thing you can do is validate the ranges of your data. For this, .describe() is quite handy. Recall that it returns the following output:

The year_id varies between 1947 and 2015. That sounds plausible.

What about pts? How can the minimum be 0? Let’s have a look at those games:

>>>

>>> nba[nba["pts"]==0]

This query returns a single row:

It seems the game was forfeited. Depending on your analysis, you may want to remove it from the dataset.

Inconsistent Values

Sometimes a value would be entirely realistic in and of itself, but it doesn’t fit with the values in the other columns. You can define some query criteria that are mutually exclusive and verify that these don’t occur together.

In the NBA dataset, the values of the fields pts, opp_pts and game_result should be consistent with each other. You can check this using the .empty attribute:

>>>

>>> nba[(nba["pts"]>nba["opp_pts"])&(nba["game_result"]!='W')].emptyTrue>>> nba[(nba["pts"]<nba["opp_pts"])&(nba["game_result"]!='L')].emptyTrue

Fortunately, both of these queries return an empty DataFrame.

Be prepared for surprises whenever you’re working with raw datasets, especially if they were gathered from different sources or through a complex pipeline. You might see rows where a team scored more points than their opponent, but still didn’t win—at least, according to your dataset! To avoid situations like this, make sure you add further data cleaning techniques to your Pandas and Python arsenal.

Combining Multiple Datasets

In the previous section, you’ve learned how to clean a messy dataset. Another aspect of real-world data is that it often comes in multiple pieces. In this section, you’ll learn how to grab those pieces and combine them into one dataset that’s ready for analysis.

Earlier, you combined two Series objects into a DataFrame based on their indices. Now, you’ll take this one step further and use .concat() to combine city_data with another DataFrame. Say you’ve managed to gather some data on two more cities:

>>>

>>> further_city_data=pd.DataFrame(... {"revenue":[7000,3400],"employee_count":[2,2]},... index=["New York","Barcelona"]... )

This second DataFrame contains info on the cities "New York" and "Barcelona".

You can add these cities to city_data using .concat():

>>>

>>> all_city_data=pd.concat([city_data,further_city_data],sort=False)>>> all_city_dataAmsterdam   4200    5.0Tokyo       6500    8.0Toronto     8000    NaNNew York    7000    2.0Barcelona   3400    2.0

Now, the new variable all_city_data contains the values from both DataFrame objects.

Note: As of Pandas version 0.25.0, the sort parameter’s default value is True, but this will change to False soon. It’s good practice to provide an explicit value for this parameter to ensure that your code works consistently in different Pandas and Python versions. For more info, consult the Pandas User Guide.

By default, concat() combines along axis=0. In other words, it appends rows. You can also use it to append columns by supplying the parameter axis=1:

>>>

>>> city_countries=pd.DataFrame({... "country":["Holland","Japan","Holland","Canada","Spain"],... "capital":[1,1,0,0,0]},... index=["Amsterdam","Tokyo","Rotterdam","Toronto","Barcelona"]... )>>> cities=pd.concat([all_city_data,city_countries],axis=1,sort=False)>>> cities           revenue  employee_count  country  capitalAmsterdam   4200.0             5.0  Holland      1.0Tokyo       6500.0             8.0    Japan      1.0Toronto     8000.0             NaN   Canada      0.0New York    7000.0             2.0      NaN      NaNBarcelona   3400.0             2.0    Spain      0.0Rotterdam      NaN             NaN  Holland      0.0

Note how Pandas added NaN for the missing values. If you want to combine only the cities that appear in both DataFrame objects, then you can set the join parameter to inner:

>>>

>>> pd.concat([all_city_data,city_countries],axis=1,join="inner")           revenue  employee_count  country  capitalAmsterdam     4200             5.0  Holland        1Tokyo         6500             8.0    Japan        1Toronto       8000             NaN   Canada        0Barcelona     3400             2.0    Spain        0

While it’s most straightforward to combine data based on the index, it’s not the only possibility. You can use .merge() to implement a join operation similar to the one from SQL:

>>>

>>> countries=pd.DataFrame({... "population_millions":[17,127,37],... "continent":["Europe","Asia","North America"]... },index=["Holland","Japan","Canada"])>>> pd.merge(cities,countries,left_on="country",right_index=True)

Here, you pass the parameter left_on="country" to .merge() to indicate what column you want to join on. The result is a bigger DataFrame that contains not only city data, but also the population and continent of the respective countries:

Note that the result contains only the cities where the country is known and appears in the joined DataFrame.

.merge() performs an inner join by default. If you want to include all cities in the result, then you need to provide the how parameter:

>>>

>>> pd.merge(... cities,... countries,... left_on="country",... right_index=True,... how="left"... )

With this left join, you’ll see all the cities, including those without country data:

Welcome back, New York & Barcelona!

Visualizing Your Pandas DataFrame

Data visualization is one of the things that works much better in a Jupyter notebook than in a terminal, so go ahead and fire one up. If you need help getting started, then check out Jupyter Notebook: An Introduction. You can also access the Jupyter notebook that contains the examples from this tutorial by clicking the link below:

Get Jupyter Notebook:Click here get the Jupyter Notebook you'll use to explore data with Pandas in this tutorial.

Include this line to show plots directly in the notebook:

>>>

>>> %matplotlibinline

Both Series and DataFrame objects have a .plot() method, which is a wrapper around matplotlib.pyplot.plot(). By default, it creates a line plot. Visualize how many points the Knicks scored throughout the seasons:

>>>

>>> nba[nba["fran_id"]=="Knicks"].groupby("year_id")["pts"].sum().plot()

This shows a line plot with several peaks and two notable valleys around the years 2000 and 2010:

You can also create other types of plots, like a bar plot:

>>>

>>> nba["fran_id"].value_counts().head(10).plot(kind="bar")

This will show the franchises with the most games played:

The Lakers are leading the Celtics by a minimal edge, and there are six further teams with a game count above 5000.

Now try a more complicated exercise. In 2013, the Miami Heat won the championship. Create a pie plot showing the count of their wins and losses during that season. Then, expand the code block to see a solution:

First, you define a criteria to include only the Heat’s games from 2013. Then, you create a plot in the same way as you’ve seen above:

>>>

>>> nba[... (nba["fran_id"]=="Heat")&... (nba["year_id"]==2013)... ]["game_result"].value_counts().plot(kind="pie")

Here’s what a champion pie looks like:

The slice of wins is significantly larger than the slice of losses!

Sometimes, the numbers speak for themselves, but often a chart helps a lot with communicating your insights. To learn more about visualizing your data, check out Interactive Data Visualization in Python With Bokeh.

Conclusion

In this tutorial, you’ve learned how to start exploring a dataset with the Pandas Python library. You saw how you could access specific rows and columns to tame even the largest of datasets. Speaking of taming, you’ve also seen multiple techniques to prepare and clean your data, by specifying the data type of columns, dealing with missing values, and more. You’ve even created queries, aggregations, and plots based on those.

Now you can:

Work with Series and DataFrame objects
Subset your data with .loc, .iloc, and the indexing operator
Answer questions with queries, grouping, and aggregation
Handle missing, invalid, and inconsistent data
Visualize your dataset in a Jupyter notebook

This journey using the NBA stats only scratches the surface of what you can do with the Pandas Python library. You can power up your project with Pandas tricks, learn techniques to speed up Pandas in Python, and even dive deep to see how Pandas works behind the scenes. There are many more features for you to discover, so get out there and tackle those datasets!

You can get all the code examples you saw in this tutorial by clicking the link below:

Get Jupyter Notebook:Click here get the Jupyter Notebook you'll use to explore data with Pandas in this tutorial.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Vinta Software: Counting Queries: Basic Performance Testing in Django

January 6, 2020, 8:48 am

≫ Next: Catalin George Festila: Python 3.7.5 : Post class and migration process.

≪ Previous: Real Python: Using Pandas and Python to Explore Your Dataset

It's very common to read about testing techniques such as TDD and how to test application business logic. But testing the performance of an application is a whole different issue. There are many ways you can do it, but a common approach is to setup an environment where you can DDoS your application and watch how it behaves. This is a very interesti

↧

Catalin George Festila: Python 3.7.5 : Post class and migration process.

January 6, 2020, 2:09 am

≫ Next: Kushal Das: 5 months of Internet shutdown in Kashmir and more fascist attacks in India

≪ Previous: Vinta Software: Counting Queries: Basic Performance Testing in Django

Today I will solve some issues with the Django framework like: create a new class for posts; explain how the migration process works. use the database with Django shell; Let's activate the environment: [mythcat@desk django]$ source env/bin/activateI used my old project django-chart, see my old tutorials. Let's add some source code to the models.py to create a class for the post into my website:

↧

Kushal Das: 5 months of Internet shutdown in Kashmir and more fascist attacks in India

January 6, 2020, 8:41 am

≫ Next: Reuven Lerner: Looking back at 2019, looking forward to 2020

≪ Previous: Catalin George Festila: Python 3.7.5 : Post class and migration process.

From 5th August 2019, Kashmir is under a communication shutdown. SMS service for a particular connection provider is now available for postpaid users, but Internet is still down for all Indian citizens of Kashmir.

This is above 155 days of Internet shutdown. If you are reading this blog post, it means you have an active Internet connection, and you can connect to the different servers/services that are essential to modern life. Now, think about all of those citizens of India staying in Kashmir. Think about the problem when they have to access a website for job/medical/banking/travel or any other necessary work.

The current fascist regime of India kept shouting about “Digital India” for the last few years, and at the same time, making sure to use the Internet shutdown as a tool of oppression. By using a proper communication shutdown and blocking reporters, they made sure only the false stories from the state can be reached to the readers/viewers of news across the world. But, a few brave outside journalists and too many brave local journalists from Kashmir made sure that they kept pushing the real news from the ground. They tried their best to record atrocities.

This story in the New Yorker by Dexter Filkins should be the one for everyone to read. Take your time to read how brave Rana Ayyub and the author managed to sneak into Kashmir, and did the report.

Internet shutdowns across India

Now, if you think that the Indian government is doing this only in Kashmir, then you are totally wrong. In the last few years, India saw the highest number of Internet shutdowns across the country. Govt did not care about the reason. Given any chance, they shut down the Internet. During the current protests against the regime, they shut down the Internet in parts of Delhi, the capital of India. BBC did another story on why India gets the greatest number of Internet shutdowns.

To find all the instances of the shutdown, have a look at this site from SFLC India team.

Latest attack on students and professors of JNU

Jawaharlal Nehru University (JNU) is India’s topmost university, a place where leaders of many different fields got their education, including Nobel laureates. Yesterday evening a bunch of goons from the student wing (ABVP) of the party in power (BJP), went inside of the campus (with the full support of Delhi Police, who waited outside), and started attacking students and professors with rods and other weapons. They turned off all the street lights, but, as they forgot to shut down the Internet in the area, students managed to send across SOS messages. Search #SOSJNU on Twitter to see the amount of atrocity. Now, think for a second, what if they would have managed to shut down the Internet before the attack, just like they are doing now in Kashmir and many other parts of India. Economist and Nobel laureate Abhijit Banerjee commented how this “Echoes of Germany moving towards Nazi rule”.

Why should this matter to you, the technologist?

All of the technologies we are enjoying today, the modern world, the Internet is one of the major bounding material of the same. Think about the pain and oppression the people has to go through as this basic necessity is cut down from their lives.

Most people do not have a voice to raise for themselves. If we don’t know, then the whole country will be lost. And, we know from history what happens next.

People still count India as a democracy, actually the largest in the world. But, unless we raise up, the so-called democracy will be crushed the fascist regime in no-time.

Quick point about different mesh-network and other solutions available at Internet shutdown time

We need more documentation and examples (also translated in local languages) of the different tools available, which can help the citizens when the regime is trying their best to shut down the Internet. India is also known for random blocking of sites, and this is where free software like the Tor Project becomes so essential.

↧

Reuven Lerner: Looking back at 2019, looking forward to 2020

January 6, 2020, 10:28 am

≫ Next: Erik Marsja: Pandas drop_duplicates(): How to Drop Duplicated Rows

≪ Previous: Kushal Das: 5 months of Internet shutdown in Kashmir and more fascist attacks in India

Hi, and welcome to 2020! The last year (2019) was quite a whirlwind for me and my work — and I thus wanted to take a few minutes to summarize what I’ve done over the past year. But the coming year looks like it’ll be just as exciting, if not more so, and I wanted to fill you in on what you can expect.

Let me start off by saying that I’m extremely grateful to have the opportunity to teach Python to so many people around the world, both in person and online. Thanks so much for your interest in my writing and work, and (for so many of you) for taking the time to e-mail me with corrections and suggestions. It means a lot to me.

Summary of 2019

On-site training: I traveled quite a bit in 2019, teaching in-person courses at companies in the US, Europe, India, and China. (And of course, I’m teaching quite a bit in Israel, where I live.)
Conferences: I attended PyCon in Cleveland, Ohio, where I also gave a talk on “Practical Decorators” and sponsored a booth, where I gave away more than 800 “Weekly Python Exercise” T-shirts. I also attended Euro Python in Basel, Switzerland, where I gave my “Practical Decorators” talk a second time, and met lots of great Python developers.
Local talks: I gave talks to local Python user groups in Beijing, China and Hyderabad, India. I also met some some subscribers to my “Better developers” list in San Jose, California when I was there!
Online courses: I released three new paid courses in 2019: Intro Python Functions, NumPy, and Pandas. All three courses include many exercises, as well as video lectures.
Weekly Python Exercise: There are now six distinct versions of Weekly Python Exercise, three for beginners, and three for intermediate/advanced Python developers. Each cohort has been larger than the previous one.
Book: My book, “Python Workout,” was released in early edition (MEAP) format by Manning, and is slated to be complete within the next two months. It includes 50 Python exercises to improve your fluency, as well as a lot of background material, additional exercises, and insights that I’ve gained in teaching over the years. I have been very impressed with Manning and all they’ve done to make the book far better than I could have done on my own.
Free online course: I also released a new, free online course, aimed at helping people who are interviewing for Python programming positions, called “Ace Python Interviews.” So far, the response has been overwhelming; I hope to get this course out to as many people as possible, to help them get the Python job of their dreams.
YouTube: I started a series of videos, walking through the Python standard library. I had to pause that series in order to do a few other projects, but hope to get back to it within the coming weeks, and thus explain more about the standard library to the world. Subscribe to my YouTube channel to get regular updates!
Twitter: I recently started tweeting interesting questions that I get in my courses, along with their answers. I hope to keep doing this a few times a week, to give short insights about Python based on real-world questions and problems. Follow me on Twitter to get the latest!
Blogging: I wrote a number of Python-related articles on my blog this year, including one about the search path for attributes in Python, which I call ICPO — instance, class, parents, and object.
Trainer Weekly: I continue to write my newsletter for trainers, about training. If you’re interested in the business, logistics, and pedagogy of the training industry, then feel free to sign up!
Better developers: My free, weekly list about Python has grown to more than 14,000 subscribers from around the world, up from about 8,000 subscribers on January 1st of 2019. You can subscribe here: https://lerner.co.il/newsletter.
Podcast hosting: After several years of co-hosting the Freelancers Show podcast, I left, along with my co-panelists. We’ll almost certainly be starting a new podcast in the near future to help people with their freelancing/consulting issues.
Podcast appearances: I appeared on a whole bunch of podcasts in 2019, including Talk Python (twice), Test & Code, Teaching Python, and Profitable Python. I also appeared on the “You Can Learn Chinese” podcast, where I talked about my journey learning to speak, read, and write Chinese.

What’s planned for 2020

On-site training: I’m already booked solid through March of this year, and partly through September. I already expect to return to the US, UK, India, and China, and will try to announce it when I’m in town, to meet up with subscribers. If you want to book me for in-person training at your company, please reply to this message — we can chat about your needs! You can even book time to speak with me via this link: https://calendly.com/reuvenlerner/corporate-training-needs
New courses: I’m adding two new courses to my existing list. First, I have a new one-day course in “pytest” testing, which I’m very excited to start offering. I’ve also decided to go back to my roots in Web development, and I’m developing a new course in creating Web applications using Flask. The first course is already being taught, and the second will be ready by the spring.
Conferences: I’ll once be sponsoring a booth at PyCon 2020, which will take place this year in Pittsburgh, PA. (And yes, I’ll again be giving out T-shirts!) If you plan to be at PyCon, please let me know; I’d love to meet you in person. (I’ve applied to speak, and hope that I’ll manage to get a slot there, as well.) I am also planning to attend Euro Python in Dublin, Ireland toward the end of July. I’m open to attending other conferences; if you are running a conference and would like to have me speak there, please drop me a line.
Online courses: I’m planning to release 3-5 new courses during 2020. At this point, I’m going to divide things up between introductory and advanced courses. The beginner courses will likely be about working with files and modules, while the advanced courses will likely be about iterators/generators, decorators, and advanced object-oriented programming.
Weekly Python Exercise: All six existing WPE cohorts are already scheduled for 2020, with WPE A1: Data structures for beginners starting on January 14th.
New Weekly Python Exercise courses: I’m planning to start new cohorts of WPE on specific topics, such as Web development, design patterns, and data science. These will follow the same WPE format, but be on particular topics. I’m hoping to start at least 1-2 of these by the summer.
Certification: A number of people have asked me about certification for my courses. I have some ideas for how I’ll do that, and I’m going to try some experiments with WPE cohorts early this year to see how it goes. I realize that getting a certificate at the conclusion of a course is worth quite a bit, and want to help people to that end.
More “workout” books: I’m already speaking with Manning about producing additional “workout” books. I hope to start on at least one by the end of 2020, and to have a MEAP available for people to start looking, tinkering, and responding.
Podcast: As I mentioned above, my former “Freelancers Show” panelists and I are looking to start a new podcast about freelancing in the near future. I’ll announce more details here. I’m also toying with the idea of starting a Python-related podcast — if you have thoughts about this, please let me know!
China: I’m setting up a new company in China to distribute my online courses there, with Chinese subtitles and mobile payment support (aka AliPay and WeChat wallet). I hope to have more details in the coming months, but if you’re based in China and have insights into what people might want, I would be happy to hear from you. (For now, my Chinese isn’t nearly good enough to teach in the language, so the lectures will continue to be in English.)
PySpa training: Earlier this year, my family and I took a vacation to Rhodes, a Greek island in the Mediterranean. It was October, but still more than nice enough to go in the water and enjoy the weather. I’m thinking of offering one or more of my 4-day Python courses at a similar venue, with intensive training during the day — and free time for swimming, eating, and touring at night. If this sounds interesting to you, then please tell me what you think!

I hope that you also have some big plans for 2020. Best of luck with them, and I hope to see you at one or more of my courses and visits during the coming year.

The post Looking back at 2019, looking forward to 2020 appeared first on Reuven Lerner.

↧

Erik Marsja: Pandas drop_duplicates(): How to Drop Duplicated Rows

January 6, 2020, 1:01 pm

≫ Next: Mike Driscoll: Top 10 Most Read Mouse vs Python Articles of 2019

≪ Previous: Reuven Lerner: Looking back at 2019, looking forward to 2020

The post Pandas drop_duplicates(): How to Drop Duplicated Rows appeared first on Erik Marsja.

In this post, we will learn how to use Pandas drop_duplicates() to remove duplicate records and combinations of columns from a Pandas dataframe. That is, we will delete duplicate data and only keep the unique values.

This Pandas tutorial will cover the following; what’s needed to follow the tutorial, importing Pandas, and how to create a dataframe fro a dictionary. After this, we will get into how to use Pandas drop_duplicates() to drop duplicate rows and duplicate columns.

Note, that we will drop duplicates using Pandas and Pyjanitor, which is a Python package that extends Pandas with an API based on verbs. It’s much like working with the Tidyverse packages in R. See this post on more about working with Pyjanitor. However, we will only use Pyjanitor to drop duplicate columns from a Pandas dataframe.

Data Cleaning in Python with Pandas and Pyjanitor is a post where you will learn more about working with Pyjanitor.

Prerequisites

As usual in the Python tutorials focusing on Pandas, we need to have both Python 3 and Pandas installed. Python can be installed by downloaded here or by installing a Python distribution such as Anaconda or Canopy. If we install Anaconda, for instance, we’ll also get Pandas. Obviously, if we want to use Pyjanitor, we also need to install it: pip install pyjanitor. Note, this post explains how to install, use, and upgrade Python packages using pip or conda. That said, now we can continue the Pandas drop duplicates tutorial.

Creating a Dataframe from a Dictionary

In this section, of the Pandas drop_duplicates() tutorial, we are going to create a Pandas dataframe from a dictionary. First, we are going to create a dictionary and then we use pd.Dataframe() to create a Pandas dataframe.

import pandas as pd

data = {'FirstName':['Steve', 'Steve', 'Erica',
                      'John', 'Brody', 'Lisa', 'Lisa'],
        'SurName':['Johnson', 'Johnson', 'Ericson',
                  'Peterson', 'Stephenson', 'Bond', 'Bond'],
        'Age':[34, 34, 40, 
                44, 66, 51, 51],
        'Sex':['M', 'M', 'F', 'M',
               'M', 'F', 'F']}

df = pd.DataFrame(data)

Now, before removing duplicate rows we are going to print the dataframe with the highlighted rows. Here, we use the Styling API which enables us to apply conditional styling. In the code chunk below, we create a function that checks for duplicate rows and then set the color of the rows to the color red.

def highlight_dupes(x):
    df = x.copy()
    
    df['Dup'] = df.duplicated(keep=False)
    mask = df['Dup'] == True
    
    df.loc[mask, :] = 'background-color: red'
    df.loc[~mask,:] = 'background-color: ""'
    return df.drop('Dup', axis=1) 

df.style.apply(highlight_col, axis=None))

How to remove duplicate rows using Pandas drop_duplicates()

Duplicated rows highlighted in red.

Note, this above code was found on Stackoverflow and adapted to the problem of the current post.

Finally, before going on and deleting duplicate rows we can use Pandas groupby() and size() to count the duplicated rows:

df.groupby(df.columns.tolist(),as_index=False).size()

Duplicated rows counted

Learn more about grouping categorical data and describing the data using the groupby() method:

Pandas Groupby() tutorial: how to group categorical data in Python

Pandas drop_duplicates(): Deleting Duplicate Rows:

In this section, we are going to drop rows that are the same using Pandas drop_duplicates(). This is very simple, we just run the code example below.

df_new = df.drop_duplicates()
df_new

Now, in the image above we can see that the duplicate rows were removed from the Pandas dataframe but we can count the rows, again to double-check.

df_new.groupby(df_new.columns.tolist(), as_index=False).size()

Pandas Drop Duplicates with Subset

If we want to remove duplicates, from a Pandas dataframe, where only one or a subset of columns contains the same data we can use the subset argument. When using the subset argument with Pandas drop_duplicates(), we tell the method which column, or list of columns, we want to be unique. Now, before working with Pandas drop_duplicates(), again, we need to create a new example dataset:

df_new = df.drop_duplicates(subset=subset=['FirstName', 'SurName'])
df_new

Working with duplicated data with the drop_duplicates() method in Pandas

Pandas dataframe with duplicated rows

Furthermore, we can also specify which duplicated row we want to keep. If we don’t change the argument it will remove the first:

df_new = df.drop_duplicates(subset=['FirstName', 'SurName'])
df_new

If we, on the other hand, set it to “last” we will see a different result:

df_new = df.drop_duplicates(keep='last', subset=subset=['FirstName', 'SurName'])
df_new

Pandas Drop Duplicated Columns using Pyjanitor

In the last section, of this Pandas drop duplicated data tutorial, we will work with Pyjanitor to remove duplicated columns. This is very easy and we do this with the drop_duplicate_columns() method and the column_name argument. Now, before dropping duplicated columns we create a dataframe from a dictionary:

code class="lang-py">import pandas as pd

data = {'FirstName':['Steve', 'Steve', 'Erica',
                      'John', 'Brody', 'Lisa', 'Lisa'],
        'SurName':['Johnson', 'Johnson', 'Ericson',
                  'Peterson', 'Stephenson', 'Bond', 'Bond'],
        'Age':[34, 34, 40, 
                44, 66, 51, 51],
        'Sex':['M', 'M', 'F', 'M',
               'M', 'F', 'F'],
        'S':['M', 'M', 'F', 'M',
               'M', 'F', 'F']}

df = pd.DataFrame(data)

df

As can be seen in the image, above, we have the duplicated columns “Sex” and “S” and we remove them now:

code class="lang-py>df = df.drop_duplicate_columns(column_name='S')

Conclusion: Using Pandas drop_duplicates()

In conclusion, using Pandas drop_duplicates() was very easy and in this post, we have learned how to:

Delete duplicated rows
Delete duplicated rows using a subset of columns
Delete duplicated columns

The two first things, we learned, was accomplished using drop_duplicates() in Pandas and the third thing was done with Pyjanitor.

The post Pandas drop_duplicates(): How to Drop Duplicated Rows appeared first on Erik Marsja.

↧

Mike Driscoll: Top 10 Most Read Mouse vs Python Articles of 2019

January 6, 2020, 10:05 pm

≫ Next: Hynek Schlawack: Better Python Object Serialization

≪ Previous: Erik Marsja: Pandas drop_duplicates(): How to Drop Duplicated Rows

2019 was a good year for my blog. While we didn’t end up getting a lot of new readers, we did receive a small bump. There has also been a lot more interest in the books that are available on this site.

For the year 2019, these are the top ten most read:

#1 – Creating PDFs with PyFPDF and Python
#2 – Exporting Data from PDFs with Python
#3 – Reading Excel Spreadsheets with Python and xlrd
#4 – Determining if all Elements in a List are the Same in Python
#5 – A Simple Step-by-Step Reportlab Tutorial
#6 – How to Get a List of Class Attributes in Python
#7 – Python: Using Turtles for Drawing Circles
#8 – How to Export Jupyter Notebooks into Other Formats
#9 – Python 201: A multiprocessing tutorial
#10 – Python 3: An Intro to Enumerations

Note that none of these articles were actually written in 2019. Half of them were written in 2018 and one of them dates all the way back to 2010. Interestingly enough, my most popular article of 2019 is about using Python to take a photo of the black hole. That article ranks way down at #28.

For 2020, I am going to work hard at creating new content and tutorials that you will find useful in your Python journey. In the meantime, I hope you’ll enjoy reading the archives while I work on some new ones!

The post Top 10 Most Read Mouse vs Python Articles of 2019 appeared first on The Mouse Vs. The Python.

↧

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Straightforward Implementation

The Thread-Safe Implementation

A Fast Write Implementation

Measuring Performance

Conclusion

Prerequisites

How to Install MyCLI?

Setting Up Your Environment

Using the Pandas Python Library

Getting to Know Your Data

Displaying Data Types

Showing Basics Statistics

Exploring Your Dataset

Getting to Know Pandas’ Data Structures

Understanding Series Objects

Understanding DataFrame Objects

Accessing Series Elements

Using the Indexing Operator

Using .loc and .iloc

Accessing DataFrame Elements

Using the Indexing Operator

Using .loc and .iloc

Querying Your Dataset

Grouping and Aggregating Your Data

Manipulating Columns

Specifying Data Types

Cleaning Data

Missing Values

Invalid Values

Inconsistent Values

Combining Multiple Datasets

Visualizing Your Pandas DataFrame

Conclusion

Internet shutdowns across India

Latest attack on students and professors of JNU

Why should this matter to you, the technologist?

Quick point about different mesh-network and other solutions available at Internet shutdown time

Summary of 2019

What’s planned for 2020

Prerequisites

Creating a Dataframe from a Dictionary

Pandas drop_duplicates(): Deleting Duplicate Rows:

Pandas Drop Duplicates with Subset

Pandas Drop Duplicated Columns using Pyjanitor

Conclusion: Using Pandas drop_duplicates()

Using `.loc` and `.iloc`

Using `.loc` and `.iloc`