Quantcast
Channel: Planet Python
Viewing all 22412 articles
Browse latest View live

Reuven Lerner: Last chance for Black Friday pricing


Django Weblog: Django 3.0 Released

$
0
0

The Django team is happy to announce the release of Django 3.0.

The release notes cover the raft of new features in detail, but a few highlights are:

  • Django 3.0 begins our journey to making Django fully async-capable by providing support for running as an ASGI application.
  • Django now officially supports MariaDB 10.1 and higher.
  • Custom enumeration types TextChoices, IntegerChoices, and Choices are now available as a way to define model field choices.

You can get Django 3.0 from our downloads page or from the Python Package Index. The PGP key ID used for this release is Carlton Gibson: E17DF5C82B4F9D00.

With the release of Django 3.0, Django 2.2 has reached the end of mainstream support. The final minor bug fix release (which is also a security release), 2.2.8, was issued today. Django 2.2 is an LTS release and will receive security and data loss fixes until April 2022. All users are encouraged to upgrade before then to continue receiving fixes for security issues.

See the downloads page for a table of supported versions and the future release schedule.

Chris Moffitt: Building a Windows Shortcut with Python

$
0
0

Introduction

I prefer to use miniconda for installing a lightweight python environment on Windows. I also like to create and customize Windows shortcuts for launching different conda environments in specific working directories. This is an especially useful tip for new users that are not as familiar with the command line on Windows.

After spending way too much time trying to get the shortcuts setup properly on multiple Windows machines, I spent some time automating the link creation process. This article will discuss how to use python to create custom Windows shortcuts to launch conda environments.

Launching Windows Environments

miniconda is great for streamlining the install of packages on Windows and using conda for environment management.

By default, miniconda tries to have as minimal an impact on your system as possible. For example, a default install will not add any python information to your default path, nor will it require admin privileges for installation. This is “a good thing” but it means that you need to do a couple of extra steps to get your python environment working from a standard Windows prompt. For new users this is just one more step in the python installation process.

Fortunately, Anaconda (fka Continuum) provides all the foundations to launch a powershell or command prompt with everything setup for your environment. In fact, the default install will create some shortcuts to do exactly that.

However, I had a hard time modifying these shortcuts to customize the working directory. Additionally, It’s really useful to automate a new user setup instead of trying to walk someone through this tedious process by hand. Hence, the need for this script to automate the process.

For the purposes of this article, I am only going to discuss using the command prompt approach to launching python. There is also a powershell option which is a little more complex but the same principals apply to both.

Once miniconda is installed, the preferred way to launch a python shell is to use miniconda’s activate.bat file to configure the shell environment. On my system (with a default miniconda install), the file is stored here: C:/Users/CMoffitt/AppData/Local/Continuum/miniconda3/Scripts/activate.bat

In addition, I recommend that you keep your conda base environment relatively lightweight and use another environment for your actual work. On my system, I have a work environment that I want to start up with this shortcut.

When conda creates a new environment on windows, the default directory location for the environment looks like this: C:/Users/CMoffitt/AppData/Local/Continuum/miniconda3/envs/work . You can pass this full path to the activate.bat file and it will launch for you and automatically start with the work environment activated.

The final piece of the launch puzzle is to use cmd.exe /K to run a command shell and return to a prompt once the shell is active.

The full command, if you were to type it, would look something like this:

cmd.exe /K C:/Users/CMoffitt/AppData/Local/Continuum/miniconda3/Scripts/activate.bat C:/Users/CMoffitt/AppData/Local/Continuum/miniconda3/envs/work

The overall concept is pretty straightforward. The challenge is that the paths get pretty long and we want to be smart about making sure we make this as future-proof and portable as possible.

Special Folders

The winshell module makes the process of working with Windows shortcuts a lot easier. This module has been around for a while and has not been updated recently but it worked just fine for me. Since it is a relatively thin wrapper over pywin32 there’s not much need to keep updating winshell.

For the purposes of this article, I used winshell to access special folders, create shortcuts and read shortcuts. The documentation is straightforward but still uses os.path for file path manipulations so I decided to update my examples to use pathlib. You can refer to my previous post for an intro to pathlib.

One of the useful aspects of winshell is that it gives you shortcuts to access special directories on Windows. It’s a best practice not to hard code paths but use the aliases that Windows provides. This way, your scripts should work seamlessly on someone else’s machine and work across different versions of Windows.

As shown above, the paths to our miniconda files are buried pretty deep and are dependent on the logged in user’s profile. Trying to hard code all this would be problematic. Talking a new user through the process can be challenging as well.

In order to demonstrate winshell, let’s get the imports in place:

importwinshellfrompathlibimportPath

If we want to get the user’s profile directory, we can use the folder function:

profile=winshell.folder('profile')

Which automatically figures out that it is:

'C:\\Users\\CMoffitt`

Winshell offers access to many different folders that can be accessed via their CSIDL(Constant Special ID List). Here is a list of CSIDLs for reference. As a side note, it looks like the CSIDL has been replaced with KNOWNFOLDERID but in my limited testing, the CSIDLs I’m using in this article are supported for backwards compatibility.

One of the things I like to do is use Pathlib to make some of the needed manipulations a little bit easier. In the example above, the profile variable is a string. I can pass the string to Path() which will make subsequent operations easier when building up our paths.

Let’s illustrate by getting the full path to my desktop using the convenience function available for the desktop folder:

desktop=Path(winshell.desktop())

Which looks like this now:

WindowsPath('C:/Users/CMoffitt/OneDrive-Desktop')

We can combine these folder approaches to get a location of the miniconda base directory.

miniconda_base=Path(winshell.folder('CSIDL_LOCAL_APPDATA'))/'Continuum'/'miniconda3')

If we want to validate that this is a valid directory:

miniconda_base.is_dir()
True

In my opionion this is much cleaner than trying to do a lot of os.path.join to build up the directory structure.

The other location we need is cmd.exe which we can get with CSIDL_SYSTEM .

win32_cmd=str(Path(winshell.folder('CSIDL_SYSTEM'))/'cmd.exe')

You will notice that I converted the Path to a string by using str . I did this because winshell expects all of its inputs to be strings. It does not know how to handle a pathlib object directly. This is important to keep in mind when creating the actual shortcut in the code below.

Working with Shortcuts

When working with shortcuts on windows, you can right click on the shortcut icon and view the properties. Most people have probably seen something like this:

Properties

As you get really long command strings, it can be difficult to view in the GUI. Editing them can also get a little challenging when it comes to making sure quotes and escape characters are used correctly.

Winshell provides a dump function to make the actual shortcut properties easier to review.

For example, if we want to look at the existing shortcut in our start menu, we need to get the full path to the .lnk file, then create a shortcut object and display the values using dump .

lnk=Path(winshell.programs())/"Anaconda3 (64-bit)"/"Anaconda Prompt (miniconda3).lnk"shortcut=winshell.shortcut(str(lnk))shortcut.dump()
{
C:\Users\CMoffitt\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Anaconda3 (64-bit)\Anaconda Prompt (miniconda3).lnk -> C:\Windows\System32\cmd.exe

arguments: "/K" C:\Users\CMoffitt\AppData\Local\Continuum\miniconda3\Scripts\activate.bat C:\Users\CMoffitt\AppData\Local\Continuum\miniconda3
description: Anaconda Prompt (miniconda3)
hotkey: 0
icon_location: ('C:\\Users\\CMoffitt\\AppData\\Local\\Continuum\\miniconda3\\Menu\\Iconleak-Atrous-Console.ico', 0)
path: C:\Windows\System32\cmd.exe
show_cmd: normal
working_directory: %HOMEPATH%
}

This is a simple representation of all the information we need to use to create a new shortcut link. In my experience this view can make it much easier to understand how to create your own.

Now that we know the information we need, we can create our own shortcut.

We will create our full argument string which includes cmd.exe /K followed by the activate.bat then the environment we want to start in:

arg_str="/K "+str(miniconda_base/"Scripts"/"activate.bat")+""+str(miniconda_base/"envs"/"work")

We also have the option of passing in an icon which needs to include a full path as well as the index for the icon.

For this example, I’m using the default icon that miniconda uses. Feel free to modify for your own usage.

icon=str(miniconda_base/"Menu"/"Iconleak-Atrous-Console.ico")

The final portion is to start in a specified working directory.

In my case, I have a My Documents/py_work directory that contains all my python code. We can use CSIDL_PERSONAL to access My Documents and build the full path to py_work .

my_working=str(Path(winshell.folder('CSIDL_PERSONAL'))/"py_work")

Now that all the variables are defined, we create a shortcut link on the desktop:

link_filepath=str(desktop/"python_working.lnk")withwinshell.shortcut(link_filepath)aslink:link.path=win32_cmdlink.description="Python(work)"link.arguments=arg_strlink.icon_location=(icon,0)link.working_directory=my_working

You should now see something like this on your desktop:

Properties

You can easily customize it to use your own directories and environments. It’s a short bit of code but in my opinion it is a lot easier to understand and customize than dealing with Windows shortcut files by hand.

Summary

Here is the full example for a creating a simple shortcut on your desktop that activates a working conda environment and starts in a specific working directory.

importwinshellfrompathlibimportPath# Define all the file paths needed for the shortcut# Assumes default miniconda installdesktop=Path(winshell.desktop())miniconda_base=Path(winshell.folder('CSIDL_LOCAL_APPDATA'))/'Continuum'/'miniconda3'win32_cmd=str(Path(winshell.folder('CSIDL_SYSTEM'))/'cmd.exe')icon=str(miniconda_base/"Menu"/"Iconleak-Atrous-Console.ico")# This will point to My Documents/py_work. Adjust to your preferencesmy_working=str(Path(winshell.folder('CSIDL_PERSONAL'))/"py_work")link_filepath=str(desktop/"python_working.lnk")# Build up all the arguments to cmd.exe# Use /K so that the command prompt will stay openarg_str="/K "+str(miniconda_base/"Scripts"/"activate.bat")+""+str(miniconda_base/"envs"/"work")# Create the shortcut on the desktopwithwinshell.shortcut(link_filepath)aslink:link.path=win32_cmdlink.description="Python(work)"link.arguments=arg_strlink.icon_location=(icon,0)link.working_directory=my_working

I hope this script will save you just a little bit of time when you are trying to get your Windows system setup to run various conda environments. If you have any other favorite tips you use, let me know in the comments.

Real Python: Pandas: How to Read and Write Files

$
0
0

Pandas is a powerful and flexible Python package that allows you to work with labeled and time series data. It also provides statistics methods, enables plotting, and more. One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the Pandas read_csv() method enable you to work with files effectively. You can use them to save the data and labels from Pandas objects to a file and load them later as Pandas Series or DataFrame instances.

In this tutorial, you’ll learn:

  • What the Pandas IO tools API is
  • How to read and write data to and from files
  • How to work with various file formats
  • How to work with big data efficiently

Let’s start reading and writing files!

Free Bonus:5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you'll need to take your Python skills to the next level.

Installing Pandas

The code in this tutorial is executed with CPython 3.7.4 and Pandas 0.25.1. It would be beneficial to make sure you have the latest versions of Python and Pandas on your machine. You might want to create a new virtual environment and install the dependencies for this tutorial.

First, you’ll need the Pandas library. You may already have it installed. If you don’t, then you can install it with pip:

$ pip install pandas

Once the installation process completes, you should have Pandas installed and ready.

Anaconda is an excellent Python distribution that comes with Python, many useful packages like Pandas, and a package and environment manager called Conda. To learn more about Anaconda, check out Setting Up Python for Machine Learning on Windows.

If you don’t have Pandas in your virtual environment, then you can install it with Conda:

$ conda install pandas

Conda is powerful as it manages the dependencies and their versions. To learn more about working with Conda, you can check out the official documentation.

Preparing Data

In this tutorial, you’ll use the data related to 20 countries. Here’s an overview of the data and sources you’ll be working with:

  • Country is denoted by the country name. Each country is in the top 10 list for either population, area, or gross domestic product (GDP). The row labels for the dataset are the three-letter country codes defined in ISO 3166-1. The column label for the dataset is COUNTRY.

  • Population is expressed in millions. The data comes from a list of countries and dependencies by population on Wikipedia. The column label for the dataset is POP.

  • Area is expressed in thousands of kilometers squared. The data comes from a list of countries and dependencies by area on Wikipedia. The column label for the dataset is AREA.

  • Gross domestic product is expressed in millions of U.S. dollars, according to the United Nations data for 2017. You can find this data in the list of countries by nominal GDP on Wikipedia. The column label for the dataset is GDP.

  • Continent is either Africa, Asia, Oceania, Europe, North America, or South America. You can find this information on Wikipedia as well. The column label for the dataset is CONT.

  • Independence day is a date that commemorates a nation’s independence. The data comes from the list of national independence days on Wikipedia. The dates are shown in ISO 8601 format. The first four digits represent the year, the next two numbers are the month, and the last two are for the day of the month. The column label for the dataset is IND_DAY.

This is how the data looks as a table:

COUNTRYPOPAREAGDPCONTIND_DAY
CHNChina1398.729596.9612234.78Asia
INDIndia1351.163287.262575.67Asia1947-08-15
USAUS329.749833.5219485.39N.America1776-07-04
IDNIndonesia268.071910.931015.54Asia1945-08-17
BRABrazil210.328515.772055.51S.America1822-09-07
PAKPakistan205.71881.91302.14Asia1947-08-14
NGANigeria200.96923.77375.77Africa1960-10-01
BGDBangladesh167.09147.57245.63Asia1971-03-26
RUSRussia146.7917098.251530.751992-06-12
MEXMexico126.581964.381158.23N.America1810-09-16
JPNJapan126.22377.974872.42Asia
DEUGermany83.02357.113693.20Europe
FRAFrance67.02640.682582.49Europe1789-07-14
GBRUK66.44242.502631.23Europe
ITAItaly60.36301.341943.84Europe
ARGArgentina44.942780.40637.49S.America1816-07-09
DZAAlgeria43.382381.74167.56Africa1962-07-05
CANCanada37.599984.671647.12N.America1867-07-01
AUSAustralia25.477692.021408.68Oceania
KAZKazakhstan18.532724.90159.41Asia1991-12-16

You may notice that some of the data is missing. For example, the continent for Russia is not specified because it spreads across both Europe and Asia. There are also several missing independence days because the data source omits them.

You can organize this data in Python using a nested dictionary:

data={'CHN':{'COUNTRY':'China','POP':1_398.72,'AREA':9_596.96,'GDP':12_234.78,'CONT':'Asia'},'IND':{'COUNTRY':'India','POP':1_351.16,'AREA':3_287.26,'GDP':2_575.67,'CONT':'Asia','IND_DAY':'1947-08-15'},'USA':{'COUNTRY':'US','POP':329.74,'AREA':9_833.52,'GDP':19_485.39,'CONT':'N.America','IND_DAY':'1776-07-04'},'IDN':{'COUNTRY':'Indonesia','POP':268.07,'AREA':1_910.93,'GDP':1_015.54,'CONT':'Asia','IND_DAY':'1945-08-17'},'BRA':{'COUNTRY':'Brazil','POP':210.32,'AREA':8_515.77,'GDP':2_055.51,'CONT':'S.America','IND_DAY':'1822-09-07'},'PAK':{'COUNTRY':'Pakistan','POP':205.71,'AREA':881.91,'GDP':302.14,'CONT':'Asia','IND_DAY':'1947-08-14'},'NGA':{'COUNTRY':'Nigeria','POP':200.96,'AREA':923.77,'GDP':375.77,'CONT':'Africa','IND_DAY':'1960-10-01'},'BGD':{'COUNTRY':'Bangladesh','POP':167.09,'AREA':147.57,'GDP':245.63,'CONT':'Asia','IND_DAY':'1971-03-26'},'RUS':{'COUNTRY':'Russia','POP':146.79,'AREA':17_098.25,'GDP':1_530.75,'IND_DAY':'1992-06-12'},'MEX':{'COUNTRY':'Mexico','POP':126.58,'AREA':1_964.38,'GDP':1_158.23,'CONT':'N.America','IND_DAY':'1810-09-16'},'JPN':{'COUNTRY':'Japan','POP':126.22,'AREA':377.97,'GDP':4_872.42,'CONT':'Asia'},'DEU':{'COUNTRY':'Germany','POP':83.02,'AREA':357.11,'GDP':3_693.20,'CONT':'Europe'},'FRA':{'COUNTRY':'France','POP':67.02,'AREA':640.68,'GDP':2_582.49,'CONT':'Europe','IND_DAY':'1789-07-14'},'GBR':{'COUNTRY':'UK','POP':66.44,'AREA':242.50,'GDP':2_631.23,'CONT':'Europe'},'ITA':{'COUNTRY':'Italy','POP':60.36,'AREA':301.34,'GDP':1_943.84,'CONT':'Europe'},'ARG':{'COUNTRY':'Argentina','POP':44.94,'AREA':2_780.40,'GDP':637.49,'CONT':'S.America','IND_DAY':'1816-07-09'},'DZA':{'COUNTRY':'Algeria','POP':43.38,'AREA':2_381.74,'GDP':167.56,'CONT':'Africa','IND_DAY':'1962-07-05'},'CAN':{'COUNTRY':'Canada','POP':37.59,'AREA':9_984.67,'GDP':1_647.12,'CONT':'N.America','IND_DAY':'1867-07-01'},'AUS':{'COUNTRY':'Australia','POP':25.47,'AREA':7_692.02,'GDP':1_408.68,'CONT':'Oceania'},'KAZ':{'COUNTRY':'Kazakhstan','POP':18.53,'AREA':2_724.90,'GDP':159.41,'CONT':'Asia','IND_DAY':'1991-12-16'}}columns=('COUNTRY','POP','AREA','GDP','CONT','IND_DAY')

Each row of the table is written as an inner dictionary whose keys are the column names and values are the corresponding data. These dictionaries are then collected as the values in the outer data dictionary. The corresponding keys for data are the three-letter country codes.

You can use this data to create an instance of a Pandas DataFrame. First, you need to import Pandas:

>>>
>>> importpandasaspd

Now that you have Pandas imported, you can use the DataFrame constructor and data to create a DataFrame object.

data is organized in such a way that the country codes correspond to columns. You can reverse the rows and columns of a DataFrame with the property .T:

>>>
>>> df=pd.DataFrame(data=data).T>>> df        COUNTRY      POP     AREA      GDP       CONT     IND_DAYCHN       China  1398.72  9596.96  12234.8       Asia         NaNIND       India  1351.16  3287.26  2575.67       Asia  1947-08-15USA          US   329.74  9833.52  19485.4  N.America  1776-07-04IDN   Indonesia   268.07  1910.93  1015.54       Asia  1945-08-17BRA      Brazil   210.32  8515.77  2055.51  S.America  1822-09-07PAK    Pakistan   205.71   881.91   302.14       Asia  1947-08-14NGA     Nigeria   200.96   923.77   375.77     Africa  1960-10-01BGD  Bangladesh   167.09   147.57   245.63       Asia  1971-03-26RUS      Russia   146.79  17098.2  1530.75        NaN  1992-06-12MEX      Mexico   126.58  1964.38  1158.23  N.America  1810-09-16JPN       Japan   126.22   377.97  4872.42       Asia         NaNDEU     Germany    83.02   357.11   3693.2     Europe         NaNFRA      France    67.02   640.68  2582.49     Europe  1789-07-14GBR          UK    66.44    242.5  2631.23     Europe         NaNITA       Italy    60.36   301.34  1943.84     Europe         NaNARG   Argentina    44.94   2780.4   637.49  S.America  1816-07-09DZA     Algeria    43.38  2381.74   167.56     Africa  1962-07-05CAN      Canada    37.59  9984.67  1647.12  N.America  1867-07-01AUS   Australia    25.47  7692.02  1408.68    Oceania         NaNKAZ  Kazakhstan    18.53   2724.9   159.41       Asia  1991-12-16

Now you have your DataFrame object populated with the data about each country.

Note: You can use .transpose() instead of .T to reverse the rows and columns of your dataset. If you use .transpose(), then you can set the optional parameter copy to specify if you want to copy the underlying data. The default behavior is False.

Versions of Python older than 3.6 did not guarantee the order of keys in dictionaries. To ensure the order of columns is maintained for older versions of Python and Pandas, you can specify index=columns:

>>>
>>> df=pd.DataFrame(data=data,index=columns).T

Now that you’ve prepared your data, you’re ready to start working with files!

Using the Pandas read_csv() and .to_csv() Functions

A comma-separated values (CSV) file is a plaintext file with a .csv extension that holds tabular data. This is one of the most popular file formats for storing large amounts of data. Each row of the CSV file represents a single table row. The values in the same row are by default separated with commas, but you could change the separator to a semicolon, tab, space, or some other character.

Write a CSV File

You can save your Pandas DataFrame as a CSV file with .to_csv():

>>>
>>> df.to_csv('data.csv')

That’s it! You’ve created the file data.csv in your current working directory. You can expand the code block below to see how your CSV file should look:

,COUNTRY,POP,AREA,GDP,CONT,IND_DAYCHN,China,1398.72,9596.96,12234.78,Asia,IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15USA,US,329.74,9833.52,19485.39,N.America,1776-07-04IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26RUS,Russia,146.79,17098.25,1530.75,,1992-06-12MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16JPN,Japan,126.22,377.97,4872.42,Asia,DEU,Germany,83.02,357.11,3693.2,Europe,FRA,France,67.02,640.68,2582.49,Europe,1789-07-14GBR,UK,66.44,242.5,2631.23,Europe,ITA,Italy,60.36,301.34,1943.84,Europe,ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01AUS,Australia,25.47,7692.02,1408.68,Oceania,KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

This text file contains the data separated with commas. The first column contains the row labels. In some cases, you’ll find them irrelevant. If you don’t want to keep them, then you can pass the argument index=False to .to_csv().

Read a CSV File

Once your data is saved in a CSV file, you’ll likely want to load and use it from time to time. You can do that with the Pandas read_csv() function:

>>>
>>> df=pd.read_csv('data.csv',index_col=0)>>> df        COUNTRY      POP      AREA       GDP       CONT     IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia         NaNIND       India  1351.16   3287.26   2575.67       Asia  1947-08-15USA          US   329.74   9833.52  19485.39  N.America  1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia         NaNDEU     Germany    83.02    357.11   3693.20     Europe         NaNFRA      France    67.02    640.68   2582.49     Europe  1789-07-14GBR          UK    66.44    242.50   2631.23     Europe         NaNITA       Italy    60.36    301.34   1943.84     Europe         NaNARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania         NaNKAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

In this case, the Pandas read_csv() function returns a new DataFrame with the data and labels from the file data.csv, which you specified with the first argument. This string can be any valid path, including URLs.

The parameter index_col specifies the column from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col when the CSV file contains the row labels to avoid loading them as data.

You’ll learn more about using Pandas with CSV files later on in this tutorial. You can also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the built-in Python library csv as well.

Using Pandas to Write and Read Excel Files

Microsoft Excel is probably the most widely-used spreadsheet software. While older versions used binary .xls files, Excel 2007 introduced the new XML-based .xlsx file. You can read and write Excel files in Pandas, similar to CSV files. However, you’ll need to install the following Python packages first:

You can install them using pip with a single command:

$ pip install xlwt openpyxl xlsxwriter xlrd

You can also use Conda:

$ conda install xlwt openpyxl xlsxwriter xlrd

Please note that you don’t have to install all these packages. For example, you don’t need both openpyxl and XlsxWriter. If you’re going to work just with .xls files, then you don’t need any of them! However, if you intend to work only with .xlsx files, then you’re going to need at least one of them, but not xlwt. Take some time to decide which packages are right for your project.

Write an Excel File

Once you have those packages installed, you can save your DataFrame in an Excel file with .to_excel():

>>>
>>> df.to_excel('data.xlsx')

The argument 'data.xlsx' represents the target file and, optionally, its path. The above statement should create the file data.xlsx in your current working directory. That file should look like this:

mmst-pandas-rw-files-excel

The first column of the file contains the labels of the rows, while the other columns store data.

Read an Excel File

You can load data from Excel files with read_excel():

>>>
>>> df=pd.read_excel('data.xlsx',index_col=0)>>> df        COUNTRY      POP      AREA       GDP       CONT     IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia         NaNIND       India  1351.16   3287.26   2575.67       Asia  1947-08-15USA          US   329.74   9833.52  19485.39  N.America  1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia         NaNDEU     Germany    83.02    357.11   3693.20     Europe         NaNFRA      France    67.02    640.68   2582.49     Europe  1789-07-14GBR          UK    66.44    242.50   2631.23     Europe         NaNITA       Italy    60.36    301.34   1943.84     Europe         NaNARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania         NaNKAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

read_excel() returns a new DataFrame that contains the values from data.xlsx. You can also use read_excel() with OpenDocument spreadsheets, or .ods files.

You’ll learn more about working with Excel files later on in this tutorial. You can also check out Using Pandas to Read Large Excel Files in Python.

Understanding the Pandas IO API

Pandas IO Tools is the API that allows you to save the contents of Series and DataFrame objects to the clipboard, objects, or files of various types. It also enables loading data from the clipboard, objects, or files.

Write Files

Series and DataFrame objects have methods that enable writing data and labels to the clipboard or files. They’re named with the pattern .to_<file-type>(), where <file-type> is the type of the target file.

You’ve learned about .to_csv() and .to_excel(), but there are others, including:

  • .to_json()
  • .to_html()
  • .to_sql()
  • .to_pickle()

There are still more file types that you can write to, so this list is not exhaustive.

Note: To find similar methods, check the official documentation about serialization, IO, and conversion related to Series and DataFrame objects.

These methods have parameters specifying the target file path where you saved the data and labels. This is mandatory in some cases and optional in others. If this option is available and you choose to omit it, then the methods return the objects (like strings or iterables) with the contents of DataFrame instances.

The optional parameter compression decides how to compress the file with the data and labels. You’ll learn more about it later on. There are a few other parameters, but they’re mostly specific to one or several methods. You won’t go into them in detail here.

Read Files

Pandas functions for reading the contents of files are named using the pattern .read_<file-type>(), where <file-type> indicates the type of the file to read. You’ve already seen the Pandas read_csv() and read_excel() functions. Here are a few others:

  • read_json()
  • read_html()
  • read_sql()
  • read_pickle()

These functions have a parameter that specifies the target file path. It can be any valid string that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file type.

The optional parameter compression determines the type of decompression to use for the compressed files. You’ll learn about it later on in this tutorial. There are other parameters, but they’re specific to one or several functions. You won’t go into them in detail here.

Working With Different File Types

The Pandas library offers a wide range of possibilities for saving your data to files and loading data from files. In this section, you’ll learn more about working with CSV and Excel files. You’ll also see how to use other types of files, like JSON, web pages, databases, and Python pickle files.

CSV Files

You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use .to_csv() to save your DataFrame, you can provide an argument for the parameter path_or_buff to specify the path, name, and extension of the target file.

path_or_buff is the first argument .to_csv() will get. It can be any string that represents a valid file path that includes the file name and its extension. You’ve seen this in a previous example. However, if you omit path_or_buff, then .to_csv() won’t create any files. Instead, it’ll return the corresponding string:

>>>
>>> df=pd.DataFrame(data=data).T>>> s=df.to_csv()>>> print(s),COUNTRY,POP,AREA,GDP,CONT,IND_DAYCHN,China,1398.72,9596.96,12234.78,Asia,IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15USA,US,329.74,9833.52,19485.39,N.America,1776-07-04IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26RUS,Russia,146.79,17098.25,1530.75,,1992-06-12MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16JPN,Japan,126.22,377.97,4872.42,Asia,DEU,Germany,83.02,357.11,3693.2,Europe,FRA,France,67.02,640.68,2582.49,Europe,1789-07-14GBR,UK,66.44,242.5,2631.23,Europe,ITA,Italy,60.36,301.34,1943.84,Europe,ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01AUS,Australia,25.47,7692.02,1408.68,Oceania,KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

Now you have the string s instead of a CSV file. You also have some missing values in your DataFrame object. For example, the continent for Russia and the independence days for several countries (China, Japan, and so on) are not available. In data science and machine learning, you must handle missing values carefully. Pandas excels here! By default, Pandas uses the NaN value to replace the missing values.

Note:nan, which stands for “not a number,” is a particular floating-point value in Python.

You can get a nan value with any of the following functions:

The continent that corresponds to Russia in df is nan:

>>>
>>> df.loc['RUS','CONT']nan

This example uses .loc[] to get data with the specified row and column names.

When you save your DataFrame to a CSV file, empty strings ('') will represent the missing data. You can see this both in your file data.csv and in the string s. If you want to change this behavior, then use the optional parameter na_rep:

>>>
>>> df.to_csv('new-data.csv',na_rep='(missing)')

This code produces the file new-data.csv where the missing values are no longer empty strings. You can expand the code block below to see how this file should look:

,COUNTRY,POP,AREA,GDP,CONT,IND_DAYCHN,China,1398.72,9596.96,12234.78,Asia,(missing)IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15USA,US,329.74,9833.52,19485.39,N.America,1776-07-04IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26RUS,Russia,146.79,17098.25,1530.75,(missing),1992-06-12MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16JPN,Japan,126.22,377.97,4872.42,Asia,(missing)DEU,Germany,83.02,357.11,3693.2,Europe,(missing)FRA,France,67.02,640.68,2582.49,Europe,1789-07-14GBR,UK,66.44,242.5,2631.23,Europe,(missing)ITA,Italy,60.36,301.34,1943.84,Europe,(missing)ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01AUS,Australia,25.47,7692.02,1408.68,Oceania,(missing)KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

Now, the string '(missing)' in the file corresponds to the nan values from df.

When Pandas reads files, it considers the empty string ('') and a few others as missing values by default:

  • 'nan'
  • '-nan'
  • 'NA'
  • 'N/A'
  • 'NaN'
  • 'null'

If you don’t want this behavior, then you can pass keep_default_na=False to the Pandas read_csv() function. To specify other labels for missing values, use the parameter na_values:

>>>
>>> pd.read_csv('new-data.csv',index_col=0,na_values='(missing)')        COUNTRY      POP      AREA       GDP       CONT     IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia         NaNIND       India  1351.16   3287.26   2575.67       Asia  1947-08-15USA          US   329.74   9833.52  19485.39  N.America  1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia         NaNDEU     Germany    83.02    357.11   3693.20     Europe         NaNFRA      France    67.02    640.68   2582.49     Europe  1789-07-14GBR          UK    66.44    242.50   2631.23     Europe         NaNITA       Italy    60.36    301.34   1943.84     Europe         NaNARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania         NaNKAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

Here, you’ve marked the string '(missing)' as a new missing data label, and Pandas replaced it with nan when it read the file.

When you load data from a file, Pandas assigns the data types to the values of each column by default. You can check these types with .dtypes:

>>>
>>> df=pd.read_csv('data.csv',index_col=0)>>> df.dtypesCOUNTRY     objectPOP        float64AREA       float64GDP        float64CONT        objectIND_DAY     objectdtype: object

The columns with strings and dates ('COUNTRY', 'CONT', and 'IND_DAY') have the data type object. Meanwhile, the numeric columns contain 64-bit floating-point numbers (float64).

You can use the parameter dtype to specify the desired data types and parse_dates to force use of datetimes:

>>>
>>> dtypes={'POP':'float32','AREA':'float32','GDP':'float32'}>>> df=pd.read_csv('data.csv',index_col=0,dtype=dtypes,... parse_dates=['IND_DAY'])>>> df.dtypesCOUNTRY            objectPOP               float32AREA              float32GDP               float32CONT               objectIND_DAY    datetime64[ns]dtype: object>>> df['IND_DAY']CHN          NaTIND   1947-08-15USA   1776-07-04IDN   1945-08-17BRA   1822-09-07PAK   1947-08-14NGA   1960-10-01BGD   1971-03-26RUS   1992-06-12MEX   1810-09-16JPN          NaTDEU          NaTFRA   1789-07-14GBR          NaTITA          NaTARG   1816-07-09DZA   1962-07-05CAN   1867-07-01AUS          NaTKAZ   1991-12-16Name: IND_DAY, dtype: datetime64[ns]

Now, you have 32-bit floating-point numbers ()float32) as specified with dtype. These differ slightly from the original 64-bit numbers because of smaller precision. The values in the last column are considered as dates and have the data type datetime64. That’s why the NaN values in this column are replaced with NaT.

Now that you have real dates, you can save them in the format you like:

>>>
>>> df=pd.read_csv('data.csv',index_col=0,parse_dates=['IND_DAY'])>>> df.to_csv('formatted-data.csv',date_format='%B %d, %Y')

Here, you’ve specified the parameter date_format to be '%B %d, %Y'. You can expand the code block below to see the resulting file:

,COUNTRY,POP,AREA,GDP,CONT,IND_DAYCHN,China,1398.72,9596.96,12234.78,Asia,IND,India,1351.16,3287.26,2575.67,Asia,"August 15, 1947"USA,US,329.74,9833.52,19485.39,N.America,"July 04, 1776"IDN,Indonesia,268.07,1910.93,1015.54,Asia,"August 17, 1945"BRA,Brazil,210.32,8515.77,2055.51,S.America,"September 07, 1822"PAK,Pakistan,205.71,881.91,302.14,Asia,"August 14, 1947"NGA,Nigeria,200.96,923.77,375.77,Africa,"October 01, 1960"BGD,Bangladesh,167.09,147.57,245.63,Asia,"March 26, 1971"RUS,Russia,146.79,17098.25,1530.75,,"June 12, 1992"MEX,Mexico,126.58,1964.38,1158.23,N.America,"September 16, 1810"JPN,Japan,126.22,377.97,4872.42,Asia,DEU,Germany,83.02,357.11,3693.2,Europe,FRA,France,67.02,640.68,2582.49,Europe,"July 14, 1789"GBR,UK,66.44,242.5,2631.23,Europe,ITA,Italy,60.36,301.34,1943.84,Europe,ARG,Argentina,44.94,2780.4,637.49,S.America,"July 09, 1816"DZA,Algeria,43.38,2381.74,167.56,Africa,"July 05, 1962"CAN,Canada,37.59,9984.67,1647.12,N.America,"July 01, 1867"AUS,Australia,25.47,7692.02,1408.68,Oceania,KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,"December 16, 1991"

The format of the dates is different now. The format '%B %d, %Y' means the date will first display the full name of the month, then the day followed by a comma, and finally the full year.

There are several other optional parameters that you can use with .to_csv():

  • sep denotes a values separator.
  • decimal indicates a decimal separator.
  • encoding sets the file encoding.
  • header specifies whether you want to write column labels in the file.

Here’s how you would pass arguments for sep and header:

>>>
>>> s=df.to_csv(sep=';',header=False)>>> print(s)CHN;China;1398.72;9596.96;12234.78;Asia;IND;India;1351.16;3287.26;2575.67;Asia;1947-08-15USA;US;329.74;9833.52;19485.39;N.America;1776-07-04IDN;Indonesia;268.07;1910.93;1015.54;Asia;1945-08-17BRA;Brazil;210.32;8515.77;2055.51;S.America;1822-09-07PAK;Pakistan;205.71;881.91;302.14;Asia;1947-08-14NGA;Nigeria;200.96;923.77;375.77;Africa;1960-10-01BGD;Bangladesh;167.09;147.57;245.63;Asia;1971-03-26RUS;Russia;146.79;17098.25;1530.75;;1992-06-12MEX;Mexico;126.58;1964.38;1158.23;N.America;1810-09-16JPN;Japan;126.22;377.97;4872.42;Asia;DEU;Germany;83.02;357.11;3693.2;Europe;FRA;France;67.02;640.68;2582.49;Europe;1789-07-14GBR;UK;66.44;242.5;2631.23;Europe;ITA;Italy;60.36;301.34;1943.84;Europe;ARG;Argentina;44.94;2780.4;637.49;S.America;1816-07-09DZA;Algeria;43.38;2381.74;167.56;Africa;1962-07-05CAN;Canada;37.59;9984.67;1647.12;N.America;1867-07-01AUS;Australia;25.47;7692.02;1408.68;Oceania;KAZ;Kazakhstan;18.53;2724.9;159.41;Asia;1991-12-16

The data is separated with a semicolon (';') because you’ve specified sep=';'. Also, since you passed header=False, you see your data without the header row of column names.

The Pandas read_csv() function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and more. For instance, if you have a file with one data column and want to get a Series object instead of a DataFrame, then you can pass squeeze=True to read_csv(). You’ll learn later on about data compression and decompression, as well as how to skip rows and columns.

JSON Files

JSON stands for JavaScript object notation. JSON files are plaintext files used for data interchange, and humans can read them easily. They follow the ISO/IEC 21778:2017 and ECMA-404 standards and use the .json extension. Python and Pandas work well with JSON files, as Python’s json library offers built-in support for them.

You can save the data from your DataFrame to a JSON file with .to_json(). Start by creating a DataFrame object again. Use the dictionary data that holds the data about countries and then apply .to_json():

>>>
>>> df=pd.DataFrame(data=data).T>>> df.to_json('data-columns.json')

This code produces the file data-columns.json. You can expand the code block below to see how this file should look:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":"1947-08-15","USA":"1776-07-04","IDN":"1945-08-17","BRA":"1822-09-07","PAK":"1947-08-14","NGA":"1960-10-01","BGD":"1971-03-26","RUS":"1992-06-12","MEX":"1810-09-16","JPN":null,"DEU":null,"FRA":"1789-07-14","GBR":null,"ITA":null,"ARG":"1816-07-09","DZA":"1962-07-05","CAN":"1867-07-01","AUS":null,"KAZ":"1991-12-16"}}

data-columns.json has one large dictionary with the column labels as keys and the corresponding inner dictionaries as values.

You can get a different file structure if you pass an argument for the optional parameter orient:

>>>
>>> df.to_json('data-index.json',orient='index')

The orient parameter defaults to 'columns'. Here, you’ve set it to index.

You should get a new file data-index.json. You can expand the code block below to see the changes:

{"CHN":{"COUNTRY":"China","POP":1398.72,"AREA":9596.96,"GDP":12234.78,"CONT":"Asia","IND_DAY":null},"IND":{"COUNTRY":"India","POP":1351.16,"AREA":3287.26,"GDP":2575.67,"CONT":"Asia","IND_DAY":"1947-08-15"},"USA":{"COUNTRY":"US","POP":329.74,"AREA":9833.52,"GDP":19485.39,"CONT":"N.America","IND_DAY":"1776-07-04"},"IDN":{"COUNTRY":"Indonesia","POP":268.07,"AREA":1910.93,"GDP":1015.54,"CONT":"Asia","IND_DAY":"1945-08-17"},"BRA":{"COUNTRY":"Brazil","POP":210.32,"AREA":8515.77,"GDP":2055.51,"CONT":"S.America","IND_DAY":"1822-09-07"},"PAK":{"COUNTRY":"Pakistan","POP":205.71,"AREA":881.91,"GDP":302.14,"CONT":"Asia","IND_DAY":"1947-08-14"},"NGA":{"COUNTRY":"Nigeria","POP":200.96,"AREA":923.77,"GDP":375.77,"CONT":"Africa","IND_DAY":"1960-10-01"},"BGD":{"COUNTRY":"Bangladesh","POP":167.09,"AREA":147.57,"GDP":245.63,"CONT":"Asia","IND_DAY":"1971-03-26"},"RUS":{"COUNTRY":"Russia","POP":146.79,"AREA":17098.25,"GDP":1530.75,"CONT":null,"IND_DAY":"1992-06-12"},"MEX":{"COUNTRY":"Mexico","POP":126.58,"AREA":1964.38,"GDP":1158.23,"CONT":"N.America","IND_DAY":"1810-09-16"},"JPN":{"COUNTRY":"Japan","POP":126.22,"AREA":377.97,"GDP":4872.42,"CONT":"Asia","IND_DAY":null},"DEU":{"COUNTRY":"Germany","POP":83.02,"AREA":357.11,"GDP":3693.2,"CONT":"Europe","IND_DAY":null},"FRA":{"COUNTRY":"France","POP":67.02,"AREA":640.68,"GDP":2582.49,"CONT":"Europe","IND_DAY":"1789-07-14"},"GBR":{"COUNTRY":"UK","POP":66.44,"AREA":242.5,"GDP":2631.23,"CONT":"Europe","IND_DAY":null},"ITA":{"COUNTRY":"Italy","POP":60.36,"AREA":301.34,"GDP":1943.84,"CONT":"Europe","IND_DAY":null},"ARG":{"COUNTRY":"Argentina","POP":44.94,"AREA":2780.4,"GDP":637.49,"CONT":"S.America","IND_DAY":"1816-07-09"},"DZA":{"COUNTRY":"Algeria","POP":43.38,"AREA":2381.74,"GDP":167.56,"CONT":"Africa","IND_DAY":"1962-07-05"},"CAN":{"COUNTRY":"Canada","POP":37.59,"AREA":9984.67,"GDP":1647.12,"CONT":"N.America","IND_DAY":"1867-07-01"},"AUS":{"COUNTRY":"Australia","POP":25.47,"AREA":7692.02,"GDP":1408.68,"CONT":"Oceania","IND_DAY":null},"KAZ":{"COUNTRY":"Kazakhstan","POP":18.53,"AREA":2724.9,"GDP":159.41,"CONT":"Asia","IND_DAY":"1991-12-16"}}

data-index.json also has one large dictionary, but this time the row labels are the keys, and the inner dictionaries are the values.

There are few more options for orient. One of them is 'records':

>>>
>>> df.to_json('data-records.json',orient='records')

This code should yield the file data-records.json. You can expand the code block below to see the content:

[{"COUNTRY":"China","POP":1398.72,"AREA":9596.96,"GDP":12234.78,"CONT":"Asia","IND_DAY":null},{"COUNTRY":"India","POP":1351.16,"AREA":3287.26,"GDP":2575.67,"CONT":"Asia","IND_DAY":"1947-08-15"},{"COUNTRY":"US","POP":329.74,"AREA":9833.52,"GDP":19485.39,"CONT":"N.America","IND_DAY":"1776-07-04"},{"COUNTRY":"Indonesia","POP":268.07,"AREA":1910.93,"GDP":1015.54,"CONT":"Asia","IND_DAY":"1945-08-17"},{"COUNTRY":"Brazil","POP":210.32,"AREA":8515.77,"GDP":2055.51,"CONT":"S.America","IND_DAY":"1822-09-07"},{"COUNTRY":"Pakistan","POP":205.71,"AREA":881.91,"GDP":302.14,"CONT":"Asia","IND_DAY":"1947-08-14"},{"COUNTRY":"Nigeria","POP":200.96,"AREA":923.77,"GDP":375.77,"CONT":"Africa","IND_DAY":"1960-10-01"},{"COUNTRY":"Bangladesh","POP":167.09,"AREA":147.57,"GDP":245.63,"CONT":"Asia","IND_DAY":"1971-03-26"},{"COUNTRY":"Russia","POP":146.79,"AREA":17098.25,"GDP":1530.75,"CONT":null,"IND_DAY":"1992-06-12"},{"COUNTRY":"Mexico","POP":126.58,"AREA":1964.38,"GDP":1158.23,"CONT":"N.America","IND_DAY":"1810-09-16"},{"COUNTRY":"Japan","POP":126.22,"AREA":377.97,"GDP":4872.42,"CONT":"Asia","IND_DAY":null},{"COUNTRY":"Germany","POP":83.02,"AREA":357.11,"GDP":3693.2,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"France","POP":67.02,"AREA":640.68,"GDP":2582.49,"CONT":"Europe","IND_DAY":"1789-07-14"},{"COUNTRY":"UK","POP":66.44,"AREA":242.5,"GDP":2631.23,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"Italy","POP":60.36,"AREA":301.34,"GDP":1943.84,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"Argentina","POP":44.94,"AREA":2780.4,"GDP":637.49,"CONT":"S.America","IND_DAY":"1816-07-09"},{"COUNTRY":"Algeria","POP":43.38,"AREA":2381.74,"GDP":167.56,"CONT":"Africa","IND_DAY":"1962-07-05"},{"COUNTRY":"Canada","POP":37.59,"AREA":9984.67,"GDP":1647.12,"CONT":"N.America","IND_DAY":"1867-07-01"},{"COUNTRY":"Australia","POP":25.47,"AREA":7692.02,"GDP":1408.68,"CONT":"Oceania","IND_DAY":null},{"COUNTRY":"Kazakhstan","POP":18.53,"AREA":2724.9,"GDP":159.41,"CONT":"Asia","IND_DAY":"1991-12-16"}]

data-records.json holds a list with one dictionary for each row. The row labels are not written.

You can get another interesting file structure with orient='split':

>>>
>>> df.to_json('data-split.json',orient='split')

The resulting file is data-split.json. You can expand the code block below to see how this file should look:

{"columns":["COUNTRY","POP","AREA","GDP","CONT","IND_DAY"],"index":["CHN","IND","USA","IDN","BRA","PAK","NGA","BGD","RUS","MEX","JPN","DEU","FRA","GBR","ITA","ARG","DZA","CAN","AUS","KAZ"],"data":[["China",1398.72,9596.96,12234.78,"Asia",null],["India",1351.16,3287.26,2575.67,"Asia","1947-08-15"],["US",329.74,9833.52,19485.39,"N.America","1776-07-04"],["Indonesia",268.07,1910.93,1015.54,"Asia","1945-08-17"],["Brazil",210.32,8515.77,2055.51,"S.America","1822-09-07"],["Pakistan",205.71,881.91,302.14,"Asia","1947-08-14"],["Nigeria",200.96,923.77,375.77,"Africa","1960-10-01"],["Bangladesh",167.09,147.57,245.63,"Asia","1971-03-26"],["Russia",146.79,17098.25,1530.75,null,"1992-06-12"],["Mexico",126.58,1964.38,1158.23,"N.America","1810-09-16"],["Japan",126.22,377.97,4872.42,"Asia",null],["Germany",83.02,357.11,3693.2,"Europe",null],["France",67.02,640.68,2582.49,"Europe","1789-07-14"],["UK",66.44,242.5,2631.23,"Europe",null],["Italy",60.36,301.34,1943.84,"Europe",null],["Argentina",44.94,2780.4,637.49,"S.America","1816-07-09"],["Algeria",43.38,2381.74,167.56,"Africa","1962-07-05"],["Canada",37.59,9984.67,1647.12,"N.America","1867-07-01"],["Australia",25.47,7692.02,1408.68,"Oceania",null],["Kazakhstan",18.53,2724.9,159.41,"Asia","1991-12-16"]]}

data-split.json contains one dictionary that holds the following lists:

  • The names of the columns
  • The labels of the rows
  • The inner lists (two-dimensional sequence) that hold data values

If you don’t provide the value for the optional parameter path_or_buf that defines the file path, then .to_json() will return a JSON string instead of writing the results to a file. This behavior is consistent with .to_csv().

There are other optional parameters you can use. For instance, you can set index=False to forego saving row labels. You can manipulate precision with double_precision, and dates with date_format and date_unit. These last two parameters are particularly important when you have time series among your data:

>>>
>>> df=pd.DataFrame(data=data).T>>> df['IND_DAY']=pd.to_datetime(df['IND_DAY'])>>> df.dtypesCOUNTRY            objectPOP                objectAREA               objectGDP                objectCONT               objectIND_DAY    datetime64[ns]dtype: object>>> df.to_json('data-time.json')

In this example, you’ve created the DataFrame from the dictionary data and used to_datetime() to convert the values in the last column to datetime64. You can expand the code block below to see the resulting file:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":-706320000000,"USA":-6106060800000,"IDN":-769219200000,"BRA":-4648924800000,"PAK":-706406400000,"NGA":-291945600000,"BGD":38793600000,"RUS":708307200000,"MEX":-5026838400000,"JPN":null,"DEU":null,"FRA":-5694969600000,"GBR":null,"ITA":null,"ARG":-4843411200000,"DZA":-236476800000,"CAN":-3234729600000,"AUS":null,"KAZ":692841600000}}

In this file, you have large integers instead of dates for the independence days. That’s because the default value of the optional parameter date_format is 'epoch' whenever orient isn’t 'table'. This default behavior expresses dates as an epoch in milliseconds relative to midnight on January 1, 1970.

However, if you pass date_format='iso', then you’ll get the dates in the ISO 8601 format. In addition, date_unit decides the units of time:

>>>
>>> df=pd.DataFrame(data=data).T>>> df['IND_DAY']=pd.to_datetime(df['IND_DAY'])>>> df.to_json('new-data-time.json',date_format='iso',date_unit='s')

This code produces the following JSON file:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":"1947-08-15T00:00:00Z","USA":"1776-07-04T00:00:00Z","IDN":"1945-08-17T00:00:00Z","BRA":"1822-09-07T00:00:00Z","PAK":"1947-08-14T00:00:00Z","NGA":"1960-10-01T00:00:00Z","BGD":"1971-03-26T00:00:00Z","RUS":"1992-06-12T00:00:00Z","MEX":"1810-09-16T00:00:00Z","JPN":null,"DEU":null,"FRA":"1789-07-14T00:00:00Z","GBR":null,"ITA":null,"ARG":"1816-07-09T00:00:00Z","DZA":"1962-07-05T00:00:00Z","CAN":"1867-07-01T00:00:00Z","AUS":null,"KAZ":"1991-12-16T00:00:00Z"}}

The dates in the resulting file are in the ISO 8601 format.

You can load the data from a JSON file with read_json():

>>>
>>> df=pd.read_json('data-index.json',orient='index',... convert_dates=['IND_DAY'])

The parameter convert_dates has a similar purpose as parse_dates when you use it to read CSV files. The optional parameter orient is very important because it specifies how Pandas understands the structure of the file.

There are other optional parameters you can use as well:

  • Set the encoding with encoding.
  • Manipulate dates with convert_dates and keep_default_dates.
  • Impact precision with dtype and precise_float.
  • Decode numeric data directly to NumPy arrays with numpy=True.

Note that you might lose the order of rows and columns when using the JSON format to store your data.

HTML Files

An HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html and .htm. You’ll need to install an HTML parser library like lxml or html5lib to be able to work with HTML files:

$pip install lxml html5lib

You can also use Conda to install the same packages:

$ conda install lxml html5lib

Once you have these libraries, you can save the contents of your DataFrame as an HTML file with .to_html():

>>>
df = pd.DataFrame(data=data).Tdf.to_html('data.html')

This code generates a file data.html. You can expand the code block below to see how this file should look:

<tableborder="1"class="dataframe"><thead><trstyle="text-align: right;"><th></th><th>COUNTRY</th><th>POP</th><th>AREA</th><th>GDP</th><th>CONT</th><th>IND_DAY</th></tr></thead><tbody><tr><th>CHN</th><td>China</td><td>1398.72</td><td>9596.96</td><td>12234.8</td><td>Asia</td><td>NaN</td></tr><tr><th>IND</th><td>India</td><td>1351.16</td><td>3287.26</td><td>2575.67</td><td>Asia</td><td>1947-08-15</td></tr><tr><th>USA</th><td>US</td><td>329.74</td><td>9833.52</td><td>19485.4</td><td>N.America</td><td>1776-07-04</td></tr><tr><th>IDN</th><td>Indonesia</td><td>268.07</td><td>1910.93</td><td>1015.54</td><td>Asia</td><td>1945-08-17</td></tr><tr><th>BRA</th><td>Brazil</td><td>210.32</td><td>8515.77</td><td>2055.51</td><td>S.America</td><td>1822-09-07</td></tr><tr><th>PAK</th><td>Pakistan</td><td>205.71</td><td>881.91</td><td>302.14</td><td>Asia</td><td>1947-08-14</td></tr><tr><th>NGA</th><td>Nigeria</td><td>200.96</td><td>923.77</td><td>375.77</td><td>Africa</td><td>1960-10-01</td></tr><tr><th>BGD</th><td>Bangladesh</td><td>167.09</td><td>147.57</td><td>245.63</td><td>Asia</td><td>1971-03-26</td></tr><tr><th>RUS</th><td>Russia</td><td>146.79</td><td>17098.2</td><td>1530.75</td><td>NaN</td><td>1992-06-12</td></tr><tr><th>MEX</th><td>Mexico</td><td>126.58</td><td>1964.38</td><td>1158.23</td><td>N.America</td><td>1810-09-16</td></tr><tr><th>JPN</th><td>Japan</td><td>126.22</td><td>377.97</td><td>4872.42</td><td>Asia</td><td>NaN</td></tr><tr><th>DEU</th><td>Germany</td><td>83.02</td><td>357.11</td><td>3693.2</td><td>Europe</td><td>NaN</td></tr><tr><th>FRA</th><td>France</td><td>67.02</td><td>640.68</td><td>2582.49</td><td>Europe</td><td>1789-07-14</td></tr><tr><th>GBR</th><td>UK</td><td>66.44</td><td>242.5</td><td>2631.23</td><td>Europe</td><td>NaN</td></tr><tr><th>ITA</th><td>Italy</td><td>60.36</td><td>301.34</td><td>1943.84</td><td>Europe</td><td>NaN</td></tr><tr><th>ARG</th><td>Argentina</td><td>44.94</td><td>2780.4</td><td>637.49</td><td>S.America</td><td>1816-07-09</td></tr><tr><th>DZA</th><td>Algeria</td><td>43.38</td><td>2381.74</td><td>167.56</td><td>Africa</td><td>1962-07-05</td></tr><tr><th>CAN</th><td>Canada</td><td>37.59</td><td>9984.67</td><td>1647.12</td><td>N.America</td><td>1867-07-01</td></tr><tr><th>AUS</th><td>Australia</td><td>25.47</td><td>7692.02</td><td>1408.68</td><td>Oceania</td><td>NaN</td></tr><tr><th>KAZ</th><td>Kazakhstan</td><td>18.53</td><td>2724.9</td><td>159.41</td><td>Asia</td><td>1991-12-16</td></tr></tbody></table>

This file shows the DataFrame contents nicely. However, notice that you haven’t obtained an entire web page. You’ve just output the data that corresponds to df in the HTML format.

.to_html() won’t create a file if you don’t provide the optional parameter buf, which denotes the buffer to write to. If you leave this parameter out, then your code will return a string as it did with .to_csv() and .to_json().

Here are some other optional parameters:

  • header determines whether to save the column names.
  • index determines whether to save the row labels.
  • classes assigns cascading style sheet (CSS) classes.
  • render_links specifies whether to convert URLs to HTML links.
  • table_id assigns the CSS id to the table tag.
  • escape decides whether to convert the characters <, >, and & to HTML-safe strings.

You use parameters like these to specify different aspects of the resulting files or strings.

You can create a DataFrame object from a suitable HTML file using read_html(), which will return a DataFrame instance or a list of them:

>>>
>>> df=pd.read_html('data.html',index_col=0,parse_dates=['IND_DAY'])

This is very similar to what you did when reading CSV files. You also have parameters that help you work with dates, missing values, precision, encoding, HTML parsers, and more.

Excel Files

You’ve already learned how to read and write Excel files with Pandas. However, there are a few more options worth considering. For one, when you use .to_excel(), you can specify the name of the target worksheet with the optional parameter sheet_name:

>>>
>>> df=pd.DataFrame(data=data).T>>> df.to_excel('data.xlsx',sheet_name='COUNTRIES')

Here, you create a file data.xlsx with a worksheet called COUNTRIES that stores the data. The string 'data.xlsx' is the argument for the parameter excel_writer that defines the name of the Excel file or its path.

The optional parameters startrow and startcol both default to 0 and indicate the upper left-most cell where the data should start being written:

>>>
>>> df.to_excel('data-shifted.xlsx',sheet_name='COUNTRIES',... startrow=2,startcol=4)

Here, you specify that the table should start in the third row and the fifth column. You also used zero-based indexing, so the third row is denoted by 2 and the fifth column by 4.

Now the resulting worksheet looks like this:

mmst-pandas-rw-files-excel-shifted

As you can see, the table starts in the third row 2 and the fifth column E.

.read_excel() also has the optional parameter sheet_name that specifies which worksheets to read when loading data. It can take on one of the following values:

  • The zero-based index of the worksheet
  • The name of the worksheet
  • The list of indices or names to read multiple sheets
  • The value None to read all sheets

Here’s how you would use this parameter in your code:

>>>
>>> df=pd.read_excel('data.xlsx',sheet_name=0,index_col=0,... parse_dates=['IND_DAY'])>>> df=pd.read_excel('data.xlsx',sheet_name='COUNTRIES',index_col=0,... parse_dates=['IND_DAY'])

Both statements above create the same DataFrame because the sheet_name parameters have the same values. In both cases, sheet_name=0 and sheet_name='COUNTRIES' refer to the same worksheet. The argument parse_dates=['IND_DAY'] tells Pandas to try to consider the values in this column as dates or times.

There are other optional parameters you can use with .read_excel() and .to_excel() to determine the Excel engine, the encoding, the way to handle missing values and infinities, the method for writing column names and row labels, and so on.

SQL Files

Pandas IO tools can also read and write databases. In this next example, you’ll write your data to a database called data.db. To get started, you’ll need the SQLAlchemy package. To learn more about it, you can read the official ORM tutorial. You’ll also need the database driver. Python has a built-in driver for SQLite.

You can install SQLAlchemy with pip:

$ pip install sqlalchemy

You can also install it with Conda:

$ conda install sqlalchemy

Once you have SQLAlchemy installed, import create_engine() and create a database engine:

>>>
>>> fromsqlalchemyimportcreate_engine>>> engine=create_engine('sqlite:///data.db',echo=False)

Now that you have everything set up, the next step is to create a DataFrame object. It’s convenient to specify the data types and apply .to_sql().

>>>
>>> dtypes={'POP':'float64','AREA':'float64','GDP':'float64',... 'IND_DAY':'datetime64'}>>> df=pd.DataFrame(data=data).T.astype(dtype=dtypes)>>> df.dtypesCOUNTRY            objectPOP               float64AREA              float64GDP               float64CONT               objectIND_DAY    datetime64[ns]dtype: object

.astype() is a very convenient method you can use to set multiple data types at once.

Once you’ve created your DataFrame, you can save it to the database with .to_sql():

>>>
>>> df.to_sql('data.db',con=engine,index_label='ID')

The parameter con is used to specify the database connection or engine that you want to use. The optional parameter index_label specifies how to call the database column with the row labels. You’ll often see it take on the value ID, Id, or id.

You should get the database data.db with a single table that looks like this:

mmst-pandas-rw-files-db

The first column contains the row labels. To omit writing them into the database, pass index=False to .to_sql(). The other columns correspond to the columns of the DataFrame.

There are a few more optional parameters. For example, you can use schema to specify the database schema and dtype to determine the types of the database columns. You can also use if_exists, which says what to do if a database with the same name and path already exists:

  • if_exists='fail' raises a ValueError and is the default.
  • if_exists='replace' drops the table and inserts new values.
  • if_exists='append' inserts new values into the table.

You can load the data from the database with read_sql():

>>>
>>> df=pd.read_sql('data.db',con=engine,index_col='ID')>>> df        COUNTRY      POP      AREA       GDP       CONT    IND_DAYIDCHN       China  1398.72   9596.96  12234.78       Asia        NaTIND       India  1351.16   3287.26   2575.67       Asia 1947-08-15USA          US   329.74   9833.52  19485.39  N.America 1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26RUS      Russia   146.79  17098.25   1530.75       None 1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia        NaTDEU     Germany    83.02    357.11   3693.20     Europe        NaTFRA      France    67.02    640.68   2582.49     Europe 1789-07-14GBR          UK    66.44    242.50   2631.23     Europe        NaTITA       Italy    60.36    301.34   1943.84     Europe        NaTARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania        NaTKAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

The parameter index_col specifies the name of the column with the row labels. Note that this inserts an extra row after the header that starts with ID. You can fix this behavior with the following line of code:

>>>
>>> df.index.name=None>>> df        COUNTRY      POP      AREA       GDP       CONT    IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia        NaTIND       India  1351.16   3287.26   2575.67       Asia 1947-08-15USA          US   329.74   9833.52  19485.39  N.America 1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26RUS      Russia   146.79  17098.25   1530.75       None 1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia        NaTDEU     Germany    83.02    357.11   3693.20     Europe        NaTFRA      France    67.02    640.68   2582.49     Europe 1789-07-14GBR          UK    66.44    242.50   2631.23     Europe        NaTITA       Italy    60.36    301.34   1943.84     Europe        NaTARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania        NaTKAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

Now you have the same DataFrame object as before.

Note that the continent for Russia is now None instead of nan. If you want to fill the missing values with nan, then you can use .fillna():

>>>
>>> df.fillna(value=float('nan'),inplace=True)

.fillna() replaces all missing values with whatever you pass to value. Here, you passed float('nan'), which says to fill all missing values with nan.

Also note that you didn’t have to pass parse_dates=['IND_DAY'] to read_sql(). That’s because your database was able to detect that the last column contains dates. However, you can pass parse_dates if you’d like. You’ll get the same results.

There are other functions that you can use to read databases, like read_sql_table() and read_sql_query(). Feel free to try them out!

Pickle Files

Pickling is the act of converting Python objects into byte streams. Unpickling is the inverse process. Python pickle files are the binary files that keep the data and hierarchy of Python objects. They usually have the extension .pickle or .pkl.

You can save your DataFrame in a pickle file with .to_pickle():

>>>
>>> dtypes={'POP':'float64','AREA':'float64','GDP':'float64',... 'IND_DAY':'datetime64'}>>> df=pd.DataFrame(data=data).T.astype(dtype=dtypes)>>> df.to_pickle('data.pickle')

Like you did with databases, it can be convenient first to specify the data types. Then, you create a file data.pickle to contain your data. You could also pass an integer value to the optional parameter protocol, which specifies the protocol of the pickler.

You can get the data from a pickle file with read_pickle():

>>>
>>> df=pd.read_pickle('data.pickle')>>> df        COUNTRY      POP      AREA       GDP       CONT    IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia        NaTIND       India  1351.16   3287.26   2575.67       Asia 1947-08-15USA          US   329.74   9833.52  19485.39  N.America 1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26RUS      Russia   146.79  17098.25   1530.75        NaN 1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia        NaTDEU     Germany    83.02    357.11   3693.20     Europe        NaTFRA      France    67.02    640.68   2582.49     Europe 1789-07-14GBR          UK    66.44    242.50   2631.23     Europe        NaTITA       Italy    60.36    301.34   1943.84     Europe        NaTARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania        NaTKAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

read_pickle() returns the DataFrame with the stored data. You can also check the data types:

>>>
>>> df.dtypesCOUNTRY            objectPOP               float64AREA              float64GDP               float64CONT               objectIND_DAY    datetime64[ns]dtype: object

These are the same ones that you specified before using .to_pickle().

As a word of caution, you should always beware of loading pickles from untrusted sources. This can be dangerous! When you unpickle an untrustworthy file, it could execute arbitrary code on your machine, gain remote access to your computer, or otherwise exploit your device in other ways.

Working With Big Data

If your files are too large for saving or processing, then there are several approaches you can take to reduce the required disk space:

  • Compress your files
  • Choose only the columns you want
  • Omit the rows you don’t need
  • Force the use of less precise data types
  • Split the data into chunks

You’ll take a look at each of these techniques in turn.

Compress and Decompress Files

You can create an archive file like you would a regular one, with the addition of a suffix that corresponds to the desired compression type:

  • '.gz'
  • '.bz2'
  • '.zip'
  • '.xz'

Pandas can deduce the compression type by itself:

>>>
>>> df=pd.DataFrame(data=data).T>>> df.to_csv('data.csv.zip')

Here, you create a compressed .csv file as an archive. The size of the regular .csv file is 1048 bytes, while the compressed file only has 766 bytes.

You can open this compressed file as usual with the Pandas read_csv() function:

>>>
>>> df=pd.read_csv('data.csv.zip',index_col=0,... parse_dates=['IND_DAY'])>>> df        COUNTRY      POP      AREA       GDP       CONT    IND_DAYCHN       China  1398.72   9596.96  12234.78       Asia        NaTIND       India  1351.16   3287.26   2575.67       Asia 1947-08-15USA          US   329.74   9833.52  19485.39  N.America 1776-07-04IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26RUS      Russia   146.79  17098.25   1530.75        NaN 1992-06-12MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16JPN       Japan   126.22    377.97   4872.42       Asia        NaTDEU     Germany    83.02    357.11   3693.20     Europe        NaTFRA      France    67.02    640.68   2582.49     Europe 1789-07-14GBR          UK    66.44    242.50   2631.23     Europe        NaTITA       Italy    60.36    301.34   1943.84     Europe        NaTARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01AUS   Australia    25.47   7692.02   1408.68    Oceania        NaTKAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

read_csv() decompresses the file before reading it into a DataFrame.

You can specify the type of compression with the optional parameter compression, which can take on any of the following values:

  • 'infer'
  • 'gzip'
  • 'bz2'
  • 'zip'
  • 'xz'
  • None

The default value compression='infer' indicates that Pandas should deduce the compression type from the file extension.

Here’s how you would compress a pickle file:

>>>
>>> df=pd.DataFrame(data=data).T>>> df.to_pickle('data.pickle.compress',compression='gzip')

You should get the file data.pickle.compress that you can later decompress and read:

>>>
>>> df=pd.read_pickle('data.pickle.compress',compression='gzip')

df again corresponds to the DataFrame with the same data as before.

You can give the other compression methods a try, as well. If you’re using pickle files, then keep in mind that the .zip format supports reading only.

Choose Columns

The Pandas read_csv() and read_excel() functions have the optional parameter usecols that you can use to specify the columns you want to load from the file. You can pass the list of column names as the corresponding argument:

>>>
>>> df=pd.read_csv('data.csv',usecols=['COUNTRY','AREA'])>>> df       COUNTRY      AREA0        China   9596.961        India   3287.262           US   9833.523    Indonesia   1910.934       Brazil   8515.775     Pakistan    881.916      Nigeria    923.777   Bangladesh    147.578       Russia  17098.259       Mexico   1964.3810       Japan    377.9711     Germany    357.1112      France    640.6813          UK    242.5014       Italy    301.3415   Argentina   2780.4016     Algeria   2381.7417      Canada   9984.6718   Australia   7692.0219  Kazakhstan   2724.90

Now you have a DataFrame that contains less data than before. Here, there are only the names of the countries and their areas.

Instead of the column names, you can also pass their indices:

>>>
>>> df=pd.read_csv('data.csv',index_col=0,usecols=[0,1,3])>>> df        COUNTRY      AREACHN       China   9596.96IND       India   3287.26USA          US   9833.52IDN   Indonesia   1910.93BRA      Brazil   8515.77PAK    Pakistan    881.91NGA     Nigeria    923.77BGD  Bangladesh    147.57RUS      Russia  17098.25MEX      Mexico   1964.38JPN       Japan    377.97DEU     Germany    357.11FRA      France    640.68GBR          UK    242.50ITA       Italy    301.34ARG   Argentina   2780.40DZA     Algeria   2381.74CAN      Canada   9984.67AUS   Australia   7692.02KAZ  Kazakhstan   2724.90

Expand the code block below to compare these results with the file 'data.csv':

,COUNTRY,POP,AREA,GDP,CONT,IND_DAYCHN,China,1398.72,9596.96,12234.78,Asia,IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15USA,US,329.74,9833.52,19485.39,N.America,1776-07-04IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26RUS,Russia,146.79,17098.25,1530.75,,1992-06-12MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16JPN,Japan,126.22,377.97,4872.42,Asia,DEU,Germany,83.02,357.11,3693.2,Europe,FRA,France,67.02,640.68,2582.49,Europe,1789-07-14GBR,UK,66.44,242.5,2631.23,Europe,ITA,Italy,60.36,301.34,1943.84,Europe,ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01AUS,Australia,25.47,7692.02,1408.68,Oceania,KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

You can see the following columns:

  • The column at index 0 contains the row labels.
  • The column at index 1 contains the country names.
  • The column at index 3 contains the areas.

Simlarly, read_sql() has the optional parameter columns that takes a list of column names to read:

>>>
>>> df=pd.read_sql('data.db',con=engine,index_col='ID',... columns=['COUNTRY','AREA'])>>> df.index.name=None>>> df        COUNTRY      AREACHN       China   9596.96IND       India   3287.26USA          US   9833.52IDN   Indonesia   1910.93BRA      Brazil   8515.77PAK    Pakistan    881.91NGA     Nigeria    923.77BGD  Bangladesh    147.57RUS      Russia  17098.25MEX      Mexico   1964.38JPN       Japan    377.97DEU     Germany    357.11FRA      France    640.68GBR          UK    242.50ITA       Italy    301.34ARG   Argentina   2780.40DZA     Algeria   2381.74CAN      Canada   9984.67AUS   Australia   7692.02KAZ  Kazakhstan   2724.90

Again, the DataFrame only contains the columns with the names of the countries and areas. If columns is None or omitted, then all of the columns will be read, as you saw before. The default behavior is columns=None.

Omit Rows

When you test an algorithm for data processing or machine learning, you often don’t need the entire dataset. It’s convenient to load only a subset of the data to speed up the process. The Pandas read_csv() and read_excel() functions have some optional parameters that allow you to select which rows you want to load:

  • skiprows: either the number of rows to skip at the beginning of the file if it’s an integer, or the zero-based indices of the rows to skip if it’s a list-like object
  • skipfooter: the number of rows to skip at the end of the file
  • nrows: the number of rows to read

Here’s how you would skip rows with odd zero-based indices, keeping the even ones:

>>>
>>> df=pd.read_csv('data.csv',index_col=0,skiprows=range(1,20,2))>>> df        COUNTRY      POP     AREA      GDP       CONT     IND_DAYIND       India  1351.16  3287.26  2575.67       Asia  1947-08-15IDN   Indonesia   268.07  1910.93  1015.54       Asia  1945-08-17PAK    Pakistan   205.71   881.91   302.14       Asia  1947-08-14BGD  Bangladesh   167.09   147.57   245.63       Asia  1971-03-26MEX      Mexico   126.58  1964.38  1158.23  N.America  1810-09-16DEU     Germany    83.02   357.11  3693.20     Europe         NaNGBR          UK    66.44   242.50  2631.23     Europe         NaNARG   Argentina    44.94  2780.40   637.49  S.America  1816-07-09CAN      Canada    37.59  9984.67  1647.12  N.America  1867-07-01KAZ  Kazakhstan    18.53  2724.90   159.41       Asia  1991-12-16

In this example, skiprows is range(1, 20, 2) and corresponds to the values 1, 3, …, 19. The instances of the Python built-in class range behave like sequences. The first row of the file data.csv is the header row. It has the index 0, so Pandas loads it in. The second row with index 1 corresponds to the label CHN, and Pandas skips it. The third row with the index 2 and label IND is loaded, and so on.

If you want to choose rows randomly, then skiprows can be a list or NumPy array with pseudo-random numbers, obtained either with pure Python or with NumPy.

Force Less Precise Data Types

If you’re okay with less precise data types, then you can potentially save a significant amount of memory! First, get the data types with .dtypes again:

>>>
>>> df=pd.read_csv('data.csv',index_col=0,parse_dates=['IND_DAY'])>>> df.dtypesCOUNTRY            objectPOP               float64AREA              float64GDP               float64CONT               objectIND_DAY    datetime64[ns]dtype: object

The columns with the floating-point numbers are 64-bit floats. Each number of this type float64 consumes 64 bits or 8 bytes. Each column has 20 numbers and requires 160 bytes. You can verify this with .memory_usage():

>>>
>>> df.memory_usage()Index      160COUNTRY    160POP        160AREA       160GDP        160CONT       160IND_DAY    160dtype: int64

.memory_usage() returns an instance of Series with the memory usage of each column in bytes. You can conveniently combine it with .loc[] and .sum() to get the memory for a group of columns:

>>>
>>> df.loc[:,['POP','AREA','GDP']].memory_usage(index=False).sum()480

This example shows how you can combine the numeric columns 'POP', 'AREA', and 'GDP' to get their total memory requirement. The argument index=False excludes data for row labels from the resulting Series object. For these three columns, you’ll need 480 bytes.

You can also extract the data values in the form of a NumPy array with .to_numpy() or .values. Then, use the .nbytes attribute to get the total bytes consumed by the items of the array:

>>>
>>> df.loc[:,['POP','AREA','GDP']].to_numpy().nbytes480

The result is the same 480 bytes. So, how do you save memory?

In this case, you can specify that your numeric columns 'POP', 'AREA', and 'GDP' should have the type float32. Use the optional parameter dtype to do this:

>>>
>>> dtypes={'POP':'float32','AREA':'float32','GDP':'float32'}>>> df=pd.read_csv('data.csv',index_col=0,dtype=dtypes,... parse_dates=['IND_DAY'])

The dictionary dtypes specifies the desired data types for each column. It’s passed to the Pandas read_csv() function as the argument that corresponds to the parameter dtype.

Now you can verify that each numeric column needs 80 bytes, or 4 bytes per item:

>>>
>>> df.dtypesCOUNTRY            objectPOP               float32AREA              float32GDP               float32CONT               objectIND_DAY    datetime64[ns]dtype: object>>> df.memory_usage()Index      160COUNTRY    160POP         80AREA        80GDP         80CONT       160IND_DAY    160dtype: int64>>> df.loc[:,['POP','AREA','GDP']].memory_usage(index=False).sum()240>>> df.loc[:,['POP','AREA','GDP']].to_numpy().nbytes240

Each value is a floating-point number of 32 bits or 4 bytes. The three numeric columns contain 20 items each. In total, you’ll need 240 bytes of memory when you work with the type float32. This is half the size of the 480 bytes you’d need to work with float64.

In addition to saving memory, you can significantly reduce the time required to process data by using float32 instead of float64 in some cases.

Use Chunks to Iterate Through Files

Another way to deal with very large datasets is to split the data into smaller chunks and process one chunk at a time. If you use read_csv(), read_json() or read_sql(), then you can specify the optional parameter chunksize:

>>>
>>> data_chunk=pd.read_csv('data.csv',index_col=0,chunksize=8)>>> type(data_chunk)<class 'pandas.io.parsers.TextFileReader'>>>> hasattr(data_chunk,'__iter__')True>>> hasattr(data_chunk,'__next__')True

chunksize defaults to None and can take on an integer value that indicates the number of items in a single chunk. When chunksize is an integer, read_csv() returns an iterable that you can use in a for loop to get and process only a fragment of the dataset in each iteration:

>>>
>>> fordf_chunkinpd.read_csv('data.csv',index_col=0,chunksize=8):... print(df_chunk,end='\n\n')... print('memory:',df_chunk.memory_usage().sum(),'bytes',... end='\n\n\n')...        COUNTRY      POP     AREA       GDP       CONT     IND_DAYCHN       China  1398.72  9596.96  12234.78       Asia         NaNIND       India  1351.16  3287.26   2575.67       Asia  1947-08-15USA          US   329.74  9833.52  19485.39  N.America  1776-07-04IDN   Indonesia   268.07  1910.93   1015.54       Asia  1945-08-17BRA      Brazil   210.32  8515.77   2055.51  S.America  1822-09-07PAK    Pakistan   205.71   881.91    302.14       Asia  1947-08-14NGA     Nigeria   200.96   923.77    375.77     Africa  1960-10-01BGD  Bangladesh   167.09   147.57    245.63       Asia  1971-03-26memory: 448 bytes       COUNTRY     POP      AREA      GDP       CONT     IND_DAYRUS     Russia  146.79  17098.25  1530.75        NaN  1992-06-12MEX     Mexico  126.58   1964.38  1158.23  N.America  1810-09-16JPN      Japan  126.22    377.97  4872.42       Asia         NaNDEU    Germany   83.02    357.11  3693.20     Europe         NaNFRA     France   67.02    640.68  2582.49     Europe  1789-07-14GBR         UK   66.44    242.50  2631.23     Europe         NaNITA      Italy   60.36    301.34  1943.84     Europe         NaNARG  Argentina   44.94   2780.40   637.49  S.America  1816-07-09memory: 448 bytes        COUNTRY    POP     AREA      GDP       CONT     IND_DAYDZA     Algeria  43.38  2381.74   167.56     Africa  1962-07-05CAN      Canada  37.59  9984.67  1647.12  N.America  1867-07-01AUS   Australia  25.47  7692.02  1408.68    Oceania         NaNKAZ  Kazakhstan  18.53  2724.90   159.41       Asia  1991-12-16memory: 224 bytes

In this example, the chunksize is 8. The first iteration of the for loop returns a DataFrame with the first eight rows of the dataset only. The second iteration returns another DataFrame with the next eight rows. The third and last iteration returns the remaining four rows.

Note: You can also pass iterator=True to force the Pandas read_csv() function to return an iterator object instead of a DataFrame object.

In each iteration, you get and process the DataFrame with the number of rows equal to chunksize. It’s possible to have fewer rows than the value of chunksize in the last iteration. You can use this functionality to control the amount of memory required to process data and keep that amount reasonably small.

Conclusion

You now know how to save the data and labels from Pandas DataFrame objects to different kinds of files. You also know how to load your data from files and create DataFrame objects.

You’ve used the Pandas read_csv() and .to_csv() methods to read and write CSV files. You also used similar methods to read and write Excel, JSON, HTML, SQL, and pickle files. These functions are very convenient and widely used. They allow you to save or load your data in a single function or method call.

You’ve also learned how to save time, memory, and disk space when working with large data files:

  • Compress or decompress files
  • Choose the rows and columns you want to load
  • Use less precise data types
  • Split data into chunks and process them one by one

You’ve mastered a significant step in the machine learning and data science process! If you have any questions or comments, then please put them in the comments section below.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Mike C. Fletcher: Seems SimpleParse needs work for 3.8

$
0
0

So as I work through all the OpenGLContext projects to get automatic (or near automatic) releasing, SimpleParse wound up failing on the 3.x branches with a weird xml test failure. But with Python 3.8 the C code just won't import at all. Seems there was a change in Python 3.8 where it does a load-time test for functions in the module and the hand-coded C module triggers it. So I'll have to spend some time on that before I can get the whole stack releasing.

Janusworx: #100DaysOfCode, Day 012 – Data Structures Video Refresher

$
0
0

Extremely tight and busy day today.
But still made the time to watch the videos for the next challenge in the course.
As a matter of fact, I think this is what I should do.
If I am in office, I ought to watch videos.
And spend the time at home when I can coding.

What videos?
The next little project they have involves data structure basics.
So today’s videos took me through lists and tuples and dictionaries.
And with this, I call it a day.

Mahmoud Hashemi: Thanks, 201X!

$
0
0

Thought I'd take a Sunday afternoon to reflect on, oh I don't know, a decade.

Been a long ten years, but it's flown past. This particular decade happens to coincide with my first years of full-time professional software engineering.

The Quantity

I can't possibly summarize it all, and if I tried, it'd still be colored by what's on my mind right now. But I can point to the artifacts I tried to leave along the way:

Taking a chronological look at each of the above, I'm relieved to see obvious growth.

If I were to highlight one resource, it would probably be the talks. Despite the stress of preparation and delivery, I'm least concerned with having a massive miscommunication when we're all in the room and I can see the points hitting home. It's impossible to pick a favorite, but Ask the Ecosystem (2019), the Restructuring Data lightning talk (2018), and The Packaging Gradient (2017) seem like audience faves from where I'm sitting.

The Quality

Each project, post, and talk had its own reward, but I guess I've got more than just those to show for the decade.

On the more profit-driven side, I built tools and teams at PayPal, but once I could manage the risk, I got to dip into startups for the last few years. Lucky for me, it wasn't a total bust, and the wife and I bought a place in my favorite neighborhood (in the USA). Not a millionaire, but I'm hoping and working for a world where no one has to be.

More recently, the Python Software Foundation made me a Fellow. This isn't something I can be nonchalant about, and I'm not going to understate how much this means, to me, working in a field like software, where concrete symbols of progress are alternatingly elusive and vanishing. Plus it's Python, and reciprocated love is nice. I have hundreds of people to thank for helping me reach this point, and I have to thank the PSF for dedicating the time to ramping up these awards. They've convinced me more than ever that we need more institutions to build this sort of advancement.

To all of you, thank you.

The Struggle

I like to think I managed to do all of the above while staying away from industry hype, on the principle that massive speculative capital influx isn't where real value is added to society, and doesn't generate the kind of innovation that excites me.

I may have been naïve, but I came to Silicon Valley with an idea about the transformative power of software. Changing times may illustrate a grittier interpretation than the one I had and have, but I continue to hold dear software's potential for positive impact. If you've felt that vision waver, let me tell you, you're not alone.

In the past decade, I've seen too many engineers sucked in by new technologies and ventures, only to find themselves alienated from their work. Episodes ranging from an afternoon lost to debugging Docker/k8s clusters, to years of work disappearing at the end of a VC runway. Nothing has been harder to watch than those bedraggled-but-persistent idealists regroup, each time a bit more cynical than the last.

Even if its seeming intractibility has taken it from the center stage, the burnout conversation continues to smolder, because there's no issue realer. I know; I released more ceramics than software back in 2014.

Some problems can be solved by paying the maintainers, but I think the vastly bigger issue is around losing the human connection between the real effort software takes and the real benefits it brings, combined with FOSS's dearth of collaborators in supporting roles (QA, product/project/release management).

That's why I'm incredibly thankful for the Wikimedia community for always being there, patient with schedules and issues, as long as the software got the job done. It can be a challenge to juggle projects, but I tell every budding engineer: find that direct connection to people who will appreciate your work, and avoid cynicism at all costs.

There are some interesting prospects in the works, but I'm keeping this post retro. Besides, if 2029 rolls around and all I did was break even with 2009-19, I don't see how I can be disappointed.

Thanks again for everything in 201X, and for sticking with me in 202X.


  1. Despite using Twitter for over a decade, the process of tweeting feels so perfunctory, and the service itself so tenuous, that I still can't bring myself to invest the time. I mostly use it to crosspost my blog posts or help friends promote their posts/projects.

    But until I start an email newsletter, or really get on top of yak.party, it's still the best I got for announcing where I'm speaking next. 


Python Software Foundation: Giving Tuesday 2019

$
0
0

For the first time the PSF is participating in Giving Tuesday! This event is held annually the Tuesday after Thanksgiving - this year on December 3rd, 2019. The global celebration runs for 24 hours and begins at midnight local time.

Please donate on December 3rd and help us meet our goal of $10,000!


Donations support sprints, meetups, community events, Python documentation, fiscal sponsorships, software development, and community projects. Your contributions help fund the critical tools you use every day.

What if everyone around the world, gave together, on one day? Please consider supporting the PSF on Giving Tuesday, December 3rd, 2019. 


Your donations have IMPACT

----------------------------------------

Our Annual Report will show just a few ways your support has made a difference, thanks to the generous support from our partners and friends. Some highlights are below:

  • Over $137,200 was awarded in financial aid to 143 PyCon attendees in 2019.
  • $324,000 was paid in grants to recipients in 51 different countries.
  • Donations and fundraisers resulted in $489,152 of revenue. This represents 15% of our total 2018 revenue. PSF and PyCon sponsors contributed over $1,071K in revenue!

How your donation dollars are spent:

  • $99 pays for 6 months of Python meetup subscriptions
  • $60 a month ($2.00 a day) pays for one workshop, impacting over 250 people
  • $.50 a day ($15 a month) pays for a meetup.com subscription for one Python group
  • $1 a day ($30 a month) supports a regional workshop, impacting over 200 people.
  • The PSF meetup.com network currently supports 68 groups with 89,000 members in 16 countries. It costs $.60 per member per month to support these worldwide meetups. 

Comments from grant recipients:

"The PSF Fiscal Sponsorship allows us to focus on building community, while they handle our non-profit status, accounting, and back office." - Eric Holscher, PyCascades Conference Organizer

"The PSF was North Bay Python's first sponsor. Their early financial support for our mission helped kickstart what has become one of the most well-regarded regional conferences in our community." - Christopher Neugebauer, Conference Organizer

"Without the support of the PSF, our events would not have been possible. Many of our attendees are now working or interning as Python or Django Developers." - Jeel Mehta, Django Girls Bhavnagar, India Conference Volunteer

"The PSF grant allowed us to run an all day workshop for women. More organizations should apply for a PSF grant so they can enable and motivate more people, especially minorities, to start their great journey into programming." - Women in Technology, Peru


From the team at the PSF!



______________________________



PyCharm: PyCharm 2019.3 is out now

$
0
0

Interactive widgets for Jupyter notebooks, MongoDB support, and code assistance for all Python 3.8 features. Download the new version now, or upgrade from within you IDE.

New in PyCharm

  • Interactive Widgets for Jupyter Notebooks. A picture is worth a thousand words, but making it interactive really makes your data come to life. Interactive widgets are now supported in PyCharm.
  • MongoDB Support. One of the most commonly used NoSQL databases is now supported by the database tools included in PyCharm Professional Edition.
  • We’ve completed support for all Python 3.8 features: you can now expect PyCharm-grade code completion when you use TypedDicts. We’ve also added support for Literal type annotations, and more.
  • Why is it that when you get a CSV file, it always has a very long name which is very prone to typos? We’ve now added a small convenience feature that’ll save you: completion for filenames in methods like open and Panda’s read_csv.

And many more improvements, like faster indexing of interpreters, read about them all on our What’s New page.

 

NumFOCUS: Stepping Onto a New Path…

Podcast.__init__: Making Complex Software Fun And Flexible With Plugin Oriented Programming

$
0
0
Starting a new project is always exciting because the scope is easy to understand and adding new features is fun and easy. As it grows, the rate of change slows down and the amount of communication necessary to introduce new engineers to the code increases along with the complexity. Thomas Hatch, CTO and creator of SaltStack, didn't want to accept that as an inevitable fact of software, so he created a new paradigm and a proof-of-concept framework to experiment with it. In this episode he shares his thoughts and findings on the topic of plugin oriented programming as a way to build and scale complex projects while keeping them fun and flexible.

Summary

Starting a new project is always exciting because the scope is easy to understand and adding new features is fun and easy. As it grows, the rate of change slows down and the amount of communication necessary to introduce new engineers to the code increases along with the complexity. Thomas Hatch, CTO and creator of SaltStack, didn’t want to accept that as an inevitable fact of software, so he created a new paradigm and a proof-of-concept framework to experiment with it. In this episode he shares his thoughts and findings on the topic of plugin oriented programming as a way to build and scale complex projects while keeping them fun and flexible.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host as usual is Tobias Macey and today I’m interviewing Thomas Hatch about his work on the POP library and how he is using plugin oriented programming in his work at SaltStack

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you start by giving your definition of Plugin Oriented Programming and your thoughts on what benefits it provides?
  • You created the POP library as a framework for enabling developers to incorporate this pattern into their own projects. What capabilities does that framework provide and what was your motivation for creating it?
    • How has your work on Salt influenced your thinking on how to implement plugins for software projects?
    • How does POP fit into the future of the SaltStack project?
  • What are some of the advanced patterns or paradigms that the POP model allows for?
  • Can you describe how the POP library itself is implemented and some of the ways that its design has evolved since you first began experimenting with it?
    • What are some of the languages or libraries that you have looked at for inspiration in your design and philosophy around this development pattern?
  • For someone who is building a project on top of POP what does their workflow look like and what are some of the up-front design considerations they should be thinking of?
  • How do you define and validate the contract exposed by or expected from a plugin subsystem?
  • One of the interesting capabilities that you highlight in the documentation is the concept of merging applications. What are your thoughts on the challenges that an engineer might face when merging library or microservice applications built with POP into a single deployable artifact?
    • What would be involved in going the other direction to split a single application into independently runnable microservices?
  • When extracting common functionality from a group of existing applications, what are the relative merits of creating a plugin sybsystem vs writing a library?
  • How does the system design of a POP application impact the available range of communication patterns for software and the teams building it?
  • What are some antipatterns that you anticipate for teams building their projects on top of POP?
  • In the documentation you mention that POP is just an example implementation of the broader pattern and that you hope to see other languages and developer communities adopt it. What are some of the barriers to adoption that you foresee?
  • What are some of the limitations of POP or cases where you would recommend against following this paradigm?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen POP used?
  • What have been some of the most interesting, unexpected, or challenging aspects of building POP?
  • What do you have planned for the future of the POP library, or any applications where you plan to employ this pattern?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

S. Lott: Functional programming design pattern: Nested Iterators == Flattening

$
0
0
Here's a functional programming design pattern I uncovered. This may not be news to you, but it was a surprise to me. It cropped up when looking at something that needs parallelization to reduced the elapsed run time.

Consider this data collection process.

for h in some_high_level_collection(arg1):
for l in h.some_low_level_collection(arg2):
if some_filter(l):
logger.info("Processing %s %s", h, l)
some_function(h, l)

This is pretty common in devops world. You might be looking at all repositories of in all github organizations. You might be looking at all keys in all AWS S3 buckets under a specific account. You might be looking at all tables owned by all schemas in a database.

It's helpful -- for the moment -- to stay away from taller tree structures like the file system. Traversing the file system involves recursion, and the pattern is slightly different there. We'll get to it, but what made this clear to me was a "simpler" walk through a two-layer hierarchy. 

The nested for-statements aren't really ideal. We can't apply any itertools techniques here. We can't trivially change this to a multiprocessing.map()

In fact, the more we look at this, the worse it is.

Here's something that's a little easier to work with:

def h_l_iter(arg1, arg2):
for h in some_high_level_collection(arg1):
for l in h.some_low_level_collection(arg2):
if some_filter(l):
logger.info("Processing %s %s", h, l)
yield h, l

itertools.starmap(some_function, h_l_iter(arg1, arg2))

The data gathering has expanded to a few more lines of code. It gained a lot of flexibility. Once we have something that can be used with starmap, it can also be used with other itertools functions to do additional processing steps without breaking the loops into horrible pieces.

I think the pattern here is a kind of "Flattened Map" transformation. The initial design, with nested loops wrapping a process wasn't a good plan. A better plan is to think of the nested loops as a way to flatten the two tiers of the hierarchy into a single iterator. Then a mapping can be applied to process each item from that flat iterator.

Extracting the Filter

We can now tease apart the nested loops to expose the filter. In the version above, the body of the h_l_iter() function binds log-writing with the yield. If we take those two apart, we gain the flexibility of being able to change the filter (or the logging) without an awfully complex rewrite.

T = TypeVar('T')
def logging_iter(source: Iteratble[T]) -> Iterator[T]:
for item in source:
logger.info("Processing %s", item)
yield item

def h_l_iter(arg1, arg2):
for h in some_high_level_collection(arg1):
for l in h.some_low_level_collection(arg2):
yield h, l

raw_data = h_l_iter(arg1, arg2)
filtered_subset = logging_iter(filter(some_filter, raw_data))
itertools.starmap(some_function, filtered_subset)

Yes, this is still longer, but all of the details are now exposed in a way that lets me change filters without further breakage.

Now, I can introduce various forms of multiprocessing to improve concurrency.

This transformed a hard-wired set of nest loops, if, and function evaluation into a "Flattener" that can be combined with off-the shelf filtering and mapping functions.

I've snuck in a kind of "tee" operation that writes an iterable sequence to a log. This can be injected at any point in the processing.

Logging the entire "item" value isn't really a great idea. Another mapping is required to create sensible log messages from each item. I've left that out to keep this exposition more focused.

I'm sure others have seen this pattern, but it was eye-opening to me.

Full Flattening

The h_l_iter() function is actually a generator expression. A function isn't needed.

h_l_iter = (
(h, l)
for h in some_high_level_collection(arg1)
for l in h.some_low_level_collection(arg2)
)

This simplification doesn't add much value, but it seems to be general truth. In Python, it's a small change in syntax and therefore, an easy optimization to make.

What About The File System?

When we're working with some a more deeply-nested structure, like the File System, we'll make a small change. We'll replace the h_l_iter() function with a recursive_walk() function.

def recursive_walk(path: Path) -> Iterator[Path]:
for item in path.glob():
if item.is_file():
yield item
elif item.is_dir():
yield from recursive_walk(item)

This function has, effectively the same signature as h_l_iter(). It walks a complex structure yielding a flat sequence of items. The other functions used for filtering, logging, and processing don't change, allowing us to build new features from various combinations of these functions.

tl;dr

The too-long version of this is:

Replace for item in iter: process(item) with map(process, iter).

This pattern works for simple, flat items, nested structures, and even recursively-defined trees. It introduces flexibility with no real cost.

The other pattern in play is:

Any for item in iter: for sub-item in item: processing is "flattening" a hierarchy into a sequence. Replace it with (sub-item for item in iter for sub-item in item).

These felt like blinding revelations to me.

Janusworx: #100DaysOfCode, Day 013 – Test code using Pytest

$
0
0

Watched the video on the upcoming challenge to learn testing code.
Sounds challenging.
There are classes and decorators, which I have not worked with, but just read about.

Will see how it goes.
The idea of test driven development though, seems right up my alley.

Learn PyQt: Creating scrollable GUIs with QScrollArea in PyQt5

$
0
0

When you start building apps that display long documents, large amounts of data or large numbers of widgets, it can be difficult to arrange things within a fixed-size window. Resizing the window beyond the size of the screen isn't an option, and shrinking widgets to fit can make the information unreadable.

To illustrate the problem below is a window in which we've created a large number of QLabel widgets. These widgets have the size Vertical Policy set to Preferred which automatically resizes the widgets down to fit the available space. The results are unreadable.

Problem of Too Many Widgets.png Problem of Too Many Widgets.png

Settings the Vertical Policy to Fixed keeps the widgets at their natural size, making them readable again.

Problem of Too Many Widgets With Fixed Heights Problem of Too Many Widgets With Fixed Heights

However, while we can still add as many labels as we like, eventually they start to fall off the bottom of the layout.

To solve this problem GUI applications can make use of scrolling regions to allow the user to move around within the bounds of the application window while keeping widgets at their usual size. By doing this an almost unlimited amount of data or widgets can be shown, navigated and viewed within a window — although care should be taken to make sure the result is still usable!

In this tutorial, we'll cover adding a scrolling region to your PyQt application using QScrollArea.

Adding a QScrollArea in Qt Designer

First we'll look at how to add a QScrollArea from Qt Designer.

from the standard empty app importing a .ui file designed in Qt Designer.

python
fromPyQt5importQtWidgets,uicimportsysclassMainWindow(QtWidgets.QMainWindow):def__init__(self,*args,**kwargs):super(MainWindow,self).__init__(*args,**kwargs)#Load the UI Pageuic.loadUi('mainwindow.ui',self)defmain():app=QtWidgets.QApplication(sys.argv)main=MainWindow()main.show()sys.exit(app.exec_())if__name__=='__main__':main()
Qt Creator — Select MainWindow for widget type Qt Creator — Select MainWindow for widget type

So we will choose the scroll area widget and add it to our layout as below.

First, create an empty MainWindow in Qt Designer and save it as mainwindow.ui

Add Scroll Area Add Scroll Area

Next choose to lay out the QScrollArea vertically or horizontally, so that it scales with the window.

Lay Out The Scroll Area Vertically Or Horizontally Lay Out The Scroll Area Vertically Or Horizontally

Voila, we now have a completed scroll area that we can populate with anything we need.

The Scroll Area Is Created The Scroll Area Is Created

Inserting Widgets

We will now add labels to that scroll area. Lets take two labels and place it inside the QScrollArea. We will then proceed to right click inside the scroll area and select Lay Out Vertically so our labels will be stacked vertically.

Add Labels to The Scroll Area And Set the Layout Add Labels to The Scroll Area And Set the Layout

We've set the background to blue so the illustration of this this works is clear. We can now add more labels to the QScrollArea and see what happens. By default, the Vertical Policy of the label is set to Preferred which means that the label size is adjusted according to the constraints of widgets above and below.

Next, we'll add a bunch of widgets.

Adding More Labels to QScrollArea Adding More Labels to QScrollArea

Any widget can be added into a `QScrollArea` although some make more sense than others. For example, it's a great way to show multiple widgets containing data in a expansive dashboard, but less appropriate for control widgets — scrolling around to control an application can get frustrating.

Note that the scroll functionality has not been triggered, and no scrollbar has appeared on the right hand side. Instead the labels are still progressively getting smaller in height to accommodate the widgets.

However, if we set Vertical Policy to Fixed and set the minimumSize of height to 100px the labels will no longer be able to shrink vertically into the available space. As the layout overflows this will now trigger the QScrollArea to display a scrollbar.

Setting Fixed Heights for Labels Setting Fixed Heights for Labels

With that, our scrollbar appears on the right hand side. What has happened is that the scroll area only appears when necessary. Without a fixed height constraint on the widget, Qt assumes the most logical way to handle the many widgets is to resize them. But by imposing size constraints on our widgets, the scroll bar appears to allow all widgets to keep their fixed sizes.

Another important thing to note is the properties of the scroll area. Instead of adjusting fixed heights, we can keep it in Preferred , we can set the properties of the verticalScrollBar to ScrollBarAlwaysOn which will enable the scroll bar to appear sooner as below

ScrollArea Properties ScrollArea Properties

Saving and running the code at the start of this tutorial gives us this scroll area app which is what we wanted.

App With Scroll Bar App With Scroll Bar

Adding a QScrollArea from code

As with all widgets you can also add a QScrollArea directly from code. Below we repeat the above example, with a flexible scroll area for a given number of widgets, using code.

python
fromPyQt5.QtWidgetsimport(QWidget,QSlider,QLineEdit,QLabel,QPushButton,QScrollArea,QApplication,QHBoxLayout,QVBoxLayout,QMainWindow)fromPyQt5.QtCoreimportQt,QSizefromPyQt5importQtWidgets,uicimportsysclassMainWindow(QMainWindow):def__init__(self):super().__init__()self.initUI()definitUI(self):self.scroll=QScrollArea()# Scroll Area which contains the widgets, set as the centralWidgetself.widget=QWidget()# Widget that contains the collection of Vertical Boxself.vbox=QVBoxLayout()# The Vertical Box that contains the Horizontal Boxes of  labels and buttonsforiinrange(1,50):object=QLabel("TextLabel")self.vbox.addWidget(object)self.widget.setLayout(self.vbox)#Scroll Area Propertiesself.scroll.setVerticalScrollBarPolicy(Qt.ScrollBarAlwaysOn)self.scroll.setHorizontalScrollBarPolicy(Qt.ScrollBarAlwaysOff)self.scroll.setWidgetResizable(True)self.scroll.setWidget(self.widget)self.setCentralWidget(self.scroll)self.setGeometry(600,100,1000,900)self.setWindowTitle('Scroll Area Demonstration')self.show()returndefmain():app=QtWidgets.QApplication(sys.argv)main=MainWindow()sys.exit(app.exec_())if__name__=='__main__':main()

If you run the above code you should see the output below, with a custom widget repeated multiple times down the window, and navigable using the scrollbar on the right.

Scroll Area App Scroll Area App

Next, we'll step through the code to explain how this view is constructed.

First we create our layout hierarchy. At the top level we have our QMainWindow which we can set the QScrollArea onto using .setCentralWidget. This places the QScrollArea in the window, taking up the entire area.

To add content to the QScrollArea we need to add a widget using .setWidget, in this case we are adding a custom QWidget onto which we have applied a QVBoxLayout containing multiple sub-widgets.

python
definitUI(self):self.scroll=QScrollArea()# Scroll Area which contains the widgets, set as the centralWidgetself.widget=QWidget()# Widget that contains the collection of Vertical Boxself.vbox=QVBoxLayout()# The Vertical Box that contains the Horizontal Boxes of  labels and buttonsforiinrange(1,50):object=QLabel("TextLabel")self.vbox.addWidget(object)

This gives us the following hierarchy in the window:

Scroll Area This is the scroll area, added as the centralWidget to the QMainWindow
Widget This is the placeholder widget onto which we've applied the Vertical Layout below
Vbox Vertical Layout containing all the QLabel widgets

Finally we set up properties on the QScrollArea, setting the vertical scrollbar Always On and the horizontal Always Off. We allow the widget to be resized, and then add the central placeholder widget to complete the layout.

python
#Scroll Area Propertiesself.scroll.setVerticalScrollBarPolicy(Qt.ScrollBarAlwaysOn)self.scroll.setHorizontalScrollBarPolicy(Qt.ScrollBarAlwaysOff)self.scroll.setWidgetResizable(True)self.scroll.setWidget(self.widget)

Finally, we will add the QScrollArea as the central widget for our QMainWindow and set up the window dimensions, title and show the window.

python
self.setCentralWidget(self.scroll)self.setGeometry(600,100,1000,900)self.setWindowTitle('Scroll Area Demonstration')self.show()

Conclusion.

In this tutorial we've learned how to add a scrollbar with an unlimited number of widgets, programatically or using Qt Designer. Adding a QScrollArea is a good way to include multiple widgets especially on apps that are data intensize and require objects to be displayed as lists.

Have a go at making your own apps with QScrollArea and share with us what you have made!

For more information about using QScrollArea check out the PyQt5 documentation.

Continuum Analytics Blog: Get Python Package Download Statistics with Condastats


Real Python: Python, Boto3, and AWS S3: Demystified

$
0
0

Amazon Web Services (AWS) has become a leader in cloud computing. One of its core components is S3, the object storage service offered by AWS. With its impressive availability and durability, it has become the standard way to store videos, images, and data. You can combine S3 with other services to build infinitely scalable applications.

Boto3 is the name of the Python SDK for AWS. It allows you to directly create, update, and delete AWS resources from your Python scripts.

If you’ve had some AWS exposure before, have your own AWS account, and want to take your skills to the next level by starting to use AWS services from within your Python code, then keep watching.

By the end of this course, you’ll:

  • Be confident working with buckets and objects directly from your Python scripts
  • Know how to avoid common pitfalls when using Boto3 and S3
  • Understand how to set up your data from the start to avoid performance issues later
  • Learn how to configure your objects to take advantage of S3’s best features

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Catalin George Festila: Python 3.7.5 : The new Django version 3.0 .

$
0
0
On December 2, 2019, comes with Django 3.0 Released. This new release comes with many changes and features, let's see : Django 3.0 supports new versions of Python 3.6, 3.7, and 3.8 (the old Django 2.2.x series is the last to support Python 3.5); Django now officially supports MariaDB 10.1 and higher; Django is fully async-capable by providing support for running as an ASGI application; The

Dataquest: Excel vs Python: How to Do Common Data Analysis Tasks

$
0
0

In this tutorial, we’ll compare Excel and Python by looking at how to perform basic analysis tasks across both platforms. Excel is the most commonly used data analysis software in the world. Why? It’s easy to get the hang of and fairly powerful once you master it. In contrast, Python’s reputation is that it’s more […]

The post Excel vs Python: How to Do Common Data Analysis Tasks appeared first on Dataquest.

PyCoder’s Weekly: Issue #397 (Dec. 3, 2019)

$
0
0

#397 – DECEMBER 3, 2019
View in Browser »

The PyCoder’s Weekly Logo


Guido van Rossum Withdraws From the Python Steering Council

“Part of my reason is that in the end, SC duty feels more like a chore to me than fun, and one of the things I’m trying to accomplish in my life post Dropbox retirement is to have more fun. To me, fun includes programming in and contributing to Python, for example the PEG parser project.”
PYTHON.ORG

Python Descriptors: An Introduction

Learn what Python descriptors are and how they’re used in Python’s internals. You’ll learn about the descriptor protocol and how the lookup chain works when you access an attribute. You’ll also see a few practical examples where Python descriptors can come in handy.
REAL PYTHON

Become a Python Guru With PyCharm

alt

PyCharm is the Python IDE for Professional Developers by JetBrains providing a complete set of tools for productive Python, Web and scientific development. Be more productive and save time while PyCharm takes care of the routine →
JETBRAINSsponsor

PSF Giving Tuesday Fundraiser

When you support the Python Software Foundation on Giving Tuesday you’l support organizations like the Cameroon Digital Skills Campaign. The global donatio drive runs for 24 hrs starting December 3.
PYTHON.ORG

PIP_REQUIRE_VIRTUALENV: Requiring an Active Virtual Environment for Pip

After setting the PIP_REQUIRE_VIRTUALENV environment variable, Pip will no longer let you accidentally install packages if you are not in a virtual environment.
PYTHON-GUIDE.ORG

Testing Your Python Package as Installed

How to test Python packages as they will be installed on your users’ systems to avoid subtle bugs.
PAUL GANSSLE

Flattening Nested Loops for Parallel Processing Speed Gains

A functional programming pattern you can use to parallelize the processing of nested loops.
S. LOTT

Python Jobs

Senior Python Engineer (Munich, Germany)

Stylight GmbH

Senior Python/Django Developer (Eindhoven, Netherlands)

Sendcloud

Django Full Stack Web Developer (Austin, TX)

Zeitcode

Contract Python / RaspPi / EPICS (Remote)

D-Pace Inc

More Python Jobs >>>

Articles & Tutorials

Reducing Pandas Memory Usage With Lossy Compression

“If you want to process a large amount data with Pandas, there are various techniques you can use to reduce memory usage without changing your data. But what if that isn’t enough? What if you still need to reduce memory usage? Another technique you can try is lossy compression: drop some of your data in a way that doesn’t impact your final results too much.”
ITAMAR TURNER-TRAURING

Pandas: How to Read and Write Files

In this tutorial, you’ll learn about the Pandas IO tools API and how you can use it to read and write files. You’ll use the Pandas read_csv() function to work with CSV files. You’ll also cover similar methods for efficiently working with Excel, CSV, JSON, HTML, SQL, pickle, and big data files.
REAL PYTHON

Python Developers Are in Demand on Vettery

alt

Vettery is an online hiring marketplace that’s changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today →
VETTERYsponsor

Python, Boto3, and AWS S3: Demystified

Get started working with Python, Boto3, and AWS S3. Learn how to create objects, upload them to S3, download their contents, and change their attributes directly from your script, all while avoiding common pitfalls.
REAL PYTHON

“Python Already Replaced Excel in Banking”

“You can already walk across the trading floor and see people writing Python code…it will become much more common in the next three to four years.”
SARAH BUTCHERopinion

A Simple Explanation of the Bag-Of-Words Model Using Python

The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears.
VICTOR ZHOU

Need Python, UX, and Front-End Help for a Custom App or Design System?

As core developers on Django and Sass, OddBird provides integrated training and consulting for resilient web applications and infrastructure.
ODDBIRDsponsor

Property Based Testing for Scientific Code in Python

How to write better tests in less time by using property based testing with the hypothesis package.
MARKUS KONRAD• Shared by Markus Konrad

Projects & Code

pywonderland: Tour in the Wonderland of Math With Python

A collection of Python scripts for drawing beautiful figures and animating interesting algorithms in mathematics.
GITHUB.COM/NEOZHAOLIANG

Events

PiterPy Meetup

December 10, 2019
PITERPY.COM


Happy Pythoning!
This was PyCoder’s Weekly Issue #397.
View in Browser »

alt

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

Andre Roberge: Friendly-traceback, Real Python, Pycon, and more

$
0
0
After an interruption that lasted a few months, I've finally been able to return to programming, more specifically working mostly on Friendly-traceback. For those that do not know Friendly-traceback: it aims to replace the sometimes obscure traceback generated by Python with something easier to understand. Furthermore, Friendly-traceback is designed with support for languages other than English so that, in theory, beginners (who are the main target audience for Friendly-traceback) could benefit no matter what their native language is ... provided someone would have done the translation into that language, of course.

As of now, 75 different cases have been tested; you can find them in the documentation.  [If you have suggestions for improvements, please do not hesitate to let me know.]

Recently, a post by Real Python on SyntaxError has given me added impetus to work on Friendly-traceback. I'm happy to report that, other than the cases mentioned dealing with misspelled or missing keywords, all of the other examples mentioned in that post can be analyzed by Friendly-traceback with an appropriate explanation provided. Note that these are not hard-coded examples from that post, so that any similar cases should be correctly identified.

Friendly-traceback works with Python 3.6, 3.7 and 3.8.  As I included support for 3.8, I found that some error messages given by Python changed in this newer version, and were generally improved. However, this meant that I had to change a few things to support all three versions.

Working on Friendly-traceback, and on AvantPy, has been so far a fun learning experience for me. I was hoping and looking forward to submit a talk proposal dealing with both these project to the Pycon Education Summit, as I thought that both projects would be of interest to Python educators. However, the call for proposals is focused on people's experience with actual teaching case studies about how teachers and Python programmers have implemented Python instruction in their schools, communities, and other places of learning ... So, definitely no interest in talks about tools like those I create. I certainly do understand the reason for this choice, but I cannot help but feeling disappointed as I was definitely hoping to get an opportunity to give a talk on these projects, and exchange ideas with interested people afterwards.

I did submit a proposal for a reasonably advanced and more technical talk dealing with import hooks and exception hooks, to share what I have learned (while working on Friendly-traceback and AvantPy) with the Pycon crowd. The last time I gave a talk at Pycon was in 2009 and the "competition" to have a talk accepted was much less than what it is now.  Giving a talk is the only way that I can justify taking a leave from my day job to attend Pycon, something I really miss.

Back to Real Python ... I remember purchasing some books from them some time in 2014, and, more recently, I did the same for the "course" on virtual environments. I had never bothered with virtual environments until recently and thought that, if I actually paid to get some proper tutorial, I would have no excuse not to start using virtual environments properly.  The "course" that I bought was well put together.  Compared to standard books, I find it a bit overpriced for the amount of material included. 

As a pure Python hobbyist, I appreciate the material Real Python make freely available, but do find their membership price rather steep.  However, I did note that their tutorial writers could get free access to their entire collection ... 

;-) Perhaps I should offer to write tutorials on 1) using import hooks; 2) using exception hooks; 3) designing libraries with support for translations in a way that they "play well together" -- all topics I had to figure out on my own.  While there are tutorials about translation support, I found that all of them give the same gettext-based approach of defining a global function named "_" which works very well for isolated packages, but can fail spectacularly in some corner cases as I found out while developing Friendly-traceback and AvantPy. 

However, writing clear tutorials takes a lot of time and effort, and is not as fun to me as writing code. So, I think that, for now, I'll just go back to add support for more Python exceptions in Friendly-traceback - and hope that I will have soon to focus my entire free time in putting together material for a Pycon talk.
Viewing all 22412 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>