Data exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.
The Data Import Tool leverages the power of Pandas while providing an interactive UI, allowing you to visually explore and experiment with the DataFrame (the Pandas equivalent of a spreadsheet or a SQL table), without having to know the details of the Pandas-specific function calls and arguments. The Data Import Tool keeps track of all of the changes you make (in the form of Python code). That way, when you are done finding the right workflow for your data set, the Tool has a record of the series of actions you performed on the DataFrame, and you can apply them to future data sets for even faster data wrangling in the future.
At the same time, the Tool can help you pick up how to use the Pandas library, while still getting work done. For every action you perform in the graphical interface, the Tool generates the appropriate Pandas/Python code, allowing you to see and relate the tasks to the corresponding Pandas code.
With the Data Import Tool, loading data is as simple as choosing a file or pasting a URL. If a file is chosen, it automatically determines the format of the file, whether or not the file is compressed, and intelligently loads the contents of the file into a Pandas DataFrame. It does so while taking into account various possibilities that often throw a monkey wrench into initial data loading: that the file might contain lines that are comments, it might contain a header row, the values in different columns could be of different types e.g. DateTime or Boolean, and many more possibilities as well.
A Glimpse into Loading Data into Pandas DataFrames (The Hard Way)
The following 4 “inconvenience” examples show typical problems (and the manual solutions) that might arise if you are writing Pandas code to load data, which are automatically solved by the Data Import Tool, saving you time and frustration, and allowing you to get to the important work of data analysis more quickly.
Let’s say you were to load data from the file by yourself. After searching the Pandas documentation a bit, you will come across the pandas.read_table
function which loads the contents of a file into a Pandas DataFrame. But it’s never so easy in practice: pandas.read_table
and other functions you might find assume certain defaults, which might be at odds with the data in your file.
Inconvenience #1: Data in the first row will automatically be used as a header. Let’s say that your file (like this one: [wind.data]) uses whitespace as the separator between columns and doesn’t have a row containing column names. pandas.read_table
assumes by default that your file contains a header row and uses tabs for delimiters. If you don’t tell it otherwise, Pandas will use the data from the first row in your file as column names, which is clearly wrong in this case.
From the docs, you can discover that this behavior can be turned off by passing header=None
and use sep=\s+
to pandas.read_table
, to use varying whitespace as the separator and to inform pandas that a header column doesn’t exist:
In [1]: df = pandas.read_table('wind.data', sep='\s+')
In [2]: df.head()
Out[2]:
61 1 1.1 15.04 14.96 13.17 9.29 13.96 9.87 13.67 10.25 10.83 \
0 61 1 2 14.71 16.88 10.83 6.50 12.62 7.67 11.50 10.04 9.79
1 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 8.04 8.50
12.58 18.50 15.04.1
0 9.67 17.54 13.83
1 7.67 12.75 12.71
Without the header=None kwarg, you can see that the first row of data is being considered as column names:
In [3]: df = pandas.read_table('wind.data', header=None, sep='\s+')
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 \
0 61 1 1 15.04 14.96 13.17 9.29 13.96 9.87 13.67 10.25 10.83
1 61 1 2 14.71 16.88 10.83 6.50 12.62 7.67 11.50 10.04 9.79
12 13 14
0 12.58 18.50 15.04
1 9.67 17.54 13.83
The behavior we expected, after we tell Pandas that the file does not contain a row containing column names using header=None
[File : test_data_comments.txt]
Inconvenience #2: Commented lines cause the data load to fail. Next let’s say that your file contains commented lines which start with a #
. Pandas doesn’t understand this by default and trying to load the data into a DataFrame will either fail with an Error or worse, succeed without notifying you that one row in the DataFrame might contain erroneous data, from the commented line. (This might also prevent correct inference of column types.)
Again, you can tell pandas.read_table
that commented lines exist in your file and to skip them using comment=#
:
In [1]: df = pandas.read_table('test_data_comments.txt', sep=',', header=None)
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-10-b5cd8eee4851> in <module>()
----> 1 df = pandas.read_table('catalyst/tests/data/test_data_comments.txt', sep=',', header=None)
(traceback)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5
As mentioned earlier, if you are lucky, Pandas will fail with a CParserError, complaining that each row contains a different number of columns in the data file. Needless to say, it’s not obvious to tell that this is an unidentified comment line:
In [2]: df = pandas.read_table('test_data_comments.txt', sep=',', comment='#', header=None)
In [3]: df
Out[3]:
0 1 2 3 4
0 1 False 1.0 one 2015-01-01
1 2 True 2.0 two 2015-01-02
2 3 False 3.0 three 2015-01-03
3 4 True 4.0 four 2015-01-04
And we can read the file contents correctly when we tell pandas that ‘#’ is the character that commented lines in the file start with, as is seen in the following file:
[File : ncaa_basketball_2016.txt]
Inconvenience #3: Fixed-width formatted data will cause data load to fail. Now let’s say that your file contains data in a fixed-width format. Trying to load this data using pandas.read_table
will fail.
Dig around a little and you will come across the function pandas.read_fwf
, which is the suggested way to load data from fixed-width files, not pandas.read_table
.
In [1]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None)
In [2]: df.head()
Out[2]:
0
0 2016-02-25 @Ark Little Rock 72 UT Ar...
1 2016-02-25 ULM 66 @South...
Those of you familiar with Pandas will recognize that the above DataFrame, created from the file, contains only one column, labelled 0. Which is clearly wrong, because there are 4 distinct columns in the file.
In [3]: df = pandas.read_table('ncaa_basketball_2016.txt', header=None, sep='\s+')
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-28-db4f2f128b37> in <module>()
----> 1 df = pandas.read_table('functional_tests/data/ncaa_basketball_2016.txt', header=None, sep='\s+')
(Traceback)
CParserError: Error tokenizing data. C error: Expected 8 fields in line 55, saw 9
If we didn’t know better, we would’ve assumed that the delimiter/separator character used in the file was whitespace. We can tell Pandas to load the file again, assuming that the separator was whitespace, represented using \s+
. But, as you can clearly see above, that raises a CParserError, complaining that it noticed more columns of data in one row than the previous.
In [4]: df = pandas.read_fwf('ncaa_basketball_2016.txt', header=None)
In [5]: df.head()
Out[5]:
0 1 2 3 4 5
0 2016-02-25 @Ark Little Rock 72 UT Arlington 60 NaN
1 2016-02-25 ULM 66 @South Alabama 59 NaN
And finally, using pandas.read_fwf
instead of pandas.read_table
gives us a DataFrame that is close to what we expected, given the data in the file.
Inconvenience #4: NA is not recognized as text; automatically converted to ‘None’: Finally, let’s assume that you have raw data containing the string NA
, which is this specific case is used to represent North America
. By default pandas.read_csv
interprets these string values to represent None
and automatically converts them to None
. And Pandas does all of this underneath the hood, without informing the user. One of the things that the Zen of Python says is that Explicit is better than implicit
. In that spirit, the Tool explicitly lists the values which will be interpreted as None/NaN
.
The user can remove NA
(or any of the other values) from this list, to prevent it from being interpreted as None
, as shown in the following file:
[File : test_data_na_values.csv]
In [2]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None)
In [3]: df
Out[3]:
0 1 2
0 NaN 1 True
1 NaN 2 False
2 NaN 3 False
3 NaN 4 True
In [4]: df = pandas.read_table('test_data_na_values.csv', sep=',', header=None, keep_default_na=False, na_values=[])
In [5]: df
Out[5]:
0 1 2
0 NA 1 True
1 NA 2 False
2 NA 3 False
3 NA 4 True
If your intentions were to jump straight into data exploration and manipulation, then the above points are some of the inconveniences that you will have to deal with, requiring you to learn the various arguments that need to be passed to pandas.read_table
before can load your data correctly and get to your analysis.
Loading Data with the Data Import Tool (The Easy Way)
The Data Import Tool takes care of all of these problems for you, allowing you to fast forward to the important work of data exploration and manipulation. It automatically:
- Infers if your file contains a row of column names or not;
- Intelligently infers if your file contains any commented lines and what the comment character is;
- Infers what delimiter is used in the file or if the file contains data in a fixed-width format.
Download Canopy (free) and start a free trial of the Data Import Tool to see just how much time and frustration you can save!
The Data Import Tool as a Learning Resource: Using Auto-Generated Python/Pandas code
So far, we talked about how the Tool can help you get started with data exploration, without the need for you to understand the Pandas library and its intricacies. But, what if you were also interested in learning about the Pandas library? That’s where the Python Code pane in the Data Import Tool can help.
As you can see from the screenshot below, the Data Import Tool generates Pandas/Python code for every command you execute. This way, you can explore and learn about the Pandas library using the Tool.
Finally, once you are done loading data from the file and manipulating the DataFrame, you can export the DataFrame to Canopy’s IPython console for further analysis and visualization. Simply click Use DataFrame at the bottom-right corner and the Tool will export the DataFrame to Canopy’s IPython pane, as you can see below.
Ready to try the Canopy Data Import Tool?
Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today
Additional resources:
- Download Enthought’s set of 8 Pandas Cheat Sheets, a quick reference tool for Pandas code and concepts
- Download Canopy and start a free 7 day trial of the data import tool
- Canopy Data Import Tool Product Page
- Related Blog post: Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features
Watch a 2-minute demo video to see how the Canopy Data Import Tool works:
See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging: