In November 2016, we released Version 1.0.6 of the Data Import Tool (DIT), an addition to the Canopy data analysis environment. With the Data Import Tool, you can quickly import structured data files as Pandas DataFrames, clean and manipulate the data using a graphical interface, and create reusable Python scripts to speed future data wrangling.
For example, the Data Import Tool lets you delete rows and columns containing Null values or replace the Null values in the DataFrame with a specific value. It also allows you to create new columns from existing ones. All operations are logged and are reversible in the Data Import Tool so you can experiment with various workflows with safeguards against errors or forgetting steps.
What’s New in the Data Import Tool November 2016 Release
Pandas 0.19 support, re-usable templates for data munging, and more.
Over the last couple of releases, we added a number of new features and enhanced a number of existing ones. A few notable changes are:
- The Data Import Tool now supports the recently released Pandas version 0.19.0. With this update, the Tool now supports Pandas versions 0.16 through 0.19.
- The Data Import Tool now allows you to delete empty columns in the DataFrame, similar to existing option to delete empty rows.
- The Data Import Tool allows you to choose how to delete rows or columns containing Null values: “Any” or “All” methods are available.
Every time you successfully import a DataFrame, the Data Import Tool automatically saves a generated Python script in your home directory. This way, you can easily review and reproduce your earlier work.
- The Data Import Tool generates a Template with every successful import. A Template is a file that contains all of the commands or actions you performed on the DataFrame and a unique Template file is generated for every unique data file. With this feature, when you load a data file, if a Template file exists corresponding to the data file, the Data Import Tool will automatically perform the operations you performed the last time. This way, you can save progress on a data file and resume your work.
Along with the feature additions discussed above, based on continued user feedback, we implemented a number of UI/UX improvements and bug fixes in this release. For a complete list of changes introduced in Version 1.0.6 of the Data Import Tool, please refer to the Release Notes page in the Tool’s documentation.
Example Use Case: Using the Data Import Tool to Speed Data Cleaning and Transformation
Now let’s take a look at how the Data Import Tool can be used to speed up the process of cleaning up and transforming data sets. As an example data set, let’s take a look at the Employee Compensation data from the city of San Francisco.
NOTE: You can follow the example step-by-step by downloading Canopy and starting a free 7 day trial of the data import tool
Step 1: Load data into the Data Import Tool
First we’ll download the data as a .csv file from the San Francisco Government data website, then open it from File -> Import Data -> From File… menu item in the Canopy Editor (see screenshot at right).
After loading the file, you should see the DataFrame below in the Data Import Tool:
As you can see at the right, the Data Import Tool automatically detected and converted the columns “Job Code”, “Job Family Code” and “Union Code” to an Integer column type. But, if the Tool inferred erroneously, you can simply remove a specific column conversion by deleting it from the Edit Command window or remove all conversions by removing the command by clicking on the “X” in the Command History window.
Step 2: Use the Data Import Tool quickly assess data by sorting in the GUI
Using the Employee Compensation data set, let’s answer a few questions. For example, let’s see which Job Families get the highest Salary, the highest Overtime, the highest Total Salary and the highest Compensation. Further, let’s also determine what the highest and mean Total Compensation for a Job Family is.
Let’s start with the question “Which Job Family contains the highest Salary?” We can now get this information easily by clicking on the right end of the “Salaries” column to sort the column in ascending or descending order. Doing so, we can see that the highest paid Job Family is “Administrative & Mgmt (Unrep)” and specifically, the Job is Chief Investment Officer. In fact, 4 out of 5 top Salaries are paid to Chief Investment Officers.
Similarly, we can sort the “Overtime” column (see screenshot at right) to see which Job Family gets paid the most Overtime (turns out to be the “Deputy Sheriff” job family).
Sort the Total Salary and Total Compensation columns to find out which Job and Job Family had the highest salary and highest overall compensation.
[Note: While sorting the data set, you may have noticed the fact that there are negative values in the Salaries column. Yup. and hey! Don’t ask us. We don’t know why there are negative Salaries values either. If you know why or if you can figure out why, we would love to know! Comment below and tell us!]
Step 3: Simplify and Clean Data
Let’s now look at the second question we mentioned earlier: “What is the median income for different Job Families?” But before we get to that, let’s first remove a few columns with data not relevant to the questions we’re trying to answer (or you may choose to ask different questions and keep the columns instead). Here we delete columns by clicking on the “Delete” menu item after right-clicking on a column name.
When you are satisfied with how the DataFrame looks, click on the “Use DataFrame” button to push the DataFrame to Canopy’s IPython Console, where we can further analyze the data set. In Canopy’s IPython console, you can see what the final columns in the DataFrame are, which can be accessed using DataFrame.columns.
[u'Year Type', u'Year', u'Organization Group', u'Department', u'Job Family', u'Job', u'Salaries', u'Overtime', u'Other Salaries', u'Total Salary', u'Retirement', u'Health/Dental', u'Other Benefits', u'Total Benefits', u'Total Compensation']
Let’s now use the pandas’ DataFrame.groupby method to calculate the median salary of different Job Families, over the years. Passing both Job Family and Year segments the original DataFrame based on Job Family first and Year next. This way, we will be able to see difference in median Total Compensation in different Job Families and how it changed in a Job Family over the years.
grouped_df = Employee_Compensation.groupby(['Job Family', 'Year'])
for name, df in grouped_df:
print("{} - {}: median={:9.2f}, n={}".format(name[-1], name[0],
df['Total Compensation'].median(),
df['Total Compensation'].count()))
2013 - Administrative & Mgmt (Unrep): median=65154.66, n=9
2014 - Administrative & Mgmt (Unrep): median=189534.965, n=12
2015 - Administrative & Mgmt (Unrep): median=352931.01, n=13
2016 - Administrative & Mgmt (Unrep): median=351961.28, n=9
2013 - Administrative Secretarial: median=122900.205, n=22
2014 - Administrative Secretarial: median=130164.525, n=20
2015 - Administrative Secretarial: median=127206.02, n=19
2016 - Administrative Secretarial: median=137861.05, n=9
2013 - Administrative-DPW/PUC: median=164535.52, n=89
2014 - Administrative-DPW/PUC: median=172906.585, n=82
2015 - Administrative-DPW/PUC: median=180582.9, n=85
2016 - Administrative-DPW/PUC: median=180095.54, n=44
. . .
We hope that this gives you a small idea of what can be done using the Data Import Tool and the Python Pandas library. If you analyzed this data set in a different way, comment below and tell us about it.
BTW, if you are interested in honing your data analysis skills in Python, check out our Virtual Pandas Crash Course or join the Pandas Mastery Workshop for a more comprehensive introduction to Pandas and data analysis using it.
If you have any feedback regarding the Data Import Tool, we’d love to hear from you at canopy.support@enthought.com.
Additional resources:
- Download Canopy here and start a free 7 day trial of the data import tool
- Canopy Data Import Tool Product Page
Watch a 2-minute demo video to see how the Canopy Data Import Tool works:
See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging: