Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 24332

François Dion: Reproducible research from a book

$
0
0
 

Preamble

Sometimes, you don't have direct access to the data, or the data changes over time. 
 
Yeah, I know, scary. So that's my point in this post. Provide a URL to a "frozen" version of your data, if at all possible. Toward the end of the article I provide a link to the notebook. This repo also holds the data I used for the visualization.
 
Let's get right into it...
 
In [1]:
%matplotlib inline
importmatplotlib.pyplotasplt
importpandasaspd
importseabornassns
sns.set_context("talk")

 

Reproducible visualization

In "The Functional Art: An introduction to information graphics and visualization" by Alberto Cairo, on page 12 we are presented with a visualization of UN data time series of Fertility rate (average number of children per woman) per country:

Figure 1.6 Highlighting the relevant, keeping the secondary in the background.

Book url:
The Functional Art


Let's try to reproduce this.

 

Getting the data

The visualization was done in 2012, but limited the visualization to 2010.

This should make it easy, in theory, to get the data, since it is historical. These are directly available as excel spreadsheets now, we'll just ignore the last bucket (2010-2015).
Pandas allows loading an excel spreadsheet straight from a URL, but here we will download it first so we have a local copy.

In [3]:
!wget 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/
EXCEL_FILES/2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'
--2015-12-29 16:57:23--  http://esa.un.org/unpd/wpp/DVD/Files/
1_Indicators%20(Standard)/EXCEL_FILES/2_Fertility/
WPP2015_FERT_F04_TOTAL_FERTILITY.XLS
Resolving esa.un.org... 157.150.185.69
Connecting to esa.un.org|157.150.185.69|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 869376 (849K) [application/vnd.ms-excel]
Saving to: 'WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

WPP2015_FERT_F04_TO 100%[=====================>] 849.00K 184KB/s in 4.6s

2015-12-29 16:57:28 (184 KB/s) - 
'WPP2015_FERT_F04_TOTAL_FERTILITY.XLS' saved [869376/869376]

 

World Population Prospects: The 2015 Revision

File FERT/4: Total fertility by major area, region and country, 1950-2100 (children per woman)
Estimates, 1950 - 2015                                
POP/DB/WPP/Rev.2015/FERT/F04
July 2015 - Copyright © 2015 by United Nations. All rights reserved
Suggested citation: United Nations, Department of Economic and Social Affairs,
Population Division (2015). 
World Population Prospects: The 2015 Revision, DVD Edition.
 
In [2]:
df=pd.read_excel('WPP2015_FERT_F04_TOTAL_FERTILITY.XLS',skiprows=16, 
                   index_col='Country code')
df=df[df.index<900]
 
In [3]:
len(df)
Out[3]:
201
 
In [4]:
df.head()
Out[4]:

IndexVariantMajor area, region, country or area *Notes1950-19551955-19601960-19651965-19701970-19751975-19801980-19851985-19901990-19951995-20002000-20052005-20102010-2015
Country code
















10815EstimatesBurundiNaN6.80106.85707.07107.26807.34307.47607.42807.59207.43107.18406.9086.5236.0756
17416EstimatesComorosNaN6.00006.60106.90907.05007.05007.05007.05006.70006.10005.60005.2004.9004.6000
26217EstimatesDjiboutiNaN6.31206.38746.54706.70706.84506.64406.25706.18105.85004.81204.2103.7003.3000
23218EstimatesEritreaNaN6.96506.96506.81506.69906.62006.62006.70006.51006.20005.60005.1004.8004.4000
23119EstimatesEthiopiaNaN7.16966.90236.89726.86917.10387.18387.42477.36737.08886.83356.1315.2584.5889

First problem... The book states on page 8:
--
Yet we have 201 countries (codes 900+ are regions) with complete data. We do not have a easy way to identify which countries were added to this. Still, let's move forward and prep our data.
In [5]:
df.rename(columns={df.columns[2]:'Description'},inplace=True)
In [6]:
df.drop(df.columns[[0,1,3,16]],axis=1,inplace=True)# drop what we dont need
In [7]:
df.head()
Out[7]:

Description1950-19551955-19601960-19651965-19701970-19751975-19801980-19851985-19901990-19951995-20002000-20052005-2010
Country code












108Burundi6.80106.85707.07107.26807.34307.47607.42807.59207.43107.18406.9086.523
174Comoros6.00006.60106.90907.05007.05007.05007.05006.70006.10005.60005.2004.900
262Djibouti6.31206.38746.54706.70706.84506.64406.25706.18105.85004.81204.2103.700
232Eritrea6.96506.96506.81506.69906.62006.62006.70006.51006.20005.60005.1004.800
231Ethiopia7.16966.90236.89726.86917.10387.18387.42477.36737.08886.83356.1315.258
In [8]:
highlight_countries=['Niger','Yemen','India',
'Brazil','Norway','France','Sweden','United Kingdom',
'Spain','Italy','Germany','Japan','China'
]
In [9]:
# Subset only countries to highlight, transpose for timeseries
df_high=df[df.Description.isin(highlight_countries)].T[1:]
In [10]:
# Subset the rest of the countries, transpose for timeseries
df_bg=df[~df.Description.isin(highlight_countries)].T[1:]

Let's make some art

In [11]:
# background
ax=df_bg.plot(legend=False,color='k',alpha=0.02,figsize=(12,12))
ax.xaxis.tick_top()

# highlighted countries
df_high.plot(legend=False,ax=ax)

# replacement level line
ax.hlines(y=2.1,xmin=0,xmax=12,color='k',alpha=1,linestyle='dashed')

# Average over time on all countries
df.mean().plot(ax=ax,color='k',label='World\naverage')

# labels for highlighted countries on the right side
forcountryinhighlight_countries:
ax.text(11.2,df[df.Description==country].values[0][12],country)

# start y axis at 1
ax.set_ylim(ymin=1)
Out[11]:
(1, 9.0)
For one thing, the line for China doesn't look like the one in the book. Concerning. The other issue is that there are some lines that are going lower than Italy or Spain in 1995-2000 and in 2000-2005 (majority in the Balkans) and that were not on the graph in the book, AFAICT:
In [12]:
df.describe()
Out[12]:

1950-19551955-19601960-19651965-19701970-19751975-19801980-19851985-19901990-19951995-20002000-20052005-2010
count201.00000201.000000201.000000201.000000201.000000201.000000201.000000201.000000201.000000201.000000201.000000201.000000
mean5.450455.4950055.4914245.2654834.9949114.6573494.4032274.1228373.7629723.4122933.1415562.992349
std1.643881.6741811.7347261.8499841.9445532.0399952.0336601.9521001.8492781.7911511.7013631.562150
min1.980001.9500001.8500001.8100001.6230001.4079001.4273001.3497001.2400000.8700000.8252000.937900
25%4.277004.2010004.2731003.4470002.9900002.5402002.3015002.2300002.0500001.8891001.8061001.818200
50%5.995006.1341006.1297005.9500005.4700004.9749004.3700003.8000003.3430002.9415002.6000002.479300
75%6.700006.7640006.8000006.7070006.7000006.5250006.3150005.9000005.2170004.6370004.2100003.980000
max8.000008.1500008.2000008.2000008.2840008.5000008.8000008.8000008.2000007.7466007.7209007.678700
In [13]:
df[df['1995-2000']<1.25]
Out[13]:

Description1950-19551955-19601960-19651965-19701970-19751975-19801980-19851985-19901990-19951995-20002000-20052005-2010
Country code












344China, Hong Kong SAR4.44004.72005.31003.64503.29002.31001.71501.35501.24000.87000.95851.0257
446China, Macao SAR4.38585.10884.40772.73671.79301.40791.97691.94111.40501.11600.82520.9379
100Bulgaria2.52642.29692.21712.13042.15732.19272.01491.94581.55271.20081.24041.5005
203Czech Republic2.73832.37652.20881.95732.21082.35881.96601.90081.64551.16701.18701.4286
643Russian Federation2.85002.82002.55002.02002.03001.94002.04002.12101.54501.24701.29801.4389
804Ukraine2.81002.70002.13462.02042.07891.97982.00401.89681.62081.24041.14551.3828
428Latvia2.00001.95001.85001.81002.00001.87452.02932.13091.63221.17221.28561.4926
380Italy2.35502.29002.50402.49892.32271.88561.52451.34971.27151.22391.29741.4169
705Slovenia2.68002.38332.33542.26502.19992.16321.92801.65171.33351.24831.21141.3841
724Spain2.53002.70002.81002.84002.85002.55001.88001.46001.28001.19001.29001.3904
In [14]:
df[df['2000-2005']<1.25]
Out[14]:

Description1950-19551955-19601960-19651965-19701970-19751975-19801980-19851985-19901990-19951995-20002000-20052005-2010
Country code












344China, Hong Kong SAR4.44004.72005.31003.64503.29002.31001.71501.35501.24000.87000.95851.0257
446China, Macao SAR4.38585.10884.40772.73671.79301.40791.97691.94111.40501.11600.82520.9379
410Republic of Korea5.05006.33205.63004.70804.28102.91902.23401.60101.69601.51401.21901.2284
100Bulgaria2.52642.29692.21712.13042.15732.19272.01491.94581.55271.20081.24041.5005
203Czech Republic2.73832.37652.20881.95732.21082.35881.96601.90081.64551.16701.18701.4286
498Republic of Moldova3.50003.44003.15002.66002.56002.44002.55002.64002.11101.70001.23781.2704
703Slovakia3.50223.24272.91102.54102.50672.46402.27102.15371.86671.40101.22051.3100
804Ukraine2.81002.70002.13462.02042.07891.97982.00401.89681.62081.24041.14551.3828
70Bosnia and Herzegovina4.77003.90863.68303.13722.73322.19002.12001.91001.65001.62611.21551.2845
705Slovenia2.68002.38332.33542.26502.19992.16321.92801.65171.33351.24831.21141.3841

The other thing that I really need to address is the labeling. Clearly we need the functionality to move labels up and down to make them readable. Collision detection, basically. I'm surprised this functionality doesn't exist, because I keep bumping into that. Usually, I can tweak the Y pos by a few pixels, but in this specific case, there is no way to do that.
So, I guess I have a project for 2016...
 

The original jupyter notebook can be downloaded here:
 
Francois Dion
@f_dion
 

Viewing all articles
Browse latest Browse all 24332

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>