Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Codementor: Data Science with Python & R: Exploratory Data Analysis

$
0
0

Introduction and Getting Data refresh

Here we are again, with a new episode in our series about doing data science with the two most popular open-source platforms you can use for the job nowadays. In this case we will have a look at a crucial step of the data analytics process, that of the Exploratory Data Analysis.

Exploratory data analysis takes place after gathering and cleaning data, and before any modeling and visualisation/presentation of results. However, it is part of an iterative process. After doing some EDA we can try to build some models or present some visualisations. At the same time, based on the results of the later we can perform some more EDA and so on. It is all about quickly finding clues and not so much about details or aesthetics. Among the main purposes of this type of analysis are of course getting to know our data, its tendencies and its quality, and also to check or even start formulating our hypothesis.

And with that idea in mind we will explain how to use descriptive statistics and basic plotting, together with data frames, in order to answer some questions and guide our further data analysis.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Getting data

We will continue using the same datasets we already loaded in the part introducing data frames. So you can either continue where you left in that tutorial, or re-run the section that gets and prepares the data.

Questions we want to answer

In any data analysis process, there is one or more questions we want to answer. That is the most basic and important step in the whole process, to define these questions. Since we are going to perform some Exploratory Data Analysis in our TB dataset, these are the questions we want to answer:

  • Which are the countries with the highest and infectious TB incidence?
  • What is the general world tendency in the period from 1990 to 2007?
  • What countries don’t follow that tendency?
  • What other facts about the disease do we know that we can check with our data?

Descriptive Statistics

Python

The basic data descriptive statistics method for a pandas.DataFrame is describe(). It is the equivalent to R data.frame function summary().

df_summary = existing_df.describe()
df_summary
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
count18.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.00000018.000000
mean353.33333336.94444447.38888912.27777825.277778413.44444435.61111110.83333361.22222274.94444428.055556128.888889186.00000040.888889282.666667126.22222243.388889194.333333535.277778512.833333
std64.7083966.9152204.4870919.8864477.27449797.7513181.2432832.81278620.23263416.1298853.71756115.91110962.0275082.42266057.32261686.7840838.33235352.15813191.975576113.411925
min238.00000022.00000042.0000000.00000017.000000281.00000034.0000007.00000035.00000049.00000023.000000102.000000102.00000038.000000220.00000013.00000031.000000130.000000387.000000392.000000
25%305.00000032.00000044.0000006.00000019.250000321.25000035.0000009.00000041.25000062.00000025.000000116.500000128.75000039.000000234.25000063.25000036.250000146.750000459.000000420.750000
50%373.50000040.50000045.5000009.00000022.500000399.00000035.00000010.00000060.50000077.00000027.500000131.500000185.00000041.000000257.000000106.00000043.000000184.500000521.500000466.000000
75%404.50000042.00000050.75000016.25000031.500000512.00000036.00000012.75000077.00000085.75000030.750000143.000000240.00000042.000000349.000000165.75000051.500000248.500000620.000000616.750000
max436.00000044.00000056.00000042.00000039.000000530.00000038.00000016.00000096.00000099.00000035.000000152.000000278.00000046.000000365.000000352.00000055.000000265.000000680.000000714.000000
8 rows × 207 columns

There is a lot of information there. We can access individual summaries as follows.

df_summary[['Spain','United Kingdom']]
countrySpainUnited Kingdom
count18.00000018.000000
mean30.6666679.611111
std6.6774420.916444
min23.0000009.000000
25%25.2500009.000000
50%29.0000009.000000
75%34.75000010.000000
max44.00000012.000000

There is a plethora of descriptive statistics methods in Pandas (check the documentation). Some of them are already included in our summary object, but there are many more. In following tutorials we will make good use of them in order to better understand our data.

For example, we can obtain the percentage change over the years for the number of tuberculosis cases in Spain.

tb_pct_change_spain = existing_df.Spain.pct_change()
tb_pct_change_spain
year
    1990         NaN
    1991   -0.045455
    1992   -0.047619
    1993   -0.075000
    1994   -0.054054
    1995   -0.028571
    1996   -0.029412
    1997   -0.090909
    1998    0.000000
    1999   -0.066667
    2000   -0.035714
    2001   -0.037037
    2002    0.000000
    2003   -0.038462
    2004   -0.040000
    2005    0.000000
    2006    0.000000
    2007   -0.041667
    Name: Spain, dtype: float64

And from there get the maximum value.

tb_pct_change_spain.max()
0.0

And do the same for the United Kingdom.

existing_df['United Kingdom'].pct_change().max()
0.11111111111111116

If we want to know the index value (year) we use argmax (callex idmax in later versions of Pandas) as follows.

existing_df['Spain'].pct_change().argmax()
'1998'
existing_df['United Kingdom'].pct_change().argmax()
'1992'

That is, 1998 and 1992 were the worst years in Spain and the UK respectibely regarding the increase of infectious TB cases.

R

The basic descriptive statistics method in R is, as we said, the function summary().

existing_summary <- summary(existing_df)
str(existing_summary)
##  'table' chr [1:6, 1:207] "Min.   :238.0  ""1st Qu.:305.0  " ...

##  - attr(*, "dimnames")=List of 2

##   ..$ : chr [1:6] """""""" ...

##   ..$ : chr [1:207] " Afghanistan""   Albania""   Algeria""American Samoa" ...

It returns a table object where we have summary statistics for each of the columns in a data frame. A table object is good for visualising data, but not so good for accessing and indexing it as a data frame. Basically we use positional indexing to access it as a matrix. This way, if we want the first column, that corresponding to Afghanistan, we do:

existing_summary[,1]
##                                                                         

## "Min.   :238.0  ""1st Qu.:305.0  ""Median :373.5  ""Mean   :353.3  " 

##                                     

## "3rd Qu.:404.5  ""Max.   :436.0  "

A trick we can use to access by column name is use the column names in the original data frame together with which(). We also can build a new data frame with the results.

data.frame(
    Spain=existing_summary[,which(colnames(existing_df)=='Spain')],
    UK=existing_summary[,which(colnames(existing_df)=='United Kingdom')])
##             Spain               UK

## 1 Min.   :23.00   Min.   : 9.000  

## 2 1st Qu.:25.25   1st Qu.: 9.000  

## 3 Median :29.00   Median : 9.000  

## 4 Mean   :30.67   Mean   : 9.611  

## 5 3rd Qu.:34.75   3rd Qu.:10.000  

## 6 Max.   :44.00   Max.   :12.000

Being R a functional language, we can apply functions such as sum, mean, sd, etc. to vectors. Remember that a data frame is a list of vectors (i.e. each column is a vector of values), so we can easily use these functions with columns. We can finally combine these functions with lapply or sapply and apply them to multiple columns in a data frame.

However, there is a family of functions in R that can be applied to columns or rows in order to get means and sums directly. These are more efficient than using apply functions, and also allows us to apply them not just by columns but also by row. If you type `?colSums’ for example, the help page describes all of them.

Let’s say we wan to obtain the average number of existing cases per year. We need a single function call.

rowMeans(existing_df)
##    X1990    X1991    X1992    X1993    X1994    X1995    X1996    X1997 

## 196.9662 196.4686 192.8116 191.1739 188.7246 187.9420 178.8986 180.9758 

##    X1998    X1999    X2000    X2001    X2002    X2003    X2004    X2005 

## 178.1208 180.4734 177.5217 177.7971 179.5169 176.4058 173.9227 171.1836 

##    X2006    X2007 

## 169.0193 167.2560

Plotting

In this section we will take a look at the basic plotting functionality in Python/Pandas and R. However, there are more powerful alternatives like ggplot2 that, although originally created for R, has its own implementation for Python from the Yhat guys.

Python

Pandas DataFrames implement up to three plotting methods out of the box (check the documentation). The first one is a basic line plot for each of the series we include in the indexing. The first line might be needed when plotting while using IPython notebook.

%matplotlib inline

 existing_df[['United Kingdom', 'Spain', 'Colombia']].plot()

enter image description here

Or we can use box plots to obtain a summarised view of a given series as follows.

existing_df[['United Kingdom', 'Spain', 'Colombia']].boxplot()

enter image description here

There is also a histogram() method, but we can’t use it with this type of data right now.

R

Base plotting in R is not very sophisticated when compared with ggplot2, but still is powerful and handy because many data types have implemented custom plot() methods that allow us to plot them with a single method call. However this is not always the case, and more often than not we will need to pass the right set of elements to our basic plotting functions.

Let’s start with a basic line chart like we did with Python/Pandas.

uk_series <- existing_df[,c("United Kingdom")]
spain_series <- existing_df[,c("Spain")]
colombia_series <- existing_df[,c("Colombia")]
xrange <- 1990:2007
plot(xrange, uk_series, 
     type='l', xlab="Year", 
     ylab="Existing cases per 100K", 
     col = "blue", 
     ylim=c(0,100))
lines(xrange, spain_series, 
      col = "darkgreen")
lines(xrange, colombia_series,
      col = "red")
legend(x=2003, y=100, 
       lty=1, 
       col=c("blue","darkgreen","red"), 
       legend=c("UK","Spain","Colombia"))

enter image description here

You can compare how easy it was to plot three series in Pandas, and how doing the same thing with basic plotting in R gets more verbose. At least we need three function calls, those for plot and line, and then we have the legend, etc. The base plotting in R is really intended to make quick and dirty charts.

Let’s use now box plots.

boxplot(uk_series, spain_series, colombia_series, 
        names=c("UK","Spain","Colombia"),
        xlab="Year", 
        ylab="Existing cases per 100K")

enter image description here

This one was way shorter, and we don’t even need colours or a legend.

Answering Questions

Let’s now start with the real fun. Once we know our tools (from the previous tutorial about data frames and this one), let’s use them to answer some questions about the incidence and prevalence of infectious tuberculosis in the world.

Question: We want to know, per year, what country has the highest number of existing and new TB cases.

Python

If we want just the top ones we can make use of apply and argmax. Remember that, by default, apply works with columns (the countries in our case), and we want to apply it to each year. Therefore we need to transpose the data frame before using it, or we can pass the argument axis=1.

existing_df.apply(pd.Series.argmax, axis=1)
year
    1990            Djibouti
    1991            Djibouti
    1992            Djibouti
    1993            Djibouti
    1994            Djibouti
    1995            Djibouti
    1996            Kiribati
    1997            Kiribati
    1998            Cambodia
    1999    Korea, Dem. Rep.
    2000            Djibouti
    2001           Swaziland
    2002            Djibouti
    2003            Djibouti
    2004            Djibouti
    2005            Djibouti
    2006            Djibouti
    2007            Djibouti
    dtype: object

But this is too simplistic. Instead, we want to get those countries that are in the fourth quartile. But first we need to find out the world general tendency.

In order to explore the world’s general trend, we need to sum up every countries’ values for the three datasets, per year.

deaths_total_per_year_df = deaths_df.sum(axis=1)
existing_total_per_year_df = existing_df.sum(axis=1)
new_total_per_year_df = new_df.sum(axis=1)

Now we will create a new DataFrame with each sum in a series that we will plot using the data frame plot() method.

world_trends_df = pd.DataFrame({
           'Total deaths per 100K' : deaths_total_per_year_df, 
           'Total existing cases per 100K' : existing_total_per_year_df, 
           'Total new cases per 100K' : new_total_per_year_df}, 
       index=deaths_total_per_year_df.index)
world_trends_df.plot(figsize=(12,6)).legend(
    loc='center left', 
    bbox_to_anchor=(1, 0.5))

enter image description here

It seems that the general tendency is for a decrease in the total number of existing cases per 100K. However the number of new cases has been increasing, although it seems reverting from 2005. So how is possible that the total number of existing cases is decreasing if the total number of new cases has been growing? One of the reasons could be the observed increase in the number of deaths per 100K, but the main reason we have to consider is that people recover from tuberculosis thanks to treatment. The sum of the recovery rate plus the death rate is greater than the new cases rate. In any case, it seems that there are more new cases, but also that we cure them better. We need to improve prevention and epidemics control.

Countries out of tendency

So the previous was the general trend of the world as a whole. So what countries with a different trend (for worse)? In order to find this out, first we need to know the distribution of deaths by countries in an average year.

deaths_by_country_mean = deaths_df.mean()
deaths_by_country_mean_summary = deaths_by_country_mean.describe()
existing_by_country_mean = existing_df.mean()
existing_by_country_mean_summary = existing_by_country_mean.describe()
new_by_country_mean = new_df.mean()
new_by_country_mean_summary = new_by_country_mean.describe()

We can plot these distributions to have an idea of how the countries are distributed in an average year.

deaths_by_country_mean.order().plot(kind='bar', figsize=(24,6))

enter image description here

We want those countries beyond 1.5 times the inter quartile range (50%). We have these values in:

deaths_outlier = deaths_by_country_mean_summary['50%']*1.5
existing_outlier = existing_by_country_mean_summary['50%']*1.5
new_outlier = new_by_country_mean_summary['50%']*1.5

Now we can use these values to get those countries that, across the period 1990-2007, have exceeded beyond those levels.

# Now compare with the outlier threshold
outlier_countries_by_deaths_index = 
    deaths_by_country_mean > deaths_outlier
outlier_countries_by_existing_index = 
   existing_by_country_mean > existing_outlier
outlier_countries_by_new_index = 
    new_by_country_mean > new_outlier

What proportion of countries do we have out of trend? For deaths:

num_countries = len(deaths_df.T)
sum(outlier_countries_by_deaths_index)/num_countries
0.39613526570048307

For existing cases (prevalence):

sum(outlier_countries_by_existing_index)/num_countries
0.39613526570048307

For new cases (incidence):

sum(outlier_countries_by_new_index)/num_countries
0.38647342995169082

Now we can use these indices to filter our original data frames.

outlier_deaths_df = deaths_df.T[ outlier_countries_by_deaths_index ].T
outlier_existing_df = existing_df.T[ outlier_countries_by_existing_index ].T
outlier_new_df = new_df.T[ outlier_countries_by_new_index ].T

This is serious stuff. We have more than one third of the world being outliers on the distribution of existings cases, new cases, and deaths by infectious tuberculosis. But what if we consider an outlier to be 5 times the IQR? Let’s repeat the previous process.

deaths_super_outlier = deaths_by_country_mean_summary['50%']*5
existing_super_outlier = existing_by_country_mean_summary['50%']*5
new_super_outlier = new_by_country_mean_summary['50%']*5
    
super_outlier_countries_by_deaths_index = 
    deaths_by_country_mean > deaths_super_outlier
super_outlier_countries_by_existing_index = 
    existing_by_country_mean > existing_super_outlier
super_outlier_countries_by_new_index = 
    new_by_country_mean > new_super_outlier

What proportion do we have now?

sum(super_outlier_countries_by_deaths_index)/num_countries
0.21739130434782608

Let’s get the data frames.

super_outlier_deaths_df = 
    deaths_df.T[ super_outlier_countries_by_deaths_index ].T
super_outlier_existing_df = 
    existing_df.T[ super_outlier_countries_by_existing_index ].T
super_outlier_new_df = 
    new_df.T[ super_outlier_countries_by_new_index ].T

Let’s concentrate on epidemics control and have a look at the new cases’ data frame.

super_outlier_new_df
countryBhutanBotswanaCambodiaCongo, Rep.Cote d’IvoireKorea, Dem. Rep.DjiboutiKiribatiLesothoMalawiPhilippinesRwandaSierra LeoneSouth AfricaSwazilandTimor-LesteTogoUgandaZambiaZimbabwe
year                     
1990540307585169177344582513184258393167207301267322308163297329
1991516341579188196344594503201286386185220301266322314250349364
1992492364574200209344606493218314380197233302260322320272411389
1993470390568215224344618483244343373212248305267322326296460417
1994449415563229239344630474280373366225263309293322333306501444
1995428444557245255344642464323390360241279317337322339319536474
1996409468552258269344655455362389353254297332398322346314554501
1997391503546277289344668446409401347273315360474322353320576538
1998373542541299312344681437461412341294334406558322360326583580
1999356588536324338344695428519417335319355479691322367324603628
2000340640530353368344708420553425329348377576801322374340602685
2001325692525382398344722412576414323376400683916322382360627740
2002310740520408425344737403613416317402425780994322389386632791
20032967725154254443447513966354103124194518521075322397396652825
20042837805104304483447663886434053064234798981127322405385623834
20052707705054254433447813806393913014185099251141322413370588824
20062587515004144323447973726383682954085409401169322421350547803
20072467314954034203448133656373462903975749481198322429330506782
18 rows × 22 columns

Let’s make some plots to get a better impression.

super_outlier_new_df.plot(figsize=(12,4)).legend(loc='center left', bbox_to_anchor=(1, 0.5))

enter image description here

We have 22 countries where the number of new cases during an average year is greater than 5 times the median value of the distribution. Let’s create a country that represents the average of these 22 countries.

average_super_outlier_country = super_outlier_new_df.mean(axis=1)
average_super_outlier_country
year
    1990    314.363636
    1991    330.136364
    1992    340.681818
    1993    352.909091
    1994    365.363636
    1995    379.227273
    1996    390.863636
    1997    408.000000
    1998    427.000000
    1999    451.409091
    2000    476.545455
    2001    502.409091
    2002    525.727273
    2003    543.318182
    2004    548.909091
    2005    546.409091
    2006    540.863636
    2007    535.181818
    dtype: float64

Now let’s create a country that represents the rest of the world.

avearge_better_world_country = 
    new_df.T[ - super_outlier_countries_by_new_index ].T.mean(axis=1)
avearge_better_world_country
year
    1990    80.751351
    1991    81.216216
    1992    80.681081
    1993    81.470270
    1994    81.832432
    1995    82.681081
    1996    82.589189
    1997    84.497297
    1998    85.189189
    1999    86.232432
    2000    86.378378
    2001    86.551351
    2002    89.848649
    2003    87.778378
    2004    87.978378
    2005    87.086022
    2006    86.559140
    2007    85.605405
    dtype: float64

Now let’s plot this country with the average world country.

two_world_df = 
    pd.DataFrame({ 
            'Average Better World Country': avearge_better_world_country,
            'Average Outlier Country' : average_super_outlier_country},
        index = new_df.index)
two_world_df.plot(title="Estimated new TB cases per 100K",figsize=(12,8))

enter image description here

The increase in new cases’ tendency is really stronger in the average super outlier country, so much stronger that is difficult to perceive that same tendency in the better world country. The 90’s decade brought a terrible increase in the number of TB cases in those countries. But let’s have a look at the exact numbers.

two_world_df.pct_change().plot(title="Percentage change in estimated new TB cases", figsize=(12,8))

enter image description here

Based on this plot, the decceleration and reversion of that tendency seem to happen at the same time in both average better and outlier countries, and something happened around 2002. We will try to find out what’s going on in the next section.

R

We already know that we can use max with a data frame column in R and get the maximum value. Additionally, we can use which.max in order to get its position (similarly to the use og argmax in Pandas). If we use the transposed data frame, we can use lapply or sapply to perform this operation in every year column, getting then either a list or a vector of indices (we will use sapply that returns a vector). We just need a little tweak and use a countries vector that we will index to get the country name instead of the index as a result.

country_names <- rownames(existing_df_t)
sapply(existing_df_t, function(x) {country_names[which.max(x)]})
##              X1990              X1991              X1992 

##         "Djibouti""Djibouti""Djibouti" 

##              X1993              X1994              X1995 

##         "Djibouti""Djibouti""Djibouti" 

##              X1996              X1997              X1998 

##         "Kiribati""Kiribati""Cambodia" 

##              X1999              X2000              X2001 

## "Korea, Dem. Rep.""Djibouti""Swaziland" 

##              X2002              X2003              X2004 

##         "Djibouti""Djibouti""Djibouti" 

##              X2005              X2006              X2007 

##         "Djibouti""Djibouti""Djibouti"

Again, in order to explore the world general tendency, we need to sum up every countries’ values for the three datasets, per year.

But first we need to load the other two datasets for number of deaths and number of new cases.

# Download files
deaths_file <- getURL("https://docs.google.com/spreadsheets/d/12uWVH_IlmzJX_75bJ3IH5E-Gqx6-zfbDKNvZqYjUuso/pub?gid=0&output=CSV")
new_cases_file <- getURL("https://docs.google.com/spreadsheets/d/1Pl51PcEGlO9Hp4Uh0x2_QM0xVb53p2UDBMPwcnSjFTk/pub?gid=0&output=csv")


# Read into data frames
deaths_df <- read.csv(
    text = deaths_file, 
    row.names=1, 
    stringsAsFactor=F)
new_df <- read.csv(
    text = new_cases_file, 
    row.names=1, 
    stringsAsFactor=F)


# Cast data to int (deaths doesn't need it)
new_df[1:18] <- lapply(
    new_df[1:18], 
    function(x) { as.integer(gsub(',', '', x) )})


# Transpose
deaths_df_t <- deaths_df
deaths_df <- as.data.frame(t(deaths_df))
new_df_t <- new_df
new_df <- as.data.frame(t(new_df))

And now the sums by row. We need to convert this to a data frame since the function returns a numeric vector.

deaths_total_per_year_df <- data.frame(total=rowSums(deaths_df))
existing_total_per_year_df <- data.frame(total=rowSums(existing_df))

# We pass na.rm = TRUE in order to ignore missing values in the new

# cases data frame when summing (no missing values in other dataframes though)
new_total_per_year_df <- data.frame(total=rowSums(new_df, na.rm = TRUE))

Now we can plot each line using what we have learnt so far. In order to get a vector with the counts to pass to each plotting function, we use R data frame indexing by column name.

xrange <- 1990:2007
plot(xrange, deaths_total_per_year_df$total, 
     type='l', xlab="Year", 
     ylab="Count per 100K", 
     col = "blue", 
     ylim=c(0,50000))
lines(xrange, existing_total_per_year_df$total,
      col = "darkgreen")
lines(xrange, new_total_per_year_df$total, 
      col = "red")
legend(x=1992, y=52000, 
       lty=1, 
       cex = .7,
       ncol = 3,
       col=c("blue","darkgreen","red"), 
       legend=c("Deaths","Existing cases","New cases"))

enter image description here

The conclusions are obviously the same as when using Python.

Countries out of tendency

So what countries are outliers of the trend (for the worse)? Again, in order to find this out, first we need to know the distribution of countries in an average year. We use colMeans for that purpose.

deaths_by_country_mean <- data.frame(mean=colMeans(deaths_df))
existing_by_country_mean <- data.frame(mean=colMeans(existing_df))
new_by_country_mean <- data.frame(mean=colMeans(new_df, na.rm=TRUE))

We can plot these distributions to have an idea of how the countries are distributed in an average year. We are not so interested about the individual countries but about the distribution itself.

barplot(sort(deaths_by_country_mean$mean))

enter image description here

Again we can see there are three trends in the plot, with a slowly decreasing part at the beginning, a second more step section, and a final peak that is clearly apart from the rest.

Let’s skip this time the 1.5-outlier part and go diretcly to the 5.0-outliers. In R we will use a different approach. We will use the quantile() function in order to get the inter-quartile range and determine the outlier threshold.

Since we already know the results from our Python section, let’s do it just for the new cases, so we generate also the plots we did before.

new_super_outlier <- 
    quantile(new_by_country_mean$mean, probs = c(.5)) * 5.0
super_outlier_countries_by_new_index <- 
    new_by_country_mean > new_super_outlier

And the proportion is.

sum(super_outlier_countries_by_new_index)/208
## [1] 0.1057692

Let’s obtain a data frame from this, with just those countries we consider to be outliers.

super_outlier_new_df <- 
    new_df[, super_outlier_countries_by_new_index ]

Now we are ready to plot them.

xrange <- 1990:2007
plot(xrange, super_outlier_new_df[,1], 
     type='l', xlab="Year", 
     ylab="New cases per 100K", 
     col = 1, 
     ylim=c(0,1800))
for (i in seq(2:ncol(super_outlier_new_df))) {
    lines(xrange, super_outlier_new_df[,i],
    col = i)
}
legend(x=1990, y=1800, 
       lty=1, cex = 0.5,
       ncol = 7,
       col=1:22,
       legend=colnames(super_outlier_new_df))

enter image description here

Definitely we can see here an advantage of using Pandas basic plotting versus R basic plotting!

So far our results match. We have 22 countries where the number of new cases on an average year is greater than 5 times the median value of the distribution. Let’s create a country that represents on average these 22. We will use rowMeans() here.

average_countries_df <- 
    data.frame(
        averageOutlierMean=rowMeans(super_outlier_new_df, na.rm=T)
    )
average_countries_df
##       averageOutlierMean

## X1990           314.3636

## X1991           330.1364

## X1992           340.6818

## X1993           352.9091

## X1994           365.3636

## X1995           379.2273

## X1996           390.8636

## X1997           408.0000

## X1998           427.0000

## X1999           451.4091

## X2000           476.5455

## X2001           502.4091

## X2002           525.7273

## X2003           543.3182

## X2004           548.9091

## X2005           546.4091

## X2006           540.8636

## X2007           535.1818

Now let’s create a country that represents the rest of the world.

average_countries_df$averageBetterWorldMean <- 
    rowMeans(new_df[ ,- super_outlier_countries_by_new_index ], na.rm=T)
average_countries_df
##       averageOutlierMean averageBetterWorldMean

## X1990           314.3636               105.2767

## X1991           330.1364               107.3786

## X1992           340.6818               108.0243

## X1993           352.9091               110.0388

## X1994           365.3636               111.6942

## X1995           379.2273               113.9369

## X1996           390.8636               115.0971

## X1997           408.0000               118.6408

## X1998           427.0000               121.2913

## X1999           451.4091               124.8350

## X2000           476.5455               127.6505

## X2001           502.4091               130.5680

## X2002           525.7273               136.0194

## X2003           543.3182               136.0388

## X2004           548.9091               136.8155

## X2005           546.4091               135.5121

## X2006           540.8636               134.4493

## X2007           535.1818               133.2184

Now let’s plot the outlier country with the average world country.

xrange <- 1990:2007
plot(xrange, average_countries_df$averageOutlierMean, 
     type='l', xlab="Year", 
     ylab="New cases per 100K", 
     col = "darkgreen", 
     ylim=c(0,600))
lines(xrange, average_countries_df$averageBetterWorldMean, col = "blue")
legend(x=1990, y=600, 
       lty=1, cex = 0.7,
       ncol = 2,
       col=c("darkgreen","blue"),
       legend=c("Average outlier country", "Average World Country"))

enter image description here

Googling about events and dates in Tuberculosis

We will use just Python in this section. About googling, actually we just went straight to Wikipedia’s entry about the disease. In the epidemics sections we found the following:

  • The total number of tuberculosis cases has been decreasing since 2005, while new cases have decreased since 2002.
  • This is confirmed by our previous analysis.

  • China has achieved particularly dramatic progress, with about an 80% reduction in its TB mortality rate between 1990 and 2010. Let’s check it:
existing_df.China.plot(title="Estimated existing TB cases in China")

enter image description here

  • In 2007, the country with the highest estimated incidence rate of TB was Swaziland, with 1,200 cases per 100,000 people.
new_df.apply(pd.Series.argmax, axis=1)['2007']
'Swaziland'

There are many more findings Wikipedia that we can confirm with these or other datasets from Gapminder world. For example, TB and HIV are frequently associated, together with poverty levels. It would be interesting to join datasets and explore tendencies in each of them. We challenge the reader to give them a try and share with us their findings.

Other web pages to explore

Some interesting resources about tuberculosis apart from the Gapminder website:

  • Gates foundation:
  • http://www.gatesfoundation.org/What-We-Do/Global-Health/Tuberculosis
  • http://www.gatesfoundation.org/Media-Center/Press-Releases/2007/09/New-Grants-to-Fight-Tuberculosis-Epidemic

Conclusions

Exploratory data analysis is a key step in data analysis. It is during this stage when we start shaping any later work. It precedes any data visualisation or machine learning work, by showing us good or bad our data and our hypothesis are.

Traditionally, R has been the weapon of choice for most EDA work, although the use of a more expressive plotting library such as gglot2 is quite convenient. In fact, the base plotting functionality incorporated in Pandas makes the process cleaner and quicker when using Python. However, the questions we have answered here were very simple and didn’t include multiple variables and encodings. In such cases an advanced library like ggplot2 will shine. Apart from providing nicer charts, it will saves us quite a lot of time due to its expressiveness and reusability.

But as simple as our analysis and charts are, we have been able to make the point about how serious the humanitarian crisis is regarding a disease like tuberculosis, specially when considering that the disease is relatively well controlled in more developed countries. We have seen how some coding skills and a good amount of curiosity allows us to create awareness in these and other world issues.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!


Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>