Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Codementor: Data Science with Python & R: Data Frames II

$
0
0

We continue here our tutorial on data frames with python and R. The first part introduced the concepts of Data Frame and explained how to create them and index them in Python and R. This part will concentrate on data selection and function mapping.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Data Selection

In this section we will show how to select data from data frames based on their values, by using logical expressions.

Python

With Pandas, we can use logical expression to select just data that satisfy certain conditions. So first, let’s see what happens when we use logical operators with data frames or series objects.

existing_df>10
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
year                     
1990TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1991TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1992TrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1993TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1994TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1995TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1996TrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1997TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1998TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
1999TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2000TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2001TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2002TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2003TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2004TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2005TrueTrueTrueTrueTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2006TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
2007TrueTrueTrueFalseTrueTrueTrueFalseTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
18 rows × 207 columns

And if applied to individual series.

existing_df['United Kingdom'] > 10
year
    1990    False
    1991    False
    1992    False
    1993    False
    1994    False
    1995    False
    1996    False
    1997    False
    1998    False
    1999    False
    2000    False
    2001    False
    2002    False
    2003    False
    2004    False
    2005     True
    2006     True
    2007     True
    Name: United Kingdom, dtype: bool

The result of these expressions can be used as a indexing vector (with [] or `.iloc’) as follows.

existing_df.Spain[existing_df['United Kingdom'] > 10]
year
    2005    24
    2006    24
    2007    23
    Name: Spain, dtype: int64

An interesting case happens when indexing several series and some of them happen to have False as index and other True at the same position. For example:

existing_df[ existing_df > 10 ]
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
year                     
19904364245423951438169652351142784636512655265436409
19914294044143751438159149341052684536135254261456417
19924224144NaN355133715865133102259443586454263494415
19934154243183351237148255321182504335417452253526419
19944074243173251036137860311162424235017252250556426
1995397434222305083512746830119234423469350244585439
19963974243NaN2851235127174281112264131212349233602453
19973874444252336336116775271222184127321346207626481
19983744345122441436116374281292114026110744194634392
19993734246NaN2238436NaN5886281341593925310542175657430
20003464048NaN2053035NaN5294271391433924810340164658479
20013263449NaN2033535NaN519925148128412431339154680523
20023043250NaN2130735NaN4297271441494123527537149517571
20033083251NaN1828135NaN4191251521283923414736146478632
20042832952NaN1931835NaN398523149118382266335138468652
20052672953111833134NaN397924144131382275733137453680
20062512655NaN1730234NaN377925134104382226032135422699
20072382256NaN1929434NaN358123140102392202531130387714
18 rows × 207 columns

Those cells where existing_df doesn’t happen to have more than 10 cases per 100K give False for indexing. The resulting data frame have a NaN value for those cells. A way of solving that (if we need to) is by using the where() method that, apart from providing a more expressive way of reading data selection, acceps a second argument that we can use to impute the NaN values. For example, if we want to have 0 as a value.

existing_df.where(existing_df > 10, 0)
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
year                     
19904364245423951438169652351142784636512655265436409
19914294044143751438159149341052684536135254261456417
199242241440355133715865133102259443586454263494415
19934154243183351237148255321182504335417452253526419
19944074243173251036137860311162424235017252250556426
1995397434222305083512746830119234423469350244585439
1996397424302851235127174281112264131212349233602453
19973874444252336336116775271222184127321346207626481
19983744345122441436116374281292114026110744194634392
199937342460223843605886281341593925310542175657430
200034640480205303505294271391433924810340164658479
20013263449020335350519925148128412431339154680523
200230432500213073504297271441494123527537149517571
200330832510182813504191251521283923414736146478632
20042832952019318350398523149118382266335138468652
200526729531118331340397924144131382275733137453680
20062512655017302340377925134104382226032135422699
20072382256019294340358123140102392202531130387714
18 rows × 207 columns

R

As we did with Pandas, let’s check the result of using a data.frame in a logical or boolean expression.

existing_df_gt10 <- existing_df>10
head(existing_df_gt10,2) # check just a couple of rows
##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla

## X1990        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE

## X1991        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE

##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan

## X1990                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE

## X1991                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE

##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin

## X1990    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE

## X1991    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE

##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil

## X1990   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE

## X1991   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE

##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso

## X1990                   TRUE              TRUE     TRUE         TRUE

## X1991                   TRUE              TRUE     TRUE         TRUE

##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands

## X1990    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE

## X1991    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE

##       Central African Republic Chad Chile China Colombia Comoros

## X1990                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE

## X1991                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE

##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus

## X1990        TRUE        FALSE       TRUE    TRUE TRUE   TRUE

## X1991        TRUE        FALSE       TRUE    TRUE TRUE   TRUE

##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.

## X1990           TRUE          TRUE             TRUE             TRUE

## X1991           TRUE          TRUE             TRUE             TRUE

##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt

## X1990    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE

## X1991    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE

##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland

## X1990        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE

## X1991        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE

##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece

## X1990   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE

## X1991   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE

##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras

## X1990   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE

## X1991   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE

##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy

## X1990    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE   TRUE  TRUE

## X1991    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE  FALSE FALSE

##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan

## X1990   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE

## X1991   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE

##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania

## X1990 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE

## X1991 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE

##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania

## X1990       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE

## X1991       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE

##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat

## X1990      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE

## X1991      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE

##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands

## X1990    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE        TRUE

## X1991    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE       FALSE

##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger

## X1990                 TRUE          TRUE       FALSE      TRUE  TRUE

## X1991                 TRUE          TRUE       FALSE      TRUE  TRUE

##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau

## X1990    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE

## X1991    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE

##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal

## X1990   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE

## X1991   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE

##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation

## X1990        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE

## X1991        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE

##       Rwanda Saint Kitts and Nevis Saint Lucia

## X1990   TRUE                  TRUE        TRUE

## X1991   TRUE                  TRUE        TRUE

##       Saint Vincent and the Grenadines Samoa San Marino

## X1990                             TRUE  TRUE      FALSE

## X1991                             TRUE  TRUE      FALSE

##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone

## X1990                  TRUE         TRUE    TRUE       TRUE         TRUE

## X1991                  TRUE         TRUE    TRUE       TRUE         TRUE

##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa

## X1990      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE

## X1991      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE

##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland

## X1990  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE

## X1991  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE

##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste

## X1990                 TRUE       TRUE     TRUE           TRUE        TRUE

## X1991                 TRUE       TRUE     TRUE           TRUE        TRUE

##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan

## X1990 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE

## X1991 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE

##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates

## X1990                     TRUE   TRUE   TRUE    TRUE                 TRUE

## X1991                     TRUE   TRUE   TRUE    TRUE                 TRUE

##       United Kingdom Tanzania Virgin Islands (U.S.)

## X1990          FALSE     TRUE                  TRUE

## X1991          FALSE     TRUE                  TRUE

##       United States of America Uruguay Uzbekistan Vanuatu Venezuela

## X1990                    FALSE    TRUE       TRUE    TRUE      TRUE

## X1991                    FALSE    TRUE       TRUE    TRUE      TRUE

##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe

## X1990     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE

## X1991     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE

In this case we get a matrix variable, with boolean values. When applied to individual columns.

existing_df['United Kingdom'] > 10
##       United Kingdom

## X1990          FALSE

## X1991          FALSE

## X1992          FALSE

## X1993          FALSE

## X1994          FALSE

## X1995          FALSE

## X1996          FALSE

## X1997          FALSE

## X1998          FALSE

## X1999          FALSE

## X2000          FALSE

## X2001          FALSE

## X2002          FALSE

## X2003          FALSE

## X2004          FALSE

## X2005           TRUE

## X2006           TRUE

## X2007           TRUE

The result (and the syntax) is equivalent to that of Pandas, and can be used for indexing as follows.

existing_df$Spain[existing_df['United Kingdom'] > 10]
## [1] 24 24 23

As we did in Python/Pandas, let’s use the whole boolean matrix we got before.

head(existing_df[ existing_df_gt10 ]) # check first few elements
## [1] 436 429 422 415 407 397

But hey, the results are quite different from what we would expect coming from using Pandas. We got a long vector of values, not a data frame. The problem is that the [ ] operator, when passed a matrix, first coerces the data frame to a matrix. Basically we cannot seamlessly work with R data.frames and boolean matrices as we did with Pandas. We should instead index in both dimensions, columns and rows, separately.

But still, we can use matrix indexing with a data frame to replace elements.

existing_df_2 <- existing_df
existing_df_2[ existing_df_gt10 ] <- -1
head(existing_df_2,2) # check just a couple of rows
##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla

## X1990          -1      -1      -1             -1      -1     -1       -1

## X1991          -1      -1      -1             -1      -1     -1       -1

##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan

## X1990                  -1        -1      -1         7      -1         -1

## X1991                  -1        -1      -1         7      -1         -1

##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin

## X1990      -1      -1         -1        8      -1      -1     -1    -1

## X1991      -1      -1         -1        8      -1      -1     -1    -1

##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil

## X1990      10     -1      -1                     -1       -1     -1

## X1991      10     -1      -1                     -1       -1     -1

##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso

## X1990                     -1                -1       -1           -1

## X1991                     -1                -1       -1           -1

##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands

## X1990      -1       -1       -1      7         -1             10

## X1991      -1       -1       -1      7         -1             10

##       Central African Republic Chad Chile China Colombia Comoros

## X1990                       -1   -1    -1    -1       -1      -1

## X1991                       -1   -1    -1    -1       -1      -1

##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus

## X1990          -1            0         -1      -1   -1     -1

## X1991          -1           10         -1      -1   -1     -1

##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.

## X1990             -1            -1               -1               -1

## X1991             -1            -1               -1               -1

##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt

## X1990      -1       -1       -1                 -1      -1    -1

## X1991      -1       -1       -1                 -1      -1    -1

##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland

## X1990          -1                -1      -1      -1       -1   -1      -1

## X1991          -1                -1      -1      -1       -1   -1      -1

##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece

## X1990     -1               -1    -1     -1      -1      -1    -1     -1

## X1991     -1               -1    -1     -1      -1      -1    -1     -1

##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras

## X1990       7   -1        -1     -1            -1     -1    -1       -1

## X1991       7   -1        -1     -1            -1     -1    -1       -1

##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy

## X1990      -1       5    -1        -1   -1   -1      -1     -1    -1

## X1991      -1       4    -1        -1   -1   -1      -1     10    10

##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan

## X1990      10    -1     -1         -1    -1       -1     -1         -1

## X1991      10    -1     -1         -1    -1       -1     -1         -1

##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania

## X1990   -1     -1      -1      -1      -1                     -1        -1

## X1991   -1     -1      -1      -1      -1                     -1        -1

##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania

## X1990         -1         -1     -1       -1       -1   -1    10         -1

## X1991         -1         -1     -1       -1       -1   -1     9         -1

##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat

## X1990        -1     -1                    -1      3       -1         -1

## X1991        -1     -1                    -1      3       -1         -1

##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands

## X1990      -1         -1      -1      -1    -1    -1          -1

## X1991      -1         -1      -1      -1    -1    -1          10

##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger

## X1990                   -1            -1          10        -1    -1

## X1991                   -1            -1          10        -1    -1

##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau

## X1990      -1   -1                       -1      8   -1       -1    -1

## X1991      -1   -1                       -1      8   -1       -1    -1

##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal

## X1990     -1               -1       -1   -1          -1     -1       -1

## X1991     -1               -1       -1   -1          -1     -1       -1

##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation

## X1990          -1    -1          -1      -1      -1                 -1

## X1991          -1    -1          -1      -1      -1                 -1

##       Rwanda Saint Kitts and Nevis Saint Lucia

## X1990     -1                    -1          -1

## X1991     -1                    -1          -1

##       Saint Vincent and the Grenadines Samoa San Marino

## X1990                               -1    -1          9

## X1991                               -1    -1          9

##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone

## X1990                    -1           -1      -1         -1           -1

## X1991                    -1           -1      -1         -1           -1

##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa

## X1990        -1       -1       -1              -1      -1           -1

## X1991        -1       -1       -1              -1      -1           -1

##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland

## X1990    -1        -1    -1       -1        -1      5          -1

## X1991    -1        -1    -1       -1        -1      5          -1

##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste

## X1990                   -1         -1       -1             -1          -1

## X1991                   -1         -1       -1             -1          -1

##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan

## X1990   -1      -1    -1                  -1      -1     -1           -1

## X1991   -1      -1    -1                  -1      -1     -1           -1

##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates

## X1990                       -1     -1     -1      -1                   -1

## X1991                       -1     -1     -1      -1                   -1

##       United Kingdom Tanzania Virgin Islands (U.S.)

## X1990              9       -1                    -1

## X1991              9       -1                    -1

##       United States of America Uruguay Uzbekistan Vanuatu Venezuela

## X1990                        7      -1         -1      -1        -1

## X1991                        7      -1         -1      -1        -1

##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe

## X1990       -1               -1                 -1    -1     -1       -1

## X1991       -1               -1                 -1    -1     -1       -1

We can see how many of the elements, those where we had more than 10 cases, where assigned a -1 value.

The most expressive way of selecting form a data.frame in R is by using the subset function (type ?subset in your R console to read about this function). The function is applied by row in the data frame. The second argument can include any condition using column names. The third argument can include a list of columns. The resulting data frame will contain those rows that satisfy the second argument conditions, including just those columns listed in the third argument (all of them bt default). For example, if we want to select those years when the United Kingdom had more than 10 cases, and list the resulting rows for three countries (UK, Spain, and Colombia) we will use:

# If a column name contains blanks, we can have to use ` `
subset(existing_df,  `United Kingdom`>10, c('United Kingdom', 'Spain','Colombia'))
##       United Kingdom Spain Colombia

## X2005             11    24       53

## X2006             11    24       44

## X2007             12    23       43

We can do the same thing using [ ] as follows.

existing_df[existing_df["United Kingdom"]>10, c('United Kingdom', 'Spain','Colombia')]
##       United Kingdom Spain Colombia

## X2005             11    24       53

## X2006             11    24       44

## X2007             12    23       43

Function mapping and data grouping

Python

The pandas.DataFrame class defines several ways of applying functions both, index-wise and element-wise. Some of them are already predefined, and are part of the descriptive statistics methods we will talk about when performing exploratory data analysis.

existing_df.sum()
country
    Afghanistan            6360
    Albania                 665
    Algeria                 853
    American Samoa          221
    Andorra                 455
    Angola                 7442
    Anguilla                641
    Antigua and Barbuda     195
    Argentina              1102
    Armenia                1349
    Australia               116
    Austria                 228
    Azerbaijan             1541
    Bahamas                 920
    Bahrain                1375
    ...
    United Arab Emirates         577
    United Kingdom               173
    Tanzania                    5713
    Virgin Islands (U.S.)        367
    United States of America      88
    Uruguay                      505
    Uzbekistan                  2320
    Vanuatu                     3348
    Venezuela                    736
    Viet Nam                    5088
    Wallis et Futuna            2272
    West Bank and Gaza           781
    Yemen                       3498
    Zambia                      9635
    Zimbabwe                    9231
    Length: 207, dtype: int64

We have just calculated the total number of TB cases from 1990 to 2007 for each country. We can do the same by year if we pass axis=1 to use columns instead of index as axis.

existing_df.sum(axis=1)
year
    1990    40772
    1991    40669
    1992    39912
    1993    39573
    1994    39066
    1995    38904
    1996    37032
    1997    37462
    1998    36871
    1999    37358
    2000    36747
    2001    36804
    2002    37160
    2003    36516
    2004    36002
    2005    35435
    2006    34987
    2007    34622
    dtype: int64

It looks like there is a descent in the existing number of TB cases per 100K across the world.

Pandas also provides methods to apply other functions to data frames. They are three: apply, applymap, and groupby.

apply and applymap

By using apply() we can apply a function along an input axis of a DataFrame. Objects passed to the functions we apply are Series objects having as index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. For example, if we want to obtain the number of existing cases per million (instead of 100K) we can use the following.

from __future__ import division # we need this to have float division without using a cast
existing_df.apply(lambda x: x/10)
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
year                     
199043.64.24.54.23.951.43.81.69.65.23.511.427.84.636.512.65.526.543.640.9
199142.94.04.41.43.751.43.81.59.14.93.410.526.84.536.135.25.426.145.641.7
199242.24.14.40.43.551.33.71.58.65.13.310.225.94.435.86.45.426.349.441.5
199341.54.24.31.83.351.23.71.48.25.53.211.825.04.335.417.45.225.352.641.9
199440.74.24.31.73.251.03.61.37.86.03.111.624.24.235.017.25.225.055.642.6
199539.74.34.22.23.050.83.51.27.46.83.011.923.44.234.69.35.024.458.543.9
199639.74.24.30.02.851.23.51.27.17.42.811.122.64.131.212.34.923.360.245.3
199738.74.44.42.52.336.33.61.16.77.52.712.221.84.127.321.34.620.762.648.1
199837.44.34.51.22.441.43.61.16.37.42.812.921.14.026.110.74.419.463.439.2
199937.34.24.60.82.238.43.60.95.88.62.813.415.93.925.310.54.217.565.743.0
200034.64.04.80.82.053.03.50.85.29.42.713.914.33.924.810.34.016.465.847.9
200132.63.44.90.62.033.53.50.95.19.92.514.812.84.124.31.33.915.468.052.3
200230.43.25.00.52.130.73.50.74.29.72.714.414.94.123.527.53.714.951.757.1
200330.83.25.10.61.828.13.50.94.19.12.515.212.83.923.414.73.614.647.863.2
200428.32.95.20.91.931.83.50.83.98.52.314.911.83.822.66.33.513.846.865.2
200526.72.95.31.11.833.13.40.83.97.92.414.413.13.822.75.73.313.745.368.0
200625.12.65.50.91.730.23.40.93.77.92.513.410.43.822.26.03.213.542.269.9
200723.82.25.60.51.929.43.40.93.58.12.314.010.23.922.02.53.113.038.771.4
18 rows × 207 columns

We have seen how apply works element-wise. If the function we pass is applicable to single elements (e.g. division) pandas will broadcast that to every single element and we will get again a Series with the function applied to each element and hence, a data frame as a result in our case. However, the function intended to be used for element-wise maps is applymap.

groupby

Grouping is a powerful an important data frame operation in Exploratory Data Analysis. In Pandas we can do this easily. For example, imagine we want the mean number of existing cases per year in two different periods, before and after the year 2000. We can do the following.

mean_cases_by_period = existing_df.groupby(lambda x: int(x)>1999).mean()
mean_cases_by_period.index = ['1990-1999', '2000-2007']
mean_cases_by_period
countryAfghanistanAlbaniaAlgeriaAmerican SamoaAndorraAngolaAnguillaAntigua and BarbudaArgentinaArmeniaUruguayUzbekistanVanuatuVenezuelaViet NamWallis et FutunaWest Bank and GazaYemenZambiaZimbabwe
1990-1999403.70042.143.9016.20030.3474.4036.40012.80076.664.40030.600117.00234.50042.300323.300152.90049.800234.500557.200428.10
2000-2007290.37530.551.757.37519.0337.2534.6258.37542.088.12524.875143.75125.37539.125231.87592.87535.375144.125507.875618.75
2 rows × 207 columns

The groupby method accepts different types of grouping, including a mapping function as we passed, a dictionary, a Series, or a tuple / list of column names. The mapping function for example will be called on each element of the object .index (the year string in our case) to determine the groups. If a dict or Series is passed, the Series or dict values are used to determine the groups (e.g. we can pass a column that contains categorical values).

We can index the resulting data frame as usual.

mean_cases_by_period[['United Kingdom', 'Spain', 'Colombia']]
countryUnited KingdomSpainColombia
1990-19999.20035.30075.10
2000-200710.12524.87553.25

R

lapply

R has a long collection of apply functions that can be used to apply functions to elements within vectors, matrices, lists, and data frames. The one we will introduce here is lapply (type ?lapply in your R console). It is the one we use with lists and, since a data frame is a list of column vectors, will work with them as well.

For example, we can repeat the by year sum we did with Pandas as follows.

existing_df_sum_years <- lapply(existing_df, function(x) { sum(x) })
existing_df_sum_years <- as.data.frame(existing_df_sum_years)
existing_df_sum_years
##   Afghanistan Albania Algeria American.Samoa Andorra Angola Anguilla

## 1        6360     665     853            221     455   7442      641

##   Antigua.and.Barbuda Argentina Armenia Australia Austria Azerbaijan

## 1                 195      1102    1349       116     228       1541

##   Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda

## 1     920    1375       9278       95    1446     229    864  2384     133

##   Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil

## 1  10579    4806                   1817     8067   1585

##   British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi

## 1                    383              1492      960         5583    8097

##   Cambodia Cameroon Canada Cape.Verde Cayman.Islands

## 1    14015     3787     92       6712            129

##   Central.African.Republic Chad Chile China Colombia Comoros Congo..Rep.

## 1                     7557 7316   452  4854     1177    2310        6755

##   Cook.Islands Costa.Rica Croatia Cuba Cyprus Czech.Republic Cote.d.Ivoire

## 1          357        349    1637  295    163            304          7900

##   Korea..Dem..Rep. Congo..Dem..Rep. Denmark Djibouti Dominica

## 1            12359             9343     151    19155      375

##   Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea

## 1               2252    3676   700        1483              5303    3181

##   Estonia Ethiopia Fiji Finland France French.Polynesia Gabon Gambia

## 1    1214     8432  811     153    263              974  5949   6700

##   Georgia Germany Ghana Greece Grenada Guam Guatemala Guinea Guinea.Bissau

## 1    1406     180  7368    380     125 1340      1716   5853          6207

##   Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland

## 1   1621  7428     1756     930      58  8107      6131  789 1433     233

##   Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait

## 1    138   139     142   822    236       2249  5117    12652    928

##   Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya

## 1       2354 6460   1351     783    6059    7707                    559

##   Lithuania Luxembourg Madagascar Malawi Malaysia Maldives  Mali Malta

## 1      1579        233       6691   6290     2615     1638 10611   120

##   Mauritania Mauritius Mexico Micronesia..Fed..Sts. Monaco Mongolia

## 1      10698       817    978                  3570     44     6127

##   Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands

## 1        227    1873       7992    5061    9990  2860  7398         138

##   Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria

## 1                  355          1095         176      1708  5360    7968

##   Niue Northern.Mariana.Islands Norway Oman Pakistan Palau Panama

## 1 1494                     3033    103  337     6889  2258   1073

##   Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico

## 1             8652     1559 4352       11604   1064      677         206

##   Qatar Korea..Rep. Moldova Romania Russian.Federation Rwanda

## 1  1380        2353    2781    2891               2170   7216

##   Saint.Kitts.and.Nevis Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa

## 1                   259         371                              709   568

##   San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Seychelles

## 1        118                  5129         1171    7423       1347

##   Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia

## 1        11756       751      700      639            6623    8128

##   South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland

## 1        10788   552      1695  7062     1975     11460     82         149

##   Syrian.Arab.Republic Tajikistan Thailand Macedonia..FYR Timor.Leste

## 1                  986       3438     4442           1108       10118

##    Togo Tokelau Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan

## 1 12111    1283   679                 282     685   1023         1866

##   Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates

## 1                      485   7795   7069    1778                  577

##   United.Kingdom Tanzania Virgin.Islands..U.S.. United.States.of.America

## 1            173     5713                   367                       88

##   Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Wallis.et.Futuna

## 1     505       2320    3348       736     5088             2272

##   West.Bank.and.Gaza Yemen Zambia Zimbabwe

## 1                781  3498   9635     9231

What did we do there? Very simple. the lapply function gets a list and a function that will be applied to each element. It returns the result as a list. The function is defined in-line (i.e. as a lambda in Python). For a given x if sums its elements.

If we want to sum by year, for every country, we can use the transposed data frame we stored before.

existing_df_sum_countries <- lapply(existing_df_t, function(x) { sum(x) })
existing_df_sum_countries <- as.data.frame(existing_df_sum_countries)
existing_df_sum_countries
##   X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001

## 1 40772 40669 39912 39573 39066 38904 37032 37462 36871 37358 36747 36804

##   X2002 X2003 X2004 X2005 X2006 X2007

## 1 37160 36516 36002 35435 34987 34622

aggregate

R provided basic grouping functionality by using aggregate. Another option is to have a look at the powerful dplyr library that I highly recommend.

But aggregate is quite powerful as well. It accepts a data frame, a list of grouping elements, and a function to apply to each group. First we need to define a grouping vector.

before_2000 <- c('1990-99','1990-99','1990-99','1990-99','1990-99',
                 '1990-99','1990-99','1990-99','1990-99','1990-99',
                 '2000-07','2000-07','2000-07','2000-07','2000-07',
                 '2000-07','2000-07','2000-07')
before_2000
##  [1] "1990-99""1990-99""1990-99""1990-99""1990-99""1990-99""1990-99"

##  [8] "1990-99""1990-99""1990-99""2000-07""2000-07""2000-07""2000-07"

## [15] "2000-07""2000-07""2000-07""2000-07"

Then we can use that column as groping element and use the function mean.

mean_cases_by_period <- aggregate(existing_df, list(Period = before_2000), mean)
mean_cases_by_period
##    Period Afghanistan Albania Algeria American Samoa Andorra Angola

## 1 1990-99     403.700    42.1   43.90         16.200    30.3 474.40

## 2 2000-07     290.375    30.5   51.75          7.375    19.0 337.25

##   Anguilla Antigua and Barbuda Argentina Armenia Australia Austria

## 1   36.400              12.800      76.6  64.400       6.8  14.500

## 2   34.625               8.375      42.0  88.125       6.0  10.375

##   Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize

## 1     75.600  52.700  95.600     571.20    6.400  80.500  14.000  54.60

## 2     98.125  49.125  52.375     445.75    3.875  80.125  11.125  39.75

##     Benin Bermuda  Bhutan Bolivia Bosnia and Herzegovina Botswana  Brazil

## 1 131.300   8.400 699.600   308.2                  132.9  356.400 103.400

## 2 133.875   6.125 447.875   215.5                   61.0  562.875  68.875

##   British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso Burundi

## 1                 24.600             90.60   57.700        239.9  332.30

## 2                 17.125             73.25   47.875        398.0  596.75

##   Cambodia Cameroon Canada Cape Verde Cayman Islands

## 1    835.9  201.400  5.900    409.500          8.400

## 2    707.0  221.625  4.125    327.125          5.625

##   Central African Republic    Chad Chile  China Colombia Comoros

## 1                  360.000 330.300  32.0 300.00    75.10 152.500

## 2                  494.625 501.625  16.5 231.75    53.25  98.125

##   Congo, Rep. Cook Islands Costa Rica Croatia  Cuba Cyprus Czech Republic

## 1     322.200       23.400       24.5 110.000 21.70  10.90           20.8

## 2     441.625       15.375       13.0  67.125  9.75   6.75           12.0

##   Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep. Denmark Djibouti

## 1        331.00          794.400           393.30    9.70 1145.000

## 2        573.75          551.875           676.25    6.75  963.125

##   Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea

## 1   22.000             148.20 236.700  45.6       101.9            206.50

## 2   19.375              96.25 163.625  30.5        58.0            404.75

##   Eritrea Estonia Ethiopia  Fiji Finland France French Polynesia   Gabon

## 1 221.200  77.700  382.900 54.50  10.400  16.90           70.900 330.800

## 2 121.125  54.625  575.375 33.25   6.125  11.75           33.125 330.125

##   Gambia Georgia Germany   Ghana Greece Grenada   Guam Guatemala  Guinea

## 1 352.20    68.2    12.8 450.100 24.300   7.000 100.20   101.500 274.200

## 2 397.25    90.5     6.5 358.375 17.125   6.875  42.25    87.625 388.875

##   Guinea-Bissau  Guyana   Haiti Honduras Hungary Iceland   India Indonesia

## 1        394.10  61.800 438.100  118.900  68.300   3.700 533.200    387.70

## 2        283.25 125.375 380.875   70.875  30.875   2.625 346.875    281.75

##     Iran   Iraq Ireland Israel Italy Jamaica  Japan Jordan Kazakhstan

## 1 52.000 85.800    14.9   8.80 8.800     8.6 53.700 16.300      107.3

## 2 33.625 71.875    10.5   6.25 6.375     7.0 35.625  9.125      147.0

##   Kenya Kiribati Kuwait Kyrgyzstan   Laos Latvia Lebanon Lesotho Liberia

## 1 208.9  874.900  69.40    118.700 393.40 75.400    57.9   271.5   444.7

## 2 378.5  487.875  29.25    145.875 315.75 74.625    25.5   418.0   407.5

##   Libyan Arab Jamahiriya Lithuania Luxembourg Madagascar Malawi Malaysia

## 1                 40.200     94.10      15.10      359.5  355.0   158.90

## 2                 19.625     79.75      10.25      387.0  342.5   128.25

##   Maldives    Mali Malta Mauritania Mauritius Mexico Micronesia, Fed. Sts.

## 1  105.500 595.200  7.80    600.700    50.200  72.40                246.80

## 2   72.875 582.375  5.25    586.375    39.375  31.75                137.75

##   Monaco Mongolia Montserrat Morocco Mozambique Myanmar Namibia   Nauru

## 1    2.8   412.50       13.5 116.600    368.300  352.70 566.900 216.500

## 2    2.0   250.25       11.5  88.375    538.625  191.75 540.125  86.875

##     Nepal Netherlands Netherlands Antilles New Caledonia New Zealand

## 1 523.300        8.80                 22.7          83.1      10.100

## 2 270.625        6.25                 16.0          33.0       9.375

##   Nicaragua  Niger Nigeria  Niue Northern Mariana Islands Norway   Oman

## 1    113.40 308.60 361.500 98.80                  228.200    6.7 23.200

## 2     71.75 284.25 544.125 63.25                   93.875    4.5 13.125

##   Pakistan   Palau Panama Papua New Guinea Paraguay   Peru Philippines

## 1  423.400 164.100 68.800          494.900   89.400 297.40       726.4

## 2  331.875  77.125 48.125          462.875   83.125 172.25       542.5

##   Poland Portugal Puerto Rico Qatar Korea, Rep. Moldova Romania

## 1 77.100    43.90      15.300    78     141.600 140.000   153.1

## 2 36.625    29.75       6.625    75     117.125 172.625   170.0

##   Russian Federation Rwanda Saint Kitts and Nevis Saint Lucia

## 1             107.20 274.20                  15.1       22.50

## 2             137.25 559.25                  13.5       18.25

##   Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe

## 1                            42.30 35.00      7.500                 306.1

## 2                            35.75 27.25      5.375                 258.5

##   Saudi Arabia Senegal Seychelles Sierra Leone Singapore Slovakia Slovenia

## 1       67.000 385.000     91.400      531.900     49.70   49.700   47.800

## 2       62.625 446.625     54.125      804.625     31.75   25.375   20.125

##   Solomon Islands Somalia South Africa  Spain Sri Lanka   Sudan Suriname

## 1         469.600 521.100        569.2 35.300      99.1 401.100     95.1

## 2         240.875 364.625        637.0 24.875      88.0 381.375    128.0

##   Swaziland Sweden Switzerland Syrian Arab Republic Tajikistan Thailand

## 1   527.900  4.900       10.30               72.300     134.00    288.6

## 2   772.625  4.125        5.75               32.875     262.25    194.5

##   Macedonia, FYR Timor-Leste   Togo Tokelau Tonga Trinidad and Tobago

## 1         80.100       662.6 650.10   105.9  39.9              16.100

## 2         38.375       436.5 701.25    28.0  35.0              15.125

##   Tunisia Turkey Turkmenistan Turks and Caicos Islands Tuvalu Uganda

## 1  46.400 68.800      105.900                   32.200 511.30 352.70

## 2  27.625 41.875      100.875                   20.375 335.25 442.75

##   Ukraine United Arab Emirates United Kingdom Tanzania

## 1   81.60               37.400          9.200  279.200

## 2  120.25               25.375         10.125  365.125

##   Virgin Islands (U.S.) United States of America Uruguay Uzbekistan

## 1                23.000                      6.0  30.600     117.00

## 2                17.125                      3.5  24.875     143.75

##   Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza   Yemen

## 1 234.500    42.300  323.300          152.900             49.800 234.500

## 2 125.375    39.125  231.875           92.875             35.375 144.125

##    Zambia Zimbabwe

## 1 557.200   428.10

## 2 507.875   618.75

The aggregate function allows subsetting the data frame we pass as first parameter of course, and also to pass multiple grouping elements and define our own functions (either as lambda or predefined functions). And again, the result is a data frame that we can index as usual.

mean_cases_by_period[,c('United Kingdom','Spain','Colombia')]
##   United Kingdom  Spain Colombia

## 1          9.200 35.300    75.10

## 2         10.125 24.875    53.25

Conclusions

This two-part tutorial has introduced the concept of data frame, together with how to use them in the two most popular Data Science ecosystems nowadays, R and Python. We have seen how Pandas is inspired by R. We can see how in Python/Pandas we can use very similar constructs to those present in the R language. Python is also a language widely used by software developers of all kinds. All this means that Pandas offers a more consistent programming interface, more efficient in many situations. It is also agreed in the community that, if you come from a software development background, you will feel more comfortable with a language like Python and how DataFrame as an object oriented concepts is defined. If you come instead from a maths and statistics background, you will appreciate a language like R, very interactive and totally function-based, with libraries made by statisticians for statisticians. It is not a language meant to be used in complex software architectures on its own, but to be used in a powerful dialog with data.

Additionally, we have introduced a few datasets from Gapminder World related with Infectious Tuberculosis, a very serious epidemic disease sometimes forgotten in developed countries but that nowadays is the second cause of death of its kind just after HIV (and many times associated to HIV). In the next tutorial in the series, we will use these datasets in order to perform some Exploratory Analysis in both, Python and R, to better understand the world situation regarding the disease.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!


Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>