Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

François Dion: Bond. James "import pandas" Bond

$
0
0

It all started when...

    [friend] I'm trying to get this table on wikipedia from python...

[me]Sure. What module are you using?

    [friend] BeautifulSoup, but man, this is hard. It's this url...

[me] Wait, this is not a Coursera assignment you are asking me to do, is it?

    [friend]No, no. I saw this thing using a different programming language and I want to do it in Python.

[me] Ok, sounds reasonable.

The URL

The basic URL that documents James Bond movies on wikipedia is at: https://en.wikipedia.org/wiki/List_of_James_Bond_films but the URL he sent me was: https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363 and hence why it looked like a assignment.


Let me pause for a brief second on this subject. I'm a big fan of reproducible research, and selecting a specific revision of a document is an excellent idea. This page will never change, whereas any given normal URL on wikipedia changes all the time.

I'll have some of that BeautifulSoup

My friend mentioned he was trying to use BeautifulSoup but facing some challenges. BeautifulSoup and lxml are the usual suspects when it comes to doing web scraping (and using requests itself to pull the data in). But I have to admit, most of the time I don't use any of these. You see, I'm lazy, and typically these solutions require too much work. If you want to see what I'm talking about, you can check using-python-beautifulsoup-to-scrape-a-wikipedia-table

I don't like to type more code than I need to. At any rate, the goal was to get the web page, parse two tables and then load the data in a pandas data frame to do further analysis, plots etc.

Enter the Pandas

And it's not even the Kung Fu Panda, just good old Pandas, the data wrangling tool par excellence (https://pypi.python.org/pypi/pandas/0.17.1).

Everybody knows, I hope, that it has a superb support for loading excel and CSV files. It's why Python is the number 1 data wrangling programming language.

But what about loading tables from wikipedia web pages, surely there is nothing that can simplify this, is there? If you've attended all PYPTUG meetings, you already know the answer.

 import pandas as pd  

wiki_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_James_Bond_films&oldid=688916363", header=0)

read_html returns a list of dataframes, with each table found on the web page being a dataframe. So to access the box office table on this page, we have to look at the second dataframe, the first being the warning table at the top of the page. Since it is 0 indexed we refer to it with wiki_df[1]. We don't want line 0 because that's sub headers, and we don't want the last two lines because one is a movie that's just been released and the numbers are not in yet, and the other one because it's a total column. How do we do this? Good old Python slices:

 df = wiki_df[1][1:24]  

And that's it, seriously. One line to ingest, one line to cleanup.

The result

TitleYearBond actorDirectorBox officeBudgetSalary of Bond actorBox office.1Budget.1Salary of Bond actor.1
1Dr. No1962Connery, SeanSean ConneryYoung, TerenceTerence Young59.51.10.1448.87.00.6
2From Russia with Love1963Connery, SeanSean ConneryYoung, TerenceTerence Young78.92.00.3543.812.61.6
3Goldfinger1964Connery, SeanSean ConneryHamilton, GuyGuy Hamilton124.93.00.5820.418.63.2
4Thunderball1965Connery, SeanSean ConneryYoung, TerenceTerence Young141.26.80.8848.141.94.7
5You Only Live Twice1967Connery, SeanSean ConneryGilbert, LewisLewis Gilbert101.010.30.8 + 25% net merch royalty514.259.94.4 excluding profit participation
6On Her Majesty's Secret Service1969Lazenby, GeorgeGeorge LazenbyHunt, Peter R.Peter R. Hunt64.67.00.1291.537.30.6
7Diamonds Are Forever1971Connery, SeanSean ConneryHamilton, GuyGuy Hamilton116.07.21.2 + 12.5% of gross (14.5)442.534.75.8 excluding profit participation
8Live and Let Die1973Moore, RogerRoger MooreHamilton, GuyGuy Hamilton126.47.0n/a460.330.8n/a
9man with !The Man with the Golden Gun1974Moore, RogerRoger MooreHamilton, GuyGuy Hamilton98.57.0n/a334.027.7n/a
10spy who !The Spy Who Loved Me1977Moore, RogerRoger MooreGilbert, LewisLewis Gilbert185.414.0n/a533.045.1n/a
11Moonraker1979Moore, RogerRoger MooreGilbert, LewisLewis Gilbert210.334.0n/a535.091.5n/a
12For Your Eyes Only1981Moore, RogerRoger MooreGlen, JohnJohn Glen194.928.0n/a449.460.2n/a
13Octopussy1983Moore, RogerRoger MooreGlen, JohnJohn Glen183.727.54.0373.853.97.8
14view !A View to a Kill1985Moore, RogerRoger MooreGlen, JohnJohn Glen152.430.05.0275.254.59.1
15living !The Living Daylights1987Dalton, TimothyTimothy DaltonGlen, JohnJohn Glen191.240.03.0313.568.85.2
16Licence to Kill1989Dalton, TimothyTimothy DaltonGlen, JohnJohn Glen156.236.05.0250.956.77.9
17GoldenEye1995Brosnan, PiercePierce BrosnanCampbell, MartinMartin Campbell351.960.04.0518.576.95.1
18Tomorrow Never Dies1997Brosnan, PiercePierce BrosnanSpottiswoode, RogerRoger Spottiswoode338.9110.08.2463.2133.910.0
19world !The World Is Not Enough1999Brosnan, PiercePierce BrosnanApted, MichaelMichael Apted361.8135.012.4439.5158.313.5
20Die Another Day2002Brosnan, PiercePierce BrosnanTamahori, LeeLee Tamahori431.9142.016.5465.4154.217.9
21Casino Royale2006Craig, DanielDaniel CraigCampbell, MartinMartin Campbell594.2150.03.4581.5145.33.3
22Quantum of Solace2008Craig, DanielDaniel CraigForster, MarcMarc Forster576.0200.08.9514.2181.48.1
23Skyfall2012Craig, DanielDaniel CraigMendes, SamSam Mendes1108.6[20]150.0[21][22]—200.0[20]17.0[23]879.8158.113.5


Francois Dion
@f_dion

Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>