Channel: Planet Python

↧

Ian Ozsvald: Convert London Oyster (Travel) PDFs to Pandas DataFrames

February 8, 2016, 2:02 am

≫ Next: Kushal Das: No discriminatory tariffs for data services in India

≪ Previous: Continuum Analytics News: Plotting Maps in rBokeh

As a part of analysing Emily’s allergic rhinitis we want to test whether using the London Underground (notoriously dirty!) increases the likelihood of sneezing. The “black snot” phenomenon is well known to Londoners, possibly the particulates (from oil and metal) cause irritation. You can get updates via our allergic rhinitis analysis mailing list (very very low volume).

Transport for London lets us download a log of journeys – either as a CSV file (just dates and costs, no details) or a PDF file (containing full details of the journey and time). It would be much nicer if they made the data available in a cleanly-formatted open format (e.g. at least a CSV, preferably as HDF5).

The goal is to take the detail-rich PDFs and to build a DataFrame like:

                             from is_train                to
date                                                        
2016-01-30  Bus Journey, Route 46    False                  
2016-01-28           Kentish Town     True  Leicester Square
2016-01-28             Old Street     True      Kentish Town
2016-01-28       Leicester Square     True        Old Street
2016-01-27                  Angel     True      Kentish Town

Using textract (see these Python 3.4 install notes, I also use pdftotext) and a very hacky parser (written this evening, it really is a stateful-messy-hack <sorry>) I can parse a single PDF or a folder to build a Pandas DataFrame of journeys. You’ll find London Oyster PDF to DataFrame Parser here. The output is an HDF5 which can be loaded by Python into Pandas (or R or Matlab or whatever).

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

↧

Latest Images

7 clever tricks Primark does to keep you walking & buying more than you need...

7 clever tricks Primark does to keep you walking & buying more than you need...

July 20, 2025, 5:14 am

Art for Everyone! Autism advocacy, local stories, and indigenous pride in one...

Art for Everyone! Autism advocacy, local stories, and indigenous pride in one...

July 20, 2025, 5:06 am

Paintings of English Downs 2

Paintings of English Downs 2

July 20, 2025, 4:30 am

How Kerala Women Rescued a Dying Forest and Turned It Into a Safe Haven for...

How Kerala Women Rescued a Dying Forest and Turned It Into a Safe Haven for...

July 20, 2025, 3:30 am

Met Eireann warns of heavy rain & spot flooding for DAYS before big...

Met Eireann warns of heavy rain & spot flooding for DAYS before big...

July 20, 2025, 1:14 am

Who is Kevin Lerena’s wife Geraldine?

Who is Kevin Lerena’s wife Geraldine?

July 20, 2025, 12:57 am

Man stabs woman, baby to death inside Queens home, police say

Man stabs woman, baby to death inside Queens home, police say

July 19, 2025, 11:00 pm

Ang papel ni whistleblower Julie Patidongan sa kaso ng mga nawawalang sabungero

Ang papel ni whistleblower Julie Patidongan sa kaso ng mga nawawalang sabungero

July 19, 2025, 9:45 pm

Telangana Human Rights Commission (TGHRC) seeks report from revenue dept on...

Telangana Human Rights Commission (TGHRC) seeks report from revenue dept on...

July 19, 2025, 7:29 pm

Crisis-hit NHS fat cats raking in MASSIVE salaries as frontline services cry...

Crisis-hit NHS fat cats raking in MASSIVE salaries as frontline services cry...

July 19, 2025, 2:11 pm

Trending Articles

Ade Edmondson leads tributes to Rik Mayall after his death, aged 56

June 9, 2014, 1:49 pm

Love (2015).H264.Italian.English.Ac3.5.1.multisub.iCV-MIRCrew Seed (62)/Leech...

September 14, 2017, 10:49 am

Students hit streets to save Agriculture College land in city

October 13, 2018, 2:20 am

Waves Complete v2019.02.14 Incl Emulator-R2R

February 16, 2019, 7:50 am

Throwback: Mad Fish — Yahooya (Ft. Samini) prod by JQ

April 23, 2015, 12:52 pm

It’s Kind of a Funny Story 2010 Dual Audio 720p BRRip [Hindi – English] ESubs

June 8, 2016, 6:15 am

CalCen

June 4, 2020, 6:35 pm

IN COURT: Full list of people sentenced at Northampton Magistrates’ Court

July 16, 2017, 10:00 pm

Pakistani National Hussain Jarrar Deported Over Claims Of Money Laundering

January 19, 2024, 11:00 am

MCGEE, SAMUEL O., DECEASED, OF...

March 19, 2025, 2:00 pm

Download: Kapili Kapish -Twende(Prod by Bicko Bicko)

March 28, 2019, 6:23 am

月例ロールアップ更新プログラムの適用後の 1 回目のシステム再起動時にサービス実行アカウントが一時ユーザープロファイルになる場合がある

April 28, 2017, 3:01 am

236 kg banned scented tobacco worth Rs 1.26 lakh seized in Wadi

June 22, 2021, 5:54 am

Piso 21 – Más Casa, Más Fama – Single [iTunes Plus M4A]

August 3, 2025, 9:10 am

NDJRyan published Torridge man with rifle assaulted 2 police officers after...

October 14, 2016, 5:40 am

Bradley Regan Arrested by Miami-Dade County Corrections on May 19, 2020

May 19, 2020, 12:00 am

[RELEASE THREAD]--_A-Team_--Cricket_Dream_5G

September 25, 2022, 7:14 pm

Munkara v Santos NA Barossa Pty Ltd (No 4) [2024] FCA 414

April 24, 2024, 12:00 am

Moondru Mudichu 27-05-2016 – Polimer tv Serial

May 27, 2016, 9:18 am

Windows Server バックアップにおける容量と世代管理について

May 27, 2016, 4:49 am

Latest Images

7 clever tricks Primark does to keep you walking & buying more than you need...

7 clever tricks Primark does to keep you walking & buying more than you need...

July 20, 2025, 5:14 am

Art for Everyone! Autism advocacy, local stories, and indigenous pride in one...

Art for Everyone! Autism advocacy, local stories, and indigenous pride in one...

July 20, 2025, 5:06 am

Paintings of English Downs 2

Paintings of English Downs 2

July 20, 2025, 4:30 am

How Kerala Women Rescued a Dying Forest and Turned It Into a Safe Haven for...

How Kerala Women Rescued a Dying Forest and Turned It Into a Safe Haven for...

July 20, 2025, 3:30 am

Met Eireann warns of heavy rain & spot flooding for DAYS before big...

Met Eireann warns of heavy rain & spot flooding for DAYS before big...

July 20, 2025, 1:14 am

Who is Kevin Lerena’s wife Geraldine?

Who is Kevin Lerena’s wife Geraldine?

July 20, 2025, 12:57 am

Man stabs woman, baby to death inside Queens home, police say

Man stabs woman, baby to death inside Queens home, police say

July 19, 2025, 11:00 pm

Ang papel ni whistleblower Julie Patidongan sa kaso ng mga nawawalang sabungero

Ang papel ni whistleblower Julie Patidongan sa kaso ng mga nawawalang sabungero

July 19, 2025, 9:45 pm

Telangana Human Rights Commission (TGHRC) seeks report from revenue dept on...

Telangana Human Rights Commission (TGHRC) seeks report from revenue dept on...

July 19, 2025, 7:29 pm

Crisis-hit NHS fat cats raking in MASSIVE salaries as frontline services cry...

Crisis-hit NHS fat cats raking in MASSIVE salaries as frontline services cry...

July 19, 2025, 2:11 pm

© 2025 //www.rssing.com