I’m at EuroSciPy 2015, we have 2 days of Pythonistic Science in Cambridge. I spoke in the morning on Data Cleaning on Text to Prepare for Data Analysis and Machine Learning (which is a terribly verbose title, sorry!). I’ve just covered 10 years of lessons learned working with NLP on (often crappy) text data, and ways to clean it up to make it easy to work with. Topics covered:
- decoding bytes into unicode (including chardet, ftfy, chromium language detector) to step past the UnicodeDecodeError
- validating that a new dataset looks like a previous+trusted dataset (I’m thinking of writing a tool for this – would that be useful to you?)
- automatically transforming data from “what I have” to “what I want” with annotate.io without writing regexps (now public)!
- manual approaches to normalisation (the stuff I do that started me thinking on annotate.io)
- visualisation with GlueViz, Seaborn and csv-fingerprint
- starting your first ML project
Here are the slides:
Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.