This is an attempt to provide attendees of PyCon AU 2015 with a guide to getting set up ahead of the tutorial. Getting set up in advance will assist greatly in getting the most of the tutorial. It will let attendees focus on the slides and the problem examples rather than on hurdling through an installation process.
There will be USB keys on the day with the data sets and some of the software libraries included, in case the network breaks. However, things will go more smoothly on everyone if some of these hurdles can be cleared out the way in advance.
I have only had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python setup.py install'. For whatever reason, what's in PyPI doesn't work for me.
I've noted the easiest path for installing each package in the list above.
Digit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.
Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.
Data for this problem will be available on USB.
Compute time: 7 minutes for deep learning
Data for this problem can be downloaded only through the Kaggle site due to the terms of use.
This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.
{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }
This is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.
I hope to see you at the conference!!
What it's like installing software during a tutorial session |
The Software You Will Need
- Python 3.4, with Numpy, Scipy, Scikit-Learn, Pandas, Xray, pillow -- install via anaconda
- Ipython Notebook, Matplotlib, Seaborn -- install via anaconda
- Theano, Keras -- install via pip
- Word2Vec (https://github.com/danielfrg/word2vec) -- avoid pip, install from source
- https://github.com/danieldiekmeier/memegenerator -- just drop in the notebook folder
- https://github.com/tweepy/tweepy -- install via pip
I have only had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python setup.py install'. For whatever reason, what's in PyPI doesn't work for me.
I've noted the easiest path for installing each package in the list above.
The Data You Will Need
- MNIST: https://github.com/tleeuwenburg/stml/blob/master/mnist/mnist.pkl.gz
- Kaggle Otto competition data: https://www.kaggle.com/c/otto-group-product-classification-challenge
- "Text8": http://mattmahoney.net/dc/text8.zip
- For a stretch, try the larger data sets from http://mattmahoney.net/dc/textdata
An Overview of the Tutorial
The tutorial will include an introduction, a mini-installfest, and then three problem walkthroughs. There will be some general tips, plus time for discussion.Entree: Problem Walkthrough One: MNIST Digit Recognition
Compute Time: Around 3 to 5 minutes for a random forest approachDigit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.
Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.
Data for this problem will be available on USB.
Main: Otto Shopping Category Challenge
Compute time: 1 minute for random forestCompute time: 7 minutes for deep learning
Data for this problem can be downloaded only through the Kaggle site due to the terms of use.
This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.
{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }
Dessert: A Twitter Memebot in Word2Vec
Compute Time: Word2Vec training of 4m + 2 mins meme generationThis is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.
Visualisation, Plotting and Results Analysis
No data science tutorial would be complete without data visualisation and plotting of results. Rather than have a separate problem for this, we will include them in each problem. We will also be considering how to determine whether your model is 'good', and how to convince both yourself and your customers / managers of that fact!Bring Your Own Data
If you have a data problem of your own, you can bring it along to the tutorial and work on that instead. As time allows, I'll endeavour to assist with any questions you might have about working with your own data. Alternatively, you can just come up to me during the conference and we can take a look! There's nothing more interesting that looking at data that inherently matters to you.I hope to see you at the conference!!