Thursday, June 25, 2015

Setup for PyCon AU Tutorial

This is an attempt to provide attendees of PyCon AU 2015 with a guide to getting set up ahead of the tutorial. Getting set up in advance will assist greatly in getting the most of the tutorial. It will let attendees focus on the slides and the problem examples rather than on hurdling through an installation process.

What it's like installing software during a tutorial session
There will be USB keys on the day with the data sets and some of the software libraries included, in case the network breaks. However, things will go more smoothly on everyone if some of these hurdles can be cleared out the way in advance.

The Software You Will Need

  1. Python 3.4, with Numpy, Scipy, Scikit-Learn, Pandas, Xray, pillow -- install via anaconda
  2. Ipython Notebook, Matplotlib, Seaborn -- install via anaconda
  3. Theano, Keras -- install via pip
  4. Word2Vec (https://github.com/danielfrg/word2vec) -- avoid pip, install from source
  5. https://github.com/danieldiekmeier/memegenerator -- just drop in the notebook folder
  6. https://github.com/tweepy/tweepy -- install via pip
I recommend using Anaconda as it ships with prebuilt binaries for O/S dependencies, for a variety of platforms. It's possible to get this all working with pip and your O/S package manager. It should be fine to use Windows, but OSX or Linux are likely to be easier to use. Due to the use of Ipython Notebook as the primary environment, the choice of operating system is not likely to be a major limiting factor in this case.

I have only had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python setup.py install'. For whatever reason, what's in PyPI doesn't work for me.

I've noted the easiest path for installing each package in the list above.

The Data You Will Need 

  1. MNIST: https://github.com/tleeuwenburg/stml/blob/master/mnist/mnist.pkl.gz 
  2. Kaggle Otto competition data: https://www.kaggle.com/c/otto-group-product-classification-challenge
  3. "Text8": http://mattmahoney.net/dc/text8.zip
  4. For a stretch, try the larger data sets from http://mattmahoney.net/dc/textdata

An Overview of the Tutorial

The tutorial will include an introduction, a mini-installfest, and then three problem walkthroughs. There will be some general tips, plus time for discussion. 

Entree: Problem Walkthrough One: MNIST Digit Recognition

Compute Time: Around 3 to 5 minutes for a random forest approach

Digit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.

Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.

Data for this problem will be available on USB.

Main: Otto Shopping Category Challenge

Compute time: 1 minute for random forest
Compute time: 7 minutes for deep learning
Data for this problem can be downloaded only through the Kaggle site due to the terms of use.

This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.

{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }

Dessert: A Twitter Memebot in Word2Vec

Compute Time: Word2Vec training of 4m + 2 mins meme generation

This is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.

Visualisation, Plotting and Results Analysis

No data science tutorial would be complete without data visualisation and plotting of results. Rather than have a separate problem for this, we will include them in each problem. We will also be considering how to determine whether your model is 'good', and how to convince both yourself and your customers / managers of that fact!

Bring Your Own Data

If you have a data problem of your own, you can bring it along to the tutorial and work on that instead. As time allows, I'll endeavour to assist with any questions you might have about working with your own data. Alternatively, you can just come up to me during the conference and we can take a look! There's nothing more interesting that looking at data that inherently matters to you.

I hope to see you at the conference!!