Wednesday, March 25, 2015

My Data Science / Machine Learning Technology Stack

The Setup

Let's talk briefly about my technology stack. I work in Python, for a number of reasons. Focusing allows me to learn more about the nuances of a single language. Many of the techniques here are well-supported in other languages, but I trip over my feet more when using them. I'm better off going with my strengths, even if that means foregoing some great tools.

So first up, here's the list:
  • A 2011 Macbook Air laptop for my main computing system
  • Python 2.7, using pandas, numpy, sqlite3 and theano (not a full list)
  • IPython Notebook for putting together walkthroughs and experiments
  • Access to a CUDA-enabled NVidia GPU-enabled machine for performance work
  • An external two-terabyte hard drive for much needed disk space
  • A copy of several significant machine learning databases, including
    • MNIST data
    • Two kaggle competition datasets (ocean plankton and diabetic retinopathy)
    • The Enron mail archive
Side note -- If you want to get your local environment set up, a great goal will be to follow along with This will exercise your setup using most of the tools and libraries that you will be interacting with.

So far, this has proved to be a tolerably effective setup. By far, the two hardest aspects of the problem are leveraging the GPU and access to disk space at a reasonable speed. GPU-based libraries are quite disconnected from general libraries, and require a different approach. There are compatability elements (such as conversion to and from numpy arrays), but essentially the GPU is a different beast to the CPU and can't simply be abstracted away ... yet.

The Macbook Air is limited in its amount of RAM and on-board storage. That said, almost any laptop is going to be very limited compared to almost any desktop for the purposes of machine learning. Basic algorithms such as linear regression are easy enough that the laptop is no problem; intensive algorithms such as deep learning are hard enough that you will want to farm the processing out to a dedicated machine. You'll basically want a gaming setup.

If I'm lucky, I may be upgrading to a 13" 2015 Macbook Pro shortly. While this will ease the constraints, the same fundamental dynamics will remain unchanged. It is still not sufficient for deep learning purposes, and won't really change the balance all that much. Unless you have a beast of a laptop with an on-board NVidia GPU, you'll be looking at a multi-machine setup. Note - the 15" MPB does have an Nvidia GPU, but I don't want the size.

Working on Multiple Machines

Some people will have easy access to a higher-end machine, some won't. Some people will be dismayed not to be able to work easily on a single machine, others will be fine with it. Unlike many, I won't be recommending using AWS or another cloud provider. They are fine if the charges represent small change to you, but because of the disk and GPU requirements of the machines, you're not going to be looking at entry-level prices. You would be much better off spending $200 on the Nvidia Jetson TK1 and using that as modular processing engine for your laptop.

For small data sets like MNIST, your laptop will be totally fine. For medium-size sets, like the ocean plankton, you will be completely unable to complete the tasks on a CPU in a tolerable time-frame.

For large data sets, you may well be unable to fit the data onto the onboard storage of a single machine, especially a laptop. For most of us, that will mean using an external USB drive.

There are many ways to effectively utilise multiple machines. Some are based on making those machines available as "slaves" to a master or controller system, others will be based on remote code execution, perhaps through shells scripts. Future blog posts will cover the how and why of this. For the moment, the important thing to take away from this is that you'll either be working on small data sets or on multiple machines.

Why NVidia? Isn't that favouritism?

Short version: It's what the library ecosystem supports best right now. NVidia GPUs support CUDA, which is the only GPU controller language the best Python libraries support properly. If you want to work on machine learning problems, rather than on adapting libraries to new GPU platforms, you'll need to go down the proprietary path. In a couple of years, I expect that the software world will have caught up to more architectures. If you're working on problems which can be reasonably solved on a CPU, then no problem. If you're an amazing software wizard who can comfortably modify the back-end architecture, then you're probably reading the wrong blog anyway.