Thursday, May 28, 2015

An environment to share...

I'm a terrible blogger. I just ground to a halt and got overwhelmed by real life. I ran out of good ideas. I hadn't finished things I was in the middle of and had no results. AAAAAGH. Here's something vague about repeatable data science environments for tutorials...

I am working on the background material to support tutorial sessions. If there's one hard thing about giving a tutorial, it's getting everyone on the same page without anyone being left behind. All of a sudden, you need to think about *everyone's* environment. Not just yours -- not even just 'most people's, but everyone's.

There are a few technologies for setting up environments, plus some entirely different approaches completely. My goal is to present people with multiple paths to success, without having to think of everything.

I'll be looking at:
  -- Virtualenv and Python packages
  -- Virtual machine images
  -- Vagrant automatic VM provisioning
  -- Alternative Python distributions
  -- Using web-based environments rather than your own installation

Why is this all so complicated? Well, without pointing any fingers, Python package alone won't get the job done for scientific packages. There's no getting around the fact that you will need to install some packages into the base operating system, and there is no good, well-supported path to make that easy. Particularly if you would prefer to do this without modifying the base system. Then, there's the layer of being able to help a room full of people all get the job done in about twenty minutes. Even with a room full of developers, it's going to be a challenge.

Let's take a tour of the options.

One -- Virtualenv and Python Packages

This option is the most 'pythonic' but also, by far, the least likely to get the job done. The reason is going to be the dependency on complex scientific libraries, which the user is then going to have to hand install by following through a list of instructions. It's doable, but I won't know up front what the differences between say yum and apt are going to be, let alone the potential differences between operating system versions might be. Then, there will be some users on OSX (hopefully using either macports or brew) and potentially some on Windows. In my experience, there are naming differences between package names across those systems, and at times there may be critical gaps or versioning differences. Furthermore, the relevant Python 3 vs 2.7 packages may differ. It is basically just too hard to use Python's inbuilt packaging mechanism to handle a whole room full of individual development platform differences.

Two -- Virtual Machines Images

This approach is fairly reliable, but feels kind of clumsy to me, and isn't necessarily very repeatable. While not everyone is going to have Virtualbox (or any other major virtualiser) installed, most people will be able to use this technology on their systems. There may be a few who will need to install Virtualbox, but from there it really should 'just work'.

VM files can be shared with USB keys or over the network. So long as you bring a long a good number of keys it should be mostly okay. A good tip though -- bring along keys of a couple of different brands. I have had firsthand experience of specific brands of USB key and computer just not getting along.

The downside is that while this will work in a tutorial setting, virtual machines can be slow, and don't necessarily set up the attendees with the technology they should be using going forward. They may find themselves left short of being able to work effectively in their own environments later.

Three -- Vagrant Automatic VM Provisioning

The level up from supplying a base VM is using Vagrant ( It allows you to specify the configuration of the base machine and its packages through a configuration file, so the only thing you need to share with people is a simple file. Rather than having to share large virtual machine files, which are also hard to version, you can share a simple configuration file only. That's something than can be easily versioned, and is lightweight to send around. The only downside is that each attendee will need to download the base VM image through the Vagrant system, which will hit the local network. Running a tutorial is an exercise in digital survivalism. It's best not to rely on any aspect of supporting technology.

I have also had some a lot of trouble trying to install Vagrant boxes. I'm not really sure what the source of the issues was. I'm not really sure why it started working either. I just know I'm not going to trust it in a live tutorial environment. Crossing it off for now.

Four -- Alternative Python Distributions

This could be a really good option. The two main distributions that I'm aware of are Python(x,y) and Anaconda. Both seem solid, but Anaconda probably has more mindshare, particularly for scientific packages. For the purposes of machine learning and data science, that is going to be very relevant. Many people support using the Anaconda distribution by default, but that's not my first option.

I would recommend Anaconda in corporate environments, where it's useful to have a singular, multi-capable installation which is going to be installable onto enterprise linux distributions but still have all the relevant scientific libraries. I would recommend against it on your own machine, because despite its general excellence, I have found occasional issues when trying to install niche packages.

Anaconda would probably also be good in the data centre or in a cloud environment, where things can also get a little wild when installing software. It's probably a good choice for system administrators as well. Firstly, installed packages won't interfere with the system Python. Secondly, it allows users to create user-space deployments with their own libraries that are isolated from the central libraries. This helps with managing robustness. Standard Python will do this with the 'virtualenv' package, so there are multiple ways to achieve this goal.

Using web-based environments rather than your own installation

This is really about side-stepping the issue, rather than fixing it as such. It's not free from pitfalls, because there are still browser incompatibilities to consider. However, if your audience can't manage to have an up-to-date version of either Firefox or Chrome installed, then things are likely to be tricky. Also, up-to-date versions of Internet Explorer are also likely to work, however I haven't tested it to any degree. You'll also need to understand the local networking environment to make sure you can run a server and that attendees are going to be able to access it. You could host an ad-hoc network on your own hardware, but I'm a bit nervous about that approach.

Perhaps, if I have some spare hardware, I'll expose something through a web server. Another alternative is to demonstrate (for example) the Kaggle scripting environment.


I think I have talked myself around to providing a virtual machine image via USB keys. I can build the environment on my own machine, verify exactly how it is set up, then provide something controlled to participants. 

In addition, I'll supply the list of packages that are in use, so that people can install them directly onto their own system if desired. This will be particularly relevant to those looking to exploit their GPUs to the maximum. 

Finally, I'll include a demo of the Kaggle scripting environment for those who don't really have an appropriate platform themselves.

I'd appreciate any comments from anyone who has run or attended tutorials who has any opinion about how best to get everyone up and running...