Thursday, April 9, 2015

STML: Loading and working with data

Loading Data Into Memory

I've had some Real Life Interruptions preventing me from taking the MNIST data further. Here's a generic post on working with data that I prepared earlier.

It may sound a little ridiculous, but sometimes even loading data into memory can be a challenge. I've come across a few challenges when loading data (and subsequently working with it).
  1. Picking an appropriate in-memory structure
  2. Dealing with very large data sets
  3. Dealing with a very large number of files
  4. Picking data structures that are appropriate for your software libraries
Some of these questions have obvious answers; some do not. Most of them will change depending on the technology stack you choose. Many of them will be different between Python 2.7 and Python 3.x. Let's take it step-by-step, in detail.

Update -- after beginning this post intending to address all four points, I found that I exceeded my length limit with even a brief statement on the various categories of input data with respect to choosing an appropriate in-memory data structure. I will deal with all these topics in detail later, when we actually focus on a specific machine learning problem. For now, this blog post is now just a brief statement on picking appropriate in-memory data structures. 

Picking an appropriate in-memory structure

Obviously, this will depend very much on your data set. Very common data formats are:
  1. Images, either grayscale or RGB, varying in size from icons to multi-megapixel photos
  2. Plain text (with many kinds of internal structure)
  3. Semantic text (xml, json, markdown, html...)
  4. Graphs (social graphs, networks, object-oriented databases)
  5. Relational databases
  6. Spatial data / geographically-aware data
Two additional complicating factors are dealing with time-series of data, and dealing with streaming data.

This isn't the place to exhaustively cover this topic. Simply covering how to load these files into memory could comfortably take a couple of posts per data type, and this is without going into cleaning data, scraping data or otherwise getting data in fit shape. Nonetheless, this is a topic often skipped over my machine learning tutorials and has been the cause of many a stubbed toe on my walk through a walkthrough.

Loading Images into Memory

Python has quite a few options when it comes to loading image data. When it comes to machine learning, the first step is typically to deal with with image intensity (e.g. grayscale) rather than utilising the colour values. Further, the data is typically treated as normalised values between zero and one. Image coordinates are typically based at the top left, array coordinates vary, graphs are bottom-left, screen coordinates vary, blah blah blah. Within a single data set, this probably doesn't matter too much, as the algorithms don't tend to care about whether images are flipped upside down, just so long as things are all internally consistent. Where it matters is when you bring together disparate data sources.

The super-super short version is:
# open random image small enough for memory
img = Image.open(open('filename.jpg'))
img = numpy.asarray(img, dtype='float32') / 256.
Note the use of float32, the division by 256, the decimal point on the divisor to force floating-point division. You'll need to install numpy and something called "Pillow". Pillow is the new PIL library and is compatible with any tutorials you're likely to find.

Loading Plain Text Into Memory

Plain text is usually pretty easy to read. There are some minor platform differences between Windows and the rest of the known universe, but by and large getting text into memory isn't a problem. The main issue is dealing with it after that point.

A simple
thestring = open('filename.txt').read()
will typically do this just fine for now. The challenge will come when you need to understand its contents. Breaking up the raw text is called tokenization, and most language processing will use something called 'parts of speech tagging' to start building up a representation of the underlying language concepts. A library called NLTK is the current best place to start with this, although Word2Vec is also a very interesting starting point for language processing. Don't mention unicode -- that's for later.

Loading Semantic Text into Memory

Let's start with HTML data, because I have a gripe about this. There's a library out there called Beautiful Soup, which has the explicit goal of being able to process pretty much any html page you can through at it. However, for some incomprehensible reason, I have found it is actually less tolerant that the standard Python library lxml, which is better able to handle nonconformant HTML input (when placed into compatability mode). First up, I really don't work with HTML much. As a result, you're going to be mostly on your own when it comes to parsing the underlying document, but you'll be using one of these two libraries depending on your needs and your source data.

Loading Graphs Into Memory

I also haven't worked all that much with graph structures. Figuring out how to load and work with this kind of data is a challenge. Firstly, there appear to be few commonly-accepted graph serialisation formats. Secondly, the term "graph" is massively overloaded, so you'll find a lot of search results for libraries that are actually pretty different. There are three broad categories of graph library: graphical tools for plotting charts (e.g. X vs Y plots a-la matplotlib and graphite); tools for plotting graph structures (e.g. dot, graphviz) and tools for actually working with graph structures in memory (like networkX). There is in fact some overlap between all these tools, plus there are sure to be more out there. It is also possible to represent graphs through Python dictionaries and custom data structures.

Working with Relational Data

Relational database queries are absolutely fantastic tools. Don't underestimate the value of Ye Olde Database structure. They are also a wonderful ways to work with data locally. The two magic libraries here are sqlite3 and pandas. Here's the three-line recipe for reading out from a local sqlite3 database
conn = sqlite3.connect('../database/demo.db')
query = 'select * from TABLE'
df = pandas.read_sql(query, conn)
Connected to a remote database simply involves using the appropriate library to create the connection object. Pandas is the best way to deal with the data once in memory, bar none. It is also the best way to read CSV files, and is readily usable with other scientific libraries due to the ability to retrieve data as numpy arrays. Many, many interesting machine learning problems can and should be approached with these tools.

More recently, I have found it more effective to use HDF5 as the local data storage format, even for relational data. Go figure.

Spatial Data / GIS Data

Abandon all hope, ye who enter here.

Spatial / GIS data is a minefield of complexity seemingly designed to entrap, confuse and beguile even the experts. I'm not going to even try to talk about it in a single paragraph. Working with spatial data is incredibly powerful, but it will require an enormous rant-length blog post all of its own to actually treat properly.

Dealing with Time Series

Machine learning problems frequently ignore the time dimension of our reality. This can be a very useful simplification, but feels unsatisfying. I haven't seen any good libraries which really nail this. Let's talk about some common approaches:

Markov Modelling. The time dimension is treated as a sequence of causally-linked events. Frequency counting is used to infer the probability of an outcome given a small number of antecedent conditions. The key concept here is the n-gram, which is an n-length sequence which is counted.

Four-Dimensional Arrays. Time can be treated either as a sequence of events with an equal or irregular interval. Arrays can be sparse or dense.

Date-Time Indexing. This is where, for example, a 2d matrix or SQL table has date and time references recorded with the entries.

Animation (including video). Video and sound formats both include time as a very significant concept. While most operating systems don't provide true realtime guarantees around operation, their components (sound cards and video cards) provide the ability to buffer data and then write it out in pretty-true realtime. These will in general be based on an implicit time interval between data values.

Dealing with Streaming Data

Algorithms which deal with streaming data are typically referred to as "online" algorithms. In development mode, your data will mostly be stored on a disk or something where you can re-process the data easily and at will. However, not all data sources can be so readily handled. Examples where you might use an online algorithm include video processing, data firehose scenarios (satellite data,  twitter, sound processing). Such data sources are often only of ephemeral value, not worth permanently storing; you may be working on a lower-capability compute platform; or you just might not have that much disk free.

You can think of these data streams as something where you store the output of your algorithm, but throw out the source data. An easy example might include feature detection on a network router. You might monitor the network stream for suspicious packets, but not store the full network data of every packet going across your network. Another might be small robots or drones which have limited on-board capacity for storage, but can run relevant algorithms on the data in realtime.

Next Steps

I am not sure exactly what to make of this blog post. The commentary is only slightly useful for people with a specific problem to solve, but perhaps it serves to explain just how complex an area machine learning can be. The only practical way forward that I can think of is to get into actual problem-solving on non-trivial problems.

I think the main thing to take away from this is: Don't Feel Stupid If You Find It Hard. There really really are a lot of factors which go into solving a machine learning problem, and you're going to have to contend with them all. These problems haven't all been solved, and they don't have universal solutions. Your libraries will do much of the heavy lifting, but then leave you without a safety net when you suddenly find yourself leaving the beaten path.