Thursday, June 17, 2010

'Python' vs 'python' @ (Lexical Dispersion Plot)

Something I love about mucking about with language are the neat features that seem interesting, but it's hard to figure out what, if anything they mean. I downloaded 50m of html files from, took the text out and then loaded it into the nltk library. I then did a lexical dispersion plot on two keywords, "Python" and "python". A lexical dispersion plot basically shows where in a text file a word occurs. If the files were organised by time, for example, the left part of this plot (near x=0) would be the oldest text, and the right part would be the newest text. My files weren't organised along any meaningful vector, so the offset only really shows how the data clusters internally. Ideally I'd order the text files by last-modified dates so that we could do a time analysis on the database.

Clearly, "Python" is used far more often than "python" within the site. However, round near offset=0, there's clearly a cluster of "python" uses. Maybe there's one rogue wiki editor who just likes using the lower case. Maybe it's used in links. Who knows? It's clear though, that the Python Software Foundation is very active in its pursuit of proper capitalisation of its trademark!

I'd have some more stuff to post, but I got sidetracked by fine-tuning my wget arguments to get a better dataset...