Thursday, June 17, 2010

'Python' vs 'python' @ python.org (Lexical Dispersion Plot)


Something I love about mucking about with language are the neat features that seem interesting, but it's hard to figure out what, if anything they mean. I downloaded 50m of html files from http://www.python.org, took the text out and then loaded it into the nltk library. I then did a lexical dispersion plot on two keywords, "Python" and "python". A lexical dispersion plot basically shows where in a text file a word occurs. If the files were organised by time, for example, the left part of this plot (near x=0) would be the oldest text, and the right part would be the newest text. My files weren't organised along any meaningful vector, so the offset only really shows how the data clusters internally. Ideally I'd order the text files by last-modified dates so that we could do a time analysis on the database.

Clearly, "Python" is used far more often than "python" within the site. However, round near offset=0, there's clearly a cluster of "python" uses. Maybe there's one rogue wiki editor who just likes using the lower case. Maybe it's used in links. Who knows? It's clear though, that the Python Software Foundation is very active in its pursuit of proper capitalisation of its trademark!

I'd have some more stuff to post, but I got sidetracked by fine-tuning my wget arguments to get a better dataset... 

3 comments:

Anonymous said...

To refer CPython interpreter lower case "python" is recommended -- this might be one reason.

Anonymous said...

Can't you download python.orgs content in plain sourcecode? Wouldn't that make a lot of things easier (for example you could follow the evolution of Python vs python in case of a revision-control-system.)?

Tennessee Leeuwenburg said...

That could explain it, for sure...

As for downloading the sourcecode, you could definitely use revision control to more finely track language evolution over time.

If you wanted to get really serious (say if you were a marketing firm or similar) you could put a weekly snapshot of your firms of interest into a repository to get great historical information.

The sky is really the limit with this stuff if you are happy to put the time in. But you can get some basic impressions for very little effort, even if those impressions raise more questions than they answer!