Thursday, June 17, 2010

'Python' vs 'python' @ python.org (Lexical Dispersion Plot)


Something I love about mucking about with language are the neat features that seem interesting, but it's hard to figure out what, if anything they mean. I downloaded 50m of html files from http://www.python.org, took the text out and then loaded it into the nltk library. I then did a lexical dispersion plot on two keywords, "Python" and "python". A lexical dispersion plot basically shows where in a text file a word occurs. If the files were organised by time, for example, the left part of this plot (near x=0) would be the oldest text, and the right part would be the newest text. My files weren't organised along any meaningful vector, so the offset only really shows how the data clusters internally. Ideally I'd order the text files by last-modified dates so that we could do a time analysis on the database.

Clearly, "Python" is used far more often than "python" within the site. However, round near offset=0, there's clearly a cluster of "python" uses. Maybe there's one rogue wiki editor who just likes using the lower case. Maybe it's used in links. Who knows? It's clear though, that the Python Software Foundation is very active in its pursuit of proper capitalisation of its trademark!

I'd have some more stuff to post, but I got sidetracked by fine-tuning my wget arguments to get a better dataset... 

Wednesday, June 16, 2010

Basic web-to-language-stats processing done

Following process completed:
-- Use wget to capture a web corpus
-- Use BeautifulSoup to pull out the language bits (note, discovered nltk has some web scraping capabilities, need to compare)
-- Use nltk to read in plain text from web corpus
-- Print a few organisational language stats and plot a couple of neat graphs

Need to develop:
-- Automated downloading through code rather than manual wget stage
-- Possibly add PDF scraping
-- Store files so they can go into the text database organised by creation date/time
-- Use nltk to be time-aware so that time charts can be plotted
-- Clean up the class design so it's (more) elegant and useful

Need to document:
-- Technology stack involved

Possible future developments:
-- Build web page out of the information found
-- Try out on a few different organisations
-- Add some metadata / parameters, build some more interesting info
-- Extent capacity to show information where organisation is cited in standard news websites

Tuesday, June 15, 2010

Organisational language analysis tools

Okay, so I'm just self-documenting some home hacking I'm doing at the moment to try to add some spice to my upcoming PyCon AU presentation. I thought I'd try to put together some code anyone could use to do some basic language analysis on their organisation. So far, I'm still building the language corpus which I'll be using. Tonight I've been using BeautifulSoup to parse web pages and extract the language components (defined for the time being as the string part of all the tags in a page).

This should be more than enough to create a plain text language corpus that I can throw at nltk in order to get some scatter plots of word frequency, maybe most-common trigrams, most-common sentences etc.

Step 1: Make a mirror of your organisation's website
Tool: Wget
Ease-of-use: Really easy

Step Two: Scrape out the language parts
Tool: Beautiful Soup (Python)
Ease-of-use: Straightforward


BeautifulSoup is currently pushing my laptop CPU up to 100% scraping 250m of web pages and seems to be doing a good job. I really love the description of the software on the home page:

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

For this kind of work, you really want something which is robust to minor quirks and hassles. The purpose is to get the most information for the least amount of work, and spending a lot of time polishing a data set is going to be wasted for most people's purposes.

Anyway, enough for now...