Tuesday, June 15, 2010

Organisational language analysis tools

Okay, so I'm just self-documenting some home hacking I'm doing at the moment to try to add some spice to my upcoming PyCon AU presentation. I thought I'd try to put together some code anyone could use to do some basic language analysis on their organisation. So far, I'm still building the language corpus which I'll be using. Tonight I've been using BeautifulSoup to parse web pages and extract the language components (defined for the time being as the string part of all the tags in a page).

This should be more than enough to create a plain text language corpus that I can throw at nltk in order to get some scatter plots of word frequency, maybe most-common trigrams, most-common sentences etc.

Step 1: Make a mirror of your organisation's website
Tool: Wget
Ease-of-use: Really easy

Step Two: Scrape out the language parts
Tool: Beautiful Soup (Python)
Ease-of-use: Straightforward


BeautifulSoup is currently pushing my laptop CPU up to 100% scraping 250m of web pages and seems to be doing a good job. I really love the description of the software on the home page:

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

For this kind of work, you really want something which is robust to minor quirks and hassles. The purpose is to get the most information for the least amount of work, and spending a lot of time polishing a data set is going to be wasted for most people's purposes.

Anyway, enough for now...