This should be more than enough to create a plain text language corpus that I can throw at nltk in order to get some scatter plots of word frequency, maybe most-common trigrams, most-common sentences etc.
Step 1: Make a mirror of your organisation's website
Ease-of-use: Really easy
Step Two: Scrape out the language parts
Tool: Beautiful Soup (Python)
BeautifulSoup is currently pushing my laptop CPU up to 100% scraping 250m of web pages and seems to be doing a good job. I really love the description of the software on the home page:
You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser.
For this kind of work, you really want something which is robust to minor quirks and hassles. The purpose is to get the most information for the least amount of work, and spending a lot of time polishing a data set is going to be wasted for most people's purposes.
Anyway, enough for now...