Wednesday, June 16, 2010

Basic web-to-language-stats processing done

Following process completed:
-- Use wget to capture a web corpus
-- Use BeautifulSoup to pull out the language bits (note, discovered nltk has some web scraping capabilities, need to compare)
-- Use nltk to read in plain text from web corpus
-- Print a few organisational language stats and plot a couple of neat graphs

Need to develop:
-- Automated downloading through code rather than manual wget stage
-- Possibly add PDF scraping
-- Store files so they can go into the text database organised by creation date/time
-- Use nltk to be time-aware so that time charts can be plotted
-- Clean up the class design so it's (more) elegant and useful

Need to document:
-- Technology stack involved

Possible future developments:
-- Build web page out of the information found
-- Try out on a few different organisations
-- Add some metadata / parameters, build some more interesting info
-- Extent capacity to show information where organisation is cited in standard news websites