Thursday, December 15, 2016

Wrangling downloading my own blog articles

99.9% of everything at the moment appears to be wrangling basic data ingest.

In a fit of semi-directionless curiosity, I decided to try expanding on my previous post by wiring up some kind of automatic blogging tool just to see if I could. This problem has many aspects. Let the yak shaving begin.

The first thing I did was start thinking about requirements a bit. I didn't, you know, write any down. But I did think about it.

I decided I needed an AI bot which could automatically write blog posts for me. I assume many people have tried and failed, or possibly even tried and succeeded, but I didn't want to let knowledge get in the way of the outcome and just ran at it.

Here is my design:
  -- A module for downloading input source data to feed into the AI
  -- A module for running learning jobs based on source data
  -- A module to write blog articles
  -- A module to publish / print out the blog articles

I thought I'd wrangle some kind of thing which would create text of the required length using some kind of minimal implementation like a markov model trained on my old blog posts. I should have just trained it on gutenberg text or something, because it turns out that downloading my old blog posts is Harder Than It First Appears. Not really difficult, but longer than the hour or so I thought it would take.

There are basically two options: web scraping or the blogger API. The tool du jour for scraping in Python is called scrapy. I decided to use the blogger API, not because I thought it was inherently better than scraping, but rather because I expect I'll get a reasonably structured object in memory at the other end and won't have to do a lot of page interpretation. Also, if I want to publish directly to blogger later down the track, it will probably use the same mechanism.

Possibly because I'm bad at searching and terrible at web programming, this took me ages to get going. In fact, I haven't gotten going yet. I have merely jumped the first couple of an unknown number of hurdles. Programming is basically like playing an infinite scrolling computer game of randomly varying difficulty until you get to the end of the current level.

The plan was to create a minimal implementation which could then be bootstrapped into a smarter implementation. I thought people might like seeing it come together (I'd open source it, and then write about the process of building it here).

Here's where I've gotten:
  -- I've installed the requisite API
  -- I've discovered that I don't just need a 'project key', I actually a full Oauth2 secret and key
  -- I read the source code to figure out that the filename that is supplied into the sample tool method in the library is actually only used to derive a directory to look for the "client_secrets.json" file, so I can actually supply a fake filename in order to tell it which directory to look for the file in. It should have just asked for a directory.

So, instead of a really cool Markov model, instead I have gotten three lines into reproducing the online walkthrough. Only took a couple hours...

More later! Wish me luck...

No comments:

Post a Comment