Monday, June 15, 2015

A Twitter Memebot in Word2Vec

I wanted to explore some ideas with Word2Vec to see how it could potentially be applied in practise. I thought that I would take a run at a Twitter bot that would try to do something semantically interesting and create new content. New ideas, from code.


Here's an example. Word2Vec is all about finding some kind of underlying representation of the semantics of words, and allowing some kind of traversal of that semantic space in a reliable fashion. It's about other things too, but what gets really excited is the fact that it's an approach which seems to actually approach the way that we humans tend to form word relationships.

Let's just say I was partially successful. The meme I've chosen above is one of the better results from the work, but there were many somewhat-interesting outputs. I refrained from making the twitter bot autonomous, as it had an unfortunate tendency to lock in on the most controversial tweets in my timeline, then make some hilarious but often unfortunate inferences from them, then meme them. Yeah, I'm not flicking that particular switch, thanks very much!

The libraries in use for this tutorial can be found at:

  • https://github.com/danieldiekmeier/memegenerator
  • https://github.com/danielfrg/word2vec
  • https://github.com/tweepy/tweepy
I recommend tweepy over other twitter API libraries, at least for Python 3.4, as it was the only one which worked for me first try. I didn't get round to the others again for a second try, because working solution.

You'll need to go get some twitter API keys. I don't remember all the steps for this, I just kind of did it on instinct. There's a Stack Overflow question on the topic if that helps. http://stackoverflow.com/questions/1808855/getting-new-twitter-api-consumer-and-secret-keys but that's not what I used. Good luck :)

This particular Twitter bot will select a random tweet from your timeline, then comment on it in the form of a meme. The relevance of those tweets is a bit hit-and-miss to be honest. This could probably be solved by using topic-modelling rather than random selection to find the most relevant keywords from the tweet.
public_tweets = api.home_timeline() 
will fetch the most recent tweets from the current user's timeline. The code then chooses a random tweet, and focuses on words that are longer than 3 characters (a proxy for 'interesting' or 'relevant). From this, we extract four words (if available). The goal is to produce a meme of the form "A is to B as C is to D". A, B and C are random words chosen from the tweet. D is a word found using word2vec. The fourth word is used to choose the image background by doing a flickr search.
indexes, metrics = model.analogy(pos=[ws[0], ws[1]], neg=[ws[2]], n=10)
ws.append(model.vocab[indexes][0])
The first line there is getting a list of candidate words for the end of our analogy. The second line is just picking the first one.

For example, a human might approach this as follows. Suppose the tweet is:

"I really love unit testing. It makes my job so much easier when doing deployments."

The word selection might result in "Testing, Easier, Deployments, Job". The goal would be to come up with a word for "Testing is to easier as Deployments is to X" (over an image of a job). I might come up with the word "automatic". Who knows -- it's kind of hard to relate all of those things. 

Here's an image showing another dubious set of relationships.


There' some sense in it -- it's certainly seems that breaking things can be trivial, and that flicking a switch is easy, and that questioning is a bit like testing. The background evokes both randomness and goal-seeking. However, getting any more precise than that is drawing a long bow, and a lot of those relationships came up pretty randomly. 

I could imagine this approach being used to create suggested memes to a human reviewer, using a supervised system approach. However, it's not really ready for autonomous use, being inadequate in both semantic meaning and sensitivity to content. However, I do think it shows that it's pretty easy to take and use the technologies for basic use.

There are a bunch of potential improvements I can think of which should result in better output. Focusing the word search towards the topic of the tweet is one. Selecting words for the analogy which are reasonably closely related would be another, and quite doable using the word2vec approach.

Understanding every step of this system requires a more involved explanation of what's going on, so I think the next few posts might be targetted at the intermediate steps and how they were arrived at, plus a walkthrough of each part of the code (which will be made available at that point).

Until next time, happy coding!