Friday, December 10, 2010

Service-based URL shorteners

Generic URL-shortening services like and, should be made service-specific (, to avoid malware site redirects.

It would work like this. Hypothetical image-based URL shortener would check to make sure the site pointed to met any of the following qualifiers:
  -- Content of URL is a recognised image format. Require proper mime-type and image headers.
  -- Content of URL is a recognised image service provider, e.g. Facebook photo, Flickr photo etc

This would prevent people doing something like the following:
"Hey check out this awesome photo of a rabbit SCUBA diving"

which actually redirects you to something else entirely, like a shopping website, or worse something seedy, a malware site, or actual criminal content.

A similar hypothetical tube-based URL shortener, would do the same thing, but check the content was hosted on YouTube, or a variety of other family-friendly video hosting sites.

That way, you still get URL shortened goodness, but also content safety. 

That's all!

Tuesday, December 7, 2010

Idea: Crowdcasting News

Okay, here's how I want to my get my news now. Think of it as like Digg meets the social graph.

I have some kind of trust network of friends. A vanilla graph would do, but let's face it, I value news from some sources more than other. Anyone who stumbles on some news (or generates some) might tweet it at the moment, or Facebook it. I want them to amplify it instead.

There is a database somewhere that tracks all the amplification clicks it gets against URLs, say keeping track of 2-3 days information. It then extracts those clicks which originated from my trust network, then weights all the URLs according to number of clicks. If a URL meets threshold, I get it as news. In fact, what I specify is my desired number of news items per day, and the threshold adapts.

If I had that, I would completely replace newspapers. I would have an aggregated news feed and an aggregated comment feed (currently newspapers+slashdot+dig, and facebook+twitter+email).

I think that would be cool.

Friday, November 19, 2010

Daft ideas on social graphs

Here's my issue with Facebook security. It's not information bleed to third parties, which I agree is important, but information bleed amongst social networks that I would like to keep separate. Work/Social being the primary division I would like to maintain. It's a lot *less* important to me if some marketing firm has my information than if my workmates know I faked sick leave or my friends are showered with geek updates. I would like multiple social graphs please. And I would like to give my blessing on a post or other Facebook object to a graph at a time, not a person at a time or an application at a time. I'm happy with connecting with applications and groups using "Like" or "Join", and I'm happy with friends-of-friends seeing most of my photos/updates. Within a single graph.

I would like the following graphs: 

The supergraph. Everyone in all of my graphs. 
A professional subgraph. Workplace + geek. 
A social subgraph: Friends and family.    
  - A family sub-subpgraph    
  - Logically, I may also need a friends sub-subgraph 
  - A dark subgraph: My connections are invisible to others in the graph

Maybe I also want a public broadcast faux graph.

That way, for example, my "Farmville Lame App" could suck down my social subgraph permissive fields, but be denied information on how much I earn. Why? Because it only gets fed what I feed my social graph. 

Thursday, July 29, 2010

+1 for xmind... download and use it, it's awesome!

I've just downloaded this, and I've used it to create about 9 diagrams in two days. It's totally awesome. It can be used to create the following charts with no fuss at all, with nice icons if you would like to include that sort of thing:
  -- Brainstorming map
  -- Organisation chart
  -- Spreadsheet-style chart
  -- Fishbone chart

The concept map style chart really helped me to capture the major tasks, goals, responsibilities and relationships for me and my team, bringing together kinds of information that are otherwise quite difficult to relate to one another. It really helped me talk to them about how things tie together.

I've also used the other chart styles to map out processes, systems and challenges, showing my boss and various colleagues how things hang together.

Tuesday, July 13, 2010

Review of the Nexus One

Okay, so the Nexus One has just been released in Australia, and I am now a proud owner. This is my first foray into Phones 2.0, so I wasn't quite sure what to expect. I also have insanely high standards, and the quality of the computing platform comes into my evaluation.

Review: Three and a Half Stars / Pretty good / Somewhat exceeds expectations

Based on the iPhone4 articles I've read, antennae issues aside, the iPhone4 would rank at 4 stars. (out of 5 in both cases).

The Nexus One has a good enough processor. It successfully gets over the bar of being an acceptable computing environment. I believe it's a 1Ghz chip, and I would say this is about the minimum you'd want before regarding a phone as a proper computing platform.

The best thing about the Android is that it is open source. This means you can install stuff yourself, use the phone to do what you want, access any websites you and and basically freely use your own property. I've already used this to my own advantage and this was an overriding concern for me. That's where my perception of the iPhone lost a star.

There are a few negatives I can report. The battery life is lousy. I'd expect this to be true of all the competitors, but it's just annoying to have to recharge twice a day (if you are using it somewhat actively). I don't think it's Google/HTC's fault as such, but it's an annoyance which I expect is present on every new gen phone.

It doesn't have a front-facing camera, which means no video calls and no potential for cool face-recognition software. It could do with an extra camera.

It doesn't do such a great job of flash. It will play YouTube (thank goodness) but if you stray to other sites with flash payloads, you'll be in trouble. I know "it's coming" but it's not there yet.

The side buttons are a bit annoying. I accidentally push one of them every time I'm putting the phone back in its carry-case, usually the volume. The buttons at the bottom of the phone are also a bit random. There are four permanent buttons, plus a trackball. Unlike other users that I've read comments from, I don't find the trackball to be completely useless, because it lets you be more accurate about what you are selecting with the cursor. You wouldn't type with it, but it's not too bad for careful navigation or some games. That said, it does seem to be taking up some valuable real estate. There's a permanent "search" button which is basically pointless.

The voice recognition is far from perfect, but it's also much quicker than typing. If they can figure it out, then it will be a great input mechanism. Don't talk for too long or it will stop listening and abandon the attempt.

The keyboard is okay, and probably comparable to other phones. No comment really.

I haven't gotten Python up and running yet. I failed on the first attempt, but haven't started to analyse the problem, so I expect it will be working soon.

Anyway, that's my 2c!

Thursday, June 17, 2010

'Python' vs 'python' @ (Lexical Dispersion Plot)

Something I love about mucking about with language are the neat features that seem interesting, but it's hard to figure out what, if anything they mean. I downloaded 50m of html files from, took the text out and then loaded it into the nltk library. I then did a lexical dispersion plot on two keywords, "Python" and "python". A lexical dispersion plot basically shows where in a text file a word occurs. If the files were organised by time, for example, the left part of this plot (near x=0) would be the oldest text, and the right part would be the newest text. My files weren't organised along any meaningful vector, so the offset only really shows how the data clusters internally. Ideally I'd order the text files by last-modified dates so that we could do a time analysis on the database.

Clearly, "Python" is used far more often than "python" within the site. However, round near offset=0, there's clearly a cluster of "python" uses. Maybe there's one rogue wiki editor who just likes using the lower case. Maybe it's used in links. Who knows? It's clear though, that the Python Software Foundation is very active in its pursuit of proper capitalisation of its trademark!

I'd have some more stuff to post, but I got sidetracked by fine-tuning my wget arguments to get a better dataset... 

Wednesday, June 16, 2010

Basic web-to-language-stats processing done

Following process completed:
-- Use wget to capture a web corpus
-- Use BeautifulSoup to pull out the language bits (note, discovered nltk has some web scraping capabilities, need to compare)
-- Use nltk to read in plain text from web corpus
-- Print a few organisational language stats and plot a couple of neat graphs

Need to develop:
-- Automated downloading through code rather than manual wget stage
-- Possibly add PDF scraping
-- Store files so they can go into the text database organised by creation date/time
-- Use nltk to be time-aware so that time charts can be plotted
-- Clean up the class design so it's (more) elegant and useful

Need to document:
-- Technology stack involved

Possible future developments:
-- Build web page out of the information found
-- Try out on a few different organisations
-- Add some metadata / parameters, build some more interesting info
-- Extent capacity to show information where organisation is cited in standard news websites

Tuesday, June 15, 2010

Organisational language analysis tools

Okay, so I'm just self-documenting some home hacking I'm doing at the moment to try to add some spice to my upcoming PyCon AU presentation. I thought I'd try to put together some code anyone could use to do some basic language analysis on their organisation. So far, I'm still building the language corpus which I'll be using. Tonight I've been using BeautifulSoup to parse web pages and extract the language components (defined for the time being as the string part of all the tags in a page).

This should be more than enough to create a plain text language corpus that I can throw at nltk in order to get some scatter plots of word frequency, maybe most-common trigrams, most-common sentences etc.

Step 1: Make a mirror of your organisation's website
Tool: Wget
Ease-of-use: Really easy

Step Two: Scrape out the language parts
Tool: Beautiful Soup (Python)
Ease-of-use: Straightforward

BeautifulSoup is currently pushing my laptop CPU up to 100% scraping 250m of web pages and seems to be doing a good job. I really love the description of the software on the home page:

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

For this kind of work, you really want something which is robust to minor quirks and hassles. The purpose is to get the most information for the least amount of work, and spending a lot of time polishing a data set is going to be wasted for most people's purposes.

Anyway, enough for now...

Monday, May 31, 2010

Facebook group for PyCon AU 2010

PyCon AU Facebook Group

For those who are keen on Facebook, I've created a group for PyCon AU 2010, to be held in Sydney on the 26th and 27th of June (with the organiser's blessing of course). I'll paste in any even updates, and it would be great to get a good number of people joined up in order to increase the visibility of the event.

Yes, you too can be part of the message that PyCon AU is going to be absolutely fantastic. Please leave a comment on the wall and let us know what you're looking forward to.


Tuesday, March 30, 2010

Creative Edge, Toronto: Article on Listening


Creative Edge, Toronto: Article on Listening

"My old pal Harry and I are walking in the park, improvising like two jazz musicians - except we're playing with words not melodies. He throws out a line. I have a comeback. He does a riff on my response. Pretty soon we're laughing so hard we're crying. Eventually we collapse, exhausted, on a park bench.

"That was amazing, Harry," I say, "Why don't we do it more often?".

I'm just being social, not really expecting a response, but Harry takes my question seriously. He leans closer, lowering his voice, like he's confiding in me."

Thursday, March 18, 2010

Incoherent Ramblings on Imagination, Semantic Nets, AI

I have been reading my AI textbook again, plus Daniel Dennett's frankly *brilliant* book "Darwin's Dangerous Idea" and I'm afraid it has sparked my overactive imagination!! If you read one serious book this year, make it that one.

Suppose you tried to build an AI entity suchly...

Suppose you start with a network of nodes. Each node is either an axiom node or a composite node. An axiom node is a node which represents something believed to be absolutely true, and un-decomposable at the current level of abstraction. A composite node is one which is not an axiom node. Composite nodes may be grounded (derived from only axiom nodes and other composite nodes) or un-grounded (only partially derived from axiom nodes and other composite nodes). The 'somethings' in the nodes may be facts as we know them (propositions about the objective world, say) or may be undescribed non-propositional nodes which are a part of learned relationships.

One could "imagine" things by overlaying a supposition network over the knowledge network, perhaps using an inheritance structure (i.e. the imagined world derives from the real world but some subnetworks truth propositions are deemed to be different).

Input streams and output streams are used to embody the agent. As such, the agent is constantly writing new observation nodes into its knowledge network.

Over this knowledge network, rule processes run. These rule processes are also stored in the knowledge network, but are marked as rule processes. A simple rule process would be one which could enforce simple truth relationships. (i.e. If A --> B and A is true, then make B true).

Nodes would be scanned for identity (Node A is Node B if node A and node B are sufficiently indistinguishable with high confidence) representing network simplification.

More complex rule processes could be "imagined" and run over the imagination network to evaluate their performance. In this way, imagination allows the entity to test potential rules for their truth value.

The model scales to the extent that the network can be decomposed (i.e. the 'locality' factor of the network with respect to the problem at hand). For example, the application of simple rules which operate on say 3 nodes would be parallelisable and scale well, while complex rules which say operates on all nodes to a depth of N and branching factor of B would not parallelise as effectively.

The fundamental mode of processing would be a build-and-test model of imagination where new knowledge networks are imagined then evaluated, and then incorporated if they perform better. This makes the system a memetic evolutionary system, which seems to me to be a prerequisite for sustainable machine learning.

Learning success comes through untrained re-enforcement based on assessment of expectations of future observational input. Because the entity is based on input *streams* and output *streams*, then even non-action is a choice of the agent.

All concepts, then, are either grounded concepts (fully understood and linked to base phenomena) or ungrounded concepts (not fully decomposable). I would also suggest that truth generally be a real-varying amount, but that Absolutely False and Absolutely True also be allowable values.

This model supports "making assumptions" by building an imagination network where, for example, all things which are believed to be 98% true are taken to be absolutely true, then the results evaluated for performance. Experiments can be run in which nodes can be tested for unification, or a node could be split into two nodes then learning and assessment re-performed.

Of course, I haven't even addressed any questions about the initial state of the system, how the learning algorithms actually work (how is knowledge propagated and how is re-enforcement applied), whether the AI needs multiple conceptual subnets, how current standard problem-solving techniques might be integrated etc etc. But that's okay, this is my blog and I'm just rambling.