Friday, October 31, 2008

Efficient encoding of URL arguments

I would like to have URLs which look like:
"http://web.host/queryhandler/Tnidf3445ssx34" or something similar. That last part should be some reasonably space-efficient encoding of something like "SELECT * from FOO where CONDITION"

I have found I can get part-way using Python's encode and decode functions, e.g.
encoded = "SELECT * from FOO where CONDITION".encode("hex")
decoded = encoded.decode("hex")

However, the resulting values are still a bit long. This method quite nearly removes nonprintable characters, spaces etc from the URL. However, I have this nagging feeling that I could be doing some compression along the way to make the encoded part shorter. For example, HEX uses only part of the alphanumeric space. I'd like to use the digits 0-9 and letters a-z and A-Z ideally.

Let's just make this clear -- for no very good reason that I can justify, I want to encode a query as the shortest possible string using printable, alphanumeric characters (and no spaces) and be able to simply decode the string also.

"The Internet" has told me about md5 hashing, but I need a two-way function here. The help on string.encode in Python simply says I can specify the name of any encoder, but I can't find a list of the default encoders available.

Any ideas, internet?

Cheers and thanks in advance,
-T

10 comments:

Anonymous said...

Just to check: you are planning on doing extremely robust sanitisation on those querystrings aren't you?

If you go for a two-way function you're opening the door for people to figure out what you're doing and encode some DROP statement before you know it.

In terms of the compression, representing the various SQL keywords with one letter codes is probably good place to start, followed by other compression and encoding of course.

Marius Gedminas said...

I can't believe nobody mentioned the base64 encoding yet.

The list of some of the encoding names accepted by Python's str.encode is here: http://www.python.org/doc/current/library/codecs.html#standard-encodings

jelen said...

isn't base64 about 30% less effecient (size) than that HEX variant?

Wikipedia says: Thus, the actual length of MIME-compliant base64-encoded binary data is usually about 137% of the original data length, though for very short messages the overhead can be a lot higher because of the overhead of the headers. Very roughly, the final size of base64-encoded binary data is equal to 1.37 times the original data size + 814 bytes (for headers).

jelen said...

now I see :) Hex encoding is actually less effective (it's 200% of length)

sorry

Anonymous said...

ASCII85 is the most efficient of these encodings. It is used in PostScript and in PDF files. See the Wikipedia article for more:
http://en.wikipedia.org/wiki/Ascii85

TimOttinger said...

Produce the code on your side, where it's clean and intentional and not open to exploit and send a small unique id.

Think "tinyurl"

ltbarcly said...

The key here is that you don't want a way to convert generic binary strings to url's, you only need to encode printable characters. However, there are about 85 or so characters you will have to consider. The number of bits necessary to encode 85 characters is 2**7, unless you do something weird like ascii85 (ascii 85 uses roughly 6.4 bits per character). So really, the best you are going to do with an encoding is roughly 1/8th smaller than the initial string.

I would just use urllib.quote_plus. It won't obfuscate, but it also won't be a complicated mess. Once the queries are around 200 characters long, you could switch to using:

astring.encode('zlib').encode('base64') using zlib makes strings a few bytes larger if they are small due to overhead, but at about 230 characters you will see some strings that have repetative elements get smaller. In reality you need about a 27% reduction in size at the zlib step to break even in the end, since base64 will expand that by about 37%. Since queries are made up of repetitive tokens, zlib should be especially efficient on them, and the break even point may be as few as 100 characters.

So I guess I'm saying, for your purposes, .encode('zlib').encode('base64') might be the best you can do short of something esoteric like zlib and then base85.

Really though, putting the queries in the url is usually very stupid.

Samat Jain said...

If you want your URLs to be case-insensitive, take a look at Base32 encoding. It also has the benefit of being representable in URLs without further escaping of characters.id

AngelBlaZe said...

What about a simple cypher like rot13 with space replacement?
SELECT * FROM USERS
FRYRPG__*__SEBZ__HFREF

or Base36 encoding:
SELECT * FROM USERS
1D8QHR28CL3YDLG5KVYOJIKMVP64PF

AFAIK TinyURL gives every link a large number which is then encoded in Base36 for compactness.

Tennessee Leeuwenburg said...

Thanks very much all for your comments! It's fantastic to have so much information contributed to this post. For the time being, I am using a hex encoding, however I may review this in future if it is warranted. Right now I'm concentrating on other aspects of functionality but I will certainly keep this thread as a pocket-reference for string encoding!