Update, 4/2017: See this StackOverflow answer for a different (and probably shorter) approach to this problem.
URL shorteners have become a hot commodity in the age of Twitter, where every byte counts. Shorteners have their uses, but they can also be potentially dangerous, since they mask the true destination of a link from users until it’s too late (shorteners are a malware installer’s wet dream). In addition, they work almost as a second layer of DNS on top of the internet, and a fragile one at that – if a shortening company goes out of business, all the links they handle could potentially break.
On bucketlist.org, a Django site that lets users catalog life goals, I’ve been using numerical IDs in URLs. As the number of items stored started to rise, I watched my URLs getting longer. Thinking optimistically about a hypothetical future with tens of millions of records to serve, and inspired by the URL structure at the Django-powered photo-sharing site Instagr.am, decided to do some trimming now, while the site’s still young. Rather than rely on a shortening service, decided to switch to a native Base 62 URL schema, with goal page URIs consisting of characters from this set:
BASE62 = "abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
rather than just the digits 0-9. The compression is significant. Car license plates use just seven characters and no lower-case letters (base 36), and are able to represent tens of millions of cars without exhausting the character space. With base 62, the namespace is far larger. Here are some sample encodings – watch as the number of characters saved increases as the length of the encoded number rises:
Numeric | Base 62 |
---|---|
1 | b |
22 | w |
333 | fx |
4444 | bjG |
55555 | o2d |
666666 | cN0G |
7777777 | 6Dwb |
88888888 | gaYdK |
999999999 | bfFTGp |
1234567890 | bv8h5u |
I was able to find several Django-based URL shortening apps, but I didn’t want redirection – I wanted native Base62 URLs. Fortunately, it wasn’t hard to roll up a system from scratch. Started by finding a python function to do the basic encoding – this one did the trick. I saved that in a utils.py in my app’s directory.
Of course we need a new field to store the hashed strings in – I created a 5-character varchar called “urlhash” … but there’s a catch – we’ll come back to this.
The best place to call the function is from the Item model’s save()
method. Any time an Item is saved, we grab the record ID, encode it, and store the return value in urlhash. By putting it on the save()
method, we know we’ll never end up with an empty urlhash field if the item gets stored in an unpredictable way (site users can either create new items, or copy items from other people’s lists into their own, for example, and there may be other ways in the future — we don’t want to have to remember to call the baseconvert()
function from everywhere when a single place will do — keep it DRY!)).
Generating hashes
So in models.py:
from bucket.utils import BASE10, BASE62, baseconvert ... def save(self): # Do a bunch of stuff not relevant here... # Initial save so the record gets an ID returned from the db super(Item, self).save() if not self.urlhash: self.urlhash = baseconvert(str(self.id),BASE10,BASE62) self.save()
Now create a new record in the usual way and verify that it always gets an accompanying urlhash stored. We also need to back-fill all the existing records. Easy enough via python manage.py shell
:
from bucket.models import Item from bucket.utils import BASE10, BASE62, baseconvert items = Item.objects.all() for i in items: print i.id i.urlhash = baseconvert(str(i.id),BASE10,BASE62) print i.urlhash print i.save()
Examine your database to make sure all fields have been populated.
About that MySQL snag
About that “snag” I mentioned earlier: The hashes will have been stored with mixed-case letters (and numbers), and they’re guaranteed to be unique if the IDs you generated them from were. But if you have two records in your table with urlhashes ‘U3b’ and ‘U3B’, and you do a Django query like :
urlhash = 'U3b' item = Item.objects.get(urlhash__exact=urlhash)
Django complains that it finds two records rather than one. That’s because the default collation for MySQL tables is case-insensitive, even when specifying case-sensitive queries with Django! This issue is described in the Django documentation and there’s nothing Django can do about it – you need to change the collation of the urlhash column to utf8_bin. You can do this easily with a good database GUI, or with a query similar to this:
ALTER TABLE `db_name`.`db_table_name` CHANGE COLUMN `urlhash` `urlhash` VARCHAR(5) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL COMMENT '' AFTER `id`;
or, if you’re creating the column fresh on an existing table:
ALTER TABLE `bucket_item` ADD `urlhash` VARCHAR( 5 ) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL AFTER `id` , ADD INDEX ( `urlhash` )
Season to taste. It’s important to get that index in there for performance reasons, since this will be your primary lookup field from now on.
Tweak URL patterns and views
Since the goal is to keep URLs as short as possible, you have two options. You could put a one-character preface on the URL to prevent it from matching other word-like URL strings, like:
foo.org/i/B3j
but I wanted the shortest URLs possible, with no preface, just:
foo.org/B3j
Since I have lots of other word-like URLs, and can’t know in advance how many characters the url hashes will be, I simply moved the regex to the very last position in urls.py – this becomes the last pattern matched before handing over to 404.
url(r'^(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),
Unfortunately, I quickly discovered that this removed the site’s ability to use Flat Pages, which rely on the same fall-through mechanism, so I switched to the “/i/B3j” technique instead.
url(r'^i/(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),
Now we need to tweak the view that handles the item details a bit, to query for the urlhash rather than the record ID:
from django.shortcuts import get_object_or_404 ... def item_view(request,urlhash): item = get_object_or_404(Item,urlhash=urlhash) ...
It’s important to use get_object_or_404
here rather than objects.get()
. That way we can still return 404 if someone types in a word-like URL string that the regex in urls.py can’t catch due to its open-endedness. Note also that we didn’t specify urlhash__exact=urlhash
— case-sensitive lookups are the default in Django queries, and there’s no need to specify the default.
If you’ve been using something like {% url item_view item.id %}
in your templates, you’ll obviously need to change all instances of that to {% url item_view item.urlhash %}
(you may have to make similar changes in your view code if you’ve been using reverses with HttpResponseRedirect).
Handling the old URLs
Of course we still want to handle all of those old incoming links to the numeric URLs. We just need a variant of the original ID-matching pattern:
url(r'^(?P
which points to a simple view item_view_redirect
that does the redirection:
def item_view_redirect(request,item_id): ''' Handle old numeric URLs by redirecting to new hashed versions ''' item = get_object_or_404(Item,id=item_id) return HttpResponseRedirect(reverse('item_view',args=[item.urlhash]))
Bingo – all newly created items get the new, permanently shortened URLs, and all old incoming links are handled transparently.
You can get nearly the same benefit, plus anti-ambiguity readability, by going with a base-58 that leaves out lowercase “l”, uppercase “O”, zero and one!
Good tip Glenn! Wish I had thought of that. Not anxious to upset the apple cart right now, but will keep in mind for future projects.
What happens when you have a hash that is the same as a url that your using, IE, links, are you going to check to see if the object id matches your current url table every time you insert an object?
After writing the original post, I realized I had lost the ability to use flat pages, for exactly this reason. I then changed the site to use the 2nd method above, with URLs starting with /i/xyz. I’ll update the post now, thx.
I was doing something very similar with converting sha hashes to base 62 to shorten them, but I am glad I came across your blog entry to learn about the MySQL case sensitivity issue. I think I might switch to base 36 (or perhaps a little higher with some url-friendly characters like dash and dot) and avoid the case issue entirely. Thanks!
Hey Scot, cool trick!
I’d suggest using a post_save signal rather than a custom save method, though. That way it wouldn’t even have to be in models.py — you could attach the handler to a set of models of your choosing without altering the model source.
Good point Philip – I’ll think about switching that over.
Good meeting you tonight. My ears perked up when I saw that Django slide :)
I happened to stumble upon this post while searching for something complete unrelated (you’re number 8 on Google for “foobar save query,” by the way), but I happen to be a professional web developer and DBA, and I wanted to chime in. To best honest, this approach seems a bit misguided to me. Here are the problems I see:
First, you’re storing calculated values; values that can be trivially calculated, at that. Of course, this is bad because it can lead to inconsistencies in the database, but on top of that the calculated value isn’t even particularly useful at the database level. Searching on a varchar (you don’t even mention an index) is orders of magnitude slower that searching on a simple int column. I would suggest instead performing the base conversion back to int on every request and fetching the record by id. It’s probably too late to change your approach now, but at least it’s very easy to add the index to the urlhash column. Otherwise the very situation you’re trying to improve (a large number of records would make your URLs too long) could end up creating a new problem (a large number of records would slow down response time).
Besides the database-level concerns, I’d also be worried about the URLs themselves. Some others pointed out that base 58 and base 36 have some advantages over base 62. Imagine reading a base 58 or base 62 URL over the phone. Having to specify the capitalization of each letter undoes all of the gains created by the shorter length of the URL. Case-insensitive URLs are much more user friendly in general, especially given that DNS is case-insensitive. Besides that, considering that the longest 32-bit integer (signed or unsigned) is 10 base 10 digits long, and the smallest such number (1 billion) compresses by a mere 4 characters, it doesn’t really seem too beneficial.
I don’t mean to be too disparaging. I just felt like no one in the comments stopped to weigh the costs against the benefits. Personally, this is not an approach that I would take.
That said, I admit that I might be overlooking something in my analysis, and I’d be happy to be corrected.
Hey William, thanks for chiming in. Lots of great feedback there. My instructions are pretty clear about the necessity of having an index (and the SQL statement even shows the index portion). You’re right though, it might be more performant to convert back to ID at lookup time and query with that instead. That would be trivial to do without messing anything up. I guess I’d need metrics on query time for an indexed varchar compared to query time for record ID, taking into account the tiny hit of converting the urlhash back to int at request time.
I think you’re right on about case insensitivity, and in retrospect wish I’d gone that way, though the compression wouldn’t be as great. Unfortunately that would be quite difficult to change at this point, and the use case of someone trying to write down or read a URL over the phone is mostly academic I think – or at least pretty darned infrequent compared to linking, etc. (i.e. it might be a fairly small price to pay).
Thanks again William!
Ah, yes, somehow I did miss the bit about the index. And of course, you’re one hundred percent right about collecting performance metrics before making any changes. If the current setup is fast enough, then by definition, it’s fast enough.
That’s a good point about the prevalence of links to share URLs, and the whole scheme does make quite a bit of sense when you consider the size of the subset of those links shared over Twitter.
As for the trade-off between base 36 and base 62, it’s worth point out that the maximum 32-bit integer (again, signed or unsigned) is only one digit longer is base 36, and the two or three characters it will save on average might just make a difference in a highly terse medium like Twitter.
In fact, on second thought, if I thought that that would be a major use case for my URLs, I would consider base 34 (skip the O and L for clarity, maybe convert them to 0 and 1 using a regex). It might be worth considering a small number of common, URL-safe symbols, as well.
At any rate, I certainly see the point now, I’m sorry I was so harsh at first!
Thanks Stenius and William! This post and William’s comment helped me a lot to decide on an ideal strategy for my use case (discussed in detail here: http://stackoverflow.com/questions/43253602/django-short-non-linear-non-predictable-id-in-the-url)