Update, 4/2017: See this StackOverflow answer for a different (and probably shorter) approach to this problem.
URL shorteners have become a hot commodity in the age of Twitter, where every byte counts. Shorteners have their uses, but they can also be potentially dangerous, since they mask the true destination of a link from users until it’s too late (shorteners are a malware installer’s wet dream). In addition, they work almost as a second layer of DNS on top of the internet, and a fragile one at that – if a shortening company goes out of business, all the links they handle could potentially break.
On bucketlist.org, a Django site that lets users catalog life goals, I’ve been using numerical IDs in URLs. As the number of items stored started to rise, I watched my URLs getting longer. Thinking optimistically about a hypothetical future with tens of millions of records to serve, and inspired by the URL structure at the Django-powered photo-sharing site Instagr.am, decided to do some trimming now, while the site’s still young. Rather than rely on a shortening service, decided to switch to a native Base 62 URL schema, with goal page URIs consisting of characters from this set:
BASE62 = "abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
rather than just the digits 0-9. The compression is significant. Car license plates use just seven characters and no lower-case letters (base 36), and are able to represent tens of millions of cars without exhausting the character space. With base 62, the namespace is far larger. Here are some sample encodings – watch as the number of characters saved increases as the length of the encoded number rises:
Numeric |
Base 62 |
1 |
b |
22 |
w |
333 |
fx |
4444 |
bjG |
55555 |
o2d |
666666 |
cN0G |
7777777 |
6Dwb |
88888888 |
gaYdK |
999999999 |
bfFTGp |
1234567890 |
bv8h5u |
I was able to find several Django-based URL shortening apps, but I didn’t want redirection – I wanted native Base62 URLs. Fortunately, it wasn’t hard to roll up a system from scratch. Started by finding a python function to do the basic encoding – this one did the trick. I saved that in a utils.py in my app’s directory.
Of course we need a new field to store the hashed strings in – I created a 5-character varchar called “urlhash” … but there’s a catch – we’ll come back to this.
The best place to call the function is from the Item model’s save()
method. Any time an Item is saved, we grab the record ID, encode it, and store the return value in urlhash. By putting it on the save()
method, we know we’ll never end up with an empty urlhash field if the item gets stored in an unpredictable way (site users can either create new items, or copy items from other people’s lists into their own, for example, and there may be other ways in the future — we don’t want to have to remember to call the baseconvert()
function from everywhere when a single place will do — keep it DRY!)).
Generating hashes
So in models.py:
from bucket.utils import BASE10, BASE62, baseconvert
...
def save(self):
# Do a bunch of stuff not relevant here...
# Initial save so the record gets an ID returned from the db
super(Item, self).save()
if not self.urlhash:
self.urlhash = baseconvert(str(self.id),BASE10,BASE62)
self.save()
Now create a new record in the usual way and verify that it always gets an accompanying urlhash stored. We also need to back-fill all the existing records. Easy enough via python manage.py shell
:
from bucket.models import Item
from bucket.utils import BASE10, BASE62, baseconvert
items = Item.objects.all()
for i in items:
print i.id
i.urlhash = baseconvert(str(i.id),BASE10,BASE62)
print i.urlhash
print
i.save()
Examine your database to make sure all fields have been populated.
About that MySQL snag
About that “snag” I mentioned earlier: The hashes will have been stored with mixed-case letters (and numbers), and they’re guaranteed to be unique if the IDs you generated them from were. But if you have two records in your table with urlhashes ‘U3b’ and ‘U3B’, and you do a Django query like :
urlhash = 'U3b'
item = Item.objects.get(urlhash__exact=urlhash)
Django complains that it finds two records rather than one. That’s because the default collation for MySQL tables is case-insensitive, even when specifying case-sensitive queries with Django! This issue is described in the Django documentation and there’s nothing Django can do about it – you need to change the collation of the urlhash column to utf8_bin. You can do this easily with a good database GUI, or with a query similar to this:
ALTER TABLE `db_name`.`db_table_name` CHANGE COLUMN `urlhash` `urlhash` VARCHAR(5) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL COMMENT '' AFTER `id`;
or, if you’re creating the column fresh on an existing table:
ALTER TABLE `bucket_item` ADD `urlhash` VARCHAR( 5 ) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL AFTER `id` , ADD INDEX ( `urlhash` )
Season to taste. It’s important to get that index in there for performance reasons, since this will be your primary lookup field from now on.
Tweak URL patterns and views
Since the goal is to keep URLs as short as possible, you have two options. You could put a one-character preface on the URL to prevent it from matching other word-like URL strings, like:
foo.org/i/B3j
but I wanted the shortest URLs possible, with no preface, just:
foo.org/B3j
Since I have lots of other word-like URLs, and can’t know in advance how many characters the url hashes will be, I simply moved the regex to the very last position in urls.py – this becomes the last pattern matched before handing over to 404.
url(r'^(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),
Unfortunately, I quickly discovered that this removed the site’s ability to use Flat Pages, which rely on the same fall-through mechanism, so I switched to the “/i/B3j” technique instead.
url(r'^i/(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),
Now we need to tweak the view that handles the item details a bit, to query for the urlhash rather than the record ID:
from django.shortcuts import get_object_or_404
...
def item_view(request,urlhash):
item = get_object_or_404(Item,urlhash=urlhash)
...
It’s important to use get_object_or_404
here rather than objects.get()
. That way we can still return 404 if someone types in a word-like URL string that the regex in urls.py can’t catch due to its open-endedness. Note also that we didn’t specify urlhash__exact=urlhash
— case-sensitive lookups are the default in Django queries, and there’s no need to specify the default.
If you’ve been using something like {% url item_view item.id %}
in your templates, you’ll obviously need to change all instances of that to {% url item_view item.urlhash %}
(you may have to make similar changes in your view code if you’ve been using reverses with HttpResponseRedirect).
Handling the old URLs
Of course we still want to handle all of those old incoming links to the numeric URLs. We just need a variant of the original ID-matching pattern:
url(r'^(?P\d+)/$', 'bucket.views.item_view_redirect', name="item_view_numeric"),
which points to a simple view item_view_redirect
that does the redirection:
def item_view_redirect(request,item_id):
'''
Handle old numeric URLs by redirecting to new hashed versions
'''
item = get_object_or_404(Item,id=item_id)
return HttpResponseRedirect(reverse('item_view',args=[item.urlhash]))
Bingo – all newly created items get the new, permanently shortened URLs, and all old incoming links are handled transparently.