Building mod_wsgi with EasyApache for WHM/cPanel

Note: These instructions are for root owners of WHM/cPanel systems, not end users.

If you want to run Django sites on a cPanel server, you’ll probably want to use the mod_wsgi Apache module. There are plenty of instructions out there on compiling mod_wsgi, but if you create it outside of the cPanel system, mod_wsgi.so will vanish each time you run easy_apache to upgrade your apache and php.

The key is to install this mod_wsgi for cPanel module. But before you go there, you’re going to want a more recent version of Python installed, since RedHat and CentOS still ship with Python 2.4, which will be deprecated by Django soon. However, you can’t overwrite the system-provided Python because yum and Mailman depend on it.

Download Python 2.7 (or whatever the latest is) into /usr/local/src. It’s critical that you build Python with shared libraries enabled, since mod_wsgi will be wanting to use them. So unpack the Python archive and cd into it, then:

./configure --enable-shared
make install

You’ll get a new build of python in /usr/local/bin, without disrupting the native version in /usr/bin. Any user wanting python2.7 to be their default can add this to their .bash_profile:

PATH=/usr/local/bin:$PATH:$HOME/bin

You’ll also get new libpython shared objects in /usr/local/lib. When you go to build mod_wsgi, easy_apache will need to look for python libs in that location. I found that copying the libs into standard library locations such as /lib and /usr/lib as suggested here didn’t do the trick. What did work was to add a system configuration file pointing to the new libs. Do this:

cd /etc/ld.so.conf.d
echo "/usr/local/lib/" > python27.conf
ldconfig

Now you’re ready to build mod_wsgi through easy_apache. Download custom_opt_mod-mod_wsgi.tar.gz from this ticket at google code and run:

tar -C /var/cpanel/easy/apache/custom_opt_mods -xzf custom_opt_mod-mod_wsgi.tar.gz

That unpacks the module into the right location so that easy_apache will find it and present it as a build option. Run easy_apache as usual (either via script or through WHM) and select the mod_wsgi option. When complete, you’ll find mod_wsgi.so along with all your other modules in /usr/local/apache/modules. The best part is, this will now become part of the default easy_apache build process, so Django sites won’t break when you rebuild apache+php in the future.

Many thanks to challgren for creating the module and to Graham Dumpleton for all of his mod_wsgi evangelism and support.

Migrating from Django-Tagging to Taggit

When Bucketlist launched a year ago and I needed a good app to let users create a taxonomy for their life goals, django-tagging was the main contender, and that’s what we went with.

Django-tagging worked pretty well overall, but had one critical bug: Because it only had a tag “name” field but no slug field, users could enter tags with slashes in them. Accessing lists of those tags would then generate a 500 error – a bad user experience, unclean, and I was getting tired of seeing the error reports. Unfortunately, django-tagging hasn’t been been updated in quite a while – starting to look like abandon-ware.

At Djangocon 2010, buzz was that Alex Gaynor’s django-taggit was picking up the slack and becoming the go-to tagging library for Django. Unfortunately, Taggit provides no migration strategy to move your existing tag base over. I held off on migration hoping one would appear, then finally decided this week to try it myself. Thought I’d document the process for others in the same boat.
Continue reading

Shorter URLs with Base62 in Django

URL shorteners have become a hot commodity in the age of Twitter, where every byte counts. Shorteners have their uses, but they can also be potentially dangerous, since they mask the true destination of a link from users until it’s too late (shorteners are a malware installer’s wet dream). In addition, they work almost as a second layer of DNS on top of the internet, and a fragile one at that – if a shortening company goes out of business, all the links they handle could potentially break.

On bucketlist.org, a Django site that lets users catalog life goals, I’ve been using numerical IDs in URLs. As the number of items stored started to rise, I watched my URLs getting longer. Thinking optimistically about a hypothetical future with tens of millions of records to serve, and inspired by the URL structure at the Django-powered photo-sharing site Instagr.am, decided to do some trimming now, while the site’s still young. Rather than rely on a shortening service, decided to switch to a native Base 62 URL schema, with goal page URIs consisting of characters from this set:

BASE62 = "abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"

rather than just the digits 0-9. The compression is significant. Car license plates use just seven characters and no lower-case letters (base 36), and are able to represent tens of millions of cars without exhausting the character space. With base 62, the namespace is far larger. Here are some sample encodings – watch as the number of characters saved increases as the length of the encoded number rises:

Numeric Base 62
1 b
22 w
333 fx
4444 bjG
55555 o2d
666666 cN0G
7777777 6Dwb
88888888 gaYdK
999999999 bfFTGp
1234567890 bv8h5u

I was able to find several Django-based URL shortening apps, but I didn’t want redirection – I wanted native Base62 URLs. Fortunately, it wasn’t hard to roll up a system from scratch. Started by finding a python function to do the basic encoding – this one did the trick. I saved that in a utils.py in my app’s directory.

Of course we need a new field to store the hashed strings in – I created a 5-character varchar called “urlhash” … but there’s a catch – we’ll come back to this.

The best place to call the function is from the Item model’s save() method. Any time an Item is saved, we grab the record ID, encode it, and store the return value in urlhash. By putting it on the save() method, we know we’ll never end up with an empty urlhash field if the item gets stored in an unpredictable way (site users can either create new items, or copy items from other people’s lists into their own, for example, and there may be other ways in the future — we don’t want to have to remember to call the baseconvert() function from everywhere when a single place will do — keep it DRY!)).

Generating hashes

So in models.py:

from bucket.utils import BASE10, BASE62, baseconvert

...

def save(self):

    # Do a bunch of stuff not relevant here...

    # Initial save so the record gets an ID returned from the db
    super(Item, self).save()

    if not self.urlhash:
        self.urlhash = baseconvert(str(self.id),BASE10,BASE62)
        self.save()     

Now create a new record in the usual way and verify that it always gets an accompanying urlhash stored. We also need to back-fill all the existing records. Easy enough via python manage.py shell:

from bucket.models import Item
from bucket.utils import BASE10, BASE62, baseconvert

items = Item.objects.all()
for i in items:
    print i.id
    i.urlhash = baseconvert(str(i.id),BASE10,BASE62)
    print i.urlhash
    print
    i.save()

Examine your database to make sure all fields have been populated.

About that MySQL snag

About that “snag” I mentioned earlier: The hashes will have been stored with mixed-case letters (and numbers), and they’re guaranteed to be unique if the IDs you generated them from were. But if you have two records in your table with urlhashes ‘U3b’ and ‘U3B’, and you do a Django query like :


urlhash = 'U3b'
item = Item.objects.get(urlhash__exact=urlhash)

Django complains that it finds two records rather than one. That’s because the default collation for MySQL tables is case-insensitive, even when specifying case-sensitive queries with Django! This issue is described in the Django documentation and there’s nothing Django can do about it – you need to change the collation of the urlhash column to utf8_bin. You can do this easily with a good database GUI, or with a query similar to this:

ALTER TABLE `db_name`.`db_table_name` CHANGE COLUMN `urlhash` `urlhash` VARCHAR(5) CHARACTER SET utf8 COLLATE utf8_bin NULL DEFAULT NULL COMMENT '' AFTER `id`;

or, if you’re creating the column fresh on an existing table:

ALTER TABLE `bucket_item` ADD `urlhash` VARCHAR( 5 ) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL AFTER `id` , ADD INDEX ( `urlhash` )

Season to taste. It’s important to get that index in there for performance reasons, since this will be your primary lookup field from now on.

Tweak URL patterns and views

Since the goal is to keep URLs as short as possible, you have two options. You could put a one-character preface on the URL to prevent it from matching other word-like URL strings, like:

foo.org/i/B3j

but I wanted the shortest URLs possible, with no preface, just:

foo.org/B3j

Since I have lots of other word-like URLs, and can’t know in advance how many characters the url hashes will be, I simply moved the regex to the very last position in urls.py – this becomes the last pattern matched before handing over to 404.

url(r'^(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),

Unfortunately, I quickly discovered that this removed the site’s ability to use Flat Pages, which rely on the same fall-through mechanism, so I switched to the “/i/B3j” technique instead.

url(r'^i/(?P<urlhash>\w+)/$', 'bucket.views.item_view', name="item_view"),

Now we need to tweak the view that handles the item details a bit, to query for the urlhash rather than the record ID:


from django.shortcuts import get_object_or_404
...

def item_view(request,urlhash):        
    item = get_object_or_404(Item,urlhash=urlhash)
	...

It’s important to use get_object_or_404 here rather than objects.get(). That way we can still return 404 if someone types in a word-like URL string that the regex in urls.py can’t catch due to its open-endedness. Note also that we didn’t specify urlhash__exact=urlhash — case-sensitive lookups are the default in Django queries, and there’s no need to specify the default.

If you’ve been using something like {% url item_view item.id %} in your templates, you’ll obviously need to change all instances of that to {% url item_view item.urlhash %} (you may have to make similar changes in your view code if you’ve been using reverses with HttpResponseRedirect).

Handling the old URLs

Of course we still want to handle all of those old incoming links to the numeric URLs. We just need a variant of the original ID-matching pattern:

url(r'^(?P\d+)/$', 'bucket.views.item_view_redirect', name="item_view_numeric"),

which points to a simple view item_view_redirect that does the redirection:


def item_view_redirect(request,item_id):
    '''
    Handle old numeric URLs by redirecting to new hashed versions
    '''
    item = get_object_or_404(Item,id=item_id)
    return HttpResponseRedirect(reverse('item_view',args=[item.urlhash]))

Bingo – all newly created items get the new, permanently shortened URLs, and all old incoming links are handled transparently.

Encouraging users to add avatars to profiles

One of the things that has vexed me since launching bucketlist.org a few months ago is the fact that most users don’t enter any sort of profile information whatsoever – not even an icon/avatar to represent themselves. In fact, I did a quick query the other night and discovered that only 1/4 of users had set up an avatar. This realization was both surprising and disappointing to me — surprising because most users of other social networks (Twitter, Facebook, etc.) go to lengths to make sure their profile info is complete and up to date. People on Twitter know that most people won’t even bother following people who don’t have personal icons.

Why was bucketlist being viewed differently by its users? And what could I do to encourage users to add profile info, or at least images of themselves?

One problem, I realized, was that the default avatar I was using on the site to represent avatar-less users was too bland. It didn’t bother users to be represented like this:

Toyed briefly with the idea of replacing the generic icon with something ridiculous, to motivate people to change it as soon as possible. But I don’t want to annoy or embarrass users. Also contemplated using some kind of Ajax-y banner thing to gently remind users to set up an avatar. Then it hit me last night – I don’t have to show the same image to everyone – why not do it like this:

if showing a bucketlist or goal whose owning user has an avatar, show that
if showing someone else’s list or item with no avatar, show the usual generic avatar
if showing your own list or item and you dont have an avatar, show something else

This trick replies on a bit of psychology – since the user probably assumes that everyone sees their lists and items with same icon they’re currently seeing, there’s a strong incentive to change it. Here’s what I came up with, based somewhat on a similar approached used for new Twitter accounts:

The other difference is that, while most avatars on the site link to the item owner’s main list page, this one links to the user’s own profile editing page. I suspect that part of the problem was that many users just didn’t notice or care that they even could edit their profiles, despite the presence of a giant “Edit Your Profile” button. Now there’s no mistaking the option.

After a week with this system, we found little to no increase in the number of users adding avatars to their profiles, so I upped the ante a bit by throwing up a friendly splash screen when the following conditions were true:

  • User has been logged in for three minutes
  • User is currently adding an item
  • User has no avatar
  • User has not yet been “nagged”

After two weeks with this system in effect, I crunched some numbers (using querysets in the Django ORM) and discovered that the new “nag” system raised the percentage of users adding avatars from 24% to 33% – a measurable difference, but still nowhere near the increase I was hoping for.

I’m not willing to nag any more than that – the real key is getting users to see the site as a social site, not just a personal list repository. I think deeper integration with social networks will make a greater difference.

Building a Bucketlist Site with Django

Half a year ago, I got this crazy idea to build a site where people could log and record all the things they wanted to accomplish before they died. But more than just simple list-making, I wanted to make it easy for people to tell stories about their goals, and to add images and video. I wanted to let people “follow” other people’s lists, to receive email when their friends accomplished their goals, to start discussions about getting the most out of life. I wanted it to be a place where people could get inspired by the goals of others, and to easily make copies of those goals in their own bucketlists.

The result is bucketlist.org.

I had a pre-existing love affair with the Python-based Django framework – there was never a question of what platform to build on. But no matter how good the platform, the devil’s in the details.
Continue reading

Allowing Secure User Input with Django

Building a site that needs to accept formatted user input? There’s no way you’re going to let random users input any old HTML – you’d open the door to all kinds of cross-site-scripting attacks and other nastiness. Nor can you just filter out the tags you consider dangerous – that road is fraught with peril. The only solution is to white-list a small subset of tags and unceremoniously drop the rest.

There are two layers to the problem – how to support formatted text on the front-end, and how to process submitted text on the back-end.

For the front-end, some developers are drawn to the Markdown syntax – a supposedly user-friendly wiki-like syntax that can be re-rendered as safe HTML. But while Markdown may look friendly to developers, it doesn’t to normal users – trust me on this. Even for tech-savvy users, Markdown requires that you place syntax instructions on your site (inelegant). A better solution is to use a rich text editor for the web, like TinyMCE or WYMEditor.

Ever notice that you often see rich text editors in content management systems run by trusted users, but seldom on public-facing web pages? That’s because it’s tricky to do securely, and without giving users enough rope to hang themselves formatting-wise.

With a bit of configuration though, you can deploy public-facing rich textareas securely, allowing only the input of tags you specify. But you can’t stop there – all the user has to do is disable Javascript in the browser to bypass your rich text editor. You must process submitted text on the back-end with the same set of rules in your view logic.

Continue reading

Reading .rst doc files with sphinx

Quite a few re-usable Django apps and Python modules come with documentation in text files ending with a .rst extension. The formatting in them is odd, but they’re more-or-less readable.

To this day, I haven’t encountered a single package that explained why docs were formatted this way. I knew there had to be an explanation, but hadn’t gotten around to looking it up, and basically just waded through. Finally went looking for an answer today. Turns out .rst files use a simple markup syntax called restructured text and you can generate nicely formatted HTML (and other documentation formats) out of them if you have python’s sphinx module installed. For the benefit of future googlers, here’s how to get up and running quickly:

$ pip install sphinx
$ cd docs
$ mkdir out
$ sphinx-build . out

Now take a look in the “out” directory and you’ll find the same set of files as a collection of handsomely formatted HTML docs.

Sphinx goes pretty deep, and I’m looking forward to exploring it for future documentation projects. For now I’m just happy to have an alternative to squinting.

django-treedata: DataSF Contest Winner

treewordle-150x150Recently I was invited to participate in the California Data Camp and DataSF App Contest hosted by California Watch and spot.us. The unconference would feature lots of discussion about making use of publicly available data sets to improve quality of life. The App Contest challenged developers to choose one of the many data sets available at DataSF.org and build something cool with it in a relatively short period of time.

Long story short — my contest entry, which explores San Francisco’s database of publicly maintained trees and plants, won the competition! Full details, and downloadable source code, available at my Scripts and Utilities site.

Thanks so much to David Cohn of Spot.us and all of the conference organizers and supporters. Thanks also to J-School webmaster for Chuck Harris for his contributions to the project. It was a great day, and winning the competition was a total surprise. Now I just need a city to take the source code and run with it.

spot.us has covered the event live throughout the day.

Huffington Post mentioned django-treedata in Sophisticated Tree Hugging: the Pure Joy of Public Data

Generating RSS Mashups from Django

I recently got to work on an interesting Django side project: the Bay News Network – a directory of Bay Area bloggers and hyperlocal news sites. The goal of the site was three-fold:

  1. To create a many-to-many directory of local sites that matched our editorial criteria
  2. To let site owners log in and edit their own listings
  3. To both consume and produce RSS feeds from the listed sites

The first two were pretty standard Django approaches – develop data models and editing interfaces using Django forms and re-usable apps like django-profiles and django-registration. The third goal turned out to be more interesting. We not only had to gather RSS feeds from more than 100 external sites several times per day, we needed to re-mix them (e.g. provide an integrated feed representing all blogs that cover Food, or all blogs that cover Oakland).

“Consuming” RSS feeds meant we needed to integrate feeds from the external sites into our own site. At the most basic level, this was pretty straightforward using Mark Pilgrim’s excellent Universal Feed Parser, which turns the real-world’s tag soup of disparate, incompatible RSS formats  into a reliable data format you can step through in your code or templates. This worked well enough until I realized that grabbing and parsing external feeds in real-time was just not going to scale, performance-wise. Plus, we still had the RSS mashups to build, and would clearly need to be storing feed entries in our own database in order to sort them by category, etc.

Thus began the hunt for good feed aggregation systems for Django. Most roads pointed to django-planet, planet planet, and FeedJack, which are systems for gathering content from external sites and importing it into a single aggregated site. These were close to what I wanted, but weren’t great on the re-usability side. Since I already had  existing models to define the sites, their owners, and their feeds, I didn’t want to rewrite all my models to work with another system’s conception of how things should be laid out. I also didn’t feel like plowing through their source code to chop out and rewrite just the bits I wanted. Eventually realized that I was looking for a few lines of code to work with my system, not a whole external system.

The surprising solution came from the Community section of the official Django project web site. The Django developers keep the code that drives djangoproject.com in subversion along with the source code to Django itself. And the code that drives that section of the site is really lightweight. So I did a subversion checkout of the Aggregator app, and found that all I really needed from it was its update_feeds.py script, which itself is a wrapper around Universal Feed Parser, tweaked to talk to my own models.

Two gotchas to be aware of:

  1. The app includes a bundled templatetags directory with a file called aggregator.py. But the name of the app itself is “aggregator.” I was getting strange import errors in various places before I discovered on the django-users mailing list that Django doesn’t like it when an app name matches a templatetag name. Easily fixed by renaming the templatetag.
  2. My first runs of update_feeds.py went fine, but later started erroring out with database integrity errors. The GUID field on the FeedItem model is set to unique=True, which prevents your database from storing any one FeedItem more than once. That’s great, but it was dishing up integrity errors for some reason. I fixed this by changing this line in update_feeds.py:
feed.feeditem_set.get(guid=guid)

to:

FeedItem.objects.get(guid=guid)

Once I was able to get the updater to run consistently without error, I needed to get it running via cron. The trick to running a Python script that talks to the Django ORM from a crontab is that you must supply the full Python paths in the environment to cron – it doesn’t pick them up automatically from the environment of the user that runs the cron job. This worked for me:

PYTHONPATH=/home/bnn/projects:/home/bnn/projects/bnn
DJANGO_SETTINGS_MODULE=bnn.settings
20 15 * * * python /home/bnn/projects/bnn/scripts/update_feeds.py 2>&1

Producing Feeds

With the harvesting system up and running, and all content coming into the datbase associated with blogs that were in turn categorized by “beat” and geographical area, outputting aggregated RSS feeds was a simple matter of using Django’s native syndication framework as documented. This went into urls.py:

feeds = {
    'all': AllFeeds,
    'cat': CategoryFeeds,
    'area': BeatFeeds,
}

# Feeds
url(r'^feeds/(?P.*)/$', 'django.contrib.syndication.views.feed', {'feed_dict': feeds}),

… and I created a file feedgenerator.py to contain the three corresponding classes and their querysets, using Holovaty’s sample code from chicagocrime.org as a starting point.

Python-MySQL Connections and Snow Leopard

Apparently I’m not the only one having trouble getting MySQL and Python to play nice under OS X — last February’s post on getting the two to cooperate under OS X has generated a ton of traffic. Now I’ve upgraded to Snow Leopard and faced a handful of new challenges (but eventually got it working). Rather than scatter my notes, I’ve updated the original post with Snow Leopard instructions.