Allowing Secure User Input with Django

Building a site that needs to accept formatted user input? There’s no way you’re going to let random users input any old HTML – you’d open the door to all kinds of cross-site-scripting attacks and other nastiness. Nor can you just filter out the tags you consider dangerous – that road is fraught with peril. The only solution is to white-list a small subset of tags and unceremoniously drop the rest.

There are two layers to the problem – how to support formatted text on the front-end, and how to process submitted text on the back-end.

For the front-end, some developers are drawn to the Markdown syntax – a supposedly user-friendly wiki-like syntax that can be re-rendered as safe HTML. But while Markdown may look friendly to developers, it doesn’t to normal users – trust me on this. Even for tech-savvy users, Markdown requires that you place syntax instructions on your site (inelegant). A better solution is to use a rich text editor for the web, like TinyMCE or WYMEditor.

Ever notice that you often see rich text editors in content management systems run by trusted users, but seldom on public-facing web pages? That’s because it’s tricky to do securely, and without giving users enough rope to hang themselves formatting-wise.

With a bit of configuration though, you can deploy public-facing rich textareas securely, allowing only the input of tags you specify. But you can’t stop there – all the user has to do is disable Javascript in the browser to bypass your rich text editor. You must process submitted text on the back-end with the same set of rules in your view logic.

Configuring TinyMCE

Let’s say you’ve already got TinyMCE installed in your Django project, and you want users to be able to use the following tags – nothing more:

p i strong b u a h1 h2 h3 blockquote br ul ol li

django-tinymce defaults to the “simple” theme, which allows a few of these, but not all of them. The “simple” theme cannot be modified – it is what it is – so you’ll need to start by switching to the “advanced” theme. django-tinymce accepts arguments passed in from your project’s settings file, so add something like:

TINYMCE_DEFAULT_CONFIG={
  'theme': "advanced", 
  'remove_linebreaks': False, 
  'convert_urls': False, 
  'width':'100%',
  'height':'300px',
  'paste_auto_cleanup_on_paste' : True,
  'theme_advanced_buttons1' : "formatselect,separator,
                              bold,italic,hr,separator,
                              link,unlink,separator,bullist,
                              numlist,separator,undo,redo,",
  'theme_advanced_buttons2' : "",
  'theme_advanced_buttons3' : ""  ,
  'theme_advanced_blockformats' : "p,h1,h2,h3,blockquote",
  'theme_advanced_toolbar_location' : "top",
  'content_css' : "/media/css/tiny_editor.css" 
}

Most of this is self-explanatory, but a few notes:

theme_advanced_buttons1: By default, the the advanced theme will have three rows of buttons. We only want one. We put all the buttons and options we want to appear on this one line. Notice that the buttons2 and buttons3 lines are empty – required if you just want a one-row toolbar.

theme_advanced_blockformats: This is the list of block-level containers that will appear on the formatting picklist. I removed several of the containers that were there by default.

convert_urls: By default, when a user pastes in a URL, it will be converted into a link. This won’t work for our purposes (see oembed, below). Fortunately, this behavior can be disabled.

paste_auto_cleanup_on_paste: If users paste out of Word or from a rich web page, all of the formatting crap will be stripped out automatically, so only your site’s styles are in play.

That takes care of the front-end nicely. Using just the wysiwyg, users can’t inject anything dangerous into your site. But it’s far from secure – anyone who turns off Javascript in the browser will still find a plain old textarea, happy to accept any random crap code they drop in.

Whitelisting tags on submit

Yes, there are Django template filters that can strip out HTML or allow only certain tags to be displayed. But that’s not good enough. We want to prevent bad stuff from ever getting into the database to begin with, so tags and filters (which only work with output, not input) aren’t the answer. There are a few functions suitable for the job on djangosnippets.org, but the best solution I found was this function on StackOverflow. The nice thing about this version is that it lets you also specify valid attributes (so you can still use onclick, e.g.), and it prevents things like “javascript” from appearing in an href (so naughty code can’t be inlined).

import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment

def sanitizeHtml(value, base_url=None):
    rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
    rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
    re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
    validTags = 'p i strong b u a h1 h2 h3 blockquote br ul ol li'.split()
    validAttrs = 'href src width height'.split()
    urlAttrs = 'href src'.split() # Attributes which should have a URL
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        # Get rid of comments
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in validTags:
            tag.hidden = True
        attrs = tag.attrs
        tag.attrs = []
        for attr, val in attrs:
            if attr in validAttrs:
                val = re_scripts.sub('', val) # Remove scripts (vbs & js)
                if attr in urlAttrs:
                    val = urljoin(base_url, val) # Calculate the absolute url
                tag.attrs.append((attr, val))

    return soup.renderContents().decode('utf8')

Obviously, your whitelist of allowable tags is stored in validTags.

Save the code above to an importable function somewhere in your project — I placed it in a utils.py. You’ll need to make sure you’ve got the amazing BeautifulSoup parser installed in your path as well: pip install BeautifulSoup. Then, in the view function that handles the form submission:

from appname.utils import sanitizeHtml
# ....
item.notes = sanitizeHtml(request.POST['notes'])

That’s it! Now test your form with Javascript disabled in the browser – no matter how hard you try, no tags beyond the ones you specified will make it into the database.

What about the embed tag?

So much great cross-pollination of content between sites is enabled by the embed and object tags – remote video on YouTube and Vimeo, Google Maps and mashups, widgets and feeds from dozens of providers… but if you allow “embed” or “object” on a public-facing form, you’re opening yourself up for a world of hurt.

An excellent alternative is to take advantage of the oembed standard. Oembed is a straightforward API agreed upon by dozens of content distributors. The idea is simple – rather than allowing “embed” code, allow the user to simply paste in the URL of the page on which the content appears. Your site can parse submitted text for these URLs, then reach out and grab the correct embeddables from the remote host. This prevents users from messing with embed sizes, fat-fingering blocks of intimidating code, or from embedding content from unapproved sites. If you’re working with Django, django-oembed provides a complete toolkit.

But django-oembed is a bit finicky – it wants to see plain old URLs, not URLs wrapped in other tags. And that’s why we disabled convert_urls in the TinyMCE config above.

5 Replies to “Allowing Secure User Input with Django”

  1. Great writeup Scot, this will surely come in useful for me in the future. Thanks for this! I wonder if there is an alternative to sanitizeHtml or a separate python module for it, something you can just import and sanitize(html) that is also somewhat of a “de-facto” standard so it’s more secure since many sites use it. Man it’s 2010 how can this wheel not be de-facto yet? People don’t want formatting in their comments? W3C help?

  2. Thanks Milan. And I totally agree – not only should this be a standard library, it should be part of the framework. A strange omission when you think about it.

  3. Depends on the use case, but I always do HTML and Javascript tag stripping in the view. BTW this post was written six years ago and I would never sanitize with a manual method like this – these days I just use bleach, like this. Way simpler.

Leave a Reply

Your email address will not be published. Required fields are marked *