Loose notes from SXSW 2008 panel Scalable Web Ventures, with:
Chris Lea Media Temple
Joe Stump Lead Architect, Digg.com Inc
Cal Henderson Badass MC, Flickr
Matt Mullenweg Founding Dev, Automattic/WordPress
Kevin Rose Founder, Diggnation/Digg Inc
This session was about much more than load balancing – scaling orgs in all directions (personnel, technique, communication), but was focused on technical scaling techniques. Amazing to see how some of the internet’s most popular properties have faced the problem in completely different ways, and how all of them basically learned by doing. You can throw, money, software, hardware, or brains at the problem, in various combinations… and these orgs have tried everything. Juicy stuff.
When do you need to worry about scaling? Ahead of time, or once you start to see problems?
First year of Flickr, didn’t know what they were doing at all… didn’t think about this kind of thing. If they had tried to deal with scaling in the first year, they wouldn’t have been able to move at the same pace. A lot of people can ignore scale forever.
Web 2.0: You want me store your data, you want it stored forever, you want it free. Flickr can’t ever “sunset” a photo because it hasn’t been seen in 4 years. You’re permanently committed.
Brad Fitzpatrick put a lot of good documentation out for LiveJournal, and Flickr looked at their scaling docs – very helpful.
Look on amazon for a book by Cal Henderson on scaling Flickr.
If you have a plan in place, at least you can cover your butt quickly. StumbleUpon was finding situations where something would broke but they had no idea what.
WikiPedia (runs on MySQL). Uses NetScaler boxes – an appliance for removing money from small places. Digg uses them too. Linux Virtual Server (LVS) is a great product for balancing – much more affordable. Load balancing is easier to set up than LVS. There are reasons to go with hardware load balancing (slightly easier), but the reasons are marginal. WordPress.com uses all open source. If you’re doing an insane amount of traffic, LVS is hard to configure to work with more than 10 gigabits.
WordPress.com – now using ngnix (applause from audience – look into this)
Scalability equals specialization. If you think you’re going to scale a generalized solution to very large, you’re pretty much wrong. You have to start breaking services apart (db servers separate from file servers separate from platform, etc.)
wordpress.com – it’s good to start everything on the cheapest hardware possible. They scale (like Google) with lots of cheap boxes. The economics of scale make it not make sense to buy big iron. In fact they’re now moving to leasing very cheap boxes from other providers – hardware as a service. Reduce total cost of ownership.
The easiest way to solve scaling problems is by throwing money at them.
“sharding” your databases – partitioning data. Digg is working on this now.
Flickr uses MySQL too. MySQL Proxy project – embeds the “LUA” language. Send it whatever you want, then just talk to the proxy. The sharding gets taken care of automatically. WordPress wrote an open source database class similar to LUA – openly available.
What about scaling people? When you have two developers it’s easy to generate consensus. When you get to 7, people start to disagree and you need to appoint leadership. Perfect team size is 3-6 developers (Digg).
StumbleUpon – as project gets bigger, importance of well-documented code become more critical. Consider designing software in a wiki before actually writing code – this is great for new devs coming on board.
WordPress – employees are 100% remote – no in-office workers. They get together 2x/year. People are more isolated, but they get a ton done b/c far fewer distractions.
Language zealots: Your language will never be your bottleneck. It will never be whether you’re using single or double quotes, or how you indent (joking). It will be usually be your db (first), then disk i/o (second).
The first obvious bottleneck for Flickr is running out of disk space. With a tiny bit of foresight, you can avoid that.
Digg has over 1 billion comments in their comments table. Highest on both the read and write scales. If you have 20 tables, chances are 19 of them are not a problem – only one or two of them is.
There’s stuff in Flickr that will never scale well, but they’d have to be 10,000 times the current size for that to matter. But if someone found out about it and DoS’d it, it would be a big problem.
On several sites, there are admin pages that take so many more resources than public pages. When a site goes down the first question is “What were the admins doing?”
ganglia.sourceforge.net is an awesome way to graph and monitor thousands of stats across lots of servers.
Monitoring tools used by the panel: nagios, mrtg, puppet, munin.
Give users the tools to moderate themselves. Digg added ability to “bury” comments and that made a HUGE difference in the moderation load.
Make it easy for communities to segregate themselves into sub-communities.
Digg releases between once and 40 times per day! Even ebay pushes 3-4x per week.
Break down a problem into the smallest possible pieces and push those.
WP: the most pain they’ve ever had was when they waited a month between release of the site. WP also doesn’t like svn branches.
flickr doesn’t do local development environments – there are too many moving parts. Too much needs to be installed locally to get it working. They just keep one big dev environment that all devs share.
Everyone agrees caching is the bomb. If your whole site is in memory, look at varnish.
memcache is incredibly important. Flickr puts cache expire to forever, so only changes if data changes.
WordPress: both memcaching and page caching. IN a slashdotting, even flusing once per second will reduce your effective requests by 20x.
Flickr does a one minute cache on all photo view pages. That saved them a ton.
How long can you get away with a dirty cache? Longer than you might think. Do it smartly – dirty cache lengths for different parts of your site.
Flickr: There’s something about providing an API that invites idiots to scrape a billion of your pages in 20 minutes. It brings out the idiots. They block IPs that try to consume all their resources and suck all their data out. It’s always research students who “had no idea.”
Digg: trac, svn. Main code plus custom pear packages (Digg is a php/mysql shop). Big believer in unit testing.
Recommended blog: High Scalability – excellent resource for all aspects of scalability problem.