What to Optimize
By Jon Saints - 17 Apr 2014
If you hang out with developers enough, you will hear this refrain:
“Premature optimization is the root of all evil”
The idea is don’t spend time fixing problems that you don’t have because you are more likely to make your code clumsy and more complicated than it needs to be.
In my recent work on www.tomnod.com we were hit with a huge amount of traffic during the search for missing Malaysia airlines flight MH370. From that experience, I can tell you that if premature optimization is evil…
“Optimizing under fire is only slightly more fun”
The question becomes what design decisions, which foundations can I lay today, that will let me scale my app to 100x or 1000x the load it handles today. Here is what we learned scaling Tomnod in March of 2014.
It is never premature to optimize these things:
- Create a Method for Sharding your Users During our traffic onslaught, an engineer from Facebook suggested the #1 thing we should do is “hash or shard our users”. By sharding we mean the ability to show a different version of your app to a fraction of users. In our case, needed to bring a bigger database online to support load. We knew the database needed time to warm up its in memory cache. This could only be done with queries to the data. If we had allowed all of our users to hammer the database before the in memory cache was ready, it would have crashed the system. So, we sharded our users. We let 5% of users access the app while the other 95% saw our productive error page. The cache warmed a bit. We then let in 10%… 20%… 50%… 75%… and finally 100% of users. Turns out sharding is useful in other ways. You can use it to launch and load test new features. Show new feature to x% of users and see if you get bug reports or unexpected load spikes. When things look good with your sample users, then roll the feature out to everyone.
- Know Your Outs For every component in the system, know and test the next size up. In some cases this might be testing that migrations to the next bigger instance size actually work. In others cases, it might be testing that adding nodes to horizontal worker or web farms works well. Test this frequently. Always make sure you know your next step for each component of your app. Easy to do. Dev environment might be small. Production a bit bigger with more nodes. Moving between them should be push button no crazy deployment hoops to go to larger system.
In sum: with a productive error page, ability to shard users, and by knowing your outs, you should be ready to scale your app to the next order of magnitude. Doing much beyond that might be premature optimization. Doing less might mean a huge lost opportunity when the hungry crowd comes knocking.