Evolving your product isn't easy. Best case, nothing breaks, users love it, and you high-five until your hand hurts. Worst case, Digg v4.
But it doesn't have to be hard. There are ways to make the transition smoother. For instance, Adzerk recently migrated from Reporting Backend 1.0 to 2.0 without any major incidents or late nights. How'd we do it? Read on!
Here's a technical diagram of what our Reporting 1.0 Backend looked like:
Joking aside, our 1.0 backend had reached its scalability limits, making it a constant source of firefighting. Not particularly liking this, we assembled a team to rewrite it, and soon we had a shiny new Reporting 2.0 Backend. The next step was decommissioning 1.0, but we didn't want to switch it over without further testing. This is where we used our good friend, The Four Step Process™:
The Four Step Process™
Step 1 - Write to both places.
Step 2 - Read from the new place, fall back to the old.
Step 3 - Stop reading from the old place.
Step 4 - Stop writing to the old place.
"Place" here refers to an address for data. It could be database columns, cache keys, a variable name, a function call, a path on a filesystem, a URI, or just about anything!
Four Step Example
Say you have a system that writes a
name value that is first name and last name concatenated. Well, that's no good; we want
last-name written individually. This will speed up our alphabetizer service tenfold. Time to 4-step this bad boy!
We write to the new places of
last-name, while still writing to 'name'.
Alright, the new data looks legit, so we start reading from
last-name; if they don't exist, we fall back to reading from
name. No problem.
Let's say our data is not ephemeral and we care about keeping old values. To stop reading from the old place, we need to modify the old data. We do so, making sure there is a
last-name for each
This step can be tricky, which is why it's important that we're still writing to
Step 3 has survived production for some time, so we stop writing to
name and then celebrate total victory.
Real World Example
While the above example was a real rollercoaster of emotion, below is how we actually used the Four Step Process™ to replace our backend infrastructure.
Step 1 - Write to both places
Step 1 was building the Reporting 2.0 Pipeline, which is a story for another post.
Step 2 - Read from the new place, fall back to the old
At this point, we hadn't load-tested our new Redshift cluster. It could ingest data, but would it survive our users' queries? To test this, we:
Built a 1.0->2.0 request translation service, as well as a 2.0->1.0 response translation service.
Wrote code to put incoming requests behind a feature flag, letting us safely stress-test our Redshift cluster and compare data.
Slowly added customers until all reporting queries were running successfully against the new backend.
Gave beta testers the option to use the translated 2.0 output, with 1.0 output as the default option. Our beta testers gave us great feedback, and on a per-customer basis we changed the default to 2.0.
Step 3 - Stop reading from the old place
By this point, we had ironed out 99% of the kinks. We were confident enough to roll out a final feature flag: stop sending requests to 1.0. Woo!
However, as ready as we were to kill 1.0, we were still ingesting data into it...juuuuust in case. Patience is hard here, but so so important.
Step 4 - Stop writing to the old place
A few months more passed without any major incidents. So, we terminated our oldest ec2 instances. Reporting 1.0 was officially dead. It felt good, but also somehow sad. That said, we overcame our grief pretty quickly after seeing the next month's Amazon cost graphs.
Burning down our Reporting 1.0 Backend was a fun and stressful adventure. We expected many late nights, but, surprisingly, the whole process went smoothly. I attribute this to how much time we spent planning our four steps (as well as having the patience to not immediately switch over).
So, next time you're in a position like this, consider using the Four Step Process™!