A narrowly averted disaster

Stellar was down for most of yesterday. The outage happened due to the confluence of at least three mistakes, A Perfect Storm of Idiocy, if you will, for which I take full responsibility. I sincerely apologize for the outage and am beyond apologetic about what almost happened to all of your faves…even now, the thought of it makes my stomach churn. Here’s what happened.

I have a friend named Adam. I’ve known Adam since the mid 90s when we were on a email list together. Upon returning from a long weekend with my family late on Monday night, I noticed that Adam had faved quite a lot of items on Stellar. 7.8 million of them, actually. Upon further investigation, I determined that every single thing ever faved by any Stellar user was now, instead, faved by Adam. OH. SHIT.

But let me back up. After I got home, I checked my email right before bed and got a notification that Stellar was acting weird. And it was…people’s faves were all out of whack and messed up. When I went to check it out, the first thing I noticed was that my RDS instance at Amazon was running out of memory. Assuming that was the cause of the issue, I raised the memory allocation from 20 GB to 25 GB. That triggered a backup of the database. A backup full of bad all-Adam-faves data that replaced the backup full of good data made earlier in the day. Or so I thought.

Around this time, I messaged my friend Greg. Greg has helped me a bunch with Stellar’s back-end setup. We found the millions of Adam-faves, worked through a bunch of scenarios, read a bunch of docs, and determined that 1) a rogue UPDATE query had changed all those DB entries, and 2) because the memory allocation had triggered a backup, the good data was gone. For good.

I didn’t sleep at all that night. I cried. I nearly threw up. The thought of something I’d worked so hard on for more than three years disappearing, of being responsible for everyone’s data vanishing, of it being all my own stupid fault…it was all too much for me to sleep. By around 8am the next morning, I felt like that raccoon that Kevin Rose threw down the stairs.

After running an essential errand, I sat down to start writing this post (or, rather, the Bizzaro universe version of this post where Stellar is dead), to explain to everyone what had happened and what the scenarios were for moving forward. I was interrupted by an email from my friend Mark offering his help. After I explained the situation and how, with no backup, there was likely no hope, he offered the assistance of Buzzfeed’s ops team. I mean, I say “team” but they are actually ninjas. BF, being a large operation with many many servers, has AWS contacts and resources that a little site like Stellar does not. Raymond got in contact with those resources and eventually an answer came back: regardless of the bad all-Adam-faves backup, you should be able to restore the database back to a point in time within the past 24 hours, perhaps even further. [I don’t want to ding Amazon here, because I love AWS and their support people were very helpful in resolving this matter, but the docs could be clearer on this point. Everything I read (or remember reading at 2am after a long day of hiking and driving) gives the impression that, with the backup settings I was using, the data was unrecoverable.]

After a few tries (and after tricking RDS into thinking it had a longer backup window than it actually did), we successfully restored the DB back to before the favepocalypse. With help from Adam in retracing his steps on Stellar the evening before, I located the bug, which turned out to be a small line of code I forgot to check regarding The Great YouTube Debacle of 2013. I fixed the code, pushed it to the server, pointed it at the recovered database, and got the site back up again.

So, the worst thing that happened in the end was that everything anyone did on Stellar for about 12 hours (from noon to midnight on Monday) was discarded. That’s still a serious thing and I am sorry that it happened. If you followed someone, changed your information, or authed/deauthed an account during that time, you’ll need to do it again. In addition, some faves were probably missed while the site was down. Many of the missing faves have already been picked up by the crawlers, so hopefully the impact will be minimal.

What happens now? Stellar’s backend infrastructure must improve. By a wide margin, the biggest mistake I made was not having an adequate backup plan in place. This was a total idiot maneuver on my part and is in the process of being corrected, big time. The plan is better as of this writing (longer backup window, frequent snapshots) and will improve significantly (daily offsite backups, etc.) by the end of the week. There are other aspects of the site’s workings that I’ve been a little lazy about…those will be improved upon as well.

And that’s only the beginning…lots more to come. The idea of learning from your mistakes has been flogged to beyond cliche, but in this case, I really feel hopeful this will make Stellar better. It’s also reinforced how much I love doing the site and how important it is to me. Having so many people reach out to help and express concern and offer their best wishes was really touching. So thanks for that, extra-special thanks to Raymond, Greg, Mark, & Eugene, and I’ll work harder on screwing up less spectacularly. Fave on, you crazy diamonds.

  1. stellar-status posted this