A Proper Post-Mortem
There are three companies that I really enjoy doing business with. They are USAA, Amazon (on the consumer side), and Internap (on the tech side).
Let’s skip USAA, because it isn’t tech oriented and let’s look at Internap first. When there is a failure in Internap’s service, 9 times out of 10 they tell me before I realize it. Most of the time, these failures are transient and I would have never even known there was a problem had Internap not sent me an email giving me the info.
Here’s an email I got from Internap on May 30th, 2012:
At approximately 12:18 EDT on May 30, 2012 we were notified that the BGP session for our Verio provider in the ACS PNAP (Atlanta, GA) was in an active (down) state. The session recovered at 12:22 EDT and has been stable since that time.
During this time period, you may have noticed some sub-optimal routing and slight latency or packet loss as traffic destined for the Verio network was re-routed through our other providers in the PNAP. Once the session recovered, you may have noticed sub-optimal routing and slight latency again, as traffic was re-routed back onto Verio.
This type of outage is routine and isn’t a big deal. Internap lost an upstream provider at their PNAP. So what? I don’t really care. I pay them to deal with this and I experienced no downtime. But what happens when Internap themselves fucks up? We’ve had two major Internap outages. Each time, we’ve received a full RFO. One of them was an internal error from a sysadmin and another was a faulty Cisco command module. Most importantly, we received a full RFO (reason for outage) each time.
Now that we’ve moved to AWS, June 14th 2012′s outage RFO from Amazon makes me incredibly happy. From Amazon:
We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.
This is the most beautiful thing I can imagine. They are not hiding the failures. They are admitting that it failed and they are giving both the reason why it failed and what they’re going to do to prevent future failures. The best part of this is that I didn’t have to wake up and deal with this all night. This is why IaaS is a good idea… as long as this communication continues.