Posts Tagged ‘Service level agreement’

Amazon’s Cloud Downtime Cost Estimated at Two Million Dollars

April 22, 2011

Quick math shows that Amazon might need to pay $2M to its customers due to its SLA policy.

Amazon Service in North Virgina has been down for more than 20 hours, to some degree.

Assuming Amazon’s EC2 Revenues in 2011 would reach $750M, I’m guessing that current monthly rate is $50M

Let’s assume %40 of the total EC2 servers are in north Virginia region (I’m too lazy to do the exact calculations).

Amazon SLA states that it will pay %10 of the bill for the effected month (April:)) if SLA is less than 99.5%. 20 Hours of downtime mean SLA for this year would be smaller than 99.8.

If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill (excluding one-time payments made for Reserved Instances) for the Eligible Credit Period. To file a claim, a customer does not have to have wait 365 days from the day they started using the service or 365 days from their last successful claim. A customer can file a claim any time their Annual Uptime Percentage over the trailing 365 days drops below 99.95%.

Therefore Amazon has to pay %10 of the monthly bill to the relevant customer.

$50 M * %40 %10 = $2M.  By the way, it seems this is not “real money” but only credit to use EC2, so real cost to Amazon is probably 70% of that (Their margins are probably much nicer than they claim  :)).

The other interesting questions is : What was the cost of the downtime for Amazon’s customers? Would I get free badges from Foursquare? That could cost them millions of free points….

Some disclaimers:

  • Customers have to ask for their money back, not sure all will remember to do it
  • The SLA states that more than one availability zone needs to be missing in order to count as “downtime”. On 1:48PM PDT  the status update mentions that only a single availability zone is unavailable, so maybe the relevant downtime is not 20 hours
  • I’m not a lawyer ,and did not do a thorough analysis of the agreement in any way. Just trying to see economics meaningful of SLA’s
  • I still think EC2 is very cool and they seem to be very honest about what’s going on.God knows most internal enterprise apps are not in a better status.  It would be interesting to see how the market leader behaves in this case and how much will they actually pay their customers back.
  • 40% might be a bit high , but it does seem most servers are still in the US

Does SLA really mean anything?

January 31, 2011

I believe most SLA’s (Service Level Agreements) are meaningless.

In the world of Software as a Service and cloud computing it has become a very popular topic, but the reality is very different from theory.

In theory, every service provider promises 99.999% of availability which means less than 6 minutes per year.

In reality, even the best services (Amazon, Google, Rackspace) had events of 8 hours of availability problems which means they are at 99.9% availability, at best.

High Availability 99.999 Downtime Table

High Availability 99.999 Downtime Table from Wikipedia

Moreover , the economics just don’t make any sense. SLA’s can not replace insurance.

Imagine the following scenario.

E-commerce site “MyCatsAndSnakes.Com” builds its consumer site in “BestAvailabilityHosting” which uses networking equipment from “VeryExpensiveMonopoly, INC.

If MyCatsandSnakes is unavailable, the site owner “Rich Bastardy” loses $100,000 per hour of downtime.

Rich pays BAHosting $20,000 per month and they promise him %99.999 avilability.

BAHostig bought two core routers in high availability mode ,connected to three different ISP’s. Each router costs $50,000 and Platinum support is another %30 per year. So total cost is $130,000 for the first year.

One horrible day, the core routers have a software bug and the traffic to the MyCatsandSnakes is dead.

Since the routers have the same software the high availability does not help to resolve the issue and VeryExpensiveMonopoly top developers have to debug the problem on site. after 8 hours of brave efforts, cats and snakes are being sold online again.

Try to guess the answers to the following questions:

  • How much money did Rich lose? (Hint: $100,000*8 )

  • How much money would Rich get from BestAvailabilityHosting? ( Hint:  (8/(24*30))*$20,000 = $166 )
  • How much money would BAHosting get back from VeryExpensiveMonopoly? (Hint:$0)

The networking vendor,VeryExpensiveMonopoly, does not give any compensation for equipment failure. This is true for all hardware and software vendors.

They don’t even have SLA for resolution time. The best you can get with platinum support is “response time”, which is not a great help.

As a result , the hosting provider can not have back to back guarantee or insurance for failures in networking.

The hosting provider limits its liability to the amount of money it receives from Rich ($20,000 per month), which makes sense.

Moreover, the service provider would only compensate Pro Rata, so the sum becomes even more neglible.

But that does not help Rich at all, as his losses are far bigger. He lost $800,000 of cats and snakes deliveries to young teenagers across Ohio.

The real answer, IMO, is “Insurance”. If Rich really wants ro mitigate his risk, he can buy an insurance for such cases.

The insurance company should be able to asses the risk and apply the right statistical costs model . Asking a service provider to do it is useless.

SLA’s might be a good way to set mutual expectations, but they are certainly not a replacement for a good insurance policy or a DRP.

Here is an interesting review of CRM and SalesFore.Com (lack of ?) SLA . And here is Amazon’s SLA for EC2    and RackSpace.

Amazon: “If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill”

GoGrid promises 10,000% but “No credit will exceed one hundred percent (100%) of Customer’s fees for the Service feature in question in the Customer’s then-current billing month”

RackSpace promises 100% avilability , but “Rackspace Guaranty: We will credit your account 5% of the monthly fee for each 30 minutes of network downtime, up to 100% of your monthly fee for the affected server.” 

Again, i don’t think one can blame these service providers, but the  gap from the perception seems major.

There are three real answers for customers who want an SLA from a service provider:

1) It would be better than on premise

2) How much are you willing to pay for extra availability?

 3) We have a great insurance agent 🙂