S3 in Business: 9 – Technical risk

[Complete series]

Last time, I covered the commercial risks associated with Amazon’s continued existence and continued offering of Amazon S3. Now it’s time to turn to technical risks. How reliable is the Amazon S3 service? Can we trust our backups to it? Can we build a business on it?

Computers go wrong. That is what they do. The job of a service provider is to build a reliable service on top of unreliable machinery. Amazon don’t say much about how they achieve reliability except that they have multiple data centres in different locations and S3 doesn’t even acknowledge receipt of a data object until it has been successfully stored in at least two geographically separate places. The idea is that a failure in one place is improbable but a failure in two places is inconceivable.

One can measure reliability in two ways: the percentage of time that the service is available, and the probability of a stored data item being lost or corrupted. In the case of Tunesafe, an occasional loss of service doesn’t matter much (the Tunesafe program can just go to sleep for a bit and try again later) but the loss or corruption of data is very serious indeed. It is in the nature of backups that no-one looks at them for years and years, and this means that any errors that have been introduced into the backup are discovered far too late for anything to be done about it.

Amazon make a claim about service availability (at least 99.99%, which means one hour’s downtime per year) but they make no claims about the integrity of the stored data. And they make no contractual promises at all about anything.

Amazon make a great thing of the fact that they use Amazon S3 themselves for their own business. This is a great recommendation but it isn’t actually a guarantee. Quite apart from anything else, we can’t compare Amazon’s tolerance for data loss with our own. If we use S3 as our only backup we may be far more sensitive to losses than Amazon are. Amazon with one or two books missing is still Amazon: a backup with one or two files missing is practically useless.

Why data loss is an easy risk

Data loss has one encouraging property: it is easy to tell whether it has happened. With S3 it’s easy to prove whose fault it is, as well: if Amazon maintain a tamper-proof log of every PUT and DELETE request to S3 then anyone can tell, objectively and beyond dispute, whether any object stored on S3 has (a) disappeared without a DELETE request (b) changed since the PUT request that created it.

So we have something that

  1. is a rare event
  2. causes harm
  3. can be identified with certainty.

An event like that is a prime candidate for insurance.

Suppose that Amazon paid compensation for each lost or corrupted S3 object – perhaps a thousand dollars per object, perhaps one cent per byte – how much would that cost them? We don’t know, because we don’t know the quantity of data that S3 loses; but Amazon do know, or they can get a mathematically inclined intern to find out.

Suppose that Amazon divide that total cost of compensation by the number of S3 requests received. Does the total come out at one cent per gigabyte? More? Less? Again, we don’t know but Amazon do.

A revolution in insurance

Traditional IT insurance is cumbersome and expensive. It requires the insurers to assess the degree of risk beforehand, with experts going through every bit of software, hardware and operating procedure. Handling claims is expensive as well, since the insurer has to assess exactly how much harm has been done to the business by an IT failure.

Things don’t have to be this way. To see this, let’s take the second point first. The “purest” form of insurance is pluvius insurance – you are having an open-air event on a certain date and you pay a premium so that you can be compensated if it’s rained off. Pluvius insurance is pure (a) because the insured person has no influence over whether the loss happens and (b) consequently the insurance can be for any amount that the insured person feels like. If I insure my local school’s fund-raising fair for $10m, no-one cares as long as I pay the premium: I can’t make it rain. So assessing the risk is as simple as reading weather records, and paying out the claim takes no time at all.

The kinds of insurance we’re familar with are “impure” by comparison, because the insured person has a strong influence over the probability of loss. My car insurers will want to know in advance what sort of driver I am and whether I live in a rough neighbourhood; and if my car is wrecked or stolen they won’t pay me any more than they say it’s worth. If it were otherwise, I could turn my car into money a lot more easily than by selling it second-hand. (I am told this used to happen in parts of Australia where they had fixed-value car insurance: you’d park your car in a certain spot with A$200 on the front seat and the keys inside the exhaust pipe, go for a drink and come back to find the car gone.)

Our risk of data loss and corruption is more like rain than like car theft. It is easy to assess in advance (if you know the error rates) and because we cannot influence the outcome, it can be insured for any amount that we feel like insuring it for, without any need for tedious calculation of the exact value of the loss suffered. All this would make S3-based data loss insurance very, very cheap to administer. It would lead a revolution in insurance just as S3 itself is leading a revolution in data storage.

Who should offer this insurance? The obvious answer is Amazon. They have the exact figures for the probabilities of data loss and no-one else has. It is like offering life insurance when you are the only one with accurate mortality tables. How they would do this would follow the pattern of many other industries around the world: they set up a wholly-owned insurance subsidiary, which offers data loss insurance for data stored on Amazon S3. That subsidiary underwrites part of the risk and reinsures the rest of it in the wholesale insurance market.

Amazon can thus give us a choice: the current S3 storage service at the current low cost, complete with all the empty guarantees about “we use our own service so it must be safe”; or an insured service for a few cents extra.

A cautious user of S3 could pay the extra premium; a braver user could use the size of the premium as an indication of how risky S3 storage actually is.

A revolution in IT responsibility

There is a Peanuts cartoon in which Lucy goes round getting everyone to sign a piece of paper that absolves her from all blame. No matter what happens any place or any time in the world, this absolves me from all blame!

When Lucy grew up she went into IT.

Every software and hardware supplier has contracts absolving itself from all blame if its products go wrong. As a result, IT remains an infantile industry. Because blame is disclaimed, there is no need for anyone to take responsibility. Moreover, where there is no blame there is no apportionment of blame. For example, if someone has difficulty printing, there is no way of finding out whether the cause is (a) the application program, (b) the operating system, (c) the printer driver, (d) the print spooler, (e) the printer firmware, (f) one of the fonts. And that is assuming that you’re not printing across a network!

As suppliers of Cardbox we spend a lot of time pinning down the causes of problems suffered by our users. We always hope that the cause is a bug in Cardbox, because then we can cure it; but all too often the cause is somewhere in the thicket of letters listed above. And there is no quick way of pinning it down, because none of the components have been designed to do a proper audit trail to help with troubleshooting. So the user suffers; and every day computing becomes less like engineering and more like medicine.

Suppose that Tunesafe wanted to offer insured backups to its users. It could use insured S3 storage, which accounts for most of the risk. The remaining risk would be that data loss or corruption was being caused by bugs in Tunesafe. Tunesafe being small and simple, that risk wouldn’t be too hard to assess and the company could decide to carry it itself. Of course there is also the risk of a greedy user trying to make Tunesafe go wrong so that he can get his hands on the compensation, so the Tunesafe program will have to have a secure log so that it can be proved whether (for example) a backup has been deleted because the user asked for it to be deleted rather than because something went wrong with the program.

So now blame can be reliably apportioned between “Amazon S3”, “Tunesafe”, and “elsewhere”; and the logs that Tunesafe keeps in self-defence will also be a valuable tool for debugging the rest of the complete system. Thus, one slice at a time, IT can be de-Lucified, with each component in a complex system able to identify whether it is to be blamed for an error or whether the error is someone else’s fault – and, if so, whose. It will take a long time, but S3 is a very good place to start.

Next time, contractual risk: will reading Amazon’s contract with us cure insomnia, or cause it?


%d bloggers like this: