Tuesday, January 04, 2011

The Universal Pattern of Huge Software Losses

What Do Failures Cost?
Some perfectionists in software engineering are overly preoccupied with failure, and most others don't rationally analyze the value they place on failure-free operation. Nonetheless, when we do measure the cost of failure carefully, we generally find that great value can be added by producing more reliable software. In Responding to Significant Software Events, I give five examples that should convince you.

The national bank of Country X issued loans to all the banks in the country. A tiny error in the interest rate calculation added up to more than a billion dollars that the national bank could never recover.

A utility company was changing its billing algorithm to accommodate rate changes (a utility company euphemism for "rate increases"). All this involved was updating a few numerical constants in the existing billing program. A slight error in one constant was multiplied by millions of customers, adding up to X dollars that the utility could never recover. The reason I say "X dollars" is that I've heard this story from four different clients, with different values of X. Estimated losses ranged from a low of $42 million to a high of $1.1 billion. Given that this happened four times to my clients, and given how few public utilities are clients of mine, I'm sure it's actually happened many more times.

I know of the next case through the public press, so I can tell you that it's about the New York State Lottery. The New York State legislature authorized a special lottery to raise extra money for some worthy purpose. As this special lottery was a variant of the regular lottery, the program to print the lottery tickets had to be modified. Fortunately, all this involved was changing one digit in the existing program. A tiny error caused duplicate tickets to be printed, and public confidence in the lottery plunged with a total loss of revenue estimated between $44 million and $55 million.

I know the next story from the outside, as a customer of a large brokerage firm:
One month, a spurious line of $100,000.00 was printed on the summary portion of 1,500,000 accounts, and nobody knew why it was there. The total cost of this failure was at least $2,000,000, and the failure resulted from one of the simplest known errors in COBOL coding: failing to clear a blank line in a printing area.

I know this story, too, from the outside, as a customer of a mail-order company, and also from the inside, as their consultant. One month, a new service phone number for customer inquiries was printed on each bill. Unfortunately, the phone number had one digit incorrect, producing the number of a local doctor instead of the mail-order company. The doctor's phone was continuously busy for a week until he could get it disconnected. Many patients suffered, though I don't know if anyone died as a result of not being able to reach the doctor. The total cost of this failure would have been hard to calculate except for the fact that the doctor sued the mail-order company and won a large settlement.

The Pattern of Large Failures
Every such case that I have investigated follows a universal pattern:

1. There is an existing system in operation, and it is considered reliable and crucial to the operation.

2. A quick change to the system is desired, usually from very high in the organization.

3. The change is labeled "trivial."

4. Nobody notices that statement 3 is a statement about the difficulty of making the change, not the consequences of making it, or of making it wrong.

5. The change is made without any of the usual software engineering safeguards, however minimal, that the organization has in place.

6. The change is put directly into the normal operations.

7. The individual effect of the change is small, so that nobody notices immediately.

8. This small effect is multiplied by many uses, producing a large consequence.

Whenever I have been able to trace management action subsequent to the loss, I have found that the universal pattern continues. After the failure is spotted:

9. Management's first reaction is to minimize its magnitude, so the consequences are continued for somewhat longer than necessary.

10. When the magnitude of the loss becomes undeniable, the programmer who actually touched the code is fired—for having done exactly what the supervisor said.

11. The supervisor is demoted to programmer, perhaps because of a demonstrated understanding of the technical aspects of the job. [not]

12. The manager who assigned the work to the supervisor is slipped sideways into a staff position, presumably to work on software engineering practices.

13. Higher managers are left untouched. After all, what could they have done?

The First Rule of Failure Prevention
Once you understand the Universal Pattern of Huge Losses, you know what to do whenever you hear someone say things like:

• "This is a trivial change."

• "What can possibly go wrong?"

• "This won't change anything."

When you hear someone express the idea that something is too small to be worth observing, always take a look. That's the First Rule of Failure Prevention:

Nothing is too small to be unworthy of observing.

It doesn't have to be that way
Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention
One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others
Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:
What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over my half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)"?

(Adapted from Responding to Significant Software Events )


Matisse Enzer said...

Sounds like a summary of the pattern is that in all cases the relationship(s) between the management and the implementors sucked.

Anders Dinsen said...

Thanks for an inspiring and great post!

I particularly like the example of rewarding people for finding "bugs" - thereby assigning value to prevention.

Alsom, it reminds me of the story of the Discovery shuttle disaster in the 80's and particularly Feynmans statement that it was "An accident rooted in history". This is not only happening in computers.

Aleš Potočnik Hahonina said...

Great post. Good morning reading to give the brain a kick start.