Tuesday, February 10, 2015

The Eight Fs of Software Failure

It doesn't have to be that way

Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention

One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others

Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:

What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over more than a half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)."

The Significance of Failure Sources

If we're to prevent failures, then we must observe the conditions that generate them. In searching out conditions that breed failures, I find it useful to consider that failures may come from the following eight F's: frailty, folly, fatuousness, fun, fraud, fanaticism, failure, and fate. The following is a brief discussion of each source of failure, along with ways of interpreting its significance when observed.

But before getting into this subject, a warning. You can read these sources of failure as passing judgment on human beings, or you can read them as simply describing human beings. For instance, when a perfectionist says "people aren't perfect," that's a condemnation, with the hidden implication that "people should be perfect." Frankly, I don't think I'd enjoy being around a perfect person, though I don't know, because I've never met one. So, when I say, "people aren't perfect," I really mean two things:

"People aren't perfect, which is a great relief to me, because I'm not perfect."

"People aren't perfect, which can be rather annoying when I'm trying to build information system. But it will be even more annoying if I build my information system without taking this wonderful imperfection into account."

It may help you, when reading the following sections, to do what I did when writing them. For each source, ask yourself, "When have I done the same stupid thing?" I was able to find many examples of times when I made mistakes, made foolish blunders, made fatuous boo boos, had fun playing with a system and caused it to fail, did something fraudulent (though not, I hope, illegal or immoral), acted with fanaticism, or blamed fate for my problems. Once, I actually even experienced a hardware failure when I hadn't backed up my data. If you haven't done these things yourself (or can't remember or admit doing them), I'd suggest that you stay out of the business of managing other people until you've been around the real world a bit longer.

Frailty
Frailty means that people aren't perfect. They can't do what the design calls for, whether it's the program design or the process design. Frailty is the ultimate source of software failure. The Second Law of Thermodynamics says nothing can be perfect. Therefore, the observation that someone has made a mistake is no observation at all. It was already predicted by the physicists.

It was also measured by the psychologists. Recall case history 5, the buying club statement with the incorrect telephone number. When copying a phone number, the programmer got one digit incorrect. Simple psychological studies demonstrate that when people copy 10-digit numbers, they invariably make mistakes. But everybody knows this. Haven't you ever copied a phone number incorrectly?

The direct observation of a mistake has no significance, but the meta-observation of how people prepare for mistakes does. It's a management job to design procedures for updating code, acknowledging facts of nature, and seeing that the procedures are carried out. The significant observation in this case, then, is that the managers of the mail-order company failed to establish or enforce such procedures.

In Pattern 1 and Pattern 2 organizations, for instance, most of the hullaballoo in failure prevention is directed at imploring or threatening people not to make mistakes. This is equivalent to trying to build a kind of perpetual motion machine—which is impossible. Trying to do what you know is impossible is fatuousness, which we will discuss in a moment.

After a mistake happens, the meta-observation of the way people respond to it can also be highly significant. In Pattern 1 and Pattern 2 organizations, most of the response is devoted to establish blame and then punishing the identified "culprit." This reaction has several disadvantages:

• It creates an environment in which people hide mistakes, rather than airing them out.

• It wastes energy searching for culprits that could be put to better use.

• It distracts attention from management responsibility for procedures that catch failures early and prevent dire consequences.

The third point, of course, is the reason many managers favor this way of dealing with failure. As the Chinese sage said,

When you point a finger at somebody, notice where the other three fingers are pointing.

Folly

Frailty is failing to do what you intended to do. Folly is doing what you intended, but intending the wrong thing. People not only make mistakes, they also do dumb things. For example, it's not a mistake to hard code numerical billing constants into a program as was done in the public utility billing cases. The programs may indeed work perfectly. It's not a mistake, but it is ignorant because it may cause mistakes later on.

Folly is based on ignorance, not stupidity. Folly is correctable, whereas frailty is not. For instance, it is folly to pretend not be frail, that is, to be perfect. Either theoretical physics or experience in the world can teach you that nobody is perfect.

In the same way, program design courses can teach you not to hard code numerical constants. Or, you can learn this practice as an apprentice to a mentor, or from participating in code reviews where you can observe good coding practices. But it's management's job to establish and support training, mentoring, and technical review programs. If these aren't done, or aren't done effectively, then you have a significant meta-observation about the management of failure.

Fatuousness
It is worse than folly to manage a foolish person and not provide the training or experience needed to eradicate the foolishness. We call such behavior "fatuousness." ("Utter stupidity" would be better, but it doesn't start with F.) Fatuousness is utter stupidity, or being incapable of learning. Fatuous people—which occasionally includes each of us—actively do stupid things and continue to do them, time after time. For example,

Ralston, a programmer, figures out how to bypass the configuration control system and zaps the "platinum" version of an about-to-be-released system. The zap corrects the situation he was working on, but results in a side-effect that costs the company several hundred thousand dollars.

The loophole in configuration control is fixed, but on the next release, Ralston figures out a new way to beat it. He zaps the platinum code again, producing another 6-figure side effect.

Once again, the new loophole is fixed. Then, on the third release, Ralston beats it again, although this time the cost is only $45,000.

The moral of this story is clear. The fatuous person will work very hard to beat any "idiot-proof" system. Indeed, there is no such thing as an "idiot-proof" system, because some of the idiots out there are unbelievably intelligent.

There's no protection against fatuous people in a software engineering organization except to move them into some other profession. What significance do you make of this typical situation?

Suppose you were Ralston's manager's manager. Hunt, his immediate manager, complains to you, "This wouldn't have happened if Ralston hadn't covertly bypassed our configuration control system. I don't know what to do about Ralston. He goes out of his way to do the wrong thing, beating all our systems of protection. And he's done this three times before, at least."

What was the significant part of this story? Ralston, of course, has to be moved out, but that's only the second most important part. Hunt—who has identified a fatuous employee and hasn't done anything about it—is doubly fatuous. Hunt needs to be recycled out of management into some profession where his utter stupidity doesn't carry such risk. If you delay in removing Hunt until he's done this with three employees, what does that make you?

Fun

Ralston's story also brings up the subject of fun. Some readers will rise to the defense of poor Ralston, saying, "He was only trying to have a little fun by beating the configuration control system." Well, I'm certainly not against fun, and if Ralston wants to have fun, he's entitled to it. But the question Ralston's manager has to ask is, "What business are we in?" If you're in the business of entertaining your employees at the cost of millions, then Ralston should stay. Otherwise, he'll have to have his fun hacking somewhere else.

In the actual situation, Ralston wasn't trying to have fun—at least that wasn't his primary motivation. He was, in fact, trying to be helpful by putting in a last minute "fix." Well-intentioned, but fatuous, people like Ralston are not as dangerous as people who are just trying to have a good time. Hunt could have predicted what Ralston that going to to to be helpful, but

Nobody can predict what somebody else will consider "fun."

Here are some items from my collection of "fun" things that people have done, each of which has resulted in costs greater than their annual salary:

• created a subroutine that flashed all the lights on the mainframe console for 20 seconds, then shut down the entire operating system.

• created a virus that displayed a screen with Alfred E. Neumann saying "What, me worry?" in every program that was infected.

• altered the pointing finger in a Macintosh application to point with the second finger, rather than the index finger. The testers didn't notice this obscene gesture, but thousands of customers did.

• diddled the print spooler so that in December, "Merry Christmas" was printed across a few tens of thousands of customer bills, as well as all other reports. The sentiment was nice, but happened to obliterate the amount due, so that customers had to call individually to find out how much to pay.

The list is endless and unpredictable, which is why fun is the most dangerous of all sources of failure. There are only two preventives: open, visible systems and work that is sufficient fun in and of itself. That's why fun is primarily a problem of Pattern 2 organizations, which seldom meet either of those conditions.

Fraud

Although fun costs more, software engineering managers are far more afraid of fraud. Fraud occurs when someone illegally extracts personal gain from a system. Although I don't mean to minimize fraud as a source of failure, it's an easier problem to solve than either fun or fatuousness. That's because it's clear what kind of thing people are after. There are an infinite number of ways to have fun with a system, but only a few things worth stealing.

I suggest that any software engineering manager be well read on the subject of information systems fraud, and take all reasonable precautions to prevent it. The subject has been well covered in other places, so I will not cover it further.

I will confess, however, to a little fraud of my own. I have often used the (very real but minimal) threat of fraud to motivate managers to introduce systematic technical reviews. I generally do this after failing to motivate them using the (very real and significant) threat of failure, folly, fatuousness, or fun.

Fanaticism

Very infrequently, people try to destroy or disrupt a system, but not for direct gain. Sometimes they are seeking revenge against the company, the industry, or the country for real or imagined wrongs done to them. Fanaticism like this is very hard to stop, if the fanatic is determined, especially because, like "fun," you never know what someone will think is an offense that requires revenge.

Fanaticism, like fraud, is a way of getting the attention of management. With reasonable precautions, however, the threat of terrorism can be reduced far below that of frailty. Frailty, however, lacks drama. In any case, many of the actions that protect you against frailty will also reduce the impact of terrorism. Besides, I cannot offer you any useful advice on how to observe potential terrorists in your organization. That would be "profiling."

Failure (of Hardware)

When the hardware in a system doesn't do what it's designed to do, failures may result. To a great extent, these can be overcome by software, but that is beyond the scope of this book. Fifty years ago, when programmers complained about hardware failures, they had a 50/50 chance of being right. Not today. So, if you hear people blaming hardware failures for their problems, this is significant information. What it signifies can be chosen from this list, for starters:

1. There really aren't significant hardware failures, but your programmers need an alibi. Where there's an alibi, start looking for what it's trying to conceal.

2. There really are hardware failures, but they are within the normally expected range. Your programmers, however, may not be taking the proper precautions, such as backing up their source code and test scripts.

3. There really are hardware failures, and you are not doing a good job managing your relationship with your hardware supplier.

4. Failure attributed to hardware may actually be caused by human error—unexpected actions on the part of the user. These are really system failures.

 Fate

This is what most bad managers think is happening to them. It isn't. When you hear a manager talking about "bad luck," substitute the word "manager" for "luck." As they say in the Army,


There are no bad soldiers, only bad officers.

What's Next?

This three-part essay is now finished, but the topic is far from complete. If you want more, note that the essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

I'm sure you can figure out what to do next. Good luck!

2 comments:

Anssi Lehtelä said...

Thanks for the interesting read.

I'd have a question with regards to the responsibilities in preventing these failures. It would seem that in the writing the role of the managers is emphasized in preventing, or minimizing, the cost&frequency of these failures. Would you think that in many cases it would be easier and even preferable that the responsibility to improve would be within the teams and individuals that actually do these mistakes? Where as the managers role would be more to create an atmosphere where this can happen?

Gerald M. Weinberg said...

Yes, that's the manager's role, all right. And that's why it's the manager's responsibility to the organization. What Anssi is talking about it the worker's responsibility to their manager. If your s/w error leads to a billion dollar loss, the workers don't owe the organization a billion dollars.

if a technician fails to sharpen a scalpel and the patient dies, the ultimate responsibility is the doctor's, who has set up or approved the system that allows a dull scalpel to reach the doctor's hand. The doctor hired the tech and trusted their work, and so is responsible. the tech didn't hire the doctor.