Monday, February 23, 2015

Errors Versus Quality


Are Errors a Moral Issue?


NOTE: This essay is the second of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns

    

Errors in software used to be a moral issue for me, and still are for many writers. Perhaps that's why these writers have asserted that "quality is the absence of errors." 

It must be a moral issue for them, because otherwise it would be a grave error in reasoning. Here's how their reasoning might have gone wrong. Perhaps they observed that when their work is interrupted by numerous software errors, they can't appreciate any other good software qualities. From this observation, they can conclude that many errors will make software worthless—i.e., zero quality.
But here's the fallacy: Though copious errors guarantee worthlessness, but zero errors guarantees nothing at all about the value of software.
Let's take one example. Would you offer me $100 for a zero defect program to compute the horoscope of Philip Amberly Warblemaxon, who died in 1927 after a 37-year career as a filing clerk in a hat factory in Akron? I doubt you would even offer me fifty cents, because to have value, software must be more than perfect. It must be useful to someone.
Still, I would never deny the importance of errors. First of all, if I did, people in Routine, Oblivious and Variable organizations would stop reading this book. To them, chasing errors is as natural as chasing sheep is to a German Shepherd Dog. And, as we've seen, when they are shown the rather different life of a Steering organization, they simply don't believe it.
Second of all, I do know that when errors run away from us, that's one of the ways to lose quality. Perhaps our customers will tolerate 10,000 errors, but, as Tom DeMarco asked me, will they tolerate 10,000,000,000,000,000,000,000,000,000? In this sense, errors are a matter of quality. Therefore, we must train people to make fewer errors, while at the same time managing the errors they do make, to keep them from running away.

The Terminology of Error

I've sometimes found it hard to talk about the dynamics of error in software because there are many different ways of talking about errors themselves. One of the best ways for a consultant to assess the software engineering maturity of an organization is by the language they use, particularly the language they use to discuss error. To take an obvious example, those who call everything "bugs" are a long way from taking responsibility for controlling their own process. Until they start using precise and accurate language, there's little sense in teaching such people about basic dynamics.

Faults and failures. First of all, it pays to distinguish between failures (the symptoms) and
faults (the diseases). Musa gives these definitions:

A failure "is the departure of the external results of program operation from requirements." A fault "is the defect in the program that, when executed under particular conditions, causes a failure." For example:
An accounting program had a incorrect instruction (fault) in the formatting routine that inserts commas in large numbers such as "$4,500,000". Any time a user prints a number greater than six digits, a comma may be missing (a failure). Many failures resulted from this one fault.
How many failures result from a single fault? That depends on
• the location of the fault
• how long the fault remains before it is removed
• how many people are using the software.
The comma-insertion fault led to millions of failures because it was in a frequently used piece of code, in software that has thousands of users, and it remained unresolved for more than a year.
When studying error reports in various clients, I often find that they mix failures and faults in the same statistics, because they don't understand the distinction. If these two different measures are mixed into one, it will be difficult to understand their own experiences. For instance, because a single fault can lead to many failures, it would be impossible to compare failures between two organizations who aren't careful in making this "semantic" distinction.
Organization A has 100,000 customers who use their software product, each  for an average of 3 hours a day. Organization B has a single customer who uses one software system once a month. Organization A produces 1 fault per thousand lines of code, and receives over 100 complaints a day. Organization B produces 100 faults per thousand lines of code, but receives only one complaint a month.
Organization A claims they are better software developers than Organization B. Organization B claims they are better software developers than Organization A. Perhaps they're both right.
The System Trouble Incident (STI). Because of the important distinction between faults and failures, I encourage my clients to keep at least two different statistics. The first of these is a data base of "system trouble incidents," or STIs. In my books, I mean an STI to be an "incident report of one failure as experienced by a customer or simulated customer (such as a tester)."
I know of no industry standard nomenclature for these reports—except that they invariably take the form of TLAs (Three Letter Acronyms). The TLAs I have encountered include:
- STR, for "software trouble report"
- SIR, for "software incident report," or "system incident report"
- SPR, for "software problem report," or "software problem record"
- MDR, for "malfunction detection report"
- CPI, for "customer problem incident"
- SEC, for "significant error case,"
- SIR, for "software issue report"- DBR, for "detailed bug report," or "detailed bug record"- SFD, for "system failure description"- STD, for "software trouble description," or "software trouble detail"
I generally try to follow my client's naming conventions, but try hard to find out exactly what is meant. I encourage them to use unique, descriptive names. It tells me a lot about a software organization when they use more than one TLA for the same item. Workers in that organization are confused, just as my readers would be confused if I kept switching among ten TLAs for STIs. The reasons I prefer STI to some of the above are as follows:
1. It makes no prejudgment about the fault that led to the failure. For instance, it might have been a misreading of the manual, or a mistyping that wasn't noticed. Calling it a bug, an error, a failure, or a problem, tends to mislead.
2. Calling it a "trouble incident" implies that once upon a time, somebody, somewhere, was sufficiently troubled by something that they happened to bother making a report. Since our definition of quality is "value to some person," someone being troubled implies that it's worth something to look at the STI.
3. The words "software" and "code" also contain a presumption of guilt, which may unnecessarily restrict location and correction activities. We might correct an STI with a code fix, but we might also change a manual, upgrade a training program, change our ads or our sales pitch, furnish a help message, change the design, or let it stand unchanged. The word "system" says to me that any part of the overall system may contain the fault, and any part (or parts) may receive the corrective activity.
4. The word "customer" excludes troubled people who don't happen to be customers, such as programmers, analysts, salespeople, managers, hardware engineers, or testers. We should be so happy to receive reports of troublesome incidents before they get to customers that we wouldn't want to discourage anybody.
Similar principles of semantic precision might guide your own design of TLAs, to remove one more source of error, or one more impediment to their removal. Pattern 3 organizations always use TLAs more precisely than do Pattern 1 and 2 organizations.
System Fault Analysis(SFA). The second statistic is a database of information on faults, which I call SFA, for System Fault Analysis. Few of my clients initially keep such a database separate from their STIs, so I haven't found such a diversity of TLAs. Ed Ely tells me, however, that he has seen the name RCA, for "Root Cause Analysis." Since RCA would never do, the name SFA is a helpful alternative because:
1. It clearly speaks about faults, not failures. This is an important distinction. No SFA is created until a fault has been identified. When a SFA is created, it is tied back to as many STIs as possible. The time lag between the earliest STI and the SFA that clears it up can be an important dynamic measure.
2. It clearly speaks about the system, so the database can contain fault reports for faults found anywhere in the system.
3. The word "analysis" correctly implies that data is the result of careful thought, and is not to be completed unless and until someone is quite sure of their reasoning.
"Fault" does not imply blame. One deficiency with the semantics of the term"fault" is the possible implication of blame, as opposed to information. In an SFA, we must be careful to distinguish two places associated with a fault, neither of these implies anything about whose "fault" it was:
a. origin: at what stage in our process the fault originated
b.
correction: what part(s) of the system will be changed to remedy the fault
Routine, Oblivious and Variable organizations tend to equate these two notions, but the motto, "you broke it, you fix it," often leads to an unproductive "blame game." "Correction" tells us where it was wisest, under the circumstances, to make the changes, regardless of what put the fault there in the first place. For example, we might decide to change the documentation—not because the documentation was bad, but because the design is so poor it needs more documenting and the code is so tangled we don't dare try to fix it there.

If Steering organizations are not heavily into blaming, why would they want to record "origin" of a fault? To these organizations, "Origin" merely suggests where action might be taken to prevent a similar fault in the future, not which employee is to be taken out and crucified. Analyzing origins, however, requires skill and experience to determine the earliest possible prevention moment in our process. For instance, an error in the code might have been prevented if the requirements document were more clearly written. In that case, we should say that the "origin" was in the requirements stage.


(to be continued)

Tuesday, February 17, 2015

Why You Should Love Errors

Observing and Reasoning About Errors


NOTE: This essay is the first of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns
Three of the great discoveries of our time have to do with programming: programming computers, the programming of inheritance through DNA, and programming of the human mind (psychoanalysis). In each, the idea of error is central.
The first of these discoveries was psychoanalysis, with which Sigmund Freud opened the Twentieth Century and set a tone for the other two. In his introductory lectures, Freud opened the human mind to inspection through the use of errors—what we now call "Freudian slips."
The second of these discoveries was DNA. Once again, key clues to the workings of inheritance were offered by the study of errors, such as mutations, which were mistakes in transcribing the genetic code from one generation to the next.
The third of these discoveries was the stored program computer. From the first, the pioneers considered error a central concern. von Neumann noted that the largest part of natural organisms was devoted to the problem of survival in the face of error, and that the programmer of a computer need be similarly concerned.
In each of these great discoveries, errors were treated in a new way: not as lapses in intelligence, or moral failures, or insignificant trivialities—all common attitudes in the past. Instead, errors were treated as sources of valuable information, on which great discoveries could be based.
The treatment of error as a source of valuable information is precisely what distinguishes the feedback (error-controlled) system from its less capable predecessors—and thus distinguishes Pattern 3 (Steering) software cultures from Patterns 0 (Oblivious), 1 (Variable), and 2 (Routine). Organizations in those patterns have more traditional—and less productive—attitudes about the role of errors in software development, attitudes that they will have to change if they are to transform themselves into Steering organizations.
So, in the following blog entries, we'll explore what happens to  people in Oblivious, Routine, and especially Variable organizations as they battle those "inevitable" errors in their software. After reading these chapters, perhaps they'll appreciate that they can never move to a Steering pattern until they learn how to use the information in the errors they make.
One of my editors complained that the first sections of this essay spend "an inordinate amount of time on semantics, relative to the thorny issues of software failures and their detection."
What I wanted to say to her, and what I will say to you, is that such "semantics" form one of the roots of "the thorny issues of software failures and their detection." Therefore, to build on a solid foundation, I need to start this discussion by clearing up some of the most subversive ideas and definitions about failure. If you already have a perfect understanding of software failure, then skim quickly, and please forgive me.

Errors Are Not A Moral Issue

"What do you do with a person who is 900 pounds overweight that approaches the problem without even wondering how a person gets to be 900 pounds overweight?"—Tom DeMarco
This is the question Tom asked when he read an early version of this blog. He was exasperated about clients who were having trouble managing more than 10,000 error reports per product. So was I.
Over fifty years ago, in my first book on computer programming, Herb Leeds and I emphasized what we then considered the first principle of programming:
The best way to deal with errors is not to make them in the first place.
Not all wisdom was born in the Computer Age. Thousands of years before computers, Epictetus said,  "Men are not moved by things, but by the views which they take of them." 
Like many hotshot programmers half a century ago, my view of "best" was then a moral stance:
Those of us who don't make errors are better programmers than those of you who do.
I still consider this a statement of the first principle of programming, but somehow I no longer apply any moral sense to the principle. Instead, I mean "best" only in an economic sense, because,
Most errors cost more to handle than they cost to prevent.
This, I believe, is part of what Crosby means when he says "quality is free." But even if it were a moral question, I don't think that Steering cultures—which do a great deal to prevent errors—can claim any moral superiority over Oblivious, Routine and Variable cultures—which do not. You cannot say that someone is morally inferior because they don't do something they cannot do. Oblivious, Routine and Variable software cultures cannot, though these days, most programmers operate in such organizations—which are culturally incapable of preventing large numbers of errors. Why incapable? Let me put Tom's question another way:
"What do you do with a person who is rich, admired by thousands, overloaded with exciting work, 900 pounds overweight, and has 'no problem' except for occasional work lost because of back problems?"
Tom's question presumes that the thousand pound person perceives a weight problem, but what if they perceive a back problem instead? My Oblivious, Routine or Variable clients with tens of thousands of errors in their software do not perceive they have a serious problem with errors. Why not? First of all, they are making money. Secondly, they are winning the praise of their customers. Customer complaints are generally at a tolerable level on every two products out of three. With their rate of profit, why should they care if a third of their projects have to be written off as total losses?
If I attempt to discuss these mountains of errors with Oblivious, Routine or Variable clients, they reply, "In programming, errors are inevitable. Even so, we've got our errors more or less under control. So d on't worry about errors. We want you to help us get things out on schedule."
Such clients see no more connection between enormous error rates and two-year schedule slippages than the obese person sees between 900 pounds of body fat and pains in the back. Would it do any good to accuse them of having the wrong moral attitude about errors? Not likely. I might just as well accuse a blind person of having the wrong moral attitude about rainbows.

But their errors do create a moral question—for me, their consultant. If my thousand-pound client is happy, it's not my business to tell him how to lose weight. If he comes to me complaining of back problems, I can step him through a diagram of effects showing how weight affects his back. Then it's up to him to decide how much pain is worth how many chocolate cakes.
Similarly, if he comes to me complaining about error problems, I can ... (you finish the sentence)
(to be continued)

Tuesday, February 10, 2015

The Eight Fs of Software Failure

It doesn't have to be that way

Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention

One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others

Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:

What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over more than a half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)."

The Significance of Failure Sources

If we're to prevent failures, then we must observe the conditions that generate them. In searching out conditions that breed failures, I find it useful to consider that failures may come from the following eight F's: frailty, folly, fatuousness, fun, fraud, fanaticism, failure, and fate. The following is a brief discussion of each source of failure, along with ways of interpreting its significance when observed.

But before getting into this subject, a warning. You can read these sources of failure as passing judgment on human beings, or you can read them as simply describing human beings. For instance, when a perfectionist says "people aren't perfect," that's a condemnation, with the hidden implication that "people should be perfect." Frankly, I don't think I'd enjoy being around a perfect person, though I don't know, because I've never met one. So, when I say, "people aren't perfect," I really mean two things:

"People aren't perfect, which is a great relief to me, because I'm not perfect."

"People aren't perfect, which can be rather annoying when I'm trying to build information system. But it will be even more annoying if I build my information system without taking this wonderful imperfection into account."

It may help you, when reading the following sections, to do what I did when writing them. For each source, ask yourself, "When have I done the same stupid thing?" I was able to find many examples of times when I made mistakes, made foolish blunders, made fatuous boo boos, had fun playing with a system and caused it to fail, did something fraudulent (though not, I hope, illegal or immoral), acted with fanaticism, or blamed fate for my problems. Once, I actually even experienced a hardware failure when I hadn't backed up my data. If you haven't done these things yourself (or can't remember or admit doing them), I'd suggest that you stay out of the business of managing other people until you've been around the real world a bit longer.

Frailty
Frailty means that people aren't perfect. They can't do what the design calls for, whether it's the program design or the process design. Frailty is the ultimate source of software failure. The Second Law of Thermodynamics says nothing can be perfect. Therefore, the observation that someone has made a mistake is no observation at all. It was already predicted by the physicists.

It was also measured by the psychologists. Recall case history 5, the buying club statement with the incorrect telephone number. When copying a phone number, the programmer got one digit incorrect. Simple psychological studies demonstrate that when people copy 10-digit numbers, they invariably make mistakes. But everybody knows this. Haven't you ever copied a phone number incorrectly?

The direct observation of a mistake has no significance, but the meta-observation of how people prepare for mistakes does. It's a management job to design procedures for updating code, acknowledging facts of nature, and seeing that the procedures are carried out. The significant observation in this case, then, is that the managers of the mail-order company failed to establish or enforce such procedures.

In Pattern 1 and Pattern 2 organizations, for instance, most of the hullaballoo in failure prevention is directed at imploring or threatening people not to make mistakes. This is equivalent to trying to build a kind of perpetual motion machine—which is impossible. Trying to do what you know is impossible is fatuousness, which we will discuss in a moment.

After a mistake happens, the meta-observation of the way people respond to it can also be highly significant. In Pattern 1 and Pattern 2 organizations, most of the response is devoted to establish blame and then punishing the identified "culprit." This reaction has several disadvantages:

• It creates an environment in which people hide mistakes, rather than airing them out.

• It wastes energy searching for culprits that could be put to better use.

• It distracts attention from management responsibility for procedures that catch failures early and prevent dire consequences.

The third point, of course, is the reason many managers favor this way of dealing with failure. As the Chinese sage said,

When you point a finger at somebody, notice where the other three fingers are pointing.

Folly

Frailty is failing to do what you intended to do. Folly is doing what you intended, but intending the wrong thing. People not only make mistakes, they also do dumb things. For example, it's not a mistake to hard code numerical billing constants into a program as was done in the public utility billing cases. The programs may indeed work perfectly. It's not a mistake, but it is ignorant because it may cause mistakes later on.

Folly is based on ignorance, not stupidity. Folly is correctable, whereas frailty is not. For instance, it is folly to pretend not be frail, that is, to be perfect. Either theoretical physics or experience in the world can teach you that nobody is perfect.

In the same way, program design courses can teach you not to hard code numerical constants. Or, you can learn this practice as an apprentice to a mentor, or from participating in code reviews where you can observe good coding practices. But it's management's job to establish and support training, mentoring, and technical review programs. If these aren't done, or aren't done effectively, then you have a significant meta-observation about the management of failure.

Fatuousness
It is worse than folly to manage a foolish person and not provide the training or experience needed to eradicate the foolishness. We call such behavior "fatuousness." ("Utter stupidity" would be better, but it doesn't start with F.) Fatuousness is utter stupidity, or being incapable of learning. Fatuous people—which occasionally includes each of us—actively do stupid things and continue to do them, time after time. For example,

Ralston, a programmer, figures out how to bypass the configuration control system and zaps the "platinum" version of an about-to-be-released system. The zap corrects the situation he was working on, but results in a side-effect that costs the company several hundred thousand dollars.

The loophole in configuration control is fixed, but on the next release, Ralston figures out a new way to beat it. He zaps the platinum code again, producing another 6-figure side effect.

Once again, the new loophole is fixed. Then, on the third release, Ralston beats it again, although this time the cost is only $45,000.

The moral of this story is clear. The fatuous person will work very hard to beat any "idiot-proof" system. Indeed, there is no such thing as an "idiot-proof" system, because some of the idiots out there are unbelievably intelligent.

There's no protection against fatuous people in a software engineering organization except to move them into some other profession. What significance do you make of this typical situation?

Suppose you were Ralston's manager's manager. Hunt, his immediate manager, complains to you, "This wouldn't have happened if Ralston hadn't covertly bypassed our configuration control system. I don't know what to do about Ralston. He goes out of his way to do the wrong thing, beating all our systems of protection. And he's done this three times before, at least."

What was the significant part of this story? Ralston, of course, has to be moved out, but that's only the second most important part. Hunt—who has identified a fatuous employee and hasn't done anything about it—is doubly fatuous. Hunt needs to be recycled out of management into some profession where his utter stupidity doesn't carry such risk. If you delay in removing Hunt until he's done this with three employees, what does that make you?

Fun

Ralston's story also brings up the subject of fun. Some readers will rise to the defense of poor Ralston, saying, "He was only trying to have a little fun by beating the configuration control system." Well, I'm certainly not against fun, and if Ralston wants to have fun, he's entitled to it. But the question Ralston's manager has to ask is, "What business are we in?" If you're in the business of entertaining your employees at the cost of millions, then Ralston should stay. Otherwise, he'll have to have his fun hacking somewhere else.

In the actual situation, Ralston wasn't trying to have fun—at least that wasn't his primary motivation. He was, in fact, trying to be helpful by putting in a last minute "fix." Well-intentioned, but fatuous, people like Ralston are not as dangerous as people who are just trying to have a good time. Hunt could have predicted what Ralston that going to to to be helpful, but

Nobody can predict what somebody else will consider "fun."

Here are some items from my collection of "fun" things that people have done, each of which has resulted in costs greater than their annual salary:

• created a subroutine that flashed all the lights on the mainframe console for 20 seconds, then shut down the entire operating system.

• created a virus that displayed a screen with Alfred E. Neumann saying "What, me worry?" in every program that was infected.

• altered the pointing finger in a Macintosh application to point with the second finger, rather than the index finger. The testers didn't notice this obscene gesture, but thousands of customers did.

• diddled the print spooler so that in December, "Merry Christmas" was printed across a few tens of thousands of customer bills, as well as all other reports. The sentiment was nice, but happened to obliterate the amount due, so that customers had to call individually to find out how much to pay.

The list is endless and unpredictable, which is why fun is the most dangerous of all sources of failure. There are only two preventives: open, visible systems and work that is sufficient fun in and of itself. That's why fun is primarily a problem of Pattern 2 organizations, which seldom meet either of those conditions.

Fraud

Although fun costs more, software engineering managers are far more afraid of fraud. Fraud occurs when someone illegally extracts personal gain from a system. Although I don't mean to minimize fraud as a source of failure, it's an easier problem to solve than either fun or fatuousness. That's because it's clear what kind of thing people are after. There are an infinite number of ways to have fun with a system, but only a few things worth stealing.

I suggest that any software engineering manager be well read on the subject of information systems fraud, and take all reasonable precautions to prevent it. The subject has been well covered in other places, so I will not cover it further.

I will confess, however, to a little fraud of my own. I have often used the (very real but minimal) threat of fraud to motivate managers to introduce systematic technical reviews. I generally do this after failing to motivate them using the (very real and significant) threat of failure, folly, fatuousness, or fun.

Fanaticism

Very infrequently, people try to destroy or disrupt a system, but not for direct gain. Sometimes they are seeking revenge against the company, the industry, or the country for real or imagined wrongs done to them. Fanaticism like this is very hard to stop, if the fanatic is determined, especially because, like "fun," you never know what someone will think is an offense that requires revenge.

Fanaticism, like fraud, is a way of getting the attention of management. With reasonable precautions, however, the threat of terrorism can be reduced far below that of frailty. Frailty, however, lacks drama. In any case, many of the actions that protect you against frailty will also reduce the impact of terrorism. Besides, I cannot offer you any useful advice on how to observe potential terrorists in your organization. That would be "profiling."

Failure (of Hardware)

When the hardware in a system doesn't do what it's designed to do, failures may result. To a great extent, these can be overcome by software, but that is beyond the scope of this book. Fifty years ago, when programmers complained about hardware failures, they had a 50/50 chance of being right. Not today. So, if you hear people blaming hardware failures for their problems, this is significant information. What it signifies can be chosen from this list, for starters:

1. There really aren't significant hardware failures, but your programmers need an alibi. Where there's an alibi, start looking for what it's trying to conceal.

2. There really are hardware failures, but they are within the normally expected range. Your programmers, however, may not be taking the proper precautions, such as backing up their source code and test scripts.

3. There really are hardware failures, and you are not doing a good job managing your relationship with your hardware supplier.

4. Failure attributed to hardware may actually be caused by human error—unexpected actions on the part of the user. These are really system failures.

 Fate

This is what most bad managers think is happening to them. It isn't. When you hear a manager talking about "bad luck," substitute the word "manager" for "luck." As they say in the Army,


There are no bad soldiers, only bad officers.

What's Next?

This three-part essay is now finished, but the topic is far from complete. If you want more, note that the essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

I'm sure you can figure out what to do next. Good luck!

Thursday, January 22, 2015

The Universal Pattern of Huge Software Losses

Apology: This post was supposed to appear automatically as a follow-up to my three cases of large, costly software failures, but evidently I had a software failure of my own so the google scheduler didn't do what I thought I asked for. So, here is the first follow-up, a bit late. I hope it's worth waiting for.

To complete my data gathering, I'll present two more loss cases, then proceed to describing the pattern that governs all of these cases, followed by the number one rule for preventing such losses in the first place.

Case history 4: A broker's statement

I know this story from the outside, as a customer of a large brokerage firm:
One month, a spurious line of $100,000.00 was printed on the summary portion of 1,500,000 accounts, and nobody knew why it was there. Twenty percent of clients called about it, using perhaps 50,000 hours of account representative time, or $1,000,000 at least. An unknown amount of customer time was used, and the effect on customer confidence was unknown. The total cost of this failure was at least $2,000,000, and the failure resulted from one of the simplest known errors in COBOL coding: failing to clear a blank line in a printing area.

Case history 5: A buying club statement

I know this story, too, from the outside, as a customer of a mail-order company, and also from the inside, as their consultant:
One month, a new service phone number for customer inquiries was printed on each bill. Unfortunately, the phone number had one digit incorrect, producing the number of a local doctor instead of the mail-order company. The doctor's phone was continuously busy for a week until he could get it disconnected. Many patients suffered, though I don't know if anyone died as a result of not being able to reach the doctor. The total cost of this failure would have been hard to calculate except for the fact that the doctor sued the mail-order company and won a large settlement. One of the terms of the settlement was that the doctor not reveal the amount, but I presume it was big enough. The failure resulted from an even simpler error in COBOL coding: copying a constant wrong.

The Universal Pattern of Huge Losses

I'll stop here, because I suspect you are getting bored with reading all these cases. Let me assure you, however, that they were anything but boring to the top management of the organizations involved. Rather than give a number of similar cases I have in my files, let's consider each case as a data point and try to extract some generalized meaning.

Every such case that I have investigated follows a universal pattern:

1. There is an existing system in operation, and it is considered reliable and crucial to the operation.

2. A quick change to the system is desired, usually from very high in the organization.

3. The change is labeled "trivial."

4. Nobody notices that statement 3 is a statement about the difficulty of making the change, not the consequences of making it, or of making it wrong.

5. The change is made without any of the usual software engineering safeguards, however minimal, that the organization has in place.

6. The change is put directly into the normal operations.

7. The individual effect of the change is small, so that nobody notices immediately.

8. This small effect is multiplied by many uses, producing a large consequence.

The Universal Pattern of Management Coping With a Large Loss

Whenever I have been able to trace management action subsequent to the loss, I have found that the universal pattern continues. After the failure is spotted:

9. Management's first reaction is to minimize its magnitude, so the consequences are continued for somewhat longer than necessary.

10. When the magnitude of the loss becomes undeniable, the programmer who actually touched the code is fired—for having done exactly what the supervisor said.

11. The supervisor is demoted to programmer, perhaps because of a demonstrated understanding of the technical aspects of the job.

12. The manager who assigned the work to the supervisor is slipped sideways into a staff position, presumably to work on software engineering practices.

13. Higher managers are left untouched. After all, what could they have done?

The First Rule of Failure Prevention

Once you understand the Universal Pattern of Huge Losses, you know what to do whenever you hear someone say things like:

• "This is a trivial change."

• "What can possibly go wrong?"

• "This won't change anything."

When you hear someone express the idea that something is too small to be worth observing, always take a look. That's the First Rule of Failure Prevention.


Nothing is too small to not be worth observing.


What's Next?

Now that you're familiar with the pattern, we'll take a breather until the next post. There I'll provide other guides for preventing such failures.

Note

This essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events.

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

- See more at: http://secretsofconsulting.blogspot.com/#sthash.SRafTDef.dpuf

Thursday, January 15, 2015

Some Very Expensive Software Failures

Why Concentrate on Failure?

"So long as a man attends to his business the public does not count his drinks. When he fails they notice if he takes even a glass of root beer." - Corra May Harris

Logically, direct measurement of value should be the first place an organization starts to look at itself, but that's not how it usually happens. Instead, the trigger for most organizations to embark on some self-examination is failure—either one whale of a failure or thousands of annoying failure mosquitoes.

This concentration on failure may seem illogical—and may be illogical in many circumstances—but it does fit with our understanding of quality as subjective value. Of all the troublesome aspects of using computers, failures are by far the most annoying to the most people. Without ever conducting a detailed impact case study, or even a greatest single benefit study, people know that they don't like it when their computer fails. Thus, customers heap abundant praise and appreciation on the software organization that doesn't fail them.

Of course, the definition of failure changes with time, as expectations change. Once customers become accustomed to a certain level of service, a lapse from that level becomes a failure. Some customers have come to expect a succession of "breakthroughs" in software, so that achieving only a modest gain is seen as a failure. Thus, the first step in managing failures is to manage customer expectations—but that's always the first step in managing quality.

What Do Failures Cost?

Some perfectionists in software engineering are overly preoccupied with failure, and most others don't rationally analyze the value they place on failure-free operation. Nonetheless, when we do measure the cost of failure carefully, we generally find that great value can be added by producing more reliable software. In this section, we'll take a look at a few examples that should convince you.

Case history 1: A national bank

The national bank of Country X issued loans to all the banks in the country. Each loan was confirmed by a telegram showing the amount of the loan, the repayment conditions, and the interest rate. The telegram became the legal loan document for the loan. The COBOL program that composed and sent these telegrams had been in operation for almost 15 years, and had worked flawlessly. Somebody noticed, however, that the serial number field would run out of digits and begin duplicating serial numbers within a few months. As each loan was legally identified by the serial number on the telegram, duplication could not be allowed.

Management directed that the serial number field be expanded. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Expand the serial number field by two digits." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.

Some time later, a financial analyst noticed a slight discrepancy between estimated loan receipts and actual loan receipts. After much searching, it was discovered that the serial number expansion had overlaid the low order digits of the interest rate field, causing the final two digits of every interest rate to be truncated to "00." Although the difference between 7.3845% and 7.3800% is quite small, when you are lending hundreds of billions of dollars, it quickly adds up to something significant. In this case, it added up to more than a billion dollars that the national bank could never recover.

Case history 2: A public utility

A utility company was changing its billing algorithm to accommodate rate changes (a utility company euphemism for "rate increases"). All this involved was updating a few numerical constants in the existing billing program.

Management directed that the constants be updated. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Replace these constants in the program." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.

Some time later, the Comptroller's office noticed a slight discrepancy between estimated receipts and actual receipts. After much searching, it was discovered that two low order digits in one of the constants had been entered with "75" transposed to "57", causing a number of the bills to be short by a small amount. Billing millions of customers, this small difference added up to X dollars that the utility could never recover.

The reason I say "X dollars" is that I've heard this story from four different clients, with different values of X. Estimated losses ranged from a low of $42 million to a high of $1.1 billion. Given that this happened four times to my clients, and given how few public utilities are clients of mine, I'm sure it's actually happened many more times.

Case history 3: A state lottery

I know of this one through the public press, so I can tell you that it's about the New York State Lottery:

A few years ago, the New York State legislature authorized a special lottery to raise extra money for some worthy purpose. As this special lottery was a variant of the regular lottery, the program to print the lottery tickets had to be modified. Fortunately, all this involved was changing one digit in the existing program.

Management directed that the change be made. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Change this digit to a five." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.


A few weeks later, when ticket sales were in full swing, one of the players bought two tickets and noticed that they had identical numbers. As there were supposed to be no duplicates in this lottery, he brought his tickets to the Daily News, which printed a photo of him and his two tickets on the front page. Public confidence in the lottery plunged, and the explanation that the error was "trivial" did not restore public confidence. In order to satisfy the public outcry, all lotteries were shut down pending the report of a blue ribbon investigating committee (this is government, after all). Altogether, it took 11 months for the matter to be resolved and the lotteries to be reestablished. At that time, the lotteries had been netting the state about $4 million to $5 million per month, so the total loss of revenue was estimated between $44 million and $55 million.

What's Next?

I have many more cases of failure, but to keep this blog short, I'll pause here. In my next blog essay, I'll give a few more cases, then describe the universal pattern of huge losses. After that, I'll provide some guides for preventing such failures.

Note

This essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).