Sunday, April 26, 2015

Ending the Requirements Process

This essay is the entire chapter 5 of the second volume of Exploring Requirements.
The book can be purchased as a single volume, or volumes 1 and 2 as a bundle, or as part of the People Skills bundle. It's also available from various vendors with both volumes as one bound book.



The Book of Life begins with a man and woman in a garden. It ends with Revelations.—Oscar Wilde, A Woman of No Importance
25.1 The Fear of Ending
The requirements process begins with ambiguity. It ends not with revelations, but with agreement. More important, it ends.
But how does it end? At times, the requirements process seems like Oscar Wilde when he remarked, "I was working on the proof of one of my poems all the morning and took out a comma. In the afternoon, I put it back again."
A certain percentage, far too large, of development efforts never emerge from the requirements phase. After two, or five, or even ten years, you can dip into the ongoing requirements process and watch them take out a comma in the morning and put it back again in the afternoon. Far better the comma should be killed during requirements than allowed to live such a lingering death.
Paradoxically, it is the attempt to finish the requirements work that creates this endless living death. The requirements phase ends with agreement, but the requirements work never ends until the product is finished. There simply comes a moment when you decide you have enough agreement to risk moving on into the full design phase.
25.2 The Courage to End It All
Nobody can tell you just when to step off the cliff. It's simply a matter of courage. Whenever an act requires courage, you can be sure that people will invent ways of reducing the courage needed. The requirements process is no different, and several inventions are available to diminish the courage needed to end it all.
25.2.1 Automatic design and development
One of the persistent "inventions" to substitute for courage is some form of automatic design and/or development.
In automatic development, the finished requirements are the input to an automatic process, the output of which is the finished product. There are, today, a few simple products that can be produced in approximately this way. For instance, certain optical lenses can be manufactured automatically starting from a statement of requirements.
In terms of the decision tree, automatic development is like a tree with a trunk, one limb, no branches or twigs, and a single leaf (Figure 25-1). With such a system, finishing requirements is not a problem. When you think you might be finished, you press the button and take a look at the product that emerges. If it's not right, then you weren't finished with the requirements process.
pastedGraphic.png
Figure 25-1. Automatic development is like a tree with a trunk, one limb, no branches or twigs, and a single leaf.
No wonder this appealing dream keeps recurring. It's exactly like the age-old story of the genie in the bottle, with no limit on the number of wishes. For those few products where such a process is in place, the only advantage of a careful requirements process is to save the time and money of wasted trial products. If trial products are cheap, then we can be quite casual about finishing requirements work.
25.2.2 Hacking
Automatic development is, in effect, nothing but requirements work—all trunk and limb. At the other end of the spectrum is hacking, development work with no explicit requirements work. When we hack, we build something, try it out, modify it, and try it again. When we like what we have, we stop. In terms of the decision tree model, hacking is not a tree at all, but a bush: all branches, twigs, and leaves, with no trunk or major limb (Figure 25-2).
pastedGraphic_1.png
Figure 25-2. Hacking is not a tree at all, but a bush—all branches, twigs, and leaves, with no trunk or major limb.
Pure hacking eliminates the problem of ending requirements work, because there is no requirements work. On the other hand, we could conceive of hacking as pure requirements work—each hack is a way of finding out what we really want.
Almost any real project, no matter how well planned and managed, contains a certain amount of hacking, because the real world always plays tricks on our assumptions. People who abhor the idea of hacking may try to create a perfect requirements process. They are the very people who create "living death" requirements processes.
25.2.3 Freezing requirements
Paradoxically, hacking and automatic development are exactly the same process from the point of view of requirements. They are the same because they do not distinguish requirements work from development. Most real-world product development falls somewhere between pure hacking and automatic development, which is why we have to wrestle with the problem of ending.
Even when we have every requirement written down in the form of an agreement, we cannot consider the requirements process to be ended. We know those agreements will have to change because in the real world, assumptions will change. Some people have tried to combat their fear of changing assumptions by imposing a freeze on requirements. They move into the design phase with the brave declaration,
No changes to requirements will be allowed.
Those of us who understand the nature of the real world will readily understand why a freeze simply cannot work. We know of only one product development when it was even possible to enforce the freeze. In that instance, a software services company took a contract to develop an inventory application for a manufacturer. The requirements were frozen by being made part of the contract, and eighteen months later, the application was delivered. Although it met all the contracted requirements, the application was rejected.
This frozen product was totally unusable because in eighteen months it had become totally removed from what was really required in the here-and-now. The customer refused to pay, and the software services company threatened legal action. The customer pointed out how embarrassing it would be to a professional software company to have its "freeze fantasy" exposed in a public courtroom.
Eventually, the two parties sat down to negotiate, and the customer paid about one-fourth of the software firm's expenses. At the end of this bellicose negotiation, someone pointed out how with far less time negotiating, they could have renegotiated the requirements as they went along, and both parties would have been happy.
25.2.4 The renegotiation process
The freeze idea is just a fantasy, designed to help us cope with our fear of closure. But we cannot fearlessly close the requirements phase unless we know there is some renegotiation process available. That's why agreeing to a renegotiation process is the last step in the requirements process.
Working out the renegotiation process is also a final test of the requirements process itself. If the requirements process has been well done, the foundation has been laid for the renegotiation process. The people have all been identified, and they know how to work well together. They have mastered all the techniques they will need in renegotiation, because they are the same techniques the participants needed to reach agreement in the first place.
25.2.5 The fear of making assumptions explicit
The agreement about renegotiation must, of course, be written down, and this act itself may strike fear in some hearts. Another way to avoid ending requirements is to avoid written agreements at the end of the requirements phase. Some designers are afraid of ending requirements because the explicit agreements would make certain assumptions explicit. An example may serve to make this surprising observation more understandable.
While working with the highway department of a certain state, we encountered the problem of what to do about a particularly dangerous curve on one of the state highways (see Figure 25-3). In an average year, about six motorists missed the curve and went to their death over a cliff. Because it was a scenic highway, it was neither practical nor desirable to eliminate the curve, but highway design principles indicated that a much heavier barrier would prevent wayward cars from going over the cliff.
pastedGraphic_2.png
Figure 25-3. What should be done about this dangerous curve?
Building the barrier seems an obvious decision, but there was another factor to consider. Perhaps once every three years, with a heavy barrier in place, one of these wayward cars would bounce off the barrier into a head-on collision with an oncoming vehicle. The collision would likely be fatal for all involved.
Now, on average, the number of people killed with the barrier in place would be perhaps one-fifth of those killed without the barrier, but the highway designers had to think of another factor. When a solo driver goes over the cliff, the newspapers will probably blame the driver. But what if a drunk driver bounces off the barrier and kills a family of seven who just happened to be driving in the wrong place at the wrong time? The headlines would shout about how the barrier caused the deaths of innocent people, and the editorials would scream for heads to roll at the highway department.
So, it's no wonder the highway engineers didn't want anything documented about their decision not to build the barrier. They were making life-and-death decisions in a way that covered their butts, and they could protect themselves by taking the position, "If I never thought about it, I'm not responsible for overlooking it in the design." And, if it was never written down, who could say they had thought of it?
25.3 The Courage to Be Inadequate
Most engineers and designers react to this story by citing similar stories of their own to prove, yes, indeed, there are decisions it's better not to write down. We believe, however, this kind of pretense is an abuse of professional power, an abuse not necessary if we remember the proper role of the requirements process.
It's not for the designers to decide what is wanted, but only to assist the customers in discovering what they want. The highway designers should have documented the two sides of the issue, then gone to the elected authorities for resolution of this open requirements question. With guidance from those charged with such responsibilities, the engineers could have designed an appropriate solution.
But suppose the politicians came back with an impossible requirement, such as,
The highway curve must be redesigned so there will be no fatalities in five years.
Then the engineers would simply go back to their customers and state they knew of no solution that fit this requirement, except perhaps for a barricade preventing cars from using the highway. Yes, they might lose their jobs, but that's what it means to be a professional—never to promise what you know in advance you can't deliver.

The purpose of requirements work is to avoid making mistakes, and to do a complete job. In the end, however, you can't avoid all mistakes, and you can't be omniscient. If you can't risk being wrong—if you can't risk being inadequate to the task you've taken on—you will never succeed in requirements work. If you want the reward, you will have to take the risk.

Tuesday, March 31, 2015

Three Samples to Read: People Skills—Soft but Difficult

Over more than half-a-century as a "technical consultant," I've learned that most of the problems I've seen are not fundamentally technical problems, but problems arising from people—problems solved or prevented by the application of so-called "soft" skills."
When we call these people-skills "soft," we imply that they are "easy" problems. Not so. If anything, they are hard problems—especially for those of us who have chosen some "technical" profession.

I've gradually developed some skill at handling such "soft" problems. I've also developed some skill at teaching others to improve their ability to solve such problems. "Teaching," of course, is another one of these "soft" skills. Writing is yet another "soft" (but not easy) skill, and I have written a number of books on a variety of "people skills."

My clients have often asked me to write one giant book covering "all the people skills," but such a book would have to have more that a thousand pages, so I have not done so. (One of my people skills is understanding that there are few people who would even attempt to read a thousand-page book.)

Recently, though, one of my clients pointed out that I had already written much more than a thousand pages on people skills (duh!). Once I let that "obvious" information into my thick skull, I realized that I knew a way to make some of those pages more readily available to my readers. And so I created a bundle of seven e-books anyone can buy at a saving of 25% of the original prices.

To help you decide if this bundle is for you, I've created the following sample of small lessons clipped from three of my books.  You can read and enjoy these samples and decide if your own "people skills" could be improved by reading dozens of such little lessons, plus a number of large lessons in the form of pragmatic models of human behavior. If your answer is "yes," then I invite you to purchase the bundle—risk-free, with Leanpub's "100% happiness guarantee."

So, here they are. Enjoy!


Sample #1: Discussing the Indiscussible 

(from More Secrets of Consulting)

Over the years I’ve come to believe that they key moment in an relationships occurs when one or both partners feels there’s something that can’t be talked about. This could be one thing, or many things, and for a lot of different reasons. When that moment arrives, the one thing that must happen is that the two partners talk about that thing. 

As a consultant, one of the most important things I can do to improve relationships is get that indiscussible subject out and on the table—but that seems risky. When I begin to fear this risk, I remind myself of the terrible consequences I’ve seen when the partners don’t discuss the taboo subject.

For instance, I was asked to work with Bill and Sherman, the co- developers of a software product who weren’t talking with one another. In this case, the thing that Bill and Sherman needed to talk about was that the Sherman didn’t want to talk to Bill. He didn’t even want to talk about the fact that he didn’t want to talk to Bill, so I decided to approach the subject indirectly, to reduce risk. I went to see Sherman and showed him my Courage Stick, letting him hold it and feel it’s smoothness. 

“It’s nice,” Sherman said. “What is it?” 

“It’s my Courage Stick. I brought it with me because I wanted to talk to you about something and I was a little afraid you might not like it.” 

“You, afraid of me? I’ve never seen you afraid to say anything.”
“Oh, I get scared about lots of things I need to say, but my Courage Stick reminds me that there are also fearful consequences when important things aren’t said.” 

“Like what?” 

“Like if I don’t tell you what will happen to your company if you and Bill don’t talk about some essential subjects. Like how you’re going to have a product that sucks, and how everybody is going to infer that Sherman is a lousy software architect.” 

You see, I didn’t know why Sherman was afraid to talk to Bill, but I knew that this would tap into one of Sherman’s greatest fears and change the Fraidycat Formula. I never tried to get Sherman to admit that he was afraid of talking with Bill, and after a little more coaxing, I led him by the elbow down to Bill’s office. I stayed for a while to act as referee, but soon Sherman’s fear was a thing of the past. Later, Bill told me that it took a lot of courage for Sherman to approach him. I didn’t bother to correct his impression. 

Sample #2: Changing Geography

(A case study from Becoming a Change Artist)

Change is a long-term process, but a living organization lives in the immediate present. Thus, without careful management, long-term change is invariably sacrificed to short-term expedience. And short-term expedience is taking place all the time, everywhere in the organization, essentially out of the view of high-level management. That's why change artists have to be in every nook and cranny, as the following case illustrates.

DeMarco and Lister have given us much useful information on the effects of geography on software development effectiveness. One manager, Ruben, reading Peopleware, was inspired to change the seating geography to improve customer relations. His semi-annual customer satisfaction survey had given the developers very low marks on communication. So, instead of seating people for efficient performance of today's work, he wanted to seat them to encourage communication, by putting eight developers in the customers' offices. However, when he surveyed customers six months later, Ruben found a large decline in their satisfaction with communication, precisely in those offices into which he had moved a developer.
What had typically happened was this. The developer would move to the customer's office and set up shop. Communication improved, but the first time there was a software emergency, the developer would rush back "home" to get the problem solved. Emergencies were rather frequent, and soon the developer would find it expedient to establish a "temporary" office in the developer area. 

After a while, the temporary office was occupied 99% of the time. In terms of the Satir Change Model, the foreign element—Ruben's move—had been rejected. Even worse, the customers were left staring at an empty office they were paying for, which reminded them of how hard it was to communicate with developers. 

Ruben noticed, however, that in one customer's office, satisfaction increased. In that office, the scenario had been different. Polly, a customer and change artist, sat in the office next door to Lyle, the developer, and noticed how often he was missing. She listened to comments made by other customers, and then she took action. 

Interviewing Lyle, she discovered that there were two main reasons why he kept leaving to solve problems: 

  • The PCs in the customer office had less capacity than the ones in the developers' office, and that capacity was needed to run debugging tools effectively. 
  • There were two other developers Lyle needed to consult on most of these problems, and they still resided in the developers' office. 
Polly knew these were typical problems of the Integration phase of the Satir Change Model, so she simply arranged to have Customer Service upgrade Lyle's PC. She then explained the benefits of having the two developers come to the customers' office, which was, after all, only two floors away. They were only too glad to come, and the customers were fascinated to learn how much work it was to fix "simple" problems in software.

With Polly as Lyle's neighbor, Ruben's strategic concept was implemented. Without using his survey to connect the strategic and the tactical, he would never have learned that it was possible to make the new geography work, and the success would have been limited to Lyle's area. Polly was sent around to the other developers who had moved, and she managed to get five out of seven working smoothly by similarly upgrading their PCs and encouraging developers to solve problems close to where they occurred. 

Polly also had the change-artist's skill to recognize that the remaining two departments were having deeper problems with information systems—problems that wouldn't be solved by upgrading PCs or encouraging proximity. Indeed, she saw that proximity was only making matters worse because the people were unskilled in handling interpersonal conflict. Thus, instead of blindly applying the same solution to everyone, she suggested to the managers that certain people could benefit from training in teamwork skills. 


Sample #3: Context-Free Questions

(from Exploring Requirements)

Once we have a starting point and a working title for a project, our next step in the project is invariably to ask some context-free questions. These are high-level questions posed early in a project to obtain information about global properties of the design problem and potential solutions.

Context-free questions are completely appropriate for any product to be designed, whether it's an automobile, airplane, ship, house, a jingle for a one-minute television commercial for chewing gum, a lifetime light bulb, or a trek in Nepal.

In terms of the decision tree, context-free questions help you decide which limb to climb, rather than which branch or twig. Because context-free questions are independent of the specific design task, they form a part of the designer's toolkit most useful before getting involved in too many enticing details.

Context-Free Process Questions
Some context-free questions relate to the process of design, rather than the designed product. Here are some typical context-free process questions, along with possible answers for the Elevator Information Device Project. (Notice that because these questions are context-free, we don't need to understand anything about the Elevator Information Device Project, an example started earlier in the Exploring Requirements book).

To appreciate how these questions can always be asked, regardless of the product being developed, also try answering them for some current project of yours, like "Trekking in Nepal."

Q: Who is the client for the Elevator Information Device?
A: The client is the Special Products Division of HAL, the world's largest imaginary corporation.

Q: What is a highly successful solution really worth to this client?
A: A highly successful solution to the problem as stated would be worth $10 million to $100 million in increased annual net profit for a period of five to ten years. The product should start earning revenue at this rate two years from now.

Q: (Ask the following context-free question if the answer to the previous context-free question does not seem to justify the effort involved.) What is the real reason for wanting to solve this problem?
A: The Elevator Information Device Project is a pilot effort for a range of commercial (and possibly even home or personal) information transfer devices. If we can demonstrate success during development and early marketing phases, this project is expected to spawn an independent business unit with gross revenues in seven years of $2 billion per year.

Q: Should we use a single design team, or more than one? Who should be on the team(s)?
A: You may choose whatever team structure you desire, but include someone on the team who knows conventional elevator technology, and someone who understands building management.

Q: How much time do we have for this project? What is the trade-off between time and value?
A: We don't need the device before two years from now because we won't be ready to market it, but every year we delay after that will probably reduce our market share by half.

Q: Where else can the solution to this design problem be obtained? Can we copy something that already exists?
A: To our knowledge, nowhere. Although many solutions exist in the form of control and information display panels for elevators, the approach adopted here should exploit the latest in sensing, control, information display, and processing—so copying doesn't seem appropriate. We have no objection to your copying features existing elsewhere, and you should be aware of what else has been done in the field, so you can surpass it.

Potential Impact of a Context-Free Question
Are context-free questions really worth such a fuss, and why do they need to be asked so early? Let's look at another example—extreme but real—of an answer to the second question, "What is a highly successful solution really worth to this client?"

At 3 a.m., a man in dirty jeans and cowboy boots showed up at the service bureau operation of a large computer manufacturer. Through the locked door, he asked if he could buy three hours' worth of computer time on their largest machine—that night. The night-shift employees were about to turn him away when one of them said, "Well, it costs $800 an hour. Is it worth $2,400?"

"Absolutely," said the cowboy, who emphasized the urgency by pulling a large wad. of $100 bills from his pocket and waving them at the employees on the other side of the glass door. They let him enter, took his payment in cash, and allowed him run his job on the machine. It turned out he owned a number of oil wells. As result of his computations, and especially the courteous treatment he received, he bought three of the giant machines, at a cost of some $10 million. If the employees had assumed the answer to the "What's it worth?" question based on his appearance, there would have been no sale.

The People-Skills Bundle


To learn more about all 7 books in the bundle, take a look at 


If you like what you see, buy the bundle with its 100% happiness guarantee. And perhaps you have a friend or two who could profit from reading their own copy.

Wednesday, March 04, 2015

The Art of Bugging: or Where Do Bugs Come From


Author's Note: When I first published this list in 1981, it was widely distributed by fans. Then for while, it disappeared. Every year, I received a request or three for a copy, but by 2000, I'd lost the original. This month, when closing up our cabin in the woods, it turned up in the back of a file cabinet. Inasmuch as I've been writing recently about testing (as in Perfect Software), and as it seems still up-to-date and relevant, I decided to republish it here for the grandchildren of the original readers.

Not so long ago, it wasn't considered in good taste to speak of errors in computer systems, but fashions change. Today articles and books on software errors are out-numbered only by those on sex, cooking, and astrology. But fashion still rules. Everybody talks about debugging—how to remove errors—but it's still of questionable taste to speak of how the bugs get there in the first place. In many ways, the word "debugging" has injured our profession.

"Bug" sounds like something that crawled in under the woodwork or flew through a momentary opening of the screen door. Misled by this terminology, people have shied away from the concept of software errors as things people do, and which, therefore, people might learn not to do.

In this column, I'd like to shift the focus from debugging—the removal of bugs—to various forms of putting them in the software  to begin with. For ease of reference, I'll simply list the various types of bugging alphabetically.

Be-bugging is the intentional putting of bugs in software for purposes of measuring the effectiveness of debugging efforts. By counting the percentage of intentional bugs that were found in testing, we get an estimate of the number of unintentional bugs that might still be remaining. Hopefully,  we remember to remove all the be-bugs when we're finished with our measurements.

Fee-bugging is the insertion of bugs by experts you've paid high fees to work on your system. Contract programmers are very skilled at fee-bugging, especially if they're treated like day laborers on a worm farm.

Gee-bugging is grafting of bugs into a program as part of a piece  of "gee-whiz" coding—fancy frills that are there to impress the maintenance programmers rather than meet the specifications.

Knee-bugging is thrusting bugs into a program when you're in too much of a hurry to bend your knee and sit down to think about what you're doing. We have a motto—"Never debug standing up." We could well extend that motto to bugging: "Never bug standing up."

Me-bugging may involve numerous forms of bug installation," for it refers to the method by which they are protected from assault. The programmer regards any attempt to question the correctness of the code as an assault on his or her value as a human being. Those who practice "egoless programming" are seldom guilty of me-bugging.

Pea-bugging can be understood by reference to the story of The Princess and the Pea. The princess showed her true royalty by being able to feel a pea through a hundred mattresses, illustrating that no bug is too small to be noticed by computers or other royalty. The pea-bugger,
however, pops in little bugs with the thought, "Oh nobody will ever notice this tiny glitch."

Pee-bugging is the rushing in of bugs when the programmer is impatient to attend to other matters, not necessarily the call of nature. The pee-bugger is the one who is always heard to contribute: "Lets just get this part coded up so we can get on to more important matters."

Plea-bugging is the legalistic method of forcing bugs into software. The plea-bugger can argue anyone out of any negative feeling for a piece of code, and is especially dangerous when walking through code, leading the committee around by its collective nose.

Pre-bugging is the art of insinuating bugs into programs before any code is written, as in the specification or design stages. We often hear the pre-bugger saying, "Oh, that's clear enough for the coders. Any moron could understand what we mean."

Re-bugging is the practice of re-introducing bugs that were already removed, or failing to remove bugs that were found and supposedly removed. Re-bugging is especially prevalent in on-line work, and some on-line programmers have been known to collapse from malnutrition when caught in an endless loop of debugging and rebugging a related pair of bugs.

Sea-bugging is named for the state of mind in which it is done—"all at sea." The typical sea-bug splashes in when everyone is making waves and shouting, "We've got to do something!" That something is invariably a bug—unless it's two bugs.

See-bugging is the implantation of bugs that "everyone can see" are correct. See-bugging is done in a state of mass hypnosis, often when too many susceptible minds are pooled on the same team or project. This unhealthy state prevails when management values harmony over quality, thus eliminating anyone who might make waves. Of course, too many wave makers leads to sea-bugging, so programming teams have to be constituted as a compromise between harmony and healthy discord.

S/he-bugging is done all the time, though nobody likes to talk about it. S/he-bugs have a way of infusing your code when your mind is on sex, or similar topics. Because sex is a topic unique unto itself, all s/he bugs originate by what might be referred to as sexual reproduction. That's why this is such a populous class of bugs.

Tea-bugging is the introduction of bugs when problems are solved during tea and coffee break conversations and then not checked before passing directly into the code.

The-bugging (pronounced thee-bugging) is crowding multiple bugs into a program under the "one-program-one bug" fallacy. The addicted the-bugger can invariably be heard ejaculating: "I've just found the bug in my program."

We-bugging is the ordination of bugs by majority rule. When someone doubts the efficacy of a piece of code, as in a review, the majority votes down this disturbing element, for how can three out of five programmers be wrong? Thus are born we-bugs.

Whee-bugging is a symptom of boredom, and frequently arises when Agile programming and other bug-prevention schemes have become well-established. Programmers reminisce about "the good old days," when programmers proved their machismo by writing code in which nobody could find bugs. One thing leads to another, and eventually someone says, "Let's do something exciting in this piece of code—wheeeeeel"

Ye-bugging is a mysterious process by which bugs appear in code touched by too many hands. Nobody knows much about ye-bugging because every time you ask how some bug got into the code, every programmer claims somebody else did it.

Z-bugging is the programmer's abbreviation for zap-bugging. The zap-bugger is in such a hurry that it's unthinkable to use three letters when one will do. Similarly, it's unthinkable to submit to any time-wasting controls on changes to working code, so the Z-bugger rams bugs into
the operating system with a special program called ZAP, or SUPERZAP. Although Z-bugs are comparatively rare, they are frequently seen because they always affect large numbers of people. There's a saying among programmers, perhaps sarcastic, "All the world loves a Z-bugger."

So there they are, in all their glory, nineteen ways that people put errors in programs. There are others. I've heard rumors of tree-bugs, spree-bugs, agree-bugs, tee-hee-hee-bugs. and fiddle-dee-dee bugs. I'm tracking down reports of the pernicious oh-promise-me-bug. Entymologists estimate that only one-tenth of all the world's insect species have been discovered and classified, so we can look forward to increasing this bugging list as our research continues. Too bad, for we seem to have enough already.

Note added in 2015: Yes, we had enough, but many have been added. If you know of a bug we didn't list here, I invite you to submit a comment about it.

Monday, February 23, 2015

Errors Versus Quality


Are Errors a Moral Issue?


NOTE: This essay is the second of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns

    

Errors in software used to be a moral issue for me, and still are for many writers. Perhaps that's why these writers have asserted that "quality is the absence of errors." 

It must be a moral issue for them, because otherwise it would be a grave error in reasoning. Here's how their reasoning might have gone wrong. Perhaps they observed that when their work is interrupted by numerous software errors, they can't appreciate any other good software qualities. From this observation, they can conclude that many errors will make software worthless—i.e., zero quality.
But here's the fallacy: Though copious errors guarantee worthlessness, but zero errors guarantees nothing at all about the value of software.
Let's take one example. Would you offer me $100 for a zero defect program to compute the horoscope of Philip Amberly Warblemaxon, who died in 1927 after a 37-year career as a filing clerk in a hat factory in Akron? I doubt you would even offer me fifty cents, because to have value, software must be more than perfect. It must be useful to someone.
Still, I would never deny the importance of errors. First of all, if I did, people in Routine, Oblivious and Variable organizations would stop reading this book. To them, chasing errors is as natural as chasing sheep is to a German Shepherd Dog. And, as we've seen, when they are shown the rather different life of a Steering organization, they simply don't believe it.
Second of all, I do know that when errors run away from us, that's one of the ways to lose quality. Perhaps our customers will tolerate 10,000 errors, but, as Tom DeMarco asked me, will they tolerate 10,000,000,000,000,000,000,000,000,000? In this sense, errors are a matter of quality. Therefore, we must train people to make fewer errors, while at the same time managing the errors they do make, to keep them from running away.

The Terminology of Error

I've sometimes found it hard to talk about the dynamics of error in software because there are many different ways of talking about errors themselves. One of the best ways for a consultant to assess the software engineering maturity of an organization is by the language they use, particularly the language they use to discuss error. To take an obvious example, those who call everything "bugs" are a long way from taking responsibility for controlling their own process. Until they start using precise and accurate language, there's little sense in teaching such people about basic dynamics.

Faults and failures. First of all, it pays to distinguish between failures (the symptoms) and
faults (the diseases). Musa gives these definitions:

A failure "is the departure of the external results of program operation from requirements." A fault "is the defect in the program that, when executed under particular conditions, causes a failure." For example:
An accounting program had a incorrect instruction (fault) in the formatting routine that inserts commas in large numbers such as "$4,500,000". Any time a user prints a number greater than six digits, a comma may be missing (a failure). Many failures resulted from this one fault.
How many failures result from a single fault? That depends on
• the location of the fault
• how long the fault remains before it is removed
• how many people are using the software.
The comma-insertion fault led to millions of failures because it was in a frequently used piece of code, in software that has thousands of users, and it remained unresolved for more than a year.
When studying error reports in various clients, I often find that they mix failures and faults in the same statistics, because they don't understand the distinction. If these two different measures are mixed into one, it will be difficult to understand their own experiences. For instance, because a single fault can lead to many failures, it would be impossible to compare failures between two organizations who aren't careful in making this "semantic" distinction.
Organization A has 100,000 customers who use their software product, each  for an average of 3 hours a day. Organization B has a single customer who uses one software system once a month. Organization A produces 1 fault per thousand lines of code, and receives over 100 complaints a day. Organization B produces 100 faults per thousand lines of code, but receives only one complaint a month.
Organization A claims they are better software developers than Organization B. Organization B claims they are better software developers than Organization A. Perhaps they're both right.
The System Trouble Incident (STI). Because of the important distinction between faults and failures, I encourage my clients to keep at least two different statistics. The first of these is a data base of "system trouble incidents," or STIs. In my books, I mean an STI to be an "incident report of one failure as experienced by a customer or simulated customer (such as a tester)."
I know of no industry standard nomenclature for these reports—except that they invariably take the form of TLAs (Three Letter Acronyms). The TLAs I have encountered include:
- STR, for "software trouble report"
- SIR, for "software incident report," or "system incident report"
- SPR, for "software problem report," or "software problem record"
- MDR, for "malfunction detection report"
- CPI, for "customer problem incident"
- SEC, for "significant error case,"
- SIR, for "software issue report"- DBR, for "detailed bug report," or "detailed bug record"- SFD, for "system failure description"- STD, for "software trouble description," or "software trouble detail"
I generally try to follow my client's naming conventions, but try hard to find out exactly what is meant. I encourage them to use unique, descriptive names. It tells me a lot about a software organization when they use more than one TLA for the same item. Workers in that organization are confused, just as my readers would be confused if I kept switching among ten TLAs for STIs. The reasons I prefer STI to some of the above are as follows:
1. It makes no prejudgment about the fault that led to the failure. For instance, it might have been a misreading of the manual, or a mistyping that wasn't noticed. Calling it a bug, an error, a failure, or a problem, tends to mislead.
2. Calling it a "trouble incident" implies that once upon a time, somebody, somewhere, was sufficiently troubled by something that they happened to bother making a report. Since our definition of quality is "value to some person," someone being troubled implies that it's worth something to look at the STI.
3. The words "software" and "code" also contain a presumption of guilt, which may unnecessarily restrict location and correction activities. We might correct an STI with a code fix, but we might also change a manual, upgrade a training program, change our ads or our sales pitch, furnish a help message, change the design, or let it stand unchanged. The word "system" says to me that any part of the overall system may contain the fault, and any part (or parts) may receive the corrective activity.
4. The word "customer" excludes troubled people who don't happen to be customers, such as programmers, analysts, salespeople, managers, hardware engineers, or testers. We should be so happy to receive reports of troublesome incidents before they get to customers that we wouldn't want to discourage anybody.
Similar principles of semantic precision might guide your own design of TLAs, to remove one more source of error, or one more impediment to their removal. Pattern 3 organizations always use TLAs more precisely than do Pattern 1 and 2 organizations.
System Fault Analysis(SFA). The second statistic is a database of information on faults, which I call SFA, for System Fault Analysis. Few of my clients initially keep such a database separate from their STIs, so I haven't found such a diversity of TLAs. Ed Ely tells me, however, that he has seen the name RCA, for "Root Cause Analysis." Since RCA would never do, the name SFA is a helpful alternative because:
1. It clearly speaks about faults, not failures. This is an important distinction. No SFA is created until a fault has been identified. When a SFA is created, it is tied back to as many STIs as possible. The time lag between the earliest STI and the SFA that clears it up can be an important dynamic measure.
2. It clearly speaks about the system, so the database can contain fault reports for faults found anywhere in the system.
3. The word "analysis" correctly implies that data is the result of careful thought, and is not to be completed unless and until someone is quite sure of their reasoning.
"Fault" does not imply blame. One deficiency with the semantics of the term"fault" is the possible implication of blame, as opposed to information. In an SFA, we must be careful to distinguish two places associated with a fault, neither of these implies anything about whose "fault" it was:
a. origin: at what stage in our process the fault originated
b.
correction: what part(s) of the system will be changed to remedy the fault
Routine, Oblivious and Variable organizations tend to equate these two notions, but the motto, "you broke it, you fix it," often leads to an unproductive "blame game." "Correction" tells us where it was wisest, under the circumstances, to make the changes, regardless of what put the fault there in the first place. For example, we might decide to change the documentation—not because the documentation was bad, but because the design is so poor it needs more documenting and the code is so tangled we don't dare try to fix it there.

If Steering organizations are not heavily into blaming, why would they want to record "origin" of a fault? To these organizations, "Origin" merely suggests where action might be taken to prevent a similar fault in the future, not which employee is to be taken out and crucified. Analyzing origins, however, requires skill and experience to determine the earliest possible prevention moment in our process. For instance, an error in the code might have been prevented if the requirements document were more clearly written. In that case, we should say that the "origin" was in the requirements stage.


(to be continued)

Tuesday, February 17, 2015

Why You Should Love Errors

Observing and Reasoning About Errors


NOTE: This essay is the first of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns
Three of the great discoveries of our time have to do with programming: programming computers, the programming of inheritance through DNA, and programming of the human mind (psychoanalysis). In each, the idea of error is central.
The first of these discoveries was psychoanalysis, with which Sigmund Freud opened the Twentieth Century and set a tone for the other two. In his introductory lectures, Freud opened the human mind to inspection through the use of errors—what we now call "Freudian slips."
The second of these discoveries was DNA. Once again, key clues to the workings of inheritance were offered by the study of errors, such as mutations, which were mistakes in transcribing the genetic code from one generation to the next.
The third of these discoveries was the stored program computer. From the first, the pioneers considered error a central concern. von Neumann noted that the largest part of natural organisms was devoted to the problem of survival in the face of error, and that the programmer of a computer need be similarly concerned.
In each of these great discoveries, errors were treated in a new way: not as lapses in intelligence, or moral failures, or insignificant trivialities—all common attitudes in the past. Instead, errors were treated as sources of valuable information, on which great discoveries could be based.
The treatment of error as a source of valuable information is precisely what distinguishes the feedback (error-controlled) system from its less capable predecessors—and thus distinguishes Pattern 3 (Steering) software cultures from Patterns 0 (Oblivious), 1 (Variable), and 2 (Routine). Organizations in those patterns have more traditional—and less productive—attitudes about the role of errors in software development, attitudes that they will have to change if they are to transform themselves into Steering organizations.
So, in the following blog entries, we'll explore what happens to  people in Oblivious, Routine, and especially Variable organizations as they battle those "inevitable" errors in their software. After reading these chapters, perhaps they'll appreciate that they can never move to a Steering pattern until they learn how to use the information in the errors they make.
One of my editors complained that the first sections of this essay spend "an inordinate amount of time on semantics, relative to the thorny issues of software failures and their detection."
What I wanted to say to her, and what I will say to you, is that such "semantics" form one of the roots of "the thorny issues of software failures and their detection." Therefore, to build on a solid foundation, I need to start this discussion by clearing up some of the most subversive ideas and definitions about failure. If you already have a perfect understanding of software failure, then skim quickly, and please forgive me.

Errors Are Not A Moral Issue

"What do you do with a person who is 900 pounds overweight that approaches the problem without even wondering how a person gets to be 900 pounds overweight?"—Tom DeMarco
This is the question Tom asked when he read an early version of this blog. He was exasperated about clients who were having trouble managing more than 10,000 error reports per product. So was I.
Over fifty years ago, in my first book on computer programming, Herb Leeds and I emphasized what we then considered the first principle of programming:
The best way to deal with errors is not to make them in the first place.
Not all wisdom was born in the Computer Age. Thousands of years before computers, Epictetus said,  "Men are not moved by things, but by the views which they take of them." 
Like many hotshot programmers half a century ago, my view of "best" was then a moral stance:
Those of us who don't make errors are better programmers than those of you who do.
I still consider this a statement of the first principle of programming, but somehow I no longer apply any moral sense to the principle. Instead, I mean "best" only in an economic sense, because,
Most errors cost more to handle than they cost to prevent.
This, I believe, is part of what Crosby means when he says "quality is free." But even if it were a moral question, I don't think that Steering cultures—which do a great deal to prevent errors—can claim any moral superiority over Oblivious, Routine and Variable cultures—which do not. You cannot say that someone is morally inferior because they don't do something they cannot do. Oblivious, Routine and Variable software cultures cannot, though these days, most programmers operate in such organizations—which are culturally incapable of preventing large numbers of errors. Why incapable? Let me put Tom's question another way:
"What do you do with a person who is rich, admired by thousands, overloaded with exciting work, 900 pounds overweight, and has 'no problem' except for occasional work lost because of back problems?"
Tom's question presumes that the thousand pound person perceives a weight problem, but what if they perceive a back problem instead? My Oblivious, Routine or Variable clients with tens of thousands of errors in their software do not perceive they have a serious problem with errors. Why not? First of all, they are making money. Secondly, they are winning the praise of their customers. Customer complaints are generally at a tolerable level on every two products out of three. With their rate of profit, why should they care if a third of their projects have to be written off as total losses?
If I attempt to discuss these mountains of errors with Oblivious, Routine or Variable clients, they reply, "In programming, errors are inevitable. Even so, we've got our errors more or less under control. So d on't worry about errors. We want you to help us get things out on schedule."
Such clients see no more connection between enormous error rates and two-year schedule slippages than the obese person sees between 900 pounds of body fat and pains in the back. Would it do any good to accuse them of having the wrong moral attitude about errors? Not likely. I might just as well accuse a blind person of having the wrong moral attitude about rainbows.

But their errors do create a moral question—for me, their consultant. If my thousand-pound client is happy, it's not my business to tell him how to lose weight. If he comes to me complaining of back problems, I can step him through a diagram of effects showing how weight affects his back. Then it's up to him to decide how much pain is worth how many chocolate cakes.
Similarly, if he comes to me complaining about error problems, I can ... (you finish the sentence)
(to be continued)

Tuesday, February 10, 2015

The Eight Fs of Software Failure

It doesn't have to be that way

Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention

One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others

Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:

What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over more than a half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)."

The Significance of Failure Sources

If we're to prevent failures, then we must observe the conditions that generate them. In searching out conditions that breed failures, I find it useful to consider that failures may come from the following eight F's: frailty, folly, fatuousness, fun, fraud, fanaticism, failure, and fate. The following is a brief discussion of each source of failure, along with ways of interpreting its significance when observed.

But before getting into this subject, a warning. You can read these sources of failure as passing judgment on human beings, or you can read them as simply describing human beings. For instance, when a perfectionist says "people aren't perfect," that's a condemnation, with the hidden implication that "people should be perfect." Frankly, I don't think I'd enjoy being around a perfect person, though I don't know, because I've never met one. So, when I say, "people aren't perfect," I really mean two things:

"People aren't perfect, which is a great relief to me, because I'm not perfect."

"People aren't perfect, which can be rather annoying when I'm trying to build information system. But it will be even more annoying if I build my information system without taking this wonderful imperfection into account."

It may help you, when reading the following sections, to do what I did when writing them. For each source, ask yourself, "When have I done the same stupid thing?" I was able to find many examples of times when I made mistakes, made foolish blunders, made fatuous boo boos, had fun playing with a system and caused it to fail, did something fraudulent (though not, I hope, illegal or immoral), acted with fanaticism, or blamed fate for my problems. Once, I actually even experienced a hardware failure when I hadn't backed up my data. If you haven't done these things yourself (or can't remember or admit doing them), I'd suggest that you stay out of the business of managing other people until you've been around the real world a bit longer.

Frailty
Frailty means that people aren't perfect. They can't do what the design calls for, whether it's the program design or the process design. Frailty is the ultimate source of software failure. The Second Law of Thermodynamics says nothing can be perfect. Therefore, the observation that someone has made a mistake is no observation at all. It was already predicted by the physicists.

It was also measured by the psychologists. Recall case history 5, the buying club statement with the incorrect telephone number. When copying a phone number, the programmer got one digit incorrect. Simple psychological studies demonstrate that when people copy 10-digit numbers, they invariably make mistakes. But everybody knows this. Haven't you ever copied a phone number incorrectly?

The direct observation of a mistake has no significance, but the meta-observation of how people prepare for mistakes does. It's a management job to design procedures for updating code, acknowledging facts of nature, and seeing that the procedures are carried out. The significant observation in this case, then, is that the managers of the mail-order company failed to establish or enforce such procedures.

In Pattern 1 and Pattern 2 organizations, for instance, most of the hullaballoo in failure prevention is directed at imploring or threatening people not to make mistakes. This is equivalent to trying to build a kind of perpetual motion machine—which is impossible. Trying to do what you know is impossible is fatuousness, which we will discuss in a moment.

After a mistake happens, the meta-observation of the way people respond to it can also be highly significant. In Pattern 1 and Pattern 2 organizations, most of the response is devoted to establish blame and then punishing the identified "culprit." This reaction has several disadvantages:

• It creates an environment in which people hide mistakes, rather than airing them out.

• It wastes energy searching for culprits that could be put to better use.

• It distracts attention from management responsibility for procedures that catch failures early and prevent dire consequences.

The third point, of course, is the reason many managers favor this way of dealing with failure. As the Chinese sage said,

When you point a finger at somebody, notice where the other three fingers are pointing.

Folly

Frailty is failing to do what you intended to do. Folly is doing what you intended, but intending the wrong thing. People not only make mistakes, they also do dumb things. For example, it's not a mistake to hard code numerical billing constants into a program as was done in the public utility billing cases. The programs may indeed work perfectly. It's not a mistake, but it is ignorant because it may cause mistakes later on.

Folly is based on ignorance, not stupidity. Folly is correctable, whereas frailty is not. For instance, it is folly to pretend not be frail, that is, to be perfect. Either theoretical physics or experience in the world can teach you that nobody is perfect.

In the same way, program design courses can teach you not to hard code numerical constants. Or, you can learn this practice as an apprentice to a mentor, or from participating in code reviews where you can observe good coding practices. But it's management's job to establish and support training, mentoring, and technical review programs. If these aren't done, or aren't done effectively, then you have a significant meta-observation about the management of failure.

Fatuousness
It is worse than folly to manage a foolish person and not provide the training or experience needed to eradicate the foolishness. We call such behavior "fatuousness." ("Utter stupidity" would be better, but it doesn't start with F.) Fatuousness is utter stupidity, or being incapable of learning. Fatuous people—which occasionally includes each of us—actively do stupid things and continue to do them, time after time. For example,

Ralston, a programmer, figures out how to bypass the configuration control system and zaps the "platinum" version of an about-to-be-released system. The zap corrects the situation he was working on, but results in a side-effect that costs the company several hundred thousand dollars.

The loophole in configuration control is fixed, but on the next release, Ralston figures out a new way to beat it. He zaps the platinum code again, producing another 6-figure side effect.

Once again, the new loophole is fixed. Then, on the third release, Ralston beats it again, although this time the cost is only $45,000.

The moral of this story is clear. The fatuous person will work very hard to beat any "idiot-proof" system. Indeed, there is no such thing as an "idiot-proof" system, because some of the idiots out there are unbelievably intelligent.

There's no protection against fatuous people in a software engineering organization except to move them into some other profession. What significance do you make of this typical situation?

Suppose you were Ralston's manager's manager. Hunt, his immediate manager, complains to you, "This wouldn't have happened if Ralston hadn't covertly bypassed our configuration control system. I don't know what to do about Ralston. He goes out of his way to do the wrong thing, beating all our systems of protection. And he's done this three times before, at least."

What was the significant part of this story? Ralston, of course, has to be moved out, but that's only the second most important part. Hunt—who has identified a fatuous employee and hasn't done anything about it—is doubly fatuous. Hunt needs to be recycled out of management into some profession where his utter stupidity doesn't carry such risk. If you delay in removing Hunt until he's done this with three employees, what does that make you?

Fun

Ralston's story also brings up the subject of fun. Some readers will rise to the defense of poor Ralston, saying, "He was only trying to have a little fun by beating the configuration control system." Well, I'm certainly not against fun, and if Ralston wants to have fun, he's entitled to it. But the question Ralston's manager has to ask is, "What business are we in?" If you're in the business of entertaining your employees at the cost of millions, then Ralston should stay. Otherwise, he'll have to have his fun hacking somewhere else.

In the actual situation, Ralston wasn't trying to have fun—at least that wasn't his primary motivation. He was, in fact, trying to be helpful by putting in a last minute "fix." Well-intentioned, but fatuous, people like Ralston are not as dangerous as people who are just trying to have a good time. Hunt could have predicted what Ralston that going to to to be helpful, but

Nobody can predict what somebody else will consider "fun."

Here are some items from my collection of "fun" things that people have done, each of which has resulted in costs greater than their annual salary:

• created a subroutine that flashed all the lights on the mainframe console for 20 seconds, then shut down the entire operating system.

• created a virus that displayed a screen with Alfred E. Neumann saying "What, me worry?" in every program that was infected.

• altered the pointing finger in a Macintosh application to point with the second finger, rather than the index finger. The testers didn't notice this obscene gesture, but thousands of customers did.

• diddled the print spooler so that in December, "Merry Christmas" was printed across a few tens of thousands of customer bills, as well as all other reports. The sentiment was nice, but happened to obliterate the amount due, so that customers had to call individually to find out how much to pay.

The list is endless and unpredictable, which is why fun is the most dangerous of all sources of failure. There are only two preventives: open, visible systems and work that is sufficient fun in and of itself. That's why fun is primarily a problem of Pattern 2 organizations, which seldom meet either of those conditions.

Fraud

Although fun costs more, software engineering managers are far more afraid of fraud. Fraud occurs when someone illegally extracts personal gain from a system. Although I don't mean to minimize fraud as a source of failure, it's an easier problem to solve than either fun or fatuousness. That's because it's clear what kind of thing people are after. There are an infinite number of ways to have fun with a system, but only a few things worth stealing.

I suggest that any software engineering manager be well read on the subject of information systems fraud, and take all reasonable precautions to prevent it. The subject has been well covered in other places, so I will not cover it further.

I will confess, however, to a little fraud of my own. I have often used the (very real but minimal) threat of fraud to motivate managers to introduce systematic technical reviews. I generally do this after failing to motivate them using the (very real and significant) threat of failure, folly, fatuousness, or fun.

Fanaticism

Very infrequently, people try to destroy or disrupt a system, but not for direct gain. Sometimes they are seeking revenge against the company, the industry, or the country for real or imagined wrongs done to them. Fanaticism like this is very hard to stop, if the fanatic is determined, especially because, like "fun," you never know what someone will think is an offense that requires revenge.

Fanaticism, like fraud, is a way of getting the attention of management. With reasonable precautions, however, the threat of terrorism can be reduced far below that of frailty. Frailty, however, lacks drama. In any case, many of the actions that protect you against frailty will also reduce the impact of terrorism. Besides, I cannot offer you any useful advice on how to observe potential terrorists in your organization. That would be "profiling."

Failure (of Hardware)

When the hardware in a system doesn't do what it's designed to do, failures may result. To a great extent, these can be overcome by software, but that is beyond the scope of this book. Fifty years ago, when programmers complained about hardware failures, they had a 50/50 chance of being right. Not today. So, if you hear people blaming hardware failures for their problems, this is significant information. What it signifies can be chosen from this list, for starters:

1. There really aren't significant hardware failures, but your programmers need an alibi. Where there's an alibi, start looking for what it's trying to conceal.

2. There really are hardware failures, but they are within the normally expected range. Your programmers, however, may not be taking the proper precautions, such as backing up their source code and test scripts.

3. There really are hardware failures, and you are not doing a good job managing your relationship with your hardware supplier.

4. Failure attributed to hardware may actually be caused by human error—unexpected actions on the part of the user. These are really system failures.

 Fate

This is what most bad managers think is happening to them. It isn't. When you hear a manager talking about "bad luck," substitute the word "manager" for "luck." As they say in the Army,


There are no bad soldiers, only bad officers.

What's Next?

This three-part essay is now finished, but the topic is far from complete. If you want more, note that the essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

I'm sure you can figure out what to do next. Good luck!