Gerald Weinberg's Secrets of Writing and Consulting: error

Showing posts with label error. Show all posts

Sunday, October 29, 2017

My most challenging experience as a software developer

Here is my detailed answer to the question, "What is the most challenging experience you encountered as a software developer?:

We were developing the tracking system for Project Mercury, to put a person in space and bring them back alive. The “back alive” was the challenging part, but not the only one. Some other challenges were as follows:

- The system was based on a world-wide network of fairly unreliable teletype connections.

- We had to determine the touchdown in the Pacific to within a small radius, which meant we needed accurate and perfectly synchronized clocks on the computer and space capsule.

- We also needed to knew exactly where our tracking stations were, but it turned out nobody knew where Australia's two stations were with sufficient precision. We had to create an entire sub-project to locate Australia.

- We needed information on the launch rocket, but because it was also a military rocket, that information was classified. We eventually found a way to work around that.

- Our computers were a pair of IBM 7090s, plus a 709 at a critical station in Bermuda. In those days, the computers were not built for on-line real-time work. For instance, there was no standard interrupt clock. We actually built our own for the Bermuda machine.

- Also, there were no disk drives yet, so everything had to be based on a tape drive system, but the tape drives were not sufficiently reliable for our specs. We beat this problem by building software error-correcting codes into the tape drive system.

We worked our way through all these problems and many more smaller ones, but the most challenging problem was the “back alive” requirement. Once we had the hardware and network reliability up to snuff, we still had the problem of software errors. To counter this problem, we created a special test group, something that had never been done before. Then we set a standard that any error detected by the test group and not explicitly corrected would stop any launch.

Our tests revealed that the system could crash for unknown reasons at random times, so it would be unable to bring down the astronaut safely at a known location. When the crash occurred in testing, the two on-line printers simultaneously printed a 120-character of random garbage. The line was identical on the two printers, indicating that this was not some kind of machine error on one of the 7090s. It could have been a hardware design error or a coding error. We had to investigate both possibilities, but the second possibility was far more likely.

We struggled to track down the source of the crash, but after a fruitless month, the project manager wanted to drop it as a “random event.” We all knew it wasn’t random, but he didn’t want to be accused of delaying the first launch.

To us, however, it was endangering the life of the astronaut, so we pleaded for time to continue trying to pinpoint the fault. “We should think more about this,” we said, to which he replied (standing under an IBM THINK sign), “Thinking is a luxury we can no longer afford.”

We believed (and still believe) that thinking is not a luxury for software developers, so we went underground. After much hard work, Marilyn pinpointed the fault and we corrected it just before the first launch. We may have saved an astronaut’s life, but we’ll never get any credit for it.

Moral: We may think that hardware and software errors are challenging, but nothing matches the difficulty of confronting human errors—especially when those humans are managers willing to hide errors in order to make schedules.

https://leanpub.com/errors

Sunday, July 09, 2017

Survey: The Worst Error Message

I participated in a recent survey. The question was:

What's the Worse Error Message You've Ever Seen?

This was my contribution, which received thousands of up-votes:

A graduate student brought me an error message that said:

THIS IS AN IMPOSSIBLE ERROR. THIS ERROR CANNOT OCCUR.

IF THIS ERROR OCCURS, SEE YOUR IBM REPRESENTATIVE.

I told the student I wasn’t the right person to see, but he should see his IBM representative.

He said, “But I’m my IBM representative.”

—————————-

I'd like to collect some of these lousy error messages for my book, ERRORS.

If you've got a good (i.e. bad) one, please share it with us in a comment to this post.

Thanks for contributing.

Sunday, March 12, 2017

Classifying Errors

I received an email the other day from Giorgio Valoti in Italy, but when I wrote a response, it bounced back with "recipient unknown." It may have been a transient error, but it made me think that others besides Giorgio might be interested in discussing the issue of classifying errors, so I'll put my answer here and hope Giorgio will see it.

Here's the letter:

Dear Mr. Weinberg,
My problem is that I’m looking for good way — maybe a standard, more likely a set of guidelines — to classify and put a some kind of label on software defects.

Is there such a thing? Does it even make sense trying to classify software defects?

And here's my reply:

https://leanpub.com/errors

Hello, Giorgio

It can certainly make sense to classify errors/defects, but there are many ways to classify, depending on what you're trying to accomplish. So, that's where you start, by answering "What's my purpose in classifying?"

For instance, here are a few ways my clients have classified errors, and for what purposes:

- cost to fix: to estimate future costs

- costs to customers: to estimate impact on product sales, or product market penetration

- place of origin in the development cycle: to decide where to concentrate quality efforts

- type of activity that led to the error: to improve the training of developers

- type of activity that led to detecting the error: to improve the training of testers

- number of places that had to be fixed to correct the error: to estimate the quality of the design

- and so on and on

I hope this helps ... and thanks for asking.

--------------end of letter-----------

As the letter says, there are numerous ways to classify errors, so I think my readers would love to read about some other ways from other readers. Care to comment?

Wednesday, January 11, 2017

Foreword and Introduction to ERRORS book

Foreword

Ever since this book came out, people have been asking me how I came to write on such an unusual topic. I've pondered their question and decided to add this foreword as an answer.

As far as I can remember, I've always been interested in errors. I was a smart kid, but didn't understand why I made mistakes. And why other people made more.

I yearned to understand how the brain, my brain, worked, so I studied everything I could find about brains. And then I heard about computers.

Way back then, computers were called "Giant Brains." Edmund Berkeley wrote a book by that title, which I read voraciously.

Those giant brains were "machines that think" and "didn't make errors." Neither turned out to be true, but back then, I believed them. I knew right away, deep down—at age eleven—that I would spend my life with computers.

Much later, I learned that computers didn't make many errors, but their programs sure did.

I realized when I worked on this book that it more or less summarizes my life's work, trying to understand all about errors. That's where it all started.

I think I was upset when I finally figured out that I wasn't going to find a way to perfectly eliminate all errors, but I got over it. How? I think it was my training in physics, where I learned that perfection simply violates the laws of thermodynamics.

Then I was upset when I realized that when a computer program had a fault, the machine could turn out errors millions of times faster than any human or group of humans.

I could actually program a machine to make more errors in a day than all human beings had made in the last 10,000 years. Not many people seemed to understand the consequences of this fact, so I decided to write this book as my contribution to a more perfect world.

Not perfect, of course, but more perfect. I hope it helps.

Introduction

For more than a half-century, I’ve written about errors: what they are, their importance, how we think about them, our attempts to prevent them, and how we deal with them when those attempts fail. People tell me how helpful some of these writings have been, so I felt it would be useful to make them more widely known. Unfortunately, the half-century has left them scattered among several dozen books, so I decided to consolidate some of the more helpful ones in this book.

I’m going to start, though, where it all started, with my first book where Herb Leeds and I made our first public mention of error. Back in those days, Herb and I both worked for IBM. As employees we were not allowed to write about computers making mistakes, but we knew how important the subject was. So, we wrote our book and didn’t ask IBM’s permission.

Computer errors are far more important today than they were back in 1960, but many of the issues haven’t changed. That’s why I’m introducing this book with some historical perspective: reprinting some of that old text about errors along with some notes with the perspective of more than half a century.

1960’s Forbidden Mention of Errors
From: CHAPTER 10
Leeds and Weinberg, Computer Programming Fundamentals PROGRAM TESTING
When we approach the subject of program testing, we might almost conclude the whole subject immediately with the anecdote about the mathematics professor who, when asked to look at a student’s problem, replied, “If you haven’t made any mistakes, you have the right answer.” He was, of course, being only slightly facetious. We have already stressed this philosophy in programming, where the major problem is knowing when a program is “right.”

In order to be sure that a program is right, a simple and systematic approach is undoubtedly best. However, no approach can assure correctness without adequate testing for verification. We smile when we read the professor’s reply because we know that human beings seldom know immediately when they have made errors—although we know they will at some time make them. The programmer must not have the view that, because he cannot think of any error, there must not be one. On the contrary, extreme skepticism is the only proper attitude. Obviously, if we can recognize an error, it ceases to be an error.

If we had to rely on our own judgment as to the correctness of our programs, we would be in a difficult position. Fortunately the computer usually provides the proof of the pudding. It is such a proper combination of programmer and computer that will ultimately determine the means of judging the program. We hope to provide some insight into the proper mixture of these ingredients. An immediate problem that we must cope with is the somewhat disheartening fact that, even after carefully eliminating clerical errors, experienced programmers will still make an average of approximately one error for every thirty instructions written.

We make errors quite regularly
This statement is still true after half a century—unless it’s actually worse nowadays. (I have some data from Capers Jones suggesting one error in fewer than ten instructions may be typical for very large, complex projects.) It will probably be true after ten centuries, unless by then we’ve made substantial modifications to the human brain. It’s a characteristic of humans would have been true a hundred centuries ago—if we’d had computers then.

1960’s Cost of errors
These errors range from minor misunderstandings of instructions to major errors of logic or problem interpretation. Strangely enough, the trivial errors often lead to spectacular results, while the major errors initially are usually the most difficult to detect.

“Trivial” errors can have great consequences
We knew about large errors way back then, but I suspect we didn’t imagine just how much errors could cost. For examples of some billion dollar errors along with explanations, read the chapter “Some Very Expensive Software Errors.”

Back to 1960 again
Of course, it is possible to write a program without errors, but this fact does not obviate the need for testing. Whether or not a program is working is a matter not to be decided by intuition. Quite often it is obvious when a program is not working. However, situations have occurred where a program which has been apparently successful for years has been exposed as erroneous in some part of its operation.

Errors can escape detection for years
With the wisdom of time, we now have quite specific examples of errors lurking in the background for thirty years or more. For example, read the chapter on “predicting the number of errors.”

How was it tested in 1960
Consequently, when we use a program, we want to know how it was tested in order to give us confidence in—or warning about—its applicability. Woe unto the programmer with “beginner’s luck” whose first program happens to have no errors. If he takes success in the wrong way, many rude shocks may be needed to jar his unfounded confidence into the shape of proper skepticism.

Many people are discouraged by what to them seems the inordinate amount of effort spent on program testing. They rightly indicate that a human being can often be trained to do a job much more easily than a computer can be programmed to do it. The rebuttal to this observation may be one or more of the following statements:

All problems are not suitable for computers. (We must never forget this one.)
The computer, once properly programmed, will give a higher level of performance, if, indeed,
the problem is suited to a computer approach.
All the human errors are removed from the system in advance, instead of distributing them
throughout the work like bits of shell in a nutcake, In such instances, unfortunately, the human errors will not necessarily repeat in identical manner. Thus, anticipating and catching such errors may be exceedingly difficult. Often in these eases the tendency is to overcompensate for such errors, resulting in expense and time loss.
The computer is often doing a different job than the man is doing, for there is a tendency– usually a good one—to enlarge the scope of a problem at the same time it is first programmed for a computer. People are often tempted to “compare apples with houses” in this case.
The computer is probably a more steadfast employee, whereas human beings tend to move on to other responsibilities and must be replaced by other human beings who must, in turn, be trained.

In other words, if a job is worth doing, it is worth doing right.

Sometimes the error is creating a program at all.
Unfortunately, the cost of developing, supporting, and maintaining a program frequently exceeds the value it produces. In any case, no amount of fixing small program errors can eliminate the big error of writing the program in the first place. For examples and explanations, read the chapter on “it shouldn’t even be done.”

The full process, 1960
If a job is a computer job, it should be handled as such without hesitation. Of course, we are obligated to include the cost of programming and testing in any justification of a new computer application. Furthermore we must not be tempted to cut costs at the end by skimping on the testing effort. An incorrect program is indeed worth less than no program at all because the false conclusions it may inspire can lead to many expensive errors.

We must not confuse cost and value.
Even after all this time, some managers still believe they can get away with skimping on the testing effort. For examples and explanations, read the section on “What Do Errors Cost?”

Coding is not the end, even in 1960
A greater danger than false economy is ennui. Sometimes a programmer, upon finishing the coding phase of a problem, feels that all the interesting work is done. He yearns to move on to the next problem.

Programs can become erroneous without changing a bit.
You may have noticed the consistent use of “he” and “his” in this quoted passage from an ancient book. These days, this would be identified as “sexist writing,” but it wasn’t called “sexist” way back then. This is an example of how something that wasn’t an error in the past becomes an error with changing culture, changing language, changing hardware, or perhaps new laws. We don’t have to do anything to make an error, but we have to do a whole lot not to make an error.

We keep learning, but is it enough?
Thus as soon as the program looks correct—or, rather, does not look incorrect—he convinces himself it is finished and abandons it. Programmers at this time are much more fickle than young lovers.
Such actions are, of course, foolish. In the first place, we cannot so easily abandon our programs and relieve ourselves of further obligation to them. It is very possible under such circumstances that in the middle of a new problem we shall be called upon to finish our previous shoddy work—which will then seem even more dry and dull, as well as being much less familiar. Such unfamiliarity is no small problem. Much grief can occur before the programmer regains the level of thought activity he achieved in originally writing the program. We have emphasized flow diagramming and its most important assistance to understanding a program but no flow diagram guarantees easy reading of a program. The proper flow diagram does guarantee the correct logical guide through the program and a shorter path to correct understanding.

It is amazing how one goes about developing a coding structure. Often the programmer will review his coding with astonishment. He will ask incredulously, “How was it possible for me to construct this coding logic? I never could have developed this logic initially.” This statement is well-founded. It is a rare case where the programmer can immediately develop the final logical construction. Normally programming is a series of attempts, of two steps forward and one step backward. As experience is gained in understanding the problem and applying techniques—as the programmer becomes more immersed in the program’s intricacies—his logic improves. We could almost relate this logical building to a pyramid. In testing out the problem we must climb the same pyramid as in coding. In this case, however, we must take care to root out all misconstructed blocks, being careful not to lose our footing on the slippery sides. Thus, if we are really bored with a problem, the smartest approach is to finish it as correctly as possible so we shall never see it again.

In the second place, the testing of a program, properly approached, is by far the most intriguing part of programming. Truly the mettle of the programmer is tested along with the program. No puzzle addict could experience the miraculous intricacies and subtleties of the trail left by a program gone wrong. In the past, these interesting aspects of program testing have been dampened by the difficulty in rigorously extracting just the information wanted about the performance of a program. Now, however, sophisticated systems are available to relieve the programmer of much of this burden.

Testing for errors grows more difficult every year.
The previous sentence was an optimistic statement a half-century ago, but not because it was wrong. Over all these years, hundreds of tools have been built attempting to simplify the testing burden. Some of them have actually succeeded. At the same time, however, we’ve never satisfied our hunger for more sophisticated applications. So, though our testing tools have improved, our testing tasks have outpaced them. For examples and explanations, read about “preventing testing from growing more difficult.”

If you're as interested in errors as I am, you can obtain a copy of Errors here:

ERRORS, bugs, boo-boos, blunders

Wednesday, July 27, 2016

Common Mistakes in Managing Testing

In my book, Perfect Software and Other Illusions About Testing, I present a list of a dozen common mistakes made when managing testing.

Here are the first six items on the list.

Not honoring testers: If you aren’t going to believe what the testers uncover, either you have the wrong people or you need to get serious about helping them build their credibility. Testing is never just a matter of hiring and hoping.
Over-honoring testers: If you let them make your decisions for you, then you should step down and let them earn the manager’s salary.
Scapegoating testers: Even worse than over-honoring testers is pretending to honor them—manipulating them to give you the answers you want, appearing to act on that “independent” assessment, then blaming them later if that assessment turns out to be damagingly wrong (or taking credit if it turns out to be right).
Not using the information gleaned from testing or other sources: If you’re going to ignore information or go ahead with predetermined plans despite what the tests turn up, don’t bother testing. (Actually, it can’t really be considered information if you don’t use it.)
Making decisions that are emotional, not rational: Not that you can be perfectly rational, but you can at least try to be calm and in control of your emotions when you make decisions.
Not evaluating the quality of test data: Numbers are just numbers. Learn to ask, “What was the process used to obtain this number?” “What does this number mean?”

The remaining six mistakes can be found in the second chapter of the book.

Or, you can purchase the book as part of the entire eight-book set of The Tester's Library at a bargain price.

Thursday, June 30, 2016

The Most Important Aspect of Software Testing

On a public forum, the question was, "What are the most important aspects of software testing?"

A number of good answers were given, but all tended to emphasize finding errors. To me, that's a limited view of testing. I would say, instead, that the most important aspect of software testing is to provide information about the state of a software product. That is, not just errors, but what's good and bad about the product. And, by the way, what's just mediocre.

For instance, the speed of the product might not be in error, but could be good, bad, medicre, or perhaps acceptable to some or all possible users. It's the tester's job to document that.

Or, there may be a feature that works okay, but is not so easy to use. Is that an error? Who knows, but in any case, it's a fact that the sponsor of the project will usually want to know about, and providing that information is the tester's job.

One more example: if a tester merely finds and reports errors, small wonder that developers feel that testers are nattering nabobs of negativity. Test reports should describe what's good—what's wonderful—about a product, and testers must never forget that part of their job.

You can read more about important aspects of testing in my book, Perfect Software And Other Illusions About Testing.

Sunday, May 22, 2016

How can I be a faster programmer?

Recently, "Tommy" posted the following question on Quora, a question-answer site with lots of good stuff (and some really cheesy stuff) on a variety of topics:

How can be a faster programmer?

Tommy received a goodly portion of excellent answers, with lots of details of tips and techniques—great stuff. Reading them over, I saw that a few of the answers offered some good high-level advice, too, but that advice might be lost in all the details, so I decided to offer a summary high-level view of what I've learned about speedy programming in my 60+ years in the computer business.

First of all, read all the great answers here on Quora and incorporate these ideas, one at a time, into you programming practices. In other words, to become faster or better in some other way, KEEP LEARNING. That’s the number one way to become faster, if you want to be fast.

Second, ask yourself, with each job, “Why is programming speed so important for this job?” Often, it’s an unskilled manager who knows only a dangerous amount about programming, but can always say “work faster.” So ask yourself, what’s an appropriate speed for this job, given that speed and error are tightly related.

Third, ask yourself, “Do I really need a program for this problem?” If you don’t have to write a program at all, that gives you infinite speed. Sometimes you don’t have to write a program because one already exists that will do an adequate job. Sometimes you don’t have to write a program because a back-of-the-envelope estimate will give all the answers needed. Sometimes you don’t have to write a program because there’s no real need for the function you’ve been asked to implement.

Fourth, don’t skimp on up front work because you’re rushing to produce code. If you’re working with the wrong requirements, any nifty code you write will be worthless, so any time you spend on it will be wasted. If your design is poor, you’ll write lots of wasted code, which means more wasted time.

Fifth, when you finally have a good design for the right requirements, slow down your coding in order to go faster. We say, “Code in haste; debug at leisure.” Take whatever time you need to think clearly, to review with your colleagues, to find solid code you can reuse, to test as you build incrementally so you don’t have to backtrack.

Sixth, develop in a team, and develop your team. If you think teammates slow you down, then it’s because you haven’t put the effort into building your team. So, in fact, we’re back to the first point, but now at the team level. Keep learning as a team. Your first job is not to build code, but to build the team that builds code.

Many of these principles are developed in much more detail in my many books on software development, so if you want to learn more, take a look at my Leanpub.com author page and make some choices among about fifty books and a bundle of bundles of books. Good reading!

Wednesday, March 04, 2015

The Art of Bugging: or Where Do Bugs Come From

Author's Note: When I first published this list in 1981, it was widely distributed by fans. Then for while, it disappeared. Every year, I received a request or three for a copy, but by 2000, I'd lost the original. This month, when closing up our cabin in the woods, it turned up in the back of a file cabinet. Inasmuch as I've been writing recently about testing (as in Perfect Software), and as it seems still up-to-date and relevant, I decided to republish it here for the grandchildren of the original readers.

Not so long ago, it wasn't considered in good taste to speak of errors in computer systems, but fashions change. Today articles and books on software errors are out-numbered only by those on sex, cooking, and astrology. But fashion still rules. Everybody talks about debugging—how to remove errors—but it's still of questionable taste to speak of how the bugs get there in the first place. In many ways, the word "debugging" has injured our profession.

"Bug" sounds like something that crawled in under the woodwork or flew through a momentary opening of the screen door. Misled by this terminology, people have shied away from the concept of software errors as things people do, and which, therefore, people might learn not to do.

In this column, I'd like to shift the focus from debugging—the removal of bugs—to various forms of putting them in the software to begin with. For ease of reference, I'll simply list the various types of bugging alphabetically.

Be-bugging is the intentional putting of bugs in software for purposes of measuring the effectiveness of debugging efforts. By counting the percentage of intentional bugs that were found in testing, we get an estimate of the number of unintentional bugs that might still be remaining. Hopefully, we remember to remove all the be-bugs when we're finished with our measurements.

Fee-bugging is the insertion of bugs by experts you've paid high fees to work on your system. Contract programmers are very skilled at fee-bugging, especially if they're treated like day laborers on a worm farm.

Gee-bugging is grafting of bugs into a program as part of a piece of "gee-whiz" coding—fancy frills that are there to impress the maintenance programmers rather than meet the specifications.

Knee-bugging is thrusting bugs into a program when you're in too much of a hurry to bend your knee and sit down to think about what you're doing. We have a motto—"Never debug standing up." We could well extend that motto to bugging: "Never bug standing up."

Me-bugging may involve numerous forms of bug installation," for it refers to the method by which they are protected from assault. The programmer regards any attempt to question the correctness of the code as an assault on his or her value as a human being. Those who practice "egoless programming" are seldom guilty of me-bugging.

Pea-bugging can be understood by reference to the story of The Princess and the Pea. The princess showed her true royalty by being able to feel a pea through a hundred mattresses, illustrating that no bug is too small to be noticed by computers or other royalty. The pea-bugger,

however, pops in little bugs with the thought, "Oh nobody will ever notice this tiny glitch."

Pee-bugging is the rushing in of bugs when the programmer is impatient to attend to other matters, not necessarily the call of nature. The pee-bugger is the one who is always heard to contribute: "Lets just get this part coded up so we can get on to more important matters."

Plea-bugging is the legalistic method of forcing bugs into software. The plea-bugger can argue anyone out of any negative feeling for a piece of code, and is especially dangerous when walking through code, leading the committee around by its collective nose.

Pre-bugging is the art of insinuating bugs into programs before any code is written, as in the specification or design stages. We often hear the pre-bugger saying, "Oh, that's clear enough for the coders. Any moron could understand what we mean."

Re-bugging is the practice of re-introducing bugs that were already removed, or failing to remove bugs that were found and supposedly removed. Re-bugging is especially prevalent in on-line work, and some on-line programmers have been known to collapse from malnutrition when caught in an endless loop of debugging and rebugging a related pair of bugs.

Sea-bugging is named for the state of mind in which it is done—"all at sea." The typical sea-bug splashes in when everyone is making waves and shouting, "We've got to do something!" That something is invariably a bug—unless it's two bugs.

See-bugging is the implantation of bugs that "everyone can see" are correct. See-bugging is done in a state of mass hypnosis, often when too many susceptible minds are pooled on the same team or project. This unhealthy state prevails when management values harmony over quality, thus eliminating anyone who might make waves. Of course, too many wave makers leads to sea-bugging, so programming teams have to be constituted as a compromise between harmony and healthy discord.

S/he-bugging is done all the time, though nobody likes to talk about it. S/he-bugs have a way of infusing your code when your mind is on sex, or similar topics. Because sex is a topic unique unto itself, all s/he bugs originate by what might be referred to as sexual reproduction. That's why this is such a populous class of bugs.

Tea-bugging is the introduction of bugs when problems are solved during tea and coffee break conversations and then not checked before passing directly into the code.

The-bugging (pronounced thee-bugging) is crowding multiple bugs into a program under the "one-program-one bug" fallacy. The addicted the-bugger can invariably be heard ejaculating: "I've just found the bug in my program."

We-bugging is the ordination of bugs by majority rule. When someone doubts the efficacy of a piece of code, as in a review, the majority votes down this disturbing element, for how can three out of five programmers be wrong? Thus are born we-bugs.

Whee-bugging is a symptom of boredom, and frequently arises when Agile programming and other bug-prevention schemes have become well-established. Programmers reminisce about "the good old days," when programmers proved their machismo by writing code in which nobody could find bugs. One thing leads to another, and eventually someone says, "Let's do something exciting in this piece of code—wheeeeeel"

Ye-bugging is a mysterious process by which bugs appear in code touched by too many hands. Nobody knows much about ye-bugging because every time you ask how some bug got into the code, every programmer claims somebody else did it.

Z-bugging is the programmer's abbreviation for zap-bugging. The zap-bugger is in such a hurry that it's unthinkable to use three letters when one will do. Similarly, it's unthinkable to submit to any time-wasting controls on changes to working code, so the Z-bugger rams bugs into

the operating system with a special program called ZAP, or SUPERZAP. Although Z-bugs are comparatively rare, they are frequently seen because they always affect large numbers of people. There's a saying among programmers, perhaps sarcastic, "All the world loves a Z-bugger."

So there they are, in all their glory, nineteen ways that people put errors in programs. There are others. I've heard rumors of tree-bugs, spree-bugs, agree-bugs, tee-hee-hee-bugs. and fiddle-dee-dee bugs. I'm tracking down reports of the pernicious oh-promise-me-bug. Entymologists estimate that only one-tenth of all the world's insect species have been discovered and classified, so we can look forward to increasing this bugging list as our research continues. Too bad, for we seem to have enough already.

Note added in 2015: Yes, we had enough, but many have been added. If you know of a bug we didn't list here, I invite you to submit a comment about it.

Monday, February 23, 2015

Errors Versus Quality

Are Errors a Moral Issue?

NOTE: This essay is the second of a series of blogs adapted from my Quality Software, Volume 2, Why Software Gets in Trouble: Fault Patterns

Errors in software used to be a moral issue for me, and still are for many writers. Perhaps that's why these writers have asserted that "quality is the absence of errors."

It must be a moral issue for them, because otherwise it would be a grave error in reasoning. Here's how their reasoning might have gone wrong. Perhaps they observed that when their work is interrupted by numerous software errors, they can't appreciate any other good software qualities. From this observation, they can conclude that many errors will make software worthless—i.e., zero quality.

But here's the fallacy: Though copious errors guarantee worthlessness, but zero errors guarantees nothing at all about the value of software.

Let's take one example. Would you offer me $100 for a zero defect program to compute the horoscope of Philip Amberly Warblemaxon, who died in 1927 after a 37-year career as a filing clerk in a hat factory in Akron? I doubt you would even offer me fifty cents, because to have value, software must be more than perfect. It must be useful to someone.

Still, I would never deny the importance of errors. First of all, if I did, people in Routine, Oblivious and Variable organizations would stop reading this book. To them, chasing errors is as natural as chasing sheep is to a German Shepherd Dog. And, as we've seen, when they are shown the rather different life of a Steering organization, they simply don't believe it.

Second of all, I do know that when errors run away from us, that's one of the ways to lose quality. Perhaps our customers will tolerate 10,000 errors, but, as Tom DeMarco asked me, will they tolerate 10,000,000,000,000,000,000,000,000,000? In this sense, errors are a matter of quality. Therefore, we must train people to make fewer errors, while at the same time managing the errors they do make, to keep them from running away.

The Terminology of Error

I've sometimes found it hard to talk about the dynamics of error in software because there are many different ways of talking about errors themselves. One of the best ways for a consultant to assess the software engineering maturity of an organization is by the language they use, particularly the language they use to discuss error. To take an obvious example, those who call everything "bugs" are a long way from taking responsibility for controlling their own process. Until they start using precise and accurate language, there's little sense in teaching such people about basic dynamics.

Faults and failures. First of all, it pays to distinguish between failures (the symptoms) and faults (the diseases). Musa gives these definitions:

A failure "is the departure of the external results of program operation from requirements." A fault "is the defect in the program that, when executed under particular conditions, causes a failure." For example:

An accounting program had a incorrect instruction (fault) in the formatting routine that inserts commas in large numbers such as "$4,500,000". Any time a user prints a number greater than six digits, a comma may be missing (a failure). Many failures resulted from this one fault.

How many failures result from a single fault? That depends on

• the location of the fault
• how long the fault remains before it is removed
• how many people are using the software.
The comma-insertion fault led to millions of failures because it was in a frequently used piece of code, in software that has thousands of users, and it remained unresolved for more than a year.

When studying error reports in various clients, I often find that they mix failures and faults in the same statistics, because they don't understand the distinction. If these two different measures are mixed into one, it will be difficult to understand their own experiences. For instance, because a single fault can lead to many failures, it would be impossible to compare failures between two organizations who aren't careful in making this "semantic" distinction.

Organization A has 100,000 customers who use their software product, each for an average of 3 hours a day. Organization B has a single customer who uses one software system once a month. Organization A produces 1 fault per thousand lines of code, and receives over 100 complaints a day. Organization B produces 100 faults per thousand lines of code, but receives only one complaint a month.

Organization A claims they are better software developers than Organization B. Organization B claims they are better software developers than Organization A. Perhaps they're both right.

The System Trouble Incident (STI). Because of the important distinction between faults and failures, I encourage my clients to keep at least two different statistics. The first of these is a data base of "system trouble incidents," or STIs. In my books, I mean an STI to be an "incident report of one failure as experienced by a customer or simulated customer (such as a tester)."

I know of no industry standard nomenclature for these reports—except that they invariably take the form of TLAs (Three Letter Acronyms). The TLAs I have encountered include:

- STR, for "software trouble report"
- SIR, for "software incident report," or "system incident report"
- SPR, for "software problem report," or "software problem record"
- MDR, for "malfunction detection report"
- CPI, for "customer problem incident"
- SEC, for "significant error case,"
- SIR, for "software issue report"- DBR, for "detailed bug report," or "detailed bug record"- SFD, for "system failure description"- STD, for "software trouble description," or "software trouble detail"

I generally try to follow my client's naming conventions, but try hard to find out exactly what is meant. I encourage them to use unique, descriptive names. It tells me a lot about a software organization when they use more than one TLA for the same item. Workers in that organization are confused, just as my readers would be confused if I kept switching among ten TLAs for STIs. The reasons I prefer STI to some of the above are as follows:

1. It makes no prejudgment about the fault that led to the failure. For instance, it might have been a misreading of the manual, or a mistyping that wasn't noticed. Calling it a bug, an error, a failure, or a problem, tends to mislead.

2. Calling it a "trouble incident" implies that once upon a time, somebody, somewhere, was sufficiently troubled by something that they happened to bother making a report. Since our definition of quality is "value to some person," someone being troubled implies that it's worth something to look at the STI.

3. The words "software" and "code" also contain a presumption of guilt, which may unnecessarily restrict location and correction activities. We might correct an STI with a code fix, but we might also change a manual, upgrade a training program, change our ads or our sales pitch, furnish a help message, change the design, or let it stand unchanged. The word "system" says to me that any part of the overall system may contain the fault, and any part (or parts) may receive the corrective activity.

4. The word "customer" excludes troubled people who don't happen to be customers, such as programmers, analysts, salespeople, managers, hardware engineers, or testers. We should be so happy to receive reports of troublesome incidents before they get to customers that we wouldn't want to discourage anybody.

Similar principles of semantic precision might guide your own design of TLAs, to remove one more source of error, or one more impediment to their removal. Pattern 3 organizations always use TLAs more precisely than do Pattern 1 and 2 organizations.

System Fault Analysis(SFA). The second statistic is a database of information on faults, which I call SFA, for System Fault Analysis. Few of my clients initially keep such a database separate from their STIs, so I haven't found such a diversity of TLAs. Ed Ely tells me, however, that he has seen the name RCA, for "Root Cause Analysis." Since RCA would never do, the name SFA is a helpful alternative because:

1. It clearly speaks about faults, not failures. This is an important distinction. No SFA is created until a fault has been identified. When a SFA is created, it is tied back to as many STIs as possible. The time lag between the earliest STI and the SFA that clears it up can be an important dynamic measure.

2. It clearly speaks about the system, so the database can contain fault reports for faults found anywhere in the system.

3. The word "analysis" correctly implies that data is the result of careful thought, and is not to be completed unless and until someone is quite sure of their reasoning.

"Fault" does not imply blame. One deficiency with the semantics of the term"fault" is the possible implication of blame, as opposed to information. In an SFA, we must be careful to distinguish two places associated with a fault, neither of these implies anything about whose "fault" it was:

a. origin: at what stage in our process the fault originated
b. correction: what part(s) of the system will be changed to remedy the fault

Routine, Oblivious and Variable organizations tend to equate these two notions, but the motto, "you broke it, you fix it," often leads to an unproductive "blame game." "Correction" tells us where it was wisest, under the circumstances, to make the changes, regardless of what put the fault there in the first place. For example, we might decide to change the documentation—not because the documentation was bad, but because the design is so poor it needs more documenting and the code is so tangled we don't dare try to fix it there.

If Steering organizations are not heavily into blaming, why would they want to record "origin" of a fault? To these organizations, "Origin" merely suggests where action might be taken to prevent a similar fault in the future, not which employee is to be taken out and crucified. Analyzing origins, however, requires skill and experience to determine the earliest possible prevention moment in our process. For instance, an error in the code might have been prevented if the requirements document were more clearly written. In that case, we should say that the "origin" was in the requirements stage.

(to be continued)