Sunday, October 11, 2015

One Little Change

Note: From time to time, I will be adding material to my new book, Errors: Bugs, Boo-boos, Blunders. All purchasers of the book from Leanpub.com will receive all of the new material free of additional charge. The following chapter will be added soon, along with some feedback from my earliest readers.

Dani, my wife, is an anthropologist by profession, but now has become a world-class dog trainer. <http://home.earthlink.net/~hardpretzel/DaniDogPage.html> The combination of the two produces some interesting ideas. For instance, she told me about the way attack dogs are trained to keep them from being dangerous. As usual, the big problem with attack dogs is not the dogs, but the people.

When an untrained person hears that a dog is attack-trained, chances are about one in three that they'll turn to the dog and command, "Kill!" As a joke. Or just to see what the dog will do. To protect against this idiotic human behavior, trainers never use command words like "kill." Instead, they use innocent words, like "breathe," that would never be given in jest in a command voice.

This kind of protection is needed because a trained dog is an information processing machine, in some ways very much like a computer. A single arbitrary command could mean anything to a dog, depending how it was trained—or programmed. The arbitrariness does not matter much if it's not an attack dog. The owner may be embarrassed when Rover heels on the Stay command, but nothing much is lost. If, however, Rover is trained to go for the throat, it's an entirely different matter.

It's the same with computers. Because they are programmed, and because the many meanings in a program are arbitrary, a single mistake can turn a helpful computer into one that can attack and kill an entire enterprise. That's why I've never understood managers who take a casual approach to software maintenance. Time and again I hear managers explain that the maintenance can be done by less intelligent people operating without all the formal controls of development—because it's not very critical. And no amount of argument seems able to convince them differently—until they have a costly maintenance blunder.

Fortunately, costly maintenance blunders are rather common, so some managers are learning—but the tuition is enormous. I keep a list of the world's most expensive programming errors, and all of the top ten are maintenance blunders. Some have cost over a billion dollars each, and some have lead to deaths. Often, the blunder involved changing a single digit in a previously functioning program.

In those horrendous losses, the change was deemed "so trivial" it was instituted casually by a supervisor telling a low-level maintenance programmer to "change that digit"—with no written instructions, no test plan, nobody to review the change, and, indeed, no controls whatsoever between the one programmer and the organization's day-to-day operations. It was exactly like having an attack dog trained to respond to KILL—or even HELLO.

I've done some studies, confirmed by others, about the chances of maintenance changes being done incorrectly. Contrary to simple-minded intuition, it turns out that "tiny" changes are more likely than larger ones to be flawed. Roughly, a one-line change has about a 50/50 chance of producing an error, while a 20-line change is similarly wrong only about one-third of the time.

Developers are often shocked to see this high one-line rate, for two reasons. In the first place, development changes are simpler because they are being made to cleaner, smaller, better structured code—code that has not been changed many times before so does not have unexpected linkages. Such linkages were involved in many of my top-ten disasters.

Secondly, the consequences of an erroneous change during development are smaller because the error can be corrected without affecting real operations. Thus, developers don't take that much notice of their errors, so they tend to underestimate their frequency. In development, you simply fix errors and go on your merry way. Not so in maintenance, where you must mop up the damage the error causes. Then you spend countless hours in meetings explaining why such an error will never happen again—until the next time.

For these two reasons, developers interpret such high rates of maintenance errors as indications of the ignorance or inexperience of maintenance programmers. They're wrong. Maintenance programmers are perfectly capable of doing better work than their record with tiny changes seems to indicate. Their competence is proved by the decrease in error rates as the size of the change increases.

If tiny changes are not taken seriously, they are done carelessly and without proper controls. A higher rate of error is an inevitable consequence.

How many times have you heard a developer say, "No problem! It's just a small change. All I have to do is change one line!"?

That statement would be sensible if "small" changes were truly small—if software maintenance were actually like maintenance of an apartment building. The janitor can change one washer in a dripping sink with much risk of causing the building to collapse. It's not safe to make the same assumption for a program once it's in production.

Whoever coined the term "maintenance" for computer programs was as careless and unthinking as the person who trains an attack dog to kill on the command, KILL. With the wisdom of hindsight, I would suggest that a maintenance programmer is more like a brain surgeon than a janitor. Would maintenance be easier to manage well if it were called "software brain surgery"?


Think about it this way. Suppose you had a bad habit—like saying KILL to attack dogs. Would you go to a brain surgeon and say, "Just open up my skull, Doc, and remove that one little habit. It's just a small change! Just a little maintenance job!"

1 comment:

windchill said...

See https://en.wikipedia.org/wiki/Knight_Capital_Group