Gerald Weinberg's Secrets of Writing and Consulting: risk

Showing posts with label risk. Show all posts

Sunday, October 29, 2017

My most challenging experience as a software developer

Here is my detailed answer to the question, "What is the most challenging experience you encountered as a software developer?:

We were developing the tracking system for Project Mercury, to put a person in space and bring them back alive. The “back alive” was the challenging part, but not the only one. Some other challenges were as follows:

- The system was based on a world-wide network of fairly unreliable teletype connections.

- We had to determine the touchdown in the Pacific to within a small radius, which meant we needed accurate and perfectly synchronized clocks on the computer and space capsule.

- We also needed to knew exactly where our tracking stations were, but it turned out nobody knew where Australia's two stations were with sufficient precision. We had to create an entire sub-project to locate Australia.

- We needed information on the launch rocket, but because it was also a military rocket, that information was classified. We eventually found a way to work around that.

- Our computers were a pair of IBM 7090s, plus a 709 at a critical station in Bermuda. In those days, the computers were not built for on-line real-time work. For instance, there was no standard interrupt clock. We actually built our own for the Bermuda machine.

- Also, there were no disk drives yet, so everything had to be based on a tape drive system, but the tape drives were not sufficiently reliable for our specs. We beat this problem by building software error-correcting codes into the tape drive system.

We worked our way through all these problems and many more smaller ones, but the most challenging problem was the “back alive” requirement. Once we had the hardware and network reliability up to snuff, we still had the problem of software errors. To counter this problem, we created a special test group, something that had never been done before. Then we set a standard that any error detected by the test group and not explicitly corrected would stop any launch.

Our tests revealed that the system could crash for unknown reasons at random times, so it would be unable to bring down the astronaut safely at a known location. When the crash occurred in testing, the two on-line printers simultaneously printed a 120-character of random garbage. The line was identical on the two printers, indicating that this was not some kind of machine error on one of the 7090s. It could have been a hardware design error or a coding error. We had to investigate both possibilities, but the second possibility was far more likely.

We struggled to track down the source of the crash, but after a fruitless month, the project manager wanted to drop it as a “random event.” We all knew it wasn’t random, but he didn’t want to be accused of delaying the first launch.

To us, however, it was endangering the life of the astronaut, so we pleaded for time to continue trying to pinpoint the fault. “We should think more about this,” we said, to which he replied (standing under an IBM THINK sign), “Thinking is a luxury we can no longer afford.”

We believed (and still believe) that thinking is not a luxury for software developers, so we went underground. After much hard work, Marilyn pinpointed the fault and we corrected it just before the first launch. We may have saved an astronaut’s life, but we’ll never get any credit for it.

Moral: We may think that hardware and software errors are challenging, but nothing matches the difficulty of confronting human errors—especially when those humans are managers willing to hide errors in order to make schedules.

https://leanpub.com/errors

Tuesday, January 04, 2011

The Universal Pattern of Huge Software Losses

What Do Failures Cost?
Some perfectionists in software engineering are overly preoccupied with failure, and most others don't rationally analyze the value they place on failure-free operation. Nonetheless, when we do measure the cost of failure carefully, we generally find that great value can be added by producing more reliable software. In Responding to Significant Software Events, I give five examples that should convince you.

The national bank of Country X issued loans to all the banks in the country. A tiny error in the interest rate calculation added up to more than a billion dollars that the national bank could never recover.

A utility company was changing its billing algorithm to accommodate rate changes (a utility company euphemism for "rate increases"). All this involved was updating a few numerical constants in the existing billing program. A slight error in one constant was multiplied by millions of customers, adding up to X dollars that the utility could never recover. The reason I say "X dollars" is that I've heard this story from four different clients, with different values of X. Estimated losses ranged from a low of $42 million to a high of $1.1 billion. Given that this happened four times to my clients, and given how few public utilities are clients of mine, I'm sure it's actually happened many more times.

I know of the next case through the public press, so I can tell you that it's about the New York State Lottery. The New York State legislature authorized a special lottery to raise extra money for some worthy purpose. As this special lottery was a variant of the regular lottery, the program to print the lottery tickets had to be modified. Fortunately, all this involved was changing one digit in the existing program. A tiny error caused duplicate tickets to be printed, and public confidence in the lottery plunged with a total loss of revenue estimated between $44 million and $55 million.

I know the next story from the outside, as a customer of a large brokerage firm:
One month, a spurious line of $100,000.00 was printed on the summary portion of 1,500,000 accounts, and nobody knew why it was there. The total cost of this failure was at least $2,000,000, and the failure resulted from one of the simplest known errors in COBOL coding: failing to clear a blank line in a printing area.

I know this story, too, from the outside, as a customer of a mail-order company, and also from the inside, as their consultant. One month, a new service phone number for customer inquiries was printed on each bill. Unfortunately, the phone number had one digit incorrect, producing the number of a local doctor instead of the mail-order company. The doctor's phone was continuously busy for a week until he could get it disconnected. Many patients suffered, though I don't know if anyone died as a result of not being able to reach the doctor. The total cost of this failure would have been hard to calculate except for the fact that the doctor sued the mail-order company and won a large settlement.

The Pattern of Large Failures
Every such case that I have investigated follows a universal pattern:

1. There is an existing system in operation, and it is considered reliable and crucial to the operation.

2. A quick change to the system is desired, usually from very high in the organization.

3. The change is labeled "trivial."

4. Nobody notices that statement 3 is a statement about the difficulty of making the change, not the consequences of making it, or of making it wrong.

5. The change is made without any of the usual software engineering safeguards, however minimal, that the organization has in place.

6. The change is put directly into the normal operations.

7. The individual effect of the change is small, so that nobody notices immediately.

8. This small effect is multiplied by many uses, producing a large consequence.

Whenever I have been able to trace management action subsequent to the loss, I have found that the universal pattern continues. After the failure is spotted:

9. Management's first reaction is to minimize its magnitude, so the consequences are continued for somewhat longer than necessary.

10. When the magnitude of the loss becomes undeniable, the programmer who actually touched the code is fired—for having done exactly what the supervisor said.

11. The supervisor is demoted to programmer, perhaps because of a demonstrated understanding of the technical aspects of the job. [not]

12. The manager who assigned the work to the supervisor is slipped sideways into a staff position, presumably to work on software engineering practices.

13. Higher managers are left untouched. After all, what could they have done?

The First Rule of Failure Prevention
Once you understand the Universal Pattern of Huge Losses, you know what to do whenever you hear someone say things like:

• "This is a trivial change."

• "What can possibly go wrong?"

• "This won't change anything."

When you hear someone express the idea that something is too small to be worth observing, always take a look. That's the First Rule of Failure Prevention:

Nothing is too small to be unworthy of observing.

It doesn't have to be that way
Disaster stories always make good news, but as observations, they distort reality. If we consider only software engineering disasters, we omit all those organizations that are managing effectively. But good management is so boring! Nothing ever happens worth putting in the paper. Or almost nothing. Fortunately, we occasionally get a heart-warming story such as Financial World telling about Charles T. Fisher III of NBD Corporation, one of their award-winning CEO's for the Eighties:

"When Comerica's computers began spewing out erroneous statements to its customers, Fisher introduced Guaranteed Performance Checking, promising $10 for any error in an NBD customer's monthly statement. Within two months, NBD claimed 15,000 new customers and more than $32 million in new accounts."

What the story doesn't tell is what happened inside the Information Systems department when they realized that their CEO, Charles T. Fisher III, had put a value on their work. I wasn't present, but I could guess the effect of knowing each prevented failure was worth $10 cash.

The Second Rule of Failure Prevention
One moral of the NBD story is that those other organizations do not know how to assign meaning to their losses, even when they finally observed them. It's as if they went to school, paid a large tuition, and failed to learn the one important lesson—the First Principle of Financial Management, which is also the Second Rule of Failure Prevention:

A loss of X dollars is always the responsibility of an executive whose financial responsibility exceeds X dollars.

Will these other firms ever realize that exposure to a potential billion dollar loss has to be the responsibility of their highest ranking officer? A programmer who is not even authorized to make a long distance phone call can never be responsible for a loss of a billion dollars. Because of the potential for billion dollar losses, reliable performance of the firm's information systems is a CEO level responsibility.

Of course I don't expect Charles T. Fisher III or any other CEO to touch even one digit of a COBOL program. But I do expect that when the CEOs realize the value of trouble-free operation, they'll take the right CEO-action. Once this happens, this message will then trickle down to the levels that can do something about it—along with the resources to do something about it.

Learning from others
Another moral of all these stories is that by the time you observe failures, it's much later than you think. Hopefully, your CEO will read about your exposure in these case studies, not in a disaster report from your office. Better to find ways of preventing failures before they get out of the office.

Here's a question to test your software engineering knowledge:
What is the earliest, cheapest, easiest, and most practical way to detect failures?

And here's the answer that you may not have been expecting:

The earliest, cheapest, easiest, and most practical way to detect failures is in the other guy's organization.

Over my half-century in the information systems business, there have been many unsolved mysteries. For instance, why don't we do what we know how to do? Or, why don't we learn from our mistakes? But the one mystery that beats all the others is why don't we learn from the mistakes of others?

Cases such as those cited above are in the news every week, with strong impact on the general public's attitudes about computers. But they seem to have no impact at all on the attitudes of software engineering professionals. Is it because they are such enormous losses that the only safe psychological reaction is, "It can't happen here (because if it did, I would lose my job, and I can't afford to lose my job, therefore I won't think about it)"?

(Adapted from Responding to Significant Software Events )
http://www.smashwords.com/books/view/35783

Thursday, December 09, 2010

Testing Without Testing 

The job of software testing, we know, is to provide information about the quality of a product. Many people, however, believe that the only way to get such information is to execute the software on computers, or at least review code. But such a belief is extremely limiting. Why? Because there's always lots of other information about product quality just lying around for the taking - if it were only recognized as relevant information.  

Because of their psychological distance from the problems, external consultants are often able to see information that escapes the eyes and ears of their clients. Dani (Jerry's wife) tells this story about one of her consulting triumphs in the dog world:  

======================================================  

A woman with a Sheltie puppy was at her wits' end because "he always poops on the living room rug." She loved the little guy, so before giving up and taking him to the Humane Society, she came to Dani for a consultation. Dani listened to the woman describe the problem, then asked, "Are there any other behavior problems you've noticed?"  

The woman thought about that for a while, then said, "Well, yes, there is one. He has this annoying habit of scratching on the front door and whining."

  ======================================================

  It's easy to laugh at this woman's inability to see the connection between the two "problems," but that sort of blindness is typical of people who are too close, too emotionally involved, with a situation. Learning to recognize such free information is one of the secrets of successful testing. Using it, you can learn quickly about the quality of an organization's products or the quality of their information obtained from machine testing.

Here are some "Sheltie stories" from our consulting practices. Test yourself by seeing what information you can derive from them about the quality of a product or the quality of the information that's been obtained about the product through testing:

  ======================================================  

1. Jerry is asked to help an organization assess its development processes, including testing. Jerry asks the test manager, "Do you use specs to test against?"

  Test manager replies, "Yes, we certainly do."

  Jerry: "May I see them?"  

Test manager: "Sure, but we don’t know where they are."  

======================================================

  2. Jerry is called in to help a product development organization's testing group. He learns that they are testing a product that has about 40,000 lines of code.  

"The problem," says the test manager, "is with our bug database."  

"What's wrong with it," Jerry asks. "Is it buggy?"  

"No, it's very reliable, but once it holds more than about 14,000 bug reports, its performance starts to degrade very rapidly. We need you to show us how to improve the performance so we can handle more bug reports."

  ======================================================  

3. Bunny is asked to help improve the testing of a large product that has 22 components. The client identifies "the worst three components" by the high number of bugs found per line of code. Bunny asks which are the best components and is given the very low bugs per line of code figures for each.

  She then examines the configuration management logs and discovers that for each of these three "best" components, more than 70 per cent of their code has not yet successfully been checked in for a build.  

======================================================  

4. Trudy is invited to help a development manager evaluate his testing department's work. She starts by looking at the reports he receives on the number and severity of bugs found each week. The reports are signed by the test manager, but Trudy notices that the severity counts have been whited out and written over.

"Those are corrections by the product development manager," the development manager explains. "Sometimes the testers don't assign the proper severity, so the development manager corrects them."

  Using her fingernail, Trudy scratches off the whiteout. Under each highest severity count is a higher printed number. Under each lowest severity count is a lower printed number.  

======================================================  

5. Jerry is watching a tester running tests on one component of a software product. As the tester is navigating to the target component, Jerry notices an error in one of the menus. The tester curses and navigates around the error by using a direct branch.  

Jerry compliments the tester on his ingenuity and resourcefulness, then asks how the error will be documented. "Oh, I don't have to document it," the tester says. "It's not in my component."  

======================================================  
6. Fanny watched one tester spend the better part of several hours testing the scroll bars on a web-based enterprise system. The scroll bars were, of course, part of web browser, not the system being tested.  

======================================================  

7. Jerry asked a development manager if the developers on her project unit tested their code.  "Absolutely," she said. "We unit test almost all the code."  

"Almost?" Jerry asked. "Which code don't you unit test?"  

"Oh, some of the code is late, so of course we don't have time to unit test that or we'd have to delay the start of systems test."  

======================================================  

8. One of Christine's clients conducted an all-day Saturday BugFest, where developers were rewarded with cash for finding bugs in their latest release candidate. They found 282 bugs, which convinced them they were “close to shipping.”  

They were so happy with the result that they did it again. This time, they found 343 new bugs - which convinced them they were “on the verge” of shipping.

  ======================================================  

9. A general manager was on the carpet because a recently shipped product was proving terribly buggy in the hands of customers. Jerry asked him why he allowed the product to ship, and he said, "Because our tests proved that it was correct."  

======================================================

  10. Another manager claimed to Noreen that he knew that their product was ready to ship because "we've run 600,000 test cases and nothing crashed the system."  

======================================================  

11. When Jerry asked about performance testing, one of his clients said, "We've already done that."

  "Really?" said Jerry. "What exactly have you done?"  

"We ran the system with one user, and the response time was about ten milliseconds. Then we ran it with two users and the response time was twenty milliseconds. With three users, it was thirty milliseconds."

  "Interesting, Jerry responded. "But the system is supposed to support at least a hundred simultaneous users. So what response time did you get when it was fully loaded?"

  "Oh, that test would have been too hard to set up, and anyway, it's not necessary. Obviously, the response time would be one second - ten milliseconds per user times one-hundred users."  

======================================================

  12. Jerry's client calls an emergency meeting to find out "why testing is holding up our product shipment." In the meeting, the testers present 15 failed tests that show that the product did not even build correctly, so it couldn't be tested. They discuss each of the 15 problems with the developers, after which the development lead writes and email summary of the meeting reporting that there are only two "significant" problems.  

The email goes to the development manager, the marketing manager, and all the developers who attended the meeting. None of the testers present are included in the cc-list, so none of them even know that the email was sent.  

======================================================  

13. Johnson watches a tester uncover five bugs in a component, but instead of logging the bugs in the bug database, the tester goes directly to the developer of the component and reports them orally. Johnson asks why the tester didn't record the bugs, and he replies, "If I do that, she (the developer) screams at me because it makes her look bad."

  ======================================================  

14. Tim reviews a client's test plan and notices that there is no plan to test one of the components. When he asks the test manager (who is new to the company) why it's missing, he's told, "We don't need to worry about that. The development manager assures me that this developer is very careful and conscientious."  

======================================================

  So, were you able to extract meta-information from these ten examples? What did each of them tell you (or hint to you) about the quality of the information from testing - the accuracy, relevance, clarity, and comprehensiveness?  Remember, though, that these are merely hints. Perhaps the Sheltie pooped on the rug because he had some medical problem that would need a veterinarian's help.

Perhaps there's a non-obvious explanation behind each of these Sheltie Situations, too, so you always have to follow-up on the hints they provide to validate your intuition or find another interpretation. And, even if your intuition is right on target, you probably won't have an easy time convincing others that you're correct, so you will have to gather other evidence, or other allies, to influence the people who can influence the situation.

  When Dani asked the Sheltie's owner why she thought the pup was whining at the front door, the woman said, "I think I've spoiled him. He just wants to go out and play all the time, but I have too much housework to do - especially since I spend so much time cleaning up the mess on the rug."

  When a problem persists in spite of how obvious the solution is to you, you aren't going to be able to convince others to solve the problem until you find out how they are rationalizing away the evidence that's so apparent to you. So we need to know how people immunize themselves against information, and what we can do about it.

(Adapted from Perfect Software: and Other Illusions about Testing)
http://www.smashwords.com/books/view/25400

Wednesday, September 22, 2010

Have S/W Projects Hit a Wall?

I recently received an interesting set of questions about software projects from a French science journalist. I thought my readers would like to see those questions and my answers, so here goes:

Q: Do large software projects fail at a rate significantly higher than other engineering projects in physical world (also quite complex!)

A: Well, as far as total failure, yes, I think s/w projects fail totally more than, say, building ships.

OTOH, the US Navy reported a few years ago that every ship built since World War II has been late and over budget, so that type of "failure" is 100%, even though we've been building ships for hundreds of years. The Wasa (or Vasa) Ship in Sweden is a good historical example of one reason for failure: the piling on of requirements until complexity is too great.

See http://en.wikipedia.org/wiki/Vasa_(ship)

Q: Do we know exactly why ? Is it a management problem or theoretical problem ?

All the failures I have studied have been management failures. In some cases, the theory might have been wrong, but management failed to notice the signs of impending failure, or noticed them but failed to act in time to prevent the project from becoming a death march.

Q: Have we reached now a critical size of dependable verifiable code, something like a "wall of complexity" ?

A: Such a wall definitely exists, though its thickness is somewhat fuzzy. Some well-managed projects can surpass the "wall" that stops poorly managed projects. But eventually, at any given state of the art, there will be a "wall," and as the project approaches its particular wall, progress becomes sluggish and expensive. When that happens, good managers recognize what's happening and take action--generally pulling back on some of the excess "requirements."

Q: Does it mean that "Internet of things", or other big things "real time" (like FAA air traffic) we would like to build with high reliability, are not really possible for the moment?

I first worked with the FAA in the late 1950s, trying to help them build the air traffic control system of their dreams. It wasn't possible then to implement their dreams, and it's still not possible. Why? One reason is that their dreams keep growing faster than our ability to implement them. They could build a better system, but when they try to build "the best system for all time," they collapse like the Wasa Ship.

Q: Is there a lack of a theory of building large-scale software (like rules governing civil engineering in physical world) ? Is it because computer science is still a relatively young science ?

A: There is a total lack of theory, but there are some empirical principles gained from experience. I've tried to catalog these principles in my Quality Software Management series (see below).

The trouble with computer science is that it's not a science, but generally a kind of mathematics without connection with the empirical world.

Rules governing civil engineering come largely from real-world experience, followed up by some scientific work in a few areas, like the properties of building materials. In computing, much of what we do know is simply not know to most developers, who are too busy trying to salvage poorly managed projects.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So, how do my answers compare with yours. Am I feeling too pessimistic, or optimistic?

Friday, November 14, 2008

What is an Alcoholic?

A commenter describes his drinking habits and says they don't interfere with work. He wants to know how I define "alcoholic"?

I think he and I agree that the question turns on how your drinking affects your work, short term and long term. Since everybody's physiology is unique, I don't think the definition can depend on how much someone drinks. What would put me in the hospital (I have a severe reaction to alcohol), might just be a thirst-quencher for someone else.

One problem with this approach is that alcohol definitely tends to take the edge off one's judgment. I have dealt with a number of worker/drinkers who believe their work is not affected by their drinking--but the reason I was asked to speak with them is that their work effectiveness has been dropping noticeably, according to their managers.

I see lots of moral judging about alcohol consumption, which in turn leads to a lot of defensiveness on the part of those who enjoy drinking. For me, I don't care if your work is affected by alcohol, M&Ms, or listening to too much opera. If something is affecting your work, then that's an issue for your manager and possibly your co-workers.

If it's affecting your health (long or short term), that's your business, and perhaps the business of your family. Of course, if it's affecting your driving, then it's my business as long as we're sharing a road. I won't accept rides from people who have been drinking--or who might be otherwise impaired.

In short, I'm concerned about your alcohol habits only insofar as they affect me. It's an individual judgment, not a stereotype--but it's definitely a sterotype for lots of people, something alcoholics have to live with in today's society.

Wednesday, January 30, 2008

How Can You Recognize Alcoholism in a Service Provider?

Jeff wrote: "Do you have a specific test for that [alcoholism in service providers]? Many of the alcoholics I've known hid their problem well, at least for some period of time."

Good question, and one I couldn't answer at the time. After being placed in jeopardy with the government as a result of Provider's actions, I began to study the problem with great interest. Here are some signs I now recognize that I didn't pay attention to at the time:

Late for Appointments and Missing Deadlines

This one I definitely noticed, but didn't recognize the possible significance. I just told myself that Provider was a person who was "habitually late." Like many of the other signs, lateness could be attributed to many things besides alcoholism, so I let it pass without comment.

Depression and Mood Fluctuations

Again, I noticed this, but didn't appreciate the possible significance. I believe I thought, "Well, such providers aren't the most sparkling of personalities." Actually, my next provider proved even that assumption wrong. She's terrific.

Mistakes

Everybody makes mistakes, and I tend to be pretty generous in allowing for them. Some of Provider's mistakes were hidden, and that was my fault for not having a reasonable feedback mechanism. But I had noticed a rather higher level of mistakes than I'd like to see in a provider, and I just let it pass.

Personal Problems as Excuses for Mistakes and Lateness

Provider was never short of excuses for mistakes and lateness. Health, problems at home, "the dog ate my calendar"--he was very creative.

Choosing Lunch Dates in Drinking Places

I didn't have lunch with Accountant very often (taking lunch alone may be another sign), but looking back, I realize that he always insisted on restaurants that had a bar.

Showing Up Intoxicated

I never noticed this with Provider, but in subsequent years, I've noticed it with other service people. Whether they're alcoholics or not, this is unacceptable. For example, someone operating a power lawnmower when drinking is a risk to his life and limb--and to my entire business if he sues me for cutting off his foot while working for me.

Health Problems

Anybody can have health problems--I'm a prime example. But someone who is consistently coming down with one misery after another might be showing symptoms typical of alcoholics. Same is true for frequent injuries and accidents. But, of course, they might just be a natural klutz.

Speaking affectionately about Drinking

I should have recognized this one, for my mother was an alcoholic. She often spoke lovingly about her Southern Comfort. Provider's drink of choice was different, but he seemed to have the same love affair. Affairs, really. He loved 'em all.

Signs, not Proof

None of these signs prove that someone is an alcoholic. They could be signs of other things--other addictions or something quite innocent. But my job is not to prove some provider is an alcoholic, which can be incredibly difficult. Alcoholics are experts at denial, rationalization, dreaming up excuses, blaming others, manipulating you, or hooking into your caretaker needs. Besides, their alcoholism is none of your business.

Your Responsibility

What is your business is your business. You hire a provider to do a particular service. If they don't do that service well enough, it's your responsibility to replace them, not to make excuses for them. And especially not to fix them. Set performance criteria. Communicate those criteria. Observe performance relative to those criteria, and take action when performance doesn't measure up. Why it doesn't measure up is not your job.

I didn't do those things with Provider, so I got snagged into his drinking problem. It was his problem, but it was my responsibility to protect myself. I now do a better job of fulfilling my responsibilities as a business owner. Overall, I've protected myself not just from alcoholism, but from other problems that are not my problems.

But She's My Friend

Does this sound heartless and cold? Maybe you're good friends with your service provider? Can you treat your friends like this?

To take just one example, I had a copy editor who had trouble getting to work in a timely manner. We have flexible working hours, but I couldn't depend on her for any schedule. Turns out, she was not an alcoholic, but was depressed over her mother's death three years earlier. She tended to sleep 12 or 14 hours a day. After causing me to miss an important mailing deadline one afternoon, she said, "Oh, if you need me and I'm not here, you can just call me and wake me up."

Not my job. Not as her client. I replaced her with an editor who could wake herself up.

Then, as a friend, not a client, I helped the copy editor find a really good therapist. Just as I helped Provider fulfill his AA twelve steps. It turns out, if you want to help people with such personal problems, it's easier if you're not hiring them to do a job.

Wednesday, November 28, 2007

Where Does the Magic Come From?

Any sufficiently advanced technology is indistinguishable from magic.
- Clarke's Third Law

In my career, I have at times run a successful project, built a high-performing team, or conducted a stunning class. Each time, though, I knew that my technology seemed like magic even to me, because I didn't really know how I did it. I do like to succeed, so perhaps I should be content with success alone. But I always worry:

"If it's indistinguishable from magic,
how do I know it won't go away next time?"

The Double Bind

When I worry, I'm reluctant to change anything, no matter how small, for fear that the magic will flee. I feel trapped between the fear of losing the magic by change and the fear of losing the magic by failing to change - a classic example of the trap known as a "double bind" (damned if you do, damned if you don't).

Double binds often result in paralysis or ritualized behavior. For example, I'm often called upon to improve meetings, but then find it difficult to persuade my clients to change anything about the meeting. "If we move to another room, it might not be as good as this one." "If we don't invite Jack to the next meeting, we might need something he knows." "If we change the order of the agenda, we might not get through on time." "If we vote in a different way, we might make a poor decision." "We must order our donuts from Sally's Bakery or we won't have a successful meeting."

The "Magic" of PSL

I'd find this behavior even more frustrating if I hadn't experienced the same double bind myself–for example, when faculty considers some potential improvements to our Problem Solving Leadership workshop (PSL). Over the years, lots of people have experienced what they call "the magic of PSL," and we're proud of that. But each time we consider a change, someone raises the fear that the change might make the magic disappear. Fortunately, each time we do this, someone is able to prove that the magic is not tied to the factor under consideration.

For instance, we've worried about changing the hotel or city where PSL is held. We do attempt to find magical sites, but then we remember that many PSLs have transformed mundane hotels in mundane cities into magical sites. This proves to us that the magic can't be in the site, and frees us from that double bind.

Or, we've worried about changing the faculty who teach PSL. We certainly don't choose faculty members at random, but every faculty member has led many, many magical PSLs. So the magic can't be in any particular faculty members.

Or, we've worried about the combination of faculty members. We don't choose our co-training teams at random, either, but all combinations experience magic. So the magic can't be in the faculty combination.

Again, we've worried about the materials we use. We certainly don't choose materials at random, but we do change materials from class to class, and each class deviates from the "standard" materials in a variety of ways. Indeed, there is no single item of material that's in common between the very first PSL (back in 1974) and the most recent one. So the magic can't be in particular materials, either.

Breaking the Bind

The same approach can be used to break other double binds - by finding a counter-example to match each objection:

- "If we move to another room, it might not be as good as this one." "Ah, but remember when they were painting this room and we met downstairs? We had a good meeting then."

- "If we don't use Microsoft Project, this project might fail." "Could be, but we did project X with other tracking software, and we did a fine job."

- "If we change to a new version of the operating system, we might have crashes." "True. But we had a few crashes the last time we upgraded, and though it was some trouble, we dealt with them."

- "If I clean up that code, the system might fail." "That could happen, but the previous three times we cleaned up some code, we caught all the failures in our technical reviews and regression testing. So let's do it, but let's be careful."

The Effective Use of Failure

What can you do if you don't have a counter-example and can't create one in a safe way? In that case, it helps if you can demystify the magic and understand its underlying structure. To do this, you need examples where the magic didn't happen. In social engineering, as in all engineering, failures teach you more than successes.

For instance, the PSL faculty became more aware of the source of PSL magic by observing a few times that the magic didn't "work." Usually, people come to PSL voluntarily–but not always. Once in a while, someone is forced to come to PSL to be "fixed," but people who have been labeled as "broken" may resent the whole experience, and may not feel much PSL magic at all.

From these rare failures of PSL magic, we have identified one key component of the magic of PSL:

People are there because they have chosen to be there.

Curiously, the same component works in creating magical meetings, magical projects, and magical teams. When people are given a choice, they are the magic. Or, more precisely, they create the magic.

When people choose to attend a workshop, to participate in a project, or to join a team, they plunge themselves fully into the experience, rather than simply going through the motions. Consultants can thus have a "magic" advantage over employees: They always know that they've chosen this assignment, so they can always throw themselves into it without reservation. Employees can have this choice, too, but they often forget–just as some consultants forget when they feel forced to take an assignment out of economic necessity.

Keep this in mind the next time you choose an assignment. If you feel forced, you won't do your magical best. You won't have access to the magic that lives inside of yourself.

Do You Want to Experience the Magic of PSL?

Esther Derby, Johanna Rothman, and I will be leading another PSL (Problem Solving Leadership) March 16-21, 2008, in Albuquerque, NM.

PSL is experiential training for learning and practicing a leader's most valuable asset: the ability to think and act creatively. PSL is the gold standard for leadership training, and I'm thrilled to be teaching again with Esther and Johanna.

See <http://www.jrothman.com/syllabus/PSL.html> for the syllabus. If you're interested, please send Esther an email, [Esther Derby <derby@estherderby.com>]. We'd love to have you help us create some more magic.

Saturday, June 02, 2007

The Exception is the Rule

Recently, I was trying to help a client (let me call them "StartupCompany") mired in conflicts, exceptions, errors, anomalies, lapses, modifications and other deviations from the norm. These annoying exceptions were playing tricks with my blood pressure, so I had to be wired to a wearable blood pressure computer for twenty-four hours. As if StartupCompany didn't have enough interruptions, now my wearable computer was inflating a blood pressure cuff at random intervals throughout the day.

Every time the cuff inflated, I petulantly asked myself: Why can't they run a project like real people living run-of-the-mill, low-blood-pressure lives?

That night, I was using the Yellow Pages, and in the A categories in the Yellow Pages index, I chanced to notice a curious pattern. Here are the first few items:

Abortion Services and Alternatives. These were the first two entries in the index. I decided to skip them both, so as not to take sides in the pro-choice/pro-life conflict. I had enough conflicts within StartupCompany.

Abuse - Men, Women, Children. I decided to continue my scan of the index, and this was the next entry. The normal process of family living involves people loving and respecting each other, communicating well, and behaving appropriately according to societal norms. But when people start behaving inappropriately, they need Abuse Services. In StartupCompany, people normally respected one another, communicated well, and behaved appropriately according to societal norms. But they sometimes didn't, and they lacked "abuse services" for coping.

Academies (including private schools and special education). When the formal education system doesn't provide special knowledge or handle special cases, private academies and special education are called for. People within StartupCompany often needed to know things they hadn't learned in the public schools, but StartupCompany had no provision for special education.

Accident Prevention. Accidents aren't "supposed" to happen, StartupCompany had accidents. In order to improve, they needed processes to prevent accidents and to mitigate their consequences.

Accordions. Despite what some people think, accordions are perfectly normal, though not everybody learns to play them or appreciate them. Still, StartupCompany could have used some entertainment to lighten the mood once in a while.

Accountants. Accounting is also normal, but, if everything always went according to plan, we wouldn't need to account for things so carefully. We have to protect our financial well-being from mistakes and misbehavior, and that's what accountants do - and also what they should have been doing in StartupCompany.

Acetylene Welding. Some welding is normal, and some is for repairing things that are not supposed to break - but do anyway. StartupCompany lacked a "welding team" to handle lots of stuff that broke.

Acrylic Nails. Most normal people have fingernails, so why is there a nail business? Oh, yes, it's the human interface, and StartupCompany had to cope with conflicting ideas of what made a system beautiful - but they had no special beauty experts to resolve the conflicts.

Acting Instruction. We all need to "put on an act" now and then when we're caught by surprise. StartupCompany's people certainly needed training in how to behave in improvisational situations, but there was no acting instruction.

Acupressure/Acupuncture. If we were all healthy all the time, we wouldn't need medical services, and if "normal" Western medical services worked all the time, we wouldn't need acupressure and acupuncture. So, there are not only abnormal services, but meta-abnormal services - the services when the normal abnormal services fail - certainly true in StartupCompany.

Addressing Service. Have you ever tried to maintain a mailing list? Almost all the work is not the mailing itself, but maintaining the addresses. It's even worse for email, because email services haven't yet evolved "normal" ways of dealing with changes. Gee, neither had StartupCompany.

Adjusters. Adjusters, of course, are an abnormal service from the get-go. Without accidents, we wouldn't need insurance, and if things stayed on course, StartupCompany wouldn't have needed risk analysis. But they did.

Adobe Materials and Contractors. Adobe materials may not be "normal" where you live, but here in New Mexico, adobe is a normal building method. StartupCompany, too, has its idiosyncratic processes that are not normal in other projects - and newcomers have to learn about them or pay the price. But StartupCompany had no special services to bring newcomers up to speed.

Adoption Services. Yes, sometimes people are not wanted by their parents, and StartupCompany certainly had some unwanted people. But, they lacked "adoption" services for moving unwanted people around.

Adult Supervisory Care. "Normal" adults can take care of themselves without supervision, and normal workers wouldn't need much managing at all. But StartupCompany had two adults who could not take proper care of themselves, and the managers spent an inordinate amount of time on these two out of a hundred.

I stopped there, sobered by my reading. It was now clear to me that StartupCompany, being a startup, had an overly simplistic picture of what it takes to run a company. I needed an adjustor to adjust my blood pressure - I needed to see that my job as their consultant was to teach them that deviations are normal, and that they (and I) could do what real people do:

• stop whining and deal with them

• create systems to deal with them

• create systems to prevent them

And, of course, I have to do these three things in my own company - like not whining about my blood pressure.

Monday, March 12, 2007

Innocent but Dangerous Language

To be successful as a consultant, you need to pay attention to seemingly innocent language. The computer software field is filled with such language booby traps, but let me introduce the subject by citing a field that might be more familiar to more people—dog training.

My wife, Dani, is an anthropologist by profession, and so naturally is a skilled listener. She's now retired from her anthropology career, but brings all her skills and experience as an anthropologist and management consultant to her new career of behavioral consulting with dog owners and dog trainers. The combination produces many interesting ideas, like what she told me about the way attack dogs are trained. As usual, the big problem with attack dogs is not the dogs, but the people.

When someone hears that a dog is attack trained, chances are about one in three that they'll turn to the dog and command: KILL!

As a joke.

Or just to see what the dog will do.

To protect against this idiotic human behavior, this carelessness with words, attack-dog trainers never use words like "kill" as the attack command. Instead, they use innocent words like "health" that would never be given in a command voice.

This kind of protection is needed because a trained dog is an information processing machine, in some ways very much like a computer with big teeth. A single arbitrary command could mean anything to a dog, depending on how it was trained—or programmed.

This arbitrariness doesn't matter much if it's not an attack dog. The handler might be embarrassed when Rover runs out to fetch a ball on the command ROLL OVER, but nothing much is lost. But if the dog were trained to respond to ROLL OVER by going for the throat, it's an entirely different matter.

Maintenance, or Computers with Teeth

It's the same with computers. Because computers are programmed, and because the meanings of many words in programs are arbitrary, a single mistake can turn a helpful computer into one that can attack and kill an entire enterprise. That's why I've never understood why some of my clients take such a casual attitude toward software maintenance. Time and again, I hear managers explain that maintenance can be done by less intelligent (and cheaper) people, operating without all the formal controls and discipline of development—because it's not so critical. And no amount of argument seems able to convince them differently—until they experience a costly maintenance blunder.

Fortunately (or unfortunately), costly maintenance blunders are rather common, so managers have many lessons, even though the tuition is high. I keep a confidential list of expensive programming errors committed by my clients, and all of the most costly ones are maintenance errors. And almost all of those involve the change of a single digit in a previously operating program.

In all these cases, the change was called, innocently, "trivial," so it was instituted casually by a supervisor telling a low-level maintenance programmer to "change that digit"—with no written instructions, no test plan, nobody to review the change, and, indeed, no controls whatsoever between that one programmer and the day-to-day operations of the organization. It was exactly like having an attack dog trained to respond to KILL—or perhaps HELLO.

Just Change One Line

I've done studies, confirmed by others about the chances of a maintenance change being done incorrectly, depending on the size of the change. Here's the first part of the table:

    Lines         Chance of

    Changed        Error

          1                    .50

          2                    .60

          3                    .65

          4                    .70

          5                    .75

Developers are often shocked to see this high rate, for two reasons. In the first place, development changes are simpler than maintenance changes because they are being applied to cleaner, smaller, better-structured code. Usually, the code has not been changed many times in the remote past by unknown hands, so does not contain many unexpected linkages. Such linkages were involved in each of my most costly disasters.

Secondly, the consequences of an erroneous change during development are usually smaller, because the error can be corrected without affecting real operations. Thus, developers don't take that much notice of their errors, and thus tend to underestimate their frequency.

In development, you simply fix errors and go on your merry way. Not so in maintenance, where you must mop up the damage the error caused, then spend countless hours in meetings explaining why it will never happen again—until the next time.

For these two reasons, developers interpret this high rates of maintenance errors as indicative of the ignorance or inexperience of maintenance programmers. But if we continue down the table a few lines, we can see that the cause cannot be either ignorance or inexperience:

    Lines         Chance of

    Changed        Error

          10                   .50

          20                   .35

The decrease in error rate as the size of the change increases shows that maintenance programmers are perfectly capable of doing better work than their record with small changes seems to indicate. That's because these "trivial" changes are not taken seriously, and so are done carelessly and without controls. How many times have you heard a programmer say, "No problem! All I have to do is change one line."

Who Coined These Innocent-Sounding Words?

And how many times have you heard these programmers' managers agree with them? Or even to work "quick and dirty" when "it's only a minor change"?

This carefree attitude would be sensible if "minor" changes were truly minor—if maintenance of a program were actually like maintenance of an apartment building. Not that janitorial maintenance can't be dangerous, but the janitor can assume that changing one washer in the kitchen sink won't incur great risk of causing the building to collapse and bury all the occupants. It's not safe to make the same assumption for a program used every day to run a business, but because we are so free and arbitrary with words, the word "maintenance" has been misappropriated from the one circumstance to the other.

Whoever coined the word "maintenance" for computer programs was as careless and unthinking as the person who trains an attack dog to kill on the command KILL or HELLO. With the wisdom of hindsight, I would suggest that the "maintenance" programmer is more like a brain surgeon than a janitor—because opening up a working system is more like opening up a human brain and replacing an nerve than opening up a sink and replacing a washer. Would maintenance be easier to manage if it were called "software brain surgery"?

Think about it this way. Suppose you had a bad habit—like saying KILL to attack dogs. Would you go to a brain surgeon and say, "Just open up my skull, Doc, and remove that one little habit. And please do a quick and dirty job—it's only a small change! Just a little maintenance job!"?

The Moral

Of course, you as a consultant would never be this careless with language, would you? But when you're called in by a client who's having trouble—like disastrous small maintenance changes—listen to their "innocent" language. It may contain just the clue you need to make one small change and fix the problem.

Oh, wait a minute!