Gerald Weinberg's Secrets of Writing and Consulting: My most insidious bug

Tuesday, February 13, 2018

My most insidious bug

I was asked, "As a coder, what is the most insidious bug you have ever come across, and how did you find it?"

It’s really hard to pick one error out of hundreds I’ve encountered in my long career, but some of the toughest have been:

compiler errors, where the compiler has created object code incorrectly. We usually found these by hacking around, changing the source code to express the program in different ways, or by examining the object code the compiler had produced.

hardware errors, both from the failure of a component and an actual design error in the hardware. Such errors are not as frequent today as they were in the old days of vacuum tubes (or relays), but in a way that infrequency makes them all the more difficult when they do occur, because we have so little experience with them.

requirements errors, where the program has actually solved the wrong problem. These errors can usually be found only after users have been in contact with the code for some time, and only when there is some communication channel between the users and the programmers.

So, what were your most insidious errors?

You can read more about errors and their consequences in

Errors: Bugs, Boo-Boos, and Blunders (https://leanpub.com/errors)

8 comments:

Hardware Junkie said...: I had a stone-age era hardware problem. A spanking new-tech "smart" graphics workstation needed a device driver... that was my job. The bug? Certain commands took far longer to process than others, and much longer than as spec'ed. Commands and data arriving when processing was making the machine too busy were simply ignored and discarded, and processing could resume at an arbitrary point in the incoming command and data stream. Debugging slowed everything down enough such that the command stream was fully processed. Running things normally triggered the problem. That one took a while to figure out.

Then there was a shared memory bug in VMS... different processes would see the same memory differently. That one required an OS patch and some carefully crafted sample code to prove the problem was DEC's, not mine.; Tuesday, 13 February, 2018
LN Mishra said...: Gerald - Looks like requirements errors are the hardest ones to catch.; Tuesday, 13 February, 2018
Gerald M. Weinberg said...: I agree that requirements errors are typical hardest, largely because finding them requires accurate human communication, which is much harder than accurate coding.; Tuesday, 13 February, 2018
Greg said...: Discovering - the hard way - that the VAX uninterruptible double linked list ensue instruction wasn’t.

Required a VAX 11/780 microcode patch to make our training simulator software run for more that a few hours without crashing.

My friend R. had the most embarrassing bug. He was wrote a paging data gathering probe on Brown’s 360/67 which did a some work then executed a floating point add every second. Unfortunately, he didn’t remember that CP/67 did not save and restore floating point registers across the interupt service routing used by the probe.

So for about six months, floating point register 0 was clobbered about once a second. The physics folk started to complain about strange results, but no-one believed them - including R. who was the senior systems programmer on the Comp Lab staff.

All the standalone hardware diagnostics ran perfectly. Eventually they gathered enough evidence to call in a senior IBM SE and schedule downtime for a full weekend of testing. R. was reviewing results when the penny dropped, and he realized what had been happening.

He fessed up and volunteered to write the memo announcing the resolution of the problem. His suggested that physics folk shooting for a Nobel prize based on anything run over the past six months try again.; Tuesday, 13 February, 2018
windchill said...: Working on an embedded system in the '80s, we had an error in pointer arithmetic that would change a branch instruction. The program would run normally until the affected if-then would quit working. Rebooting would fix it for a while and then it would start acting up again. We found it using an HP logic analyzer on a running system at a factory in a foreign country.; Thursday, 15 February, 2018
Gerald M. Weinberg said...: Windchill's story is a familiar one to me, from the old, old days of vacuum tubes and relays. A little gas in a tube might make it behave differently at different temperatures. Or, relays would sometimes stick, but work the next time. I remember one case where the field engineer said we were having "intermittent errors." That meant he couldn't solve them, but further investigation showed that someone had removed one of the vacuum tubes. "Intermittent" meant "only when the program tries to execute an instruction that uses that missing tube." But it took us at least a week to find the empty socket.; Thursday, 15 February, 2018
Jim Bullock said...: Net:
The most insidious one was this ginormous, very active data-wrangling system that started intermittently producing inconsistent results, it turned out as a consequence of a silent, and unmanaged dependency between how one component used a resource, and how another supplied it.

It was insidious because that played out in almost fractal-like recursive complexity.

The bad returned result - fault - turned out to be the correct output from a persisted data index error, created by a data indexing failure, produced by a(n index data page) cache coherency failure, itself caused by a (lock pool / lock management) cache coherency failure, produced by (silently) dropped lock management messages between nodes, undetected because the implementation (knowingly, "as designed") dropped reported "enqueue" failures on the floor because "that could never happen", except nobody knew about the install-time check that "guaranteed" (demonstrably untrue.)

There were three other ways the assumption that lock requests would always enqueue could get blown, in addition to **the standard maintenance instructions from the vendor** to "bump up the lock pool size" under many, common, run-time situations.

Politically, the DBMS wants it's lock pool to work one way, the platform implements slightly different semantics, and the poor schumck doing the port doesn't have the resources to reconcile the two. And this is what you get.; Wednesday, 21 February, 2018
Gerald M. Weinberg said...: Great example, Jim. A lesson when we hear or read "that could never happen" in a program. If it can't happen, then it wouldn't hurt anything to add a clear error message. Also when you see or hear "guranteed."; Wednesday, 21 February, 2018