tag:blogger.com,1999:blog-25922407.post5260463804051237905..comments2024-03-25T22:12:49.064-07:00Comments on Gerald Weinberg's Secrets of Writing and Consulting: My most insidious bugGerald M. Weinberghttp://www.blogger.com/profile/05902673055244863609noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-25922407.post-61061862714406928562018-02-21T20:50:48.026-08:002018-02-21T20:50:48.026-08:00Great example, Jim. A lesson when we hear or read ...Great example, Jim. A lesson when we hear or read "that could never happen" in a program. If it can't happen, then it wouldn't hurt anything to add a clear error message. Also when you see or hear "guranteed."Gerald M. Weinberghttps://www.blogger.com/profile/05902673055244863609noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-13083200557042452462018-02-21T14:55:21.872-08:002018-02-21T14:55:21.872-08:00Net:
The most insidious one was this ginormous, ve...Net:<br />The most insidious one was this ginormous, very active data-wrangling system that started intermittently producing inconsistent results, it turned out as a consequence of a silent, and unmanaged dependency between how one component used a resource, and how another supplied it.<br /><br />It was insidious because that played out in almost fractal-like recursive complexity.<br /><br />The bad returned result - fault - turned out to be the correct output from a persisted data index error, created by a data indexing failure, produced by a(n index data page) cache coherency failure, itself caused by a (lock pool / lock management) cache coherency failure, produced by (silently) dropped lock management messages between nodes, undetected because the implementation (knowingly, "as designed") dropped reported "enqueue" failures on the floor because "that could never happen", except nobody knew about the install-time check that "guaranteed" (demonstrably untrue.)<br /><br />There were three other ways the assumption that lock requests would always enqueue could get blown, in addition to **the standard maintenance instructions from the vendor** to "bump up the lock pool size" under many, common, run-time situations.<br /><br />Politically, the DBMS wants it's lock pool to work one way, the platform implements slightly different semantics, and the poor schumck doing the port doesn't have the resources to reconcile the two. And this is what you get.Jim Bullockhttps://www.blogger.com/profile/02824981079885202606noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-39094158406706811322018-02-15T15:22:25.134-08:002018-02-15T15:22:25.134-08:00Windchill's story is a familiar one to me, fro...Windchill's story is a familiar one to me, from the old, old days of vacuum tubes and relays. A little gas in a tube might make it behave differently at different temperatures. Or, relays would sometimes stick, but work the next time. I remember one case where the field engineer said we were having "intermittent errors." That meant he couldn't solve them, but further investigation showed that someone had removed one of the vacuum tubes. "Intermittent" meant "only when the program tries to execute an instruction that uses that missing tube." But it took us at least a week to find the empty socket.Gerald M. Weinberghttps://www.blogger.com/profile/05902673055244863609noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-38260222617628882452018-02-15T14:31:30.990-08:002018-02-15T14:31:30.990-08:00Working on an embedded system in the '80s, we ...Working on an embedded system in the '80s, we had an error in pointer arithmetic that would change a branch instruction. The program would run normally until the affected if-then would quit working. Rebooting would fix it for a while and then it would start acting up again. We found it using an HP logic analyzer on a running system at a factory in a foreign country. windchillhttps://www.blogger.com/profile/06073174669836699575noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-9944604966031961722018-02-13T21:15:38.713-08:002018-02-13T21:15:38.713-08:00Discovering - the hard way - that the VAX uninterr...Discovering - the hard way - that the VAX uninterruptible double linked list ensue instruction wasn’t.<br /><br />Required a VAX 11/780 microcode patch to make our training simulator software run for more that a few hours without crashing.<br /><br />My friend R. had the most <b>embarrassing</b> bug. He was wrote a paging data gathering probe on Brown’s 360/67 which did a some work then executed a floating point add every second. Unfortunately, he didn’t remember that CP/67 did not save and restore floating point registers across the interupt service routing used by the probe.<br /><br />So for about six months, floating point register 0 was clobbered about once a second. The physics folk started to complain about strange results, but no-one believed them - including R. who was the senior systems programmer on the Comp Lab staff.<br /><br />All the standalone hardware diagnostics ran perfectly. Eventually they gathered enough evidence to call in a senior IBM SE and schedule downtime for a full weekend of testing. R. was reviewing results when the penny dropped, and he realized what had been happening.<br /><br />He fessed up and volunteered to write the memo announcing the resolution of the problem. His suggested that physics folk shooting for a Nobel prize based on anything run over the past six months try again.Greghttps://www.blogger.com/profile/08257554085677136566noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-78813662566870359862018-02-13T20:30:46.336-08:002018-02-13T20:30:46.336-08:00I agree that requirements errors are typical harde...I agree that requirements errors are typical hardest, largely because finding them requires accurate human communication, which is much harder than accurate coding.Gerald M. Weinberghttps://www.blogger.com/profile/05902673055244863609noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-28499324325729567632018-02-13T19:59:35.535-08:002018-02-13T19:59:35.535-08:00Gerald - Looks like requirements errors are the ha...Gerald - Looks like requirements errors are the hardest ones to catch.LN Mishrahttps://www.blogger.com/profile/03164810449200435832noreply@blogger.comtag:blogger.com,1999:blog-25922407.post-2653024853290135992018-02-13T16:33:56.400-08:002018-02-13T16:33:56.400-08:00I had a stone-age era hardware problem. A spanking...I had a stone-age era hardware problem. A spanking new-tech "smart" graphics workstation needed a device driver... that was my job. The bug? Certain commands took far longer to process than others, and much longer than as spec'ed. Commands and data arriving when processing was making the machine too busy were simply ignored and discarded, and processing could resume at an arbitrary point in the incoming command and data stream. Debugging slowed everything down enough such that the command stream was fully processed. Running things normally triggered the problem. That one took a while to figure out. <br /><br />Then there was a shared memory bug in VMS... different processes would see the same memory differently. That one required an OS patch and some carefully crafted sample code to prove the problem was DEC's, not mine. Hardware Junkiehttps://www.blogger.com/profile/13596159685015426241noreply@blogger.com