The 0th rule of bug assessment

Larry Osterman’s WebLog at Microsoft refers to an article by Eric Sink called My Life as a Code Economist. Sink’s thesis (endorsed by Osterman) is that it’s not always worth correcting bugs, because it may be cheaper or safer to leave them uncorrected.

Sink constructs a beautiful matrix of characteristics that determine whether you should try to correct a bug.

  1. Severity – When this bug happens, how bad is the impact?
  2. Frequency – How often does this bug happen?
  3. Cost – How much effort would be required to fix this bug?
  4. Risk – What is the risk of fixing this bug?

In good business school style, you plot each bug on a four-dimensional chart and use the chart to determine which bugs you should correct and which you shouldn’t.

But the most important question of all is missing:

Question 0. Do you know exactly what is causing the bug, and how?

If the answer is No, do not rest until the answer is Yes.

Here’s a really, really low-severity bug according to Sink: “One of the pixels on the splash screen is the wrong shade of gray.�? Too trivial to think about, except… why is it wrong? Perhaps someone clicked the wrong button when designing it. In that case the bug is genuinely unimportant. But suppose that you open up Photoshop and find that the splash screen picture is OK. In that case something is modifying the splash screen after it’s loaded from the disk but before it’s displayed on the screen. Some part of your program is modifying a byte in memory that it shouldn’t be modifying. In that case you have been extraordinarily lucky. A bug that modifies an arbitrary byte could in principle modify anything, anywhere. It could modify a data structure deep inside your program that could make large chunks of user data inaccessible and you wouldn’t discover the damage for months. You are lucky because right now this bug happens to be modifying something that you can see on the screen. You can notice it and reproduce it now; and having reproduced it, you can kill it.

In the late 1980s there was a bug in one release of Cardbox-Plus that manifested itself only when you were using a template to create mail-merged letters and the last letter of the template was a lower-case letter ‘e’ – in that case, one character might be lost in the mail-merged letter. We could have applied Sink’s rules and ignored the bug. But we hunted for the underlying bug and it turned out that in other circumstances it could have had much more serious effects.

The 0th law of bugs is that most bugs, most of the time, do their damage invisibly.

It follows that Rule 0 of quality control is If you see something go wrong, drop everything and find the underlying bug. If the problem disappears before the bug is found (for instance, if a modification elsewhere in the program makes it go away), then panic.

What about Sink’s rules in general? He has to make sure he’s got his economics right. It’s easy to forget that software lasts for decades and the cost of correcting a bug is only incurred once, while the cost of giving users extra support, and explaining to them what is wrong, and helping them out of trouble, is incurred over and over and over again.

Sink’s question 4 mentions the risks arising from correcting bugs. No doubt he’s right. But if your program code is so fragile that the smallest correction will make it collapse then you should knock it down and rebuild it now, before a mouse making its nest in the rafters suddenly makes the whole thing collapse in dust and rubble.

Advertisements

One Response to “The 0th rule of bug assessment”

  1. Allen Moore Says:

    I’ve experienced the wisdom of this rule firsthand while working on an embedded system. One day I noticed two garbled characters in some diagnostic text output that I rarely looked at anymore. Obviously something was overwriting those strings in RAM. I spent some time looking for the problem but a major deadline was looming and there were still features to add. The instant I modified the code, the bug disappeared, only to re-emerge–somewhere else. For the next several months, mysterious faults plagued the system. Features that worked flawlessly became problematic. It was frustrating and embarassing. I finally was able to go back and nail the bug, but oh the heartache it had caused. Never again!

Comments are closed.


%d bloggers like this: