The Anatomy of a Bug
Know Thy Enemy
As software developers, we deal with bugs all the time. We know one when we see one, right? Is there any value, then, in looking at all of the different kinds of information that comprise a software bug? Since we spend so much time preventing, identifying, and correcting bugs I think it is valuable to spend a bit of time to think about the component parts of a bug. If we do, then we may find ourselves talking and reasoning more precisely about how to approach debugging, about the effort that goes into debugging, and how to plan our processes and procedures so that as individuals or teams we can be more effective at creating bug-free software.
What is a Software Bug?
The bugs generally have the following features:
Identifying a Software Bug
Sighting — The sighting is the event that lets you know that the bug exists; it could be a test failure, a customer report of a problem, a crash, or a hang. The information that is captured when the bug is first sighted is almost always insufficient to know very much about the cause or behavior of the defect itself.
Symptom — The symptom is the specific way the program isn't behaving as expected. I think of it as "the program should do X and instead it does Y." It is more specific than the sighting because often when a program first fails the person who makes the sighting isn't paying attention to it at a level to give a clear symptom. So it may take trying it two or three times before the symptom becomes clear.
Reproducer — This is the set of steps necessary for an arbitrary user to reproduce the symptom with at least some probability. It can include manual inputs and settings, data files or database contents, or configuration details.
Description — The description is the full write up of the bug and it should include the symptom as well as some kind of articulation of the context in which the symptom can be seen. If a full reproducer is available then including that is ideal. Usually, the more precise the information the better but in practice, sometimes less than perfect bug descriptions are what you work with. The description often starts with minimal information and gets more precise as more is learned.
Failure — This is usually related to the part of the program that is responsible for doing what the program does when the symptom occurs. A program may, for example, crash because it dereferences an invalid memory address. Often it isn't too hard to find the failure part of a bug -- but then you start looking for the cause (where did the invalid address come from?)
Cause-Effect Chain — There may be one or more steps of cause-effect that separate the initial defect in the code from the final failure that lead to the symptom.
Defect — This is the actual mistake in the program itself. It is the cause at the beginning of the effect chain. Sometimes it is a single line, word, or even character. Generally, this can only be determined through a process of analyzing the behavior of the program to find each link in the cause-effect chain.
Resolving a Software Bug
Let's go beyond the defect and explain how we get to the resolution of the bug, Since we spend so much time preventing, identifying, and correcting bugs, it’s valuable to know and understand this process in order to talk more precisely and reason more effectively when it comes to planning and executing our bug reduction activities.
Trigger — Many times defects will exist in the code and simply not cause any effects that are noticeable to the user. The defect may be on a bit of code that is only executed in unusual circumstances (or not executed at all — dead code), such as a confluence of multiple input values or settings. The trigger is the set of all the conditions that are necessary for the defect and the effect chain to cause the symptom.
Workaround — Sometimes during the bug analysis process, one or more techniques will be discovered that can prevent the symptom from occurring, but without actually addressing the defect. The classic example being restarting or resetting a program before some kind of a resource leak reaches the point where it causes termination or following a certain constrained series of steps that avoids setting up the trigger conditions. Workarounds may be very helpful in the short term but should never be confused with resolutions.
Resolution — Once the defect is identified, one or more resolutions may be proposed. This could be a one-line change or a refactoring of the entire program.
Verification — These are the steps that can be used to verify that the bug has been resolved. These can be used in turn to inform the creation of a regression test to quickly detect this defect or a similar defect if it is re-introduced at some later date.
Root Cause — Goes back earlier than the defect. What is it about the design, communication, documentation, or software development process that allowed the defect to be introduced in the first place?
Tools for Debugging
As developers, we frequently talk about bugs because, let's face it, writing software is hard and we don’t always get it 100% right the first time. But we often do so in a way that minimizes or directs attention away from the hard and important work of discovering, properly specifying, analyzing, resolving, and testing bugs. If we can agree that the components and artifacts that I have outlined above are all important and relevant to most or all bugs then we can start asking ourselves how we can work better as software organizations to tackle bugs. For example, I’ve highlighted the difference between a sighting, a symptom, a reproducer, and a description. Often these get conflated. When we fail to recognize that each of these is important we get situations where we may miss bugs. If we insist on a reproducer before we even start talking about a bug we may miss sightings that are “sporadic” or that occur to a user who is unlikely or ill-equipped to create a careful write up of a reproducer. That means we can see (sight) bugs but fail to record and fail to resolve those bugs.
By recognizing the difference between these different parts of a bug we can start to ask ourselves “what can we do to make sure that we capture all the sightings?” This might mean creating descriptions that don’t quite rise to the standard of reproducers and having a process in place to monitor these and create more detailed descriptions (including full reproducers) for some or all of the bugs that are sighted.
Similarly, it's important to distinguish between the failure and the defect and recognize the value of workarounds but also the distinction between a workaround and a resolution. As software programmers and engineers we have a responsibility to identify and resolve bugs in the software we create. This is a non-trivial task and one that is worth approaching methodically.
What do you think about this taxonomy?
Do the bugs you deal with resemble what I am describing?
Do these features help you see any way your tech support, quality assurance, and software development processes could be improved?
One option may be to start using a debugging tool, like TotalView for HPC. Designed for high-performance computing (HPC) environments, TotalView provides powerful functionality to make debugging as easy as possible. Correct bugs, memory issues, and crashes in your high-scale parallel and multicore C, C++, and Fortran applications. With TotalView, you get unparalleled visibility into running programs, unmatched control over thread states, and a unique conceptual view to aid analysis.