Some bugs in software are extremely hard to track down. One keeps trying to “kill” them and fail for a long time. Much like the guys in this video.
The most difficult bugs in the world can take months to track down and isolate. While not always deadly, Sometimes even when the bug location is known (like George Bush in Washington DC) the fix itself can be extremely difficult.
Bugs are hard to solve when they are non deterministic, non reproducible or out of control. They happen only in production system, or only during load, only at certain places, or just on the customer desktop.
Here are few examples of elusive software bugs, their characteristics and how to turn them into butterflies of code .
Bugs in other people code
We just found a bug in the way Firefox loads plug-ins. It appears that Firefox tries to cache a plug-in DLL so it does not have to load it many times. While the feature is useful there is a flaw in the caching mechanisms that causes Java plug-in to load instead of our plug-in for our object.
Why was it hard to find ? The bug only happens at very specific conditions, and not all the time. It only happens on Firefox 3.6.3. It only happens when the Plug-In uses the Object HTML tag and not the “SRC” attribute. It only happens when Java and our Plug-In are loaded in the same page, in a certain order …
What did we do ? we (which means Leeor) compiled Firefox and ran the source code with a debugger . Thanks god for Open Source. If we had the same issue in IE, we would not have had much of a chance to solve it. We submitted the bug to Firefox,but since we needed a solution right away, we were able to find a workaround, by finding the root cause.
Why was it hard to find ? Memory leaks are extremely hard to find because they tend to be non deterministic. When the memory of the process just keeps growing, it is relatively easy to find the problems source. But in many operating systems the memory management and garbage collection have become so sophisticated, it is not trivial to know if there is a leak or not. Memory goes up and down , or just stays still.
In theory, memory management problems were supposed to go away in Java,C# and Python. While most of them have, the ones that remain are the hardest one to solve. For a while we kept hunting an “Out Of memory’ problem in IronPython. In the good (?) old days of C++ we could have used a memory profiler to locate our lost memory chunks. In IronPython this is next to impossible, since the .Net object are so mangled it is not possible to correlate them to original language objects. therefore, traditional tools like Quantify have little value.
What did we do ? Sometime the best resolution is to write a custom memory management library. We used this trick in Check Point when we needed to debug memory leaks in the kernel, where no standard tool works. This approach works especially well when this infrastructure is written form the ground up.
A similar approach can be used in Python, but the performance implications are too hard to run it in production. Unfortunately , the memory leaks only happens in production …
If the problem can be reproduced with unit tests, life is a bit better. One innovative idea that Idan came up with is Binary Search over the code. Since we moved to GIT we can now change the past retroactively. In other words, we perform a binary search on the code, to track down the line of code in which the memory leak started. We “Pretend” the unit test was written in the past and run it against the old branch. Using binary search we can locate the exact commit in which the problem arose.
Another option is to look for the usual suspects – unmanaged code .In one case we used NetApp SDK that was written in C++ in our .Net code. It took three iterations to resolve all the memory leaks caused by their library. Pretty much a trail an error process.