TL;DR: How to find out if your favorite deep learning library is occasionally giving youwrong results? Such bugs happen from time to time, and are extremely difficult to notice, report, and debug.
Three years ago, I wrote an article Unawareness of Deep Learning Mistakes:buggy code can still train and appear to work, so it's difficultfor users to realize that their code is wrong.
What's apparently more difficult to find out, is when the bug comes from the deep learning library we use.Imagine, what if the library unfortunately computes wrong results for certain parts of our model during training?The training will probably still work to some extent thanks to the magic of SGD,so how could we ever possibly find out such bugs?I'll share some experience and lessons.
"Bugs" in this article specifically refer to silent bugs that lead to wrong computation results,but no errors.
Such bugs exist in deep learning libraries and will continue to exist, because these librariesare young, and new features such as operators and training paradigm will continue to emerge in them as the researchdevelops.
Such bugs in deep learning are very hard to notice.A model typically contains billions of floating point operations (FLOPs) grouped into hundreds of operators.Even with small bugs, it may still train, converge, andappear to work well. Maybe it works slightly worse, or it fails occasionally, but it's extremely difficultfor a user to associate a suspicious result to a concrete bug in libraries.After all, there are many other explanations of a bad result that need to be ruled out:the model simply does not work; incorrect model implementation; bad hyperparameter; bug in users' training code, etc.
The situation gets worse when the buggy part of computation is not even explicitly written byusers, but implicitly generated.Auto-generated computation such as auto differentiation and graph optimization are often notwell exposed to users at all, making it more difficult to observe the bug.For example, pytorch/5801 is a bug in gradientcomputation that's found during the development of ELF OpenGO at FAIR.Models can still work to some extent with the bug, which hides the bug for a long time.It has unfortunately wasted many months in the project.
PyTorch has a "silent correctness"issue label, which shows many bugs of this kind.Most of these issues are also labeled as "high priority",which says a lot about the severity of such bugs.
Compared to user's training code that may also have many silent bugs,deep learning libraries have some advantage in test-ability.They provide well-defined small building blocks (e.g. operators and their gradients),so they are more testable than an end-to-end training.But I've seen a few limitations of unittests in the context of deep learning:
A test only covers a tiny input space, but other inputs may cause bugs.
As an example, pytorch/36485computes softmax incorrectly only if number of classes (C > 1024) && (C % 4 != 0)
, which israre in real applications.It is found in the development of MoCo which uses 65537 classes.After noticing regression in model's accuracy, the root cause is later found by bisection.
Behaviors under combinations of context are hard to test exhaustively.
Deep learning libraries usually separate the definition of computation from its execution.As a result, a computation may run under different combinations of runtime context:graph/eager mode (TensorFlow), eager/tracing/scripting mode (PyTorch), eager/jit/pjit mode (JAX),fusion with other computations, the device to run on, the level of parallelismto use, the underlying compute library and algorithm to choose from, etc.Unittests are often insufficient to cover such a huge space.
This issue gets worse in higher-level interface (e.g. Keras).TensorFlow is well-known for its many high-level ways to do the same thing:users can write a model under graph or eager mode, using either object-oriented or functionalstyle, with either raw TF APIs or Keras/Estimator interface, and Keras has many more modes within itself.Handling these combinations gets more challenging,because a high-level component has much richer semantics (therefore more side effects),that are often not strictly defined and harder to test than a pure-math operator.
For example, tensorflow/25175 andtensorflow/40638are two silent bugs in Keras causing models to not train properly. Both are due to unconventional combination in waysTensorFlow / Keras interact with each other.
Concurrency bugs that happen nondeterministically.
Deep learning software and hardware stacks by design have a high degree of parallelism, whichprovide room for concurrency bugs.Concurrency bugs such as a race condition may happen only in certain program or hardware, ormay not be reproducible at all. They are difficult to notice, report, and debug.
As an example, pytorch/18465 is a use-after-free concurrency bug I found.The only symptom I observed is that some tensor values in my model are unexpectedly modified.Drawing any conclusions beyond that is challenging, because any simplication I applied to the model can cause the bug to disappear.A lot of hours were put to track down and reproduce it with minimal examples.And there is little chance that a unittest can guard against such bugs.
I'll share stories of two more silent bugs that I found in TensorFlow and PyTorch,where they both compute wrong gradients for some operators.Both bugs stayed in the codebase for > a year,presumably because users can hardly blame bad training on wrong gradients, rather than their own models.
nn.SyncBatchNorm
Notice the bug
I started to try out PyTorch's nn.SyncBatchNormin the summer of 2019 due to the need of this layer in the MoCo project.To gain some trust in this layer(I knew that BatchNorm is often implemented wrong, seethis later paper of mine),the first thing I did is to try it on some baselines I'm familiar with:a Mask R-CNN in detectron2.
Luckily, this was before TensorFlow introduced the next bug I would find later. So when I compared itwith my TensorFlow implementationof Mask R-CNN that also supports SyncBatchNorm, I can see that most results in detectron2 were a few AP (average precision) worse.
I know every detail of the two implementations since I wrote both of them, and their gap isnegligible when not using SyncBatchNorm.So I'm relatively confident that such a large gap is a library bug in PyTorch.
Confirm the bug
Next, we decided to just reimplement a correct SyncBatchNorm.It turned out to be quite easy, and this was later releasedin detectron2.Comparing the results of the two implementations further confirmed the bug is related to nn.SyncBatchNorm
.
Narrow down the bug
From the experiments in various models, I noticed that suboptimal results only appear ifSyncBN is added in Mask R-CNN's mask head. Adding it to all other components is OK.Therefore I hypothesized that there are wrong computation results when batch size is differentacross workers, since that's where mask head is different from others.This hypothesis can be verified quite easily.After sharing our findings with the code owner, the root cause in gradient computation wasfixed.
nccl_ops.all_sum
NCCL is widely used to reduce gradients among GPUs.However, it turns out that TensorFlow can do it wrong sometimes.This bug may affect all NCCL-based multi-GPU data-parallel training.Interestingly, it also affects SyncBatchNorm in TensorFlow if using NCCL.
Notice the bug
In the summer of 2020 I gave TF v1.15 a try.I planned to just do some basic benchmarks of my code, but a few Mask R-CNN training blowed up with NaNs after 10~20minutes of training. This has not happened before.
Confirm the bug
My first thought was that I broke my Mask R-CNN implementation at some commit.But after trying a few combinations of code versions,it became clear that TensorFlow was to blame, because the same code can train in TF v1.14,even if I make sure they use identical version of CUDA/CuDNN.
Narrow down the bug
I know that no one in TF team would use my entire training code to debug, so I have to narrow it down myself.But this was never easy, because wrong results in any step in the whole training system can lead to NaNs, andthere is nowhere to start looking.Moreover, the bug does not happen deterministically, and when I tried to simplify my code,it started to happen less frequently.
Luckily, there is still a painful but practical way to go: bisection. So I:
|
Unfortunately, the offending commit seems correct to me. This means the commit which increases parallelism in NCCL probablytriggers a bug that dates back even earlier.
Further narrow down the bug
After playing with the offending commit a bit, given the non-deterministic behavior of the bug,and the content of the commit,my hypothesis was that the way TensorFlow uses NCCL contains concurrency bugs.
My original code only uses NCCL's all_sum
to all-reduce gradients.To add a simple check of its results, I used tf.add_n
to all-reduce the gradients again,and added tf.debugging.Assert
to ensure that the two results have to match.Unsurprisingly, the results don'talways match -- a large discrepancy appears once a while between the results of tf.add_n
andnccl_ops.all_sum
.
This is where the heavy-lifting ended: I've turned the silent training bug into an obvious error.The bug is no longer about a failed training which "I think" should succeed,but is now about something that's obviously wrong in TensorFlow: weadded tensors in two different ways and results don't match!No one is obligated to trust the correctness of my training code,but every one has to admit that nccl_ops.all_sum
and tf.add_n
must not produce differentresults.
The rest is easy: I started to simplify my training code for better understanding of the bug,removed all depenencies, and eventually made a small enough self-contained reproducible script andreported a bug.Beyond that, it is no longer my responsibility.
Summarizing from my own experience, the following are important to fight silent bugsin deep learning libraries:
Reproducing known resultsis the only way to discover silent bugs in model training.This is how we have an "expected output", so that we can notice if anythingunexpected is happening.
Narrowing down is necessary at least in the open source environment.Unless a small enough code clearly demonstrates a bug in the library, it's not the libraryowners' responsibility to understand and debug user code.After all, a bug often lives in user code rather than the library.The general guidelinesabout how to ask good questions/bug reports can apply to deep learning.
Bisection is slow, costly, but effective.When there is no obvious clues, and its cost is affordable, do a bisection.If anything can be better than bisection, that would be a trisection or k-section to reduce itslatency, because verifying whether a commit works or not may require training a model for quite a while.
Bisection is not always applicable. If there isn't a good history version as a reference,other more creative debugging methods will be needed.
Know the library well, understand its internals so we can make reasonable hypothesis andinvestigate them.It's often helpful to dig into library code: a few lines of debugging code at the right place can provide valuable informationthat cannot be easily obtained in user code.
Silent bugs exist in deep learning libraries, and are extremely hard tofind. What does this mean for everyone working on deep learning?
As an average user, follow what the experts are using.Silent bugs exist but are hard to find. Without enough confidence onour own ability to always discover such bugs, follow the experts.
A library without years of battle testing may have many sharp edges or hidden bugs.Using a mature library like PyTorch or TensorFlow, the bug you may run into is more likelyto have been discovered by others already.This applies not only to libraries as a whole,but also to different features of a library, modules within a library, extensions of a library, etc.
This is not saying we should use the most popular thing.On the contrary, high-level frameworks that build over-simplified APIs to gain popularity among non-experts (e.g. Keras)are something a serious researcher would rather avoid:they may have silent bugs buried underneath simply because the intended user group is not capableof noticing them.
To make your code/library popular, reproduce known results to increase credibility."Following the experts" tends to create monopoly. To break that,deep learning training libraries can earn trust by reproducing known results, rather than justprovide examples of arbitrary toy models.This is a core principle in tensorpackthat I follow since the beginning,and is probably the most effective way to convince a user that your library/implementationdoes not have hidden silent bugs.