Since I joined Google Brain, I brought PyTorch to Google's internal infra andowned its maintenance. Being a "tech island", it's well known that almosteverything in Google works differently from the outside world, and thatcreates many challenges when building a massive library like PyTorch.
Among those challenges, there are a few tricky bugs related to staticinitialization order fiasco(SIOF) and their destructions. This time I was forced to learn a lot more detailsthan I'd like to know about these topics, so it's good to write them down before I forget.
"Static initialization" is an ambiguous term because "static" is very overloaded in C++.In our context, it is supposed to mean "initialization of objects that have static storage duration",i.e. objects that live through the lifetime of a program.The word "static" actually talks about the object lifetime, not about initialization.
Meanwhile, initialization of such objects can have two steps:
Objects with static storage duration can be categorized into following two types, based on when their "dynamic initialization" happen:
main()
).
|
|
SIOF typically refers to the problem that the dynamic initialization order of objects from different translation units is undefined, e.g.:
|
If a
and b
have non-trivial constructors, and the constructor of b
somehow needs to access a
, the program may crash or behave unexpectedly because a
may be initialized after b
.
PyTorch heavily uses registrations, which all have static storage duration. A few SIOF bugs were found when Itried to build PyTorch in Google. As an example, when an ATen operator has many overloads, initialization order affects which overload is called, because an overload that's initialized earlier will be preferred over those initialized later.
Standard ways to avoid SIOF problems are:
Avoid dynamic initialization: change object type to something that can be zero/const-initialized. totw/140 shows a few examples on how to replace std::string
with non-dynamic counterparts.
Use well-defined initialization order: there is a guarantee that objects within the same translation unit are dynamically initialized according to the well-defined program order. So we can sometimes just move code into the same translation unit. In another PyTorch bug where one global depends on another,I simply merged two files so that their constructors are properly sequenced.
Construct on first use: it's often not practical to merge files. A better solution is the "construct on first use" idiom:
❌ Don't use globals | ✅ Use function-local static: | ||
---|---|---|---|
|
|
By doing this, anyone that needs to access a
will have to call get_a()
. Because function-local static is guaranteed to initialize on first use, we can rest assured that a
will not be used before initialization.
The "construct on first use" idiom may look differently, because sometimes we don't need to use a
directly but do need to observe the side effects of its constructor. In such cases we just manually call get_a
to make sure a
is constructed. I used this to fix another PyTorch bug .
There are more ways things can go wrong in the destruction of objects with static storage duration.
In general, we have to carefully avoid use-after-free, i.e. access a global/function-local variable after it's destructed. This is typically protected by this rule:
Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor.
Given this rule, we can deduce that:The above result sounds nice and is often enough protection, but people tend to overlook a few ways things can still go wrong:
b
. This should be discouraged, but it means that technically ANY object could access b
in their destructor. If any of these objects are destructed after b
, we're doomed.
|
|
Given the above issues, the Google C++ style guide bluntly forbids such destructions:
Objects with static storage duration are forbidden unless they are trivially destructible.
This "no destruction" rule implies that the following code is illegal
|
if Object
is not trivially destructible. C++ FAQ advises the same.
Writing static Object* a = new Object; return *a;
is safe as long as we never call delete
, but this introduces a heap-allocation overhead.The last trick is to use a NoDestructor
wrapper class to bypass RAII(the trick is placement new operator):
Safe, but has heap allocation overhead | Safe and low overhead | ||
---|---|---|---|
|
|
Finally, as an alternative to "no destruction",another way to safely run destructors is toref-counting all such objects,but it's perhaps not worth the complexity. "No destruction" is usually a good enough solution.
In conclusion, to safely construct and destruct objects with static storage duration + dynamic initialization, follow these rules of thumb: