Building a library for research and experiments is quite different from building other types of software.A key challenge is that, in research, abstractions and APIs are rarely set in stone:users may want to propose a slight variant or modification to literally ANYWHERE in the whole program,just because they have a new idea.
In deep learning libraries, these variants can be a different implementation of a layer,a change in optimization algorithm, or a small modification to the training logic, etc.
When users want to make such changes, they often implement variants by simply adding featuresto the target API they want to modify, e.g. by adding a new flag to the API plus some controlstatements, or by adding a new abstraction that generalizes the target API towards the users’ use case.
However, when maintaining a generic, core library meant to be adopted by diverse use cases for a long term,the above approach does not scale and poses many problems (discussed morebelow).
This document lists a few principles about:
Researchers' needs are so diverse that a core library should not aim to include or implementfeatures for all possible use cases. It should aim to only include the most popularand standardized features (more on the criteria later).
For features not included in the core, ideally there should be a way for users to implementthem out-of-core as extensions, without too much overhead / repetition.
This requires a continuous design evolution to make the core more modular and composable,so that core code can be reused in users’ new implementation.
A good sanity check for library maintainers is to ask the following question:for any feature currently in the core library, suppose we remove it today, how much effort would it takefor users to reimplement it out-of-core?A well-designed library should be decoupled such that most of its features are just extensions of itself, and they can beimplemented out-of-core the same way as it is in the core.
There are 3 criteria for feature inclusion in core, ordered by their importance.
To understand the criteria more, let’s ask: what if the feature is —
Popular but not standardized: sometimes a feature is popular, but its users don’t yet align on the properparameterization, its API, or the subtle implementation details. Including such features is risky, as it may create unclearsemantics orimpede its standardization in the future. It’s still OK to include it if it’s very popular (popularity is the #1 most important criteria),but try to do it in a composable way and with warning signs.
As a negative example, "Transformer" is a popular but not standarized feature.It's included in Pytorch, but received many complaints,and many projects (e.g. fairseq, detr)eventually have to fork and reimplement their own Transformer.
Simple but not popular/standardized: Simplicity alone is not sufficient for inclusion, no matter how simple it is.Because if everyone adds a simple feature they need, together it becomes complex.
Popular, standardized but not simple: Simplicity is the #3 important factors.If something is complex but very popular & standardized (e.g. BatchNorm being a headachefor DL library developers), it should be included. In fact this is where a library couldprovide a lot of value to users.
Suppose a user wants to change the behavior of a function def func()
defined in core.Based on assessment of the above 3 criteria, this new behavior may be determined to beimplemented in one of the following ways:
def func_v2()
in user code.(Or a class ClassV2
for classes).def func_v2()
in core.def func(option)
.We recommend that methods (1) and (2), i.e. adding a separate implementation func_v2()
, should generally be preferred over (3).
For features to be included in core, adding them like (3) is often a quick way toget the job done, but could lead to long-term issues. To show why, let’s look at the two typical ways options are added:
New flags / arguments that control the behavior:
New flag | Other new argument | ||
---|---|---|---|
|
|
This is OK if we determine the new option is very clear and popular. But be aware of the potential problems:
New logic encapsulated in new abstractions
Adding an object to control behavior | Adding a callback to control behavior | ||
---|---|---|---|
|
|
This may appear nice, since the variant logic is not implemented in core, but in a user-provided obj
or callback
.However, it’s very easy to create premature abstractions this way.
For example, the callback-based interface needs to make assumptions/constraints on where the callback is triggered,what arguments it needs and what it returns. A single use case may not be sufficient to make good assumptions on them.
Sometimes callbacks are good and useful abstractions. But it is often abused to altera behavior in existing codeinto something that's strongly overfitted to a small number of use cases.In code reviews, I often frown upon APIs that contain callbacks/user-defined functions.
Other than the potentially premature abstraction, the extra redirection caused by the new abstraction alsomakes code harder to read and maintain.
Therefore, the recommendation is, for variants to be added in core:
Users/developers may find that the core design is not good enough yet, and implementing a variantof func_core()
without touching it may lead to too much code duplication.For example, ...
is duplicated between the two functions below.
Existing API in core | New variant | ||
---|---|---|---|
|
|
Such duplication is acceptable for a short term. This also echoesFlax philosophy thatsays "prefer duplication over adding options / bad abstractions".
We do NOT mean to encourage users to heavily fork core code.Instead, users and core developers should engage and aim to evolve the core design to reduce duplication— but design change takes time to happen, and duplication is preferred before a good design is found.
The most risk-free way to reduce duplications is by moving them into shared reusable code:
Existing API in core | New variant | ||
---|---|---|---|
|
|
This should be the preferred way to reduce duplications. The benefits are:
func_core()
, hence little risk.However, there are also challenges:
_reusable_parts()
) to maintain.The above challenges are less significant if _reusable_parts()
is private. Therefore:
func_v2()
is in core, make _reusable_parts()
private.func_v2()
is out-of-core, consider _reusable_parts()
as "internal/experimental APIs".Inheritance, e.g. class ModuleV2(ModuleCore)
may also reduce duplication between two variants.However, this is generally less preferable than composition like above. The reason is similar towhy callbacks are not preferred: overriding methods is like passing callbacks - they are both user-definedfunctions and suffer from the same limitations: users are constrained by the assumption ofwhen/where/how the methods/callbacks are triggered.
We generally prefer adding a new implementation over adding new conditional branches to the existing implementation,but branches probably will happen somewhere anyway – after all, the new feature variant probably ends up as a new option/argument in the end-users' config.
If branching has to happen, we prefer it at earlier, shallower code path:
Branch earlier | Branch later | ||
---|---|---|---|
|
|
By branching earlier, we keep a clean func_core()
unaffected by the new variant.This recommendation is a natural consequence of our preference of new implementation func_v2()
(vs. adding flag
to func_core()
).