"Loss function" is one of the most basic concepts today in deep learning.Despite that,it is actually not necessarily a good programming abstraction whendesigning general-purpose systems. A system should not assume thata model always comes together with a "loss function".
"Loss function" may mean different things in different systems.The version I'm going to criticize is the most common one that looks like below:
Bad | Worse | ||
---|---|---|---|
|
|
The key property of the bad "loss function" abstraction is:Users are asked to provide a "loss function" that's executed after the "model / forward logic".Such abstraction appears in a few open source systems: Keras model.compile(loss=),fast.ai Learner(loss_func=), Lingvo BaseModel.ComputeLoss.
The main problem is not with the function itself, but that the users' algorithm logic is forced to separate into two parts: model
and loss_func
.
As an alternative, trainer_good
below no longer separates "loss_func" from the model, and has equal functionalitieswith trainer_bad
.
|
In this article, I want to argue that this is a better design because:
model
into two parts if they like, but they don't have to.(Apparently, trainer_good == partial(trainer_bad, loss_func=lambda x, y: x)
.So trainer_bad
can still be used - we just set loss_func
to a no-op if we don't like it.But trainer_good
is cleaner.)
It's true that the separation can be useful to certain types of models.But it's not always the case, and enforcing it can be harmful instead.
The separation is not convenient for a model with many optional losses.Take a multi-task model for example:
Separation | No Separation | ||
---|---|---|---|
|
|
The right one is simpler in that it does not duplicate thebranches that enable different tasks/losses.In reality, these conditions can be more complex than a simple if
,and branching is generally less straightforward to maintain.So it's beneficial to not have to repeat the logic.
Note: If you think a wrapper likemulti_loss_func({"task1": loss_func1, "task2": loss_func2})
will help (like what Keras supports), it is not going to work wellbecause it doesn't know how to route the inputs/outputs to loss functions.
One may argue that separating "loss" from "model" is nice becausethen we can easily switch different loss functions independent of "model".However, in many algorithms, loss computation is simply not independent ofthe model and should not be switched arbitrarily.This could be due to:
Loss computation depends on internal states computed during model.forward
, e.g.:
forward
.forward
.In these cases, forcing a separation of "loss" and "model" will let "model" return its internal states, causing an abstraction leak.
Different loss functions expect different representations of model's predictions. These representations could be:
Since conversion between representations may be expensive or lossy, we'd like the model toproduce the exact representation needed by loss computation.Therefore, a separation would not make model independent of losses.On the contrary, it's even worse because loss-relatedlogic will be unnaturally split like this:
Separation | No Separation | ||
---|---|---|---|
|
|
One may argue that the separation is helpful because it's nice to let the "model"return the same data in training and inference.This makes sense for simple models where training and inference share most of the logic.For example, in a standard classification model shown below,we can let the "model" object return logits, which will be useful in bothtraining and inference.
But many models don't have a clean separation like this.In theory, training and inference only have to share (some) trained weights,but don't necessarily have to share any logic.Many object detection models, for example, do not compute "predictions" in trainingand do not compute losses in inference.A simplified diagram of Region-Proposal Network (RPN)of a two-stage detector looks like this during training:
Any attempt to split a complicated algorithm like this into "model" and "loss function" will:
Therefore, it's unrealistic to expect that there is a nice separation, or that "model" can producea consistent format in both training and inference.A better design is to include loss computation in the model's training-mode forward
, i.e., let model outputlosses in training, but predictions in inference.
Separation | No Separation | ||
---|---|---|---|
|
|
In the "no separation" design, users provide a "model" that returns losses.This model internally can still use separation of "loss function" and "forward logic"as long as it makes sense.However, this trainer is no longer aware of the separation,and the trainer can no longer obtain the "outputs".
Will this become a limitation of the "no separation" design? What if we'd like to do something with"outputs"? My answer is:
write_summary(outputs)
in their model.Design is always a trade-off.Adding assumptions to a system might result in some benefits, but at the same time can cause trouble when the assumption isn't true.Finding a balance in between is difficult and often subjective.
The assumption that models have to come together with a separate "loss function", in my opinion, brings more trouble than it's worth.