In this post, we show how different use cases require different model diagnostics. In short, we compare (statistical) inference and prediction.
As an example, we use a simple linear model for the Munich rent index dataset, which was kindly provided by the authors of Regression – Models, Methods and Applications 2nd ed. (2021). This dataset contains monthy rents in EUR (rent
) for about 3000 apartments in Munich, Germany, from 1999. The apartments have several features such as living area (area
) in squared meters, year of construction (yearc
), quality of location (location
, 0: average, 1: good, 2: top), quality of bath rooms (bath
, 0:standard, 1: premium), quality of kitchen (kitchen
, 0: standard, 1: premium), indicator for central heating (cheating
).
The target variable is Y=\text{rent}
and the goal of our model is to predict the mean rent, E[Y]
(we omit the conditioning on X for brevity).
Disclaimer: Before presenting the use cases, let me clearly state that I am not in the apartment rent business and everything here is merely for the purpose of demonstrating statistical good practice.
The first use case is about inference of the effect of the features. Imagine the point of view of an investor who wants to know whether the installation of a central heating is worth it (financially). To lay the ground on which to base a decision, a statistician must have answers to:
cheating
on the rent.The second use case is about prediction. This time, we take the point of view of someone looking out for a new apartment to rent. In order to know whether the proposed rent by the landlord is about right or improper (too high), a reference value would be very convenient. One can either ask the neighbors or ask a model to predict the rent of the apartment in question.
Before answering the above questions and doing some key diagnostics, we must load the data and fit a model. We choose a simple linear model and directly model rent
.
Notes:
GeneralizedLinearRegressor
from the package glum. We could as well have chosen other implementations like statsmodels.regression.linear_model.OLS. This way, we have to implement the residual diagnostics ourselves which makes it clear what is plotted.For brevity, we skip imports and data loading. Our model is then fit by:
lm = glum.GeneralizedLinearRegressor( alpha=0, drop_first=True, # this is very important if alpha=0 formula="bs(area, degree=3, df=4) + yearc" " + C(location) + C(bath) + C(kitchen) + C(cheating)" ) lm.fit(X_train, y_train)
model = lm( formula = rent ~ bs(area, degree = 3, df = 4) + yearc + location + bath + kitchen + cheating, data = df_train )
The coefficient table will already tell us the effect of the cheating
variable. For more involved models like gradient boosted trees or neural nets, one can use partial dependence and shap values to assess the effect of features.
lm.coef_table(X_train, y_train)
summary(model) confint(model)
Variable | coef | se | p_value | ci_lower | ci_upper |
---|---|---|---|---|---|
intercept | -3682.5 | 327.0 | 0.0 | -4323 | -3041 |
bs(area, ..)[1] | 88.5 | 31.3 | 4.6e-03 | 27 | 150 |
bs(area,..)[2] | 316.8 | 24.5 | 0.0 | 269 | 365 |
bs(area, ..)[3] | 547.7 | 62.8 | 0.0 | 425 | 671 |
bs(area, ..)[4] | 733.7 | 91.7 | 1.3e-15 | 554 | 913 |
yearc | 1.9 | 0.2 | 0.0 | 1.6 | 2.3 |
C(location)[2] | 48.2 | 5.9 | 4.4e-16 | 37 | 60 |
C(location)[3] | 137.9 | 27.7 | 6.6e-07 | 84 | 192 |
C(bath)[1] | 50.0 | 16.5 | 2.4e-03 | 18 | 82 |
C(kitchen)[1] | 98.2 | 18.5 | 1.1e-07 | 62 | 134 |
C(cheating)[1] | 107.8 | 10.6 | 0.0 | 87.0 | 128.6 |
We see that ceteris paribus, meaning all else equal, a central heating increases the monthly rent by about 108 EUR. Not the size of the effect of 108 EUR, but the fact that there is an effect of central heating on the rent seems statistically significant:
This is indicated by the very low probability, i.e. p-value, for the null-hypothesis of cheating
having a coefficient of zero.
We also see that the confidence interval with the default confidence level of 95%: [ci_lower
, ci_upper
] = [87, 129].
This shows the uncertainty of the estimated effect.
For a building with 10 apartments and with an investment horizon of about 10 years, the estimated effect gives roughly a budget of 13000 EUR (range is roughly 10500 to 15500 with 95% confidence).
A good statistician should ask several further questions:
cheating
and other features?cheating
valid?Here, we will only address the last question, and even that one only partially. Which assumptions were made? The error term, \epsilon = Y - E[Y]
, should be homoscedastic and Normal distributed. As the error is not observable (because the true model for E[Y]
is unknown), one replaces E[Y]
by the model prediction \hat{E}[Y]
, this gives the residuals, \hat{\epsilon} = Y - \hat{E}[Y] = y - \text{fitted values}
, instead. For homoscedasticity, the residuals should look like white (random) noise. Normality, on the other hand, becomes less of a concern with larger data thanks to the central limit theorem. With about 3000 data points, we are far away from small data, but it might still be a good idea to check for normality.
The diagnostic tools to check that are residual and quantile-quatile (QQ) plots.
# See notebook for a definition of residual_plot. import seaborn as sns fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4)) ax = residual_plot(model=lm, X=X_train, y=y_train, ax=axes[0]) sns.kdeplot( x=lm.predict(X_train), y=residuals(lm, X_train, y_train, kind="studentized"), thresh=.02, fill=True, ax=axes[1], ).set( xlabel="fitted", ylabel="studentized residuals", title="Contour Plot of Residuals", )
autoplot(model, which = c(1, 2)) # from library ggfortify # density plot of residuals ggplot(model, aes(x = .fitted, y = .resid)) + geom_point() + geom_density_2d() + geom_density_2d_filled(alpha = 0.5)
The more data points one has the less informative is a scatter plot. Therefore, we put a contour plot on the right.
Visual insights:
# See notebook for a definition of qq_plot. qq_plot(lm, X_train, y_train)
autoplot(model, which = 2)
The QQ-plot shows the quantiles of the theoretical assumed distribution of the residuals on the x-axis and the ordered values of the residuals on the y-axis. In the Python version, we decided to use the studentized residuals because normality of the error implies a student (t) distribution for these residuals.
Concluding remarks:
If we are only interested in predictions of the mean rent, \hat{E}[Y]
, we don’t care much about the probability distribution of Y
. We just want to know if the predictions are close enough to the real mean of the rent E[Y]
. In a similar argument as for the error term and residuals, we have to accept that E[Y]
is not observable (it is the quantity that we want to predict). So we have to fall back to the observations of Y
in order to judge if our model is well calibrated, i.e., close the the ideal E[Y]
.
Very importantly, here we make use of the test sample in all of our diagnostics because we fear the in-sample bias.
We start simple by a look at the unconditional calibration, that is the average (negative) residual \frac{1}{n}\sum(\hat{E}[Y_i]-Y_i)
.
compute_bias( y_obs=np.concatenate([y_train, y_test]), y_pred=lm.predict(pd.concat([X_train, X_test])), feature=np.array(["train"] * X_train.shape[0] + ["test"] * X_test.shape[0]), )
print(paste("Train set mean residual:", mean(resid(model)))) print(paste("Test set mean residual: ", mean(df_test$rent - predict(model, df_test))))
set | mean bias | count | stderr | p-value |
---|---|---|---|---|
train | -3.2e-12 | 2465 | 2.8 | 1.0 |
test | 2.1 | 617 | 5.8 | 0.72 |
It is no surprise that bias_mean
in the train set is almost zero.
This is the balance property of (generalized) linear models (with intercept term). On the test set, however, we detect a small bias of about 2 EUR per apartment on average.
Next, we have a look a reliability diagrams which contain much more information about calibration and bias of a model than the unconditional calibration above. In fact, it assesses auto-calibration, i.e. how well the model uses its own information.
An ideal model would lie on the dotted diagonal line.
fig, axes = plt.subplots(ncols=2, figsize=(4.8 * 2.1, 6.4)) plot_reliability_diagram(y_obs=y_train, y_pred=lm.predict(X_train), n_bootstrap=100, ax=axes[0]) axes[0].set_title(axes[0].get_title() + f" train set (n={X_train.shape[0]})") plot_reliability_diagram(y_obs=y_test, y_pred=lm.predict(X_test), n_bootstrap=100, ax=axes[1]) axes[1].set_title(axes[1].get_title() + f" test set (n={X_test.shape[0]})")
iso_train = isoreg(x = model$fitted.values, y = df_train$rent) iso_test = isoreg(x = predict(model, df_test), y = df_test$rent) bind_rows( tibble(set = "train", x = iso_train$x[iso_train$ord], y = iso_train$yf), tibble(set = "test", x = iso_test$x[iso_test$ord], y = iso_test$yf), ) |> ggplot(aes(x=x, y=y, color=set)) + geom_line() + geom_abline(intercept = 0, slope = 1, linetype="dashed") + ggtitle("Reliability Diagram")
Visual insights:
Finally, we assess conditional calibration, i.e. the calibration with respect to the features. Therefore, we plot one of our favorite graphs for each feature. It consists of:
Y
for each (binned) value of the featurefig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 5*4), sharey=True) for i, col in enumerate(["area", "yearc", "bath", "kitchen", "cheating"]): plot_marginal( y_obs=y_train, y_pred=lm.predict(X_train), X=X_train, feature_name=col, predict_function=lm.predict, ax=axes[i][0], ) plot_marginal( y_obs=y_test, y_pred=lm.predict(X_test), X=X_test, feature_name=col, predict_function=lm.predict, ax=axes[i][1], ) axes[i][0].set_title("Train") axes[i][1].set_title("Test") if i != 0: axes[i][0].get_legend().remove() axes[i][1].get_legend().remove() fig.tight_layout()
xvars = c("area", "yearc", "bath", "kitchen", "cheating") m_train = feature_effects(model, v = xvars, data = df_train, y = df_train$rent) m_test = feature_effects(model, v = xvars, data = df_test, y = df_test$rent) c(m_train, m_test) |> plot( share_y = "rows", ncol = 2, byrow = FALSE, stats = c("y_mean", "pred_mean", "pd"), subplot_titles = FALSE, # plotly = TRUE, title = "Left: Train - Right: Test", )
Visual insights:
We next perform a bias plot, which is plotting the average difference of predicted minus observed per feature value. The values should be around zero, so we can zoom in on the y-axis.
This is very similar to the residual plot, but the information is better condensed for its purpose.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 2*4), sharey=True) axes[0,0].set_ylim(-150, 150) for i, col in enumerate(["area", "yearc"]): plot_bias( y_obs=y_train, y_pred=lm.predict(X_train), feature=X_train[col], ax=axes[i][0], ) plot_bias( y_obs=y_test, y_pred=lm.predict(X_test), feature=X_test[col], ax=axes[i][1], ) axes[i][0].set_title("Train") axes[i][1].set_title("Test") fig.tight_layout()
c(m_train[c("area", "yearc")], m_test[c("area", "yearc")]) |> plot( ylim = c(-150, 150), ncol = 2, byrow = FALSE, stats = "resid_mean", subplot_titles = FALSE, title = "Left: Train - Right: Test", # plotly = TRUE, interval = "ci" )
Visual insights:
area
and yearc
in the 1940s and 1950s, there are only few observations available. Still, the model might be improved for those regions.yearc
shows a parabolic curve. The simple linear effect in our model seems too simplistic. A refined model could use splines instead, as for area
.Concluding remarks:
area
larger than around 120 square meters and for year of construction around the 2nd world war are less reliable.The full Python and R code is available under: