What to Include in a Model and Why
By Myfanwy Johnston
How do you know what predictors to include in a model? Here my notes on the topic from a conversation with Dr. Matt Espe, as well as from Dr. Richard McElreath’s Statistical Rethinking textbook (1st Edition).
When you include a predictor in a model, you are assuming that the samples at different levels of that predictor are not exchangeable: different levels within the predictor have different effects on the response. When you exclude a predictor from a model, you are saying that the samples of different groups or levels in that predictor are exchangeable, as far as the response is concerned. This is a strong assumption.
For example, let’s say in a model of eDNA concentration, you include a predictor variable for species, with two levels (two different species). In this case, you are saying that it matters to the response that they are different species; the two species are not exchangeable in the data-generating process. If you exclude species as a predictor in this model, you are saying that one species has exactly the same effect on the response as the other species.
From this perspective, the natural tendency (certainly mine) is to just include everything. That is okay if all the predictor variables are orthogonal (statistically independent). But with correlated variables, the model will struggle to say which one is needed for the rest to be exchangeable.
For example, let’s say that in a model of fish survival, you include fork length as a predictor variable. This makes sense to do, as it’s usually the case that larger fish survive better; fish with different fork lengths are not exchangeable. But, if we include fork length AND fish weight, the model will probably struggle. It might be that once we know a fish’s fork length, knowing its weight doesn’t change predicted survival, and vice-versa - the samples become exchangeable. The model cannot resolve this, so it blows up and won’t fit properly.
There are other considerations (post-treatment bias, overfitting) that go beyond the scope of this post, but general best practice is to include every predictor, as long as all predictors are orthogonal to each other. In a tightly controlled experiment, we could design it to have completely orthogonal predictors because they are randomly assigned, and similarities are blocked. In observational data, that does not happen, and we need to either structure the model so it does not blow up, or pick which predictors we include based on independence/collinearity.