One of the greatest joys in science is the discovery of something that is consistently true — an effect that appears again and again when an experiment is repeated under identical conditions. Unfortunately, such reproducibility is rarer than we would like, giving rise to issues such as the reproducibility crisis.
Detecting true effects is difficult for many reasons, including limited sample sizes, sampling variability, and methodological differences that influence results more strongly than expected.
One of the most important reasons, however, is that almost any measurement in biomedical research is influenced by additional factors. Ignoring these dependencies can lead to false positive as well as false negative findings.
In this post, we introduce covariates and confounders and discuss how to deal with them in data analysis.
What are covariates and confounders?
We define a covariate as any variable that is included in an analysis in addition to the variable of interest. For example, age may be a useful covariate when studying associations between biological markers and diagnoses.
Confounders are variables that are associated both with the outcome and the variable of interest. If not accounted for, they can induce spurious associations that are not reproducible.
How to deal with covariates and confounders?
Covariates are commonly included in statistical models to account for variance unrelated to the variable of interest. While this may slightly reduce statistical power, the benefits usually outweigh the drawbacks.
Confounding is more difficult to handle. Including a confounder in a model may remove variance that overlaps with the effect of interest, potentially masking true effects.
Whenever possible, confounding should be avoided through appropriate study design, exploratory data analysis, or matching strategies.
Which covariates are important?
The selection of covariates is typically guided by expert knowledge and prior literature. Visual data inspection and dimensionality reduction techniques such as PCA can help identify unexpected sources of variance.
Covariates in machine learning studies
In machine learning, covariate adjustment is often performed prior to model training. However, residual confounding may still remain, particularly for non-linear effects.
Careful validation and collaboration-wide harmonization of preprocessing steps are therefore essential.
Summary
Variables rarely exist in isolation. Understanding covariates and confounders is essential for producing reliable and reproducible scientific results.