Statistical modeling

Multiple regression analysis is often used in statistical analyses involving multiple variables to fit statistical models. Their use is often problematic, both terminologically (as discussed here: 2. Terminology) and in terms of the purpose of the analysis. The British statistician George Box coined the phrase, "All models are wrong, but some are useful". In clinical and epidemiological research, three main uses are common.

First, in observational studies, multiple variables are included in a statistical model to adjust effect size estimates for confounding bias. This is an explanatory analysis, which requires assumptions about cause-effect relationships between the variables included in the analysis to produce valid estimates. Which variables to include in the analysis depends on what is known or suspected about the disease being studied, and developing the statistical model can be methodologically complicated (see Shrier I, Platt RW. Reducing bias with directed acyclic graphs. BMC Medical Research Methodology 2008 Oct 30;8:70). An alternative method is to develop a propensity score that predicts treatment allocation and stratify on this instead of individual variables. This alternative also requires careful variable selection (see, for example, Sjölander A. Propensity Scores and M-Structures. Statistics in Medicine 28, no. 9 (30 April 2009):1416-20). The problem to be avoided is residual confounding.

Second, statistical models are also used to analyse randomised trials, but not to adjust for confounding, as this is dealt with in the study design. Instead, the model is used to adjust for randomised stratification, to analyse centre-specific effects in multicentre trials, and to estimate change from baseline in continuous variables. The study design and the trial protocol define the variables to be included in these statistical models. The problem to avoid is unnecessarily low precision, i.e. p-values that are too high and confidence intervals too wide.

Third, if the focus is not on parameter estimates but on prediction, for example, in developing a prognostic score, data-driven modelling (e.g., forward or backward stepwise regression or lasso regression) can be used. In this case, the goal is not valid and precise effect size estimates but accurate predictions in terms of sensitivity and specificity. The goal is optimal predictive accuracy and the analysis problem is overfitting, adaptation to random variation with high predictive accuracy in the dataset used to develop the model but low in a other datasets.

Many publications confuse the purpose of modelling and the presentation of results. The most common problem is probably the combination of data-driven model development and presentation of effect size estimates. See Ramspek CL, Steyerberg EW, Riley RD, Rosendaal FR, Dekkers OM, Dekker FW, et al. Prediction or causality? A scoping review of their conflation in current observational research. Eur J Epidemiol. 2021 Sep;36(9):889-98.


You'll only receive email when they publish something new.

More from Ranstam
All posts