Predictive inference

A special form of statistical inference is known as predictive inference. Instead of learning about unknown parameters, sampled data are used to predict new observations. For example, data on the treatment outcomes of existing patients can be used to develop a statistical model that facilitates clinical decision-making for new patients.

Predictive inference, often discussed with machine learning and data science, is rapidly expanding. Its potential is exemplified by the development of large language models like ChatGPT, which essentially are statistical models that predict words.

The same statistical methods (e.g. various forms of regression analysis) can be used irrespectively of purpose, but there are three major differences between explanatory models (models used for learning about parameters) and prediction models.

First, prediction models are developed for optimal prediction accuracy, not optimal validity. For example, when modelling the risk of lung cancer in relation to cigarette smoking, including a variable indicating exposure to safety matches may increase the predictive accuracy but reduce the internal validity of the model. In this example, the inclusion can induce confounding bias in the estimated effect of smoking. Consequently, the parameters of a prediction model cannot be causally interpreted.

Second, while the parameter estimates of a prediction model are technically necessary for calculating prediction outcomes, confidence intervals and p-values for these parameters are irrelevant. The uncertainty of prediction models is measured in terms of sensitivity and specificity or the area under the ROC curve, and the uncertainty of specific predictions can be measured using prediction intervals. None of these measures are relevant for explanatory models.

Third, while explanatory models are developed for optimal validity in their parameter estimates and need to include adjustments to reduce bias, the main problem for prediction models is overfitting, i.e., adaption to random variability in a dataset. Prediction models, therefore, need validation, both internal and external. Otherwise, the calculated prediction accuracy appears too promising.

This website is using cookies to improve the user-friendliness. You agree by using the website further.

Subscribe to follow Brief Comments by email.

Jonas Ranstam