Statistical reviewing is a complex task that requires theoretical education as well as experience from practical medical research. My ambition with this brief presentation is not to provide detailed suggestions on how to perform a general statistical review, but to explain some major principles that according to my experience often are forgotten or overlooked in favour of more technical details. I also suggest a number of references that describe the discussed phenomena in more depth and can be recommended for further reading.

## Background

The cells, animals, limbs, patients, etc. that are studied or experimented on in medical research typically represent greater populations. The definition of these depends on what the investigator wishes to generalise the findings to. The purpose of studying the treatments among 179 patients with osteoporosis, for example, is usually to learn something that can be of benefit for all osteoporosis patients, including future ones.

A population including all future osteoporosis patients would, however, be infinitely large, not restricted in time or space, and it would be practically impossible to study such a population directly. The only possibility is to study samples and infer the findings to the population. This, however, makes the findings uncertain, because different samples from the same patient population can have different properties. To avoid being misled by this sampling variation, the inferential uncertainty of the findings must be quantified.

The word “statistics” has two different meanings. It refers, in the plural, to summarised data (tables and figures), and, in the singular, to a methodology for quantifying sampling uncertainty. This methodology is based on probability theory. It is often confused with mathematics, but statistics and mathematics are fundamentally different. Mathematics is about deduction and statistics about inference.

Most medical research manuscripts rely heavily on statistics, both for communicating what has been observed in a study and for the evaluation of the presented findings’ empirical support.

The statistical reviewer’s task is to make sure that the author’s descriptions are clear and that his or her findings do have the empirical support that is claimed by the author. This includes considerations regarding the operational research question, the study design, the data collection, the statistical analysis, the results presentation, and the conclusions.

Good authors recognise and acknowledge both the strengths and the limitations of their studies and explain these to the reader. Good statistical reviewers help the authors to improve their manuscripts in this respect.

## Descriptive statistics

Describing a small sample (say n=3) is often easier than describing a large one. The description can, for example, just consist of a listing of the observed values. A larger sample (say n=17897) requires some form of aggregation of the data into summary measures. These usually describe the observations’ central tendency, their dispersion and the number of observation.

Some information will undoubtedly be lost when data are summarised, and it is important not to delete what is necessary for the readers’ understanding of the study, the used methodology and the results. For example, the number of observations and their distribution is usually important to present because this information can reveal whether or not underlying statistical assumptions are fulfilled, especially regarding independence and distribution (1).

Presenting a variable from a laboratory experiment (with n=3) in terms of mean and SD, or as a bar chart, obscures more information than it reveals. A dot plot describes the information much more clearly. With greater sample sizes dot plots become less useful and box-and-whisker plots are a better alternative. These and similar problems are discussed in an excellent article by Weissgerber et al. (2).

## Inferential statistics

Statistical methods are primarily used for quantifying uncertainty. Observed data, considered to be random samples from greater populations, are used to test hypotheses about these populations and to estimate parameters in them. The results, p-values and confidence intervals describe the uncertainty in two different ways. The p-value is defined as the probability of getting the observed or an even more extreme observation in a random sample drawn from a population with the properties specified by a null hypothesis. A p-value lower than 5% is generally considered statistically significant, which means that the null hypothesis is rejected.

The confidence interval describes the inferential uncertainty of a parameter estimate in the form of a range of likely values. A confidence level of 95%, corresponding to a 5% significance level, is often used.

### P-value misunderstandings

Misunderstandings about p-values and confidence intervals are, unfortunately, very common. While confidence intervals often are confused with dispersion measures, e.g. standard deviation, p-values are often believed to represent the practical importance of a finding. Statistical significance is then interpreted as an indication of clinical importance and statistical non-significance as an indication that no such importance exists.

Both interpretations are fundamentally wrong. The p-value measures inferential just uncertainty; a practically trivial finding may be statistically significant and a clinically important observation can be statistically non-significant. The only rational conclusion from a statistically non-significant finding is that the sample size is too small for a statistically significant outcome.

The importance of an observed finding must, therefore, be interpreted both with regard to the uncertainty of its existence in the population and to its practical consequences. Just considering statistical significance is a major mistake. As explained by Wasserstein (3): The p-value was never intended to be a substitute for scientific reasoning.

The use of p-values has for many years, because of the common p-value misunderstandings, been criticized by statisticians. Confidence intervals have often been suggested as a better alternative for describing inferential uncertainty (4).

### Method description

The International Committee of Medical Journal Editors’ (ICMJE’s) recommendations (5), also known as the Vancouver Convention, are followed by most medical journals. These are clear with regard to the importance of presenting the statistical methods in detail: “Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results.”

However, even with the best intentions, this recommendation is not easy to comply with. Many statistical terms are frequently misunderstood and used incorrectly by authors (6). The term “independent groups t-test” is, for example, commonly presented, but this is just the SPSS term for Student’s t-test. About a dozen different tests that could be called “independent groups t-test” have been developed.

For a statistical reviewer, it is therefore important to scrutinise the used terminology. One suggestion is to use the Oxford Dictionary of Statistical Terms (7), which represents the official view of the International Statistical Institute, as a norm.

It is also important that the authors describe whether or not the assumptions underlying the used statistical tests are fulfilled, and how this has been evaluated. Without this information, it is impossible to identify how much of the authors’ conclusions that have empirical support and how much that just is based on assumptions.

Again, the common p-value misunderstandings sometimes make the description of the evaluation of fulfilled assumptions difficult. For example, in laboratory experiments with n=3 observations, the variables’ Gaussian distribution is typically tested using the Shapiro-Wilk test of normality. The statistical power to detect a non-Gaussian distribution such a small sample size is, however, negligible. Hence, a conclusion based only on the statistical significance of a finding in the experiment may seem to have empirical support but can in practice be explained more or less entirely by the author’s assumption. Why, then, perform an experiment? Had it not been cheaper, much quicker, and much less work just to assume the outcome of the experiment?

### Multiplicity issues and Bonferroni correction

While a significance level of 5% implies a false positive rate of 5% when testing one null hypothesis, performing several tests at a 5% significance level increases the overall false positive rate as each test has a risk of 5%. This problem is often referred to as a mass-significance phenomenon, which requires correction of the significance level. The Bonferroni method (dividing the significance level with the number of tested null hypotheses) is one such method.

The correction is, however, in practice often too simplistic. First, multiplicity issues are a problem when performing confirmatory tests. They do not affect the interpretation of outcomes from a hypothesis-generating test as these need to be tested confirmatory in new studies. Second, Bonferroni correction reduces the false positive rate at the expense of an increased false negative rate. To maintain the statistical power of the study, the Bonferroni correction needs to be accounted for when designing the study, i.e. when calculating sample size. Bonferroni correction always increases the necessary sample size.

Furthermore, the correction must be performed in accordance with a particular strategy for addressing multiplicity issues. Otherwise, the relationship between results and conclusions may be inconsistent. For example, when testing null hypotheses for pairwise comparisons of 3 groups with 5 different endpoints, both the number of groups and the number of endpoints must be corrected for if a single significant test is considered sufficient for the conclusion. In practice, the number of groups is often corrected for but not the number of endpoints.

It is often a better alternative to avoid multiplicity issues in the design of the study, for example by defining one endpoint as primary, for confirmatory testing, and by interpreting the tests of other endpoints as hypothesis generating.

Third, the decision to perform a confirmatory study and the strategy for addressing multiplicity issues must be pre-specified. It is therefore generally not possible to perform confirmatory testing in observational studies because validity issues need to be prioritised in the statistical analysis, and this makes it practically impossible to pre-specify a detailed analysis plan (8).

Some authors do, however, use Bonferroni correction *post hoc* or without specifying any strategy or primary endpoint. This may not be an incorrect procedure in itself, and the author may be able to provide a rational explanation for the approach. However, this does not change the status of the results. Without pre-specification, they will still be hypothesis generating.

### Statistical modelling

One of the most common problem areas in the statistical analysis of observational and experimental studies is the use of statistical models. First, the terminology is somewhat confusing. Modelling is usually performed using multiple regression models (a model with one regressand and multiple regressors), but the models are often incorrectly described as multivariate (9). The term multivariate, however, indicates that the model is based on a multivariate probability distribution. This is not the case with standard multiple regression models, which are based on the assumptions that the model residual has a univariate Gaussian distribution.

A better description is multivariable, which simply refers to that the model is based on multiple variables. In analogy, a simple regression model (with one regressand and one regressor) should then be described as bivariable, and some others also use this description, but others describe it as univariable, referring to the number of regressors only.

Second, the use of statistical models has in medical research two principally different main purposes, inference and prediction. Inferential models can either be focused on the validity or on precision. For example, in observational studies, the purpose is usually to adjust parameter estimates for the influence from confounding factors. In the analysis of randomised trials, statistical models are instead used to improve the statistical precision by conditioning the outcomes on randomisation stratification factors and endpoint baseline imbalance when analysing change from baseline.

### Inferential models

Inferential models are sometimes developed using various data-driven methods, such as stepwise regression (which uses p-values as a model development criterion). This is a mistake because p-values and statistical significance are irrelevant for the development of inferential models.

The statistical models used in randomised trials are defined by the trial design, and the development of models for confounding adjustment depends on known or assumed cause-effect relations (10). Not including a non-significant covariate can lead to residual bias, and including a significant covariate can lead to adjustment bias, depending on the cause-effect relations between the variables.

### Prediction models

In contrast to inferential models, which usually concentrate on average properties, prediction models focus on individual events. The main problem is to find a structure of known factors, predictors, that provide an optimal prediction of a future event (e.g. death, relapse, etc.) or can be used to classify the patient in terms of diagnosis or prognostic group.

The main problem in the development of a prediction model is that the development is based on maximising the model’s goodness-of-fit, but adapting a model to the random variation of a development dataset, which improves the goodness-of-fit, does not improve the model’s predictive accuracy when used with new data. The problem is known as overfitting. Validation plays an important part of the development of prediction models (11).

### Final comments

A scientific manuscript typically starts with a question and ends with an answer, and the author usually claims that the answer is based on the presented study. A statistical reviewer should review this claim critically. The issues to address is if the limitations of the study design, data collection and statistical analysis have been adequately described and accounted for and whether the uncertainty of the results has been clearly presented. A statistical review can thus not be restricted to the statistical methods. The whole chain from question to answer needs to be reviewed. What is the actual empirical support for the presented findings? Does the manuscript provide a consistent and transparent description of the performed study? Does the author’s claim have any substance?

## References

1. Ranstam J. Repeated measurements, bilateral observations and pseudoreplicates, why does it matter? Osteoarthritis Cartilage 2012;20:473-475.

2. Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLOS Biology 2015 DOI:10.1371/journal.pbio.1002128 April 22.

3. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 2016 doi: 10.1080/00031305. 2016.1154108.

4. Ranstam J. Why the p-value culture is bad and confidence intervals a better alternative. Osteoarthritis Cartilage 2012 April 11.

5. International Committee of Medical Journal Editors. Recommendations for the Conduct, Reporting, Editing and Publication of Scholarly Work in Medical Journals. 28 July 2017 Available from http://www.ICMJE.org.

6. Ranstam J. The importance of clear language. Acta Orthop 2013;84(6).

7. The International Statistical Institute. The Oxford Dictionary of Statistical Terms. Oxford University Press, New York 2003.8. Ranstam J. Multiple p-values and Bonferroni correction. Osteoarthritis Cartilage 2016;24:763–764.

8. Ranstam J. Multiple p-values and Bonferroni correction. Osteoarthritis Cartilage 2016;24:763–764.

9. Peters T. Multifarious terminology: multivariable or multivariate? univariable or univariate? Paediatric Perinatal Epidemiol 2008;22:506.

10. Cook JA, Ranstam J. Statistical models and confounding adjustment. The British journal of surgery 2017;104:786-787.

11. Steyerberg EW, Harrell Jr. FE. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol 2015, April 18.