Jonas Ranstam

Independent medical statistician

Statistical programming

Statistical analyses are almost always performed on a computer. This has several advantages, not least that it reduces the risk of calculation errors. Statistical software usually allows two different modes of performing the calculations, by entering data into a menu system using a graphical user interface and by running a statistical program providing the statistical software with instructions. The former mode is generally considered easier and quicker. The latter is harder as it requires experience from programming and knowledge about the specific software’s syntax.

However, while menu driven analyses may be a good way to learn about statistical methods, publishing research reports require statistical programming. There are several reasons for this. It is, for example, necessary for being able to reproduce analyses, for pre-specifying analyses in detail in a statistical analysis plan, and for documenting and communicating how an analysis has actually been performed.

It is not uncommon that editors or reviewers request to see the statistical program code behind the results of a randomised trial. A movement towards reproducible research is becoming stronger with time. Data analyses and scientific claims are increasingly often published with data and statistical program available for others to verify the findings.

When writing a statistical program, much time is spent on searching for and correcting program errors. Three kinds of program errors can be defined: parse-time errors, run-time errors and logical errors. Parse-time errors are caused by erroneous syntax, e.g. by forgetting a comma between variable names or by placing a comma between variable names where this should have been avoided. Run-time errors occur when a program is run. e.g. when a matrix becomes singular and cannot be inverted. The most difficult errors to detect, however, are logical errors, which leads to wrong analysis results.

It is, therefore, important to develop statistical programs in a way that reduces the risk of logical errors and facilitates finding them. What is considered Good Programming Practice (GPP) differs between software packages (R, SAS, S-plus, SPSS, Stata, etc.), but common recommendations are to start by designing the program intellectually before writing any code, to split the programming problem into a series of minor steps, to test the program on cases with known outcome, and to make the program listing easy to read by structuring the code, using systematical indenting and a consistent naming convention for variables and functions, and to include many explaining comments.

An example of a brief SAS program (Good Programming Practice for Clinical Trials) with incomplete program head.

**********************************************************;
* Program name      :
* Author            :
* Date created      :
* Study             : (Study number)
* Purpose           :
* Template          :
* Inputs            :
* Outputs           :
* Program completed : Yes/No
* Updated by        : (Name) – (Date): 
**********************************************************;

data test01;
  do patno=1 to 40;  /* cycle through patients */
    do visit=1 to 3; /* cycle through visits */
      output; 
    end;             /* cycle through visits */
  end;               /* cycle through patients */
run;

Complying with the relevant GPP is also important for facilitating an independent validation of the program and for enabling a collaborative use of it, but not least for the possibility of developing the program further months after the initial development when the programmer has forgotten all the details of the program.