Jonas Ranstam

Independent medical statistician

Stata, R and SQL

A statistical project often includes data from a number of datasets in different formats, e.g. text-files, Stata data files, SAS data files, SPSS data files and Excel workbooks. Irrespective of whether these data are analysed using R or Stata, it can be a logistic advantage to collect all the data fomr all the files into one database before developing programs for the statistical analysis. SQL is a very useful language for storing, changing, linking and retrieving data in databases, and SQL databases can, in addition to data, also include documents like case record forms, study protocols, analysis plans, internal reports and scientific manuscripts.

Stata does not support SQL directly (apart from via ODBC), but an SQLite database can easily be generated using the RSQLite package in R, which also has other packages that can read and write data files in Stata format. SQLite provides a full-featured SQL implementation and is probably the most widely-used database software in the world.

A simple example of how to create an SQLite database, db.sqlite, with data from an Excel workbook with two data sheets and a Stata data file is presented here.

library(RSQLite)
library(readstata13)
library(gdata)

data1 <- read.xls("~/path/data.xlsx", sheet=1)
data2 <- read.xls("~/path/data.xlsx", sheet=2)
data3 <- read.dta13("~/path/data.dta")

db <- dbConnect(SQLite(), dbname="~/path/db.sqlite")

dbWriteTable(db, "sheet1", data1)
dbWriteTable(db, "sheet2", data2)
dbWriteTable(db, "stata", data3)

dbDisconnect(db)

With these files, the SQLite database includes these tables.

sheet1 (from the Excel file ~/path/data.xls)
+-----------+-----------+-----------+
| seqnr     | id        | test      |
+-----------+-----------+-----------+
| 1         | 1         | 12        |
| 2         | 1         | 23        |
| 3         | 2         | 21        |
| 4         | 2         | 44        |
| 5         | 3         | 23        |
| 6         | 3         | 41        |
| 7         | 4         | 23        |
| 8         | 4         | 42        |
| 9         | 5         | 11        |
| 10        | 5         | 12        |
+-----------+-----------+-----------+

sheet2 (from the Excel file ~/path/data.xls)
+-----------+-----------+-----------+
| seqnr     | id        | name      |
+-----------+-----------+-----------+
| 1         | 1         | A         |
| 2         | 2         | B         |
| 3         | 3         | C         |
| 4         | 4         | D         | 
| 5         | 5         | E         |
+-----------+-----------+-----------+

stata (from the Stata file ~/path/data.dta)
+-----------+-----------+
| id        | mmol      |
+-----------+-----------+
| 1         | 0.12      |
| 2         | 0.14      |
| 3         | 0.18      |
| 4         | 0.19      |
| 5         | 0.09      |
| 6         | 0.13      |
| 7         | 0.11      |
| 8         | 0.08      |
| 9         | 0.14      |
| 10        | 0.11      |
+-----------+-----------+

The data in the different tables of the database db.sqlite can now be linked and retrieved for statistical analysis by executing an SQL query. The data can then be exported as a data file in Stata’s format, expdata.dta.

library(RSQLite)
library(readstata13)

db <- dbConnect(SQLite(), dbname="~/path/db.sqlite")

sql <- "SELECT sheet1.id, sheet2.name, sheet1.test, stata.mmol
        FROM sheet1 
        LEFT JOIN sheet2 ON sheet1.id=sheet2.id
        LEFT JOIN stata ON sheet1.id=stata.id"

query <- dbGetQuery(db, sql)

save.dta13(query, "~/path/expdata.dta")

dbDisconnect(db)

The scripts can be run from inside Stata using an R source file using rsource.ado.

The exported Stata data file, expdata.dta, would with the above SQL query include these data.

. list, clean

       id   name   test   mmol  
  1.    1      A     12    .12  
  2.    1      A     23    .12  
  3.    2      B     21    .14  
  4.    2      B     44    .14  
  5.    3      C     23    .18  
  6.    3      C     41    .18  
  7.    4      D     23    .19  
  8.    4      D     42    .19  
  9.    5      E     11    .09  
 10.    5      E     12    .09 

An alternative way of retrieving the data is to run a simple shell command in Stata executing an SQLite query and saving the output in a csv-file that can be directly imported into Stata.

local query="SELECT sheet1.id, sheet2.name, sheet1.test, stata.mmol FROM sheet1 LEFT JOIN sheet2 ON sheet1.id=sheet2.id LEFT JOIN stata ON sheet1.id=stata.id;"
!(sqlite3 ~/path/db.sqlite -header -separator ", " "`query'" > ~/path/expdata.csv)

The command for importing this csv-file into Stata is “import delimited tmp.csv, varnames(1) clear”.