Jonas Ranstam

Statistical consultant

Stata, R and SQLite

A statistical project often includes data from a number of datasets, sometimes in different formats, e.g. ascii-files, Stata data files, SAS data files, SPSS data files and Excel workbooks. Irrespective of whether these data are analysed using R or Stata, it can be an advantage to collect all the data from all the files into a single database before developing programs for the statistical analysis. SQL is a powerful language for storing, changing, linking and retrieving data in databases, and SQL databases can, in addition to more conventional data files, also include documents like case record forms, photos, study protocols, analysis plans, internal reports and scientific manuscripts.

Stata does not support SQL-queries directly (apart from ODBC), but a Stata ado- (and hlp-file) for querying, importing and exporting SQLite3 databases have been developed. The script currently supports querying/importing and exporting of databases on Linux and Mac OS X but only querying/importing on Windows. To export Stata data to an SQLite3 database on Windows, first export a csv-file from Stata, then import this csv file into an SQLite3’s database using SQLite3’s internal .import command. The ado- and hlp-files can be installed to Stata by downloading them to a Stata ado directory, e.g. ~/ado/personal.

An SQLite database can also easily be handled using the RSQLite package in R, which also can read and write other data files in Stata format. SQLite provides a full-featured SQL implementation and is probably the most widely-used database.

A simple example of how to create an SQLite database, db.sqlite, with data from an Excel workbook with two data sheets and a Stata data file is presented here.

library(RSQLite)
library(readstata13)
library(gdata)

data1 <- read.xls("~/path/data.xlsx", sheet=1)
data2 <- read.xls("~/path/data.xlsx", sheet=2)
data3 <- read.dta13("~/path/data.dta")

db <- dbConnect(SQLite(), dbname="~/path/db.sqlite")

dbWriteTable(db, "sheet1", data1)
dbWriteTable(db, "sheet2", data2)
dbWriteTable(db, "stata", data3)

dbDisconnect(db)

With these files, the SQLite database includes these tables.

sheet1 (from the Excel file ~/path/data.xls)
+-----------+-----------+-----------+
| seqnr     | id        | test      |
+-----------+-----------+-----------+
| 1         | 1         | 12        |
| 2         | 1         | 23        |
| 3         | 2         | 21        |
| 4         | 2         | 44        |
| 5         | 3         | 23        |
| 6         | 3         | 41        |
| 7         | 4         | 23        |
| 8         | 4         | 42        |
| 9         | 5         | 11        |
| 10        | 5         | 12        |
+-----------+-----------+-----------+

sheet2 (from the Excel file ~/path/data.xls)
+-----------+-----------+-----------+
| seqnr     | id        | name      |
+-----------+-----------+-----------+
| 1         | 1         | A         |
| 2         | 2         | B         |
| 3         | 3         | C         |
| 4         | 4         | D         | 
| 5         | 5         | E         |
+-----------+-----------+-----------+

stata (from the Stata file ~/path/data.dta)
+-----------+-----------+
| id        | mmol      |
+-----------+-----------+
| 1         | 0.12      |
| 2         | 0.14      |
| 3         | 0.18      |
| 4         | 0.19      |
| 5         | 0.09      |
| 6         | 0.13      |
| 7         | 0.11      |
| 8         | 0.08      |
| 9         | 0.14      |
| 10        | 0.11      |
+-----------+-----------+

The data in the different tables of the database db.sqlite can now be linked and retrieved for statistical analysis by executing an SQL query. The data can then be exported as a data file in Stata’s format, expdata.dta.

library(RSQLite)
library(readstata13)

db <- dbConnect(SQLite(), dbname="~/path/db.sqlite")

sql <- "SELECT sheet1.id, sheet2.name, sheet1.test, stata.mmol
        FROM sheet1 
        LEFT JOIN sheet2 ON sheet1.id=sheet2.id
        LEFT JOIN stata ON sheet1.id=stata.id"

query <- dbGetQuery(db, sql)

save.dta13(query, "~/path/expdata.dta")

dbDisconnect(db)

The scripts can be run from inside Stata using an R source file using rsource.ado.

The exported Stata data file, expdata.dta, would with the above SQL query include these data.

. list, clean

       id   name   test   mmol  
  1.    1      A     12    .12  
  2.    1      A     23    .12  
  3.    2      B     21    .14  
  4.    2      B     44    .14  
  5.    3      C     23    .18  
  6.    3      C     41    .18  
  7.    4      D     23    .19  
  8.    4      D     42    .19  
  9.    5      E     11    .09  
 10.    5      E     12    .09 

An alternative way of retrieving the data is to run a simple shell command in Stata executing an SQLite query and saving the output in a csv-file that can be directly imported into Stata.

local query="SELECT sheet1.id, sheet2.name, sheet1.test, stata.mmol FROM sheet1 LEFT JOIN sheet2 ON sheet1.id=sheet2.id LEFT JOIN stata ON sheet1.id=stata.id;"
!(sqlite3 ~/path/db.sqlite -header -separator ", " "`query'" > ~/path/expdata.csv)

The command for importing this csv-file into Stata is “import delimited tmp.csv, varnames(1) clear”.