The (often) overlooked problem of research reproducibility in economics

 

Over the last 20 years, economic journals’ requirements for reproducibility—defined as the ability to “duplicate the results of a prior study using the same materials and procedures as were used by the original investigator”—evolved from “data availability” to “data and code availability” to “data, code, and proper data archiving.” The intensity of enforcing these policies also increased. The American Economic Review, for example, requires that authors of accepted papers provide “(a) the data set(s), (b) description sufficient to access all data at their original source location, (c) the programs used to create any final and analysis data sets from raw data, (d) programs used to run the final models, and (e) description sufficient to allow all programs to be run.”

 

This blog argues that modern computing changes the conditions necessary and sufficient for reproducibility in economics from “code and data” to “code, data, and environment.” 

 

All leading statistical packages have evolved from self-contained, closed products to open systems that rely, to a large degree, on user-written routines and/or dependencies. An example of almost closed software is Microsoft Excel. Most Excel users rely on built-in functionality, and Microsoft provides a tool (Compatibility Checker) to ensure that the workbooks created with one version of MS Excel produce the same results when loaded into a different version. Similar approaches, such as using virtual environments in Python or version command in Stata, ensure the consistent execution of the code by different software versions.

The reproducibility problem becomes complicated if researchers rely on third-party routines. The authors of user-written programs (e.g., ado-files and plug-ins in Stata or packages in R) can change their code after a paper or a report is published. These changes could lead to different results even if the core code and the data used for the analysis remained unchanged. For example, the sum of the distinct values of matrix (1, 1, 4, 4, 5) incorrectly summed to 19 using version 0.4.0 of dplyr package in R and to 10 after that bug was fixed in version 0.5.

 

> df  library('dplyr', lib.loc="altloc/v4")

> print(packageVersion('dplyr'))

[1] '0.4.0'

> print(sum(distinct(df, x)))

[1] 19


> library('dplyr', lib.loc="altloc/v5")

> print(packageVersion('dplyr'))

[1] '0.5.0'

> print(sum(distinct(df, x)))

[1] 10

 

Simply providing both the core code and the user-written routines used in a program is not a solution because, in some cases, user-written routines rely on other routines that developers could change at any time. Identifying these dependencies can be time-consuming, given that the number of dependency layers (one program uses another program that relies on a third program, etc.) could be large. Using several software programs to work with Big Data—combining Stata or R with Python or C, for example—makes the problem of reproducibility even more complex.

Several user-written guides help researchers improve the reproducibility of their results (see, for example, here or here for Stata and here or here for R). But these solutions are convoluted, requiring multiple external tools (e.g., GitHub), and they target the authors, not the reviewers. Even if all components on which the core code depends are identified, updating a statistical package to use the correct versions of these components is beyond the capacities and resources available to most journal editors or other researchers who want to reproduce a paper’s results. Neither R nor Stata provides users with automated tools to track and manage such dependencies.

It is hard to assess how often and to what degree the problems related to changes in auxiliary routines affect the reproducibility of economic results. What is clear is that if such issues occur, finding the source of the discrepancy between the original and reproduced results could be challenging and resource-consuming. These costs have increased exponentially with time since publication.

Many researchers have experienced an uneasy feeling after receiving an email from an aspiring graduate student claiming that they could not reproduce the results of a paper published five years earlier. For the reasons described above, the original researchers often cannot replicate the results.

A technological solution for the reproducibility problem exists. It is to use an isolated environment that can be shared as an image. That image could be created through virtualization or “containerization,” an approach in which software, its dependencies, and configurations are packed together as a container image so that the application can run reliably on different computers. For example, a container can contain Rstudio with all the dependencies, environmental variables, user-written packages, and data. Deploying that image on any computer at any time will restore the snapshot of the software environment at the moment the image was created. Executing the code in that environment will guarantee reproducibility.

Open-source software such as R or Python and operational systems like Linux allow containers to be built with a preloaded operating system and statistical package(s) (the so-called base image). Researchers unpack such containers on their computers, add data and code, ensure that their results are reproduced, and send this image to a journal or other researcher. The Rocket Project, for example, maintains images with various R configurations on GitHub and Docker Hub.

Distributing container images with proprietary software, like Stata and Windows requires licensing. With bring-your-own-license (BYOL) containers, users can provide licenses they own for the software and OS in the container. The American Economic Association provides a BYOL container base image for Stata. Leading cloud service providers such as AWS and Azur also support pay-as-you-go licensing, in which users can purchase limited-time software licenses to deploy container images on the cloud just to run the code for the results they want to reproduce.

A container solves two main problems. It allows authors to generate the packages needed to reproduce their results and allows researchers who want to reproduce that research to deploy these packages reliably and efficiently on their computers. However, creating and deploying containers require a relatively high level of user technical expertise. As such, containers may not be optimal for all empirical research projects. They may, however, be the only reliable solution for mission-critical large-scale projects in which replicability of the results might create liabilities or reputational risks for the agencies and individual researchers.

The reproducibility crisis in economics may be related, to some degree, to the prohibitively high costs of organizing and maintaining statistical computational environments that allow researchers to store and share reproducible results. Reducing these costs by developing automation tools or creating dedicated services within large organizations and universities to help researchers with, for example, virtualization or containerization might improve the reproducibility of the results  of economic research in the longer term.


A follow-up blog, to be published in a few weeks, will look at the different aspects of the research reproducibility problem, more specifically – the importance of writing high-quality code.

Leave a Reply

Your email address will not be published. Required fields are marked *