Fast and wild: new paper on my “boottest” program

Three coauthors and I just released a working paper that explains what the wild cluster bootstrap is, how to extend it to various econometric contexts, how to make it go really fast, and how to do it all with my “boottest” program for Stata. The paper is meant to be pedagogic, as most of the methodological ideas are not new. The novel ideas pertain mainly to techniques for speeding up the bootstrap, and to something called Restricted Limited-Information Maximum Likelihood estimation. The title is “Fast and Wild: Bootstrap Inference in Stata Using boottest.”

A few years ago I read the clever study by Kevin Croke that turned a short-term deworming impact study into a long-term one. Back in 2006, Harold Alderman and coauthors reported on a randomized study in Uganda of whether routinely giving children albendazole, a deworming pill, increased their weight. (Most of these children were poorly enough off that any weight gain was probably a sign of improved health.) In that study, the average lag from treatment to follow-up was 16.6 months. But randomized trials, as I like to say, are like the drop of a pebble in a pond: their ripples continue to radiate. Kevin followed up much later on the experiment by linking it to survey data from Uwezo on the ability of Ugandan children to read and do math, gathered in 2010–11. He obtained reading and math scores for some 700 children in parishes (groups of villages) that had been part of the experiment. This let him turn a study of  short-term effects on weight gain into one of long-term effects on academic ability.

In a standard move, the Croke paper clusters standard errors by parish, to combat the false precision that might arise if outcomes are correlated for children within a parish for unmeasured reasons. And because there are relatively few parishes—10 in the treatment group, 12 in the control—the paper uses the “wild cluster bootstrap” to interpret the results. This method has become popular since Cameron, Gelbach, and Miller proposed it about 10 years ago.

Kevin’s paper introduced me to this method. As a part of my effort to understand it, I wrote a code fragment to apply it. I quickly saw that the available programs for wild bootstrapping in Stata, cgmreg and cgmwildboot were useful, but could be dramatically improved upon, at least in speed. And so I wrote my own program, boottest, and shared it with the community of Stata users. As programs often do, this one grew in features and complexity, largely in response to feedback from users. In standard applications, like Kevin’s, the program is so damn fast it must seem like alchemy to new users, returning instantaneously results that would once have taken long enough that you could get a cup of coffee while you waited.

The new paper offers a pedagogic introduction to wild (cluster) bootstrapping. I’m pleased and honored to have coauthored it with James MacKinnon, Morten Nielsen, and Matthew Webb. James in particular is a giant in the field; he coauthored many of the papers that led to the development of the wild cluster bootstrap (among numerous other methods), as well as a leading textbooks on econometrics.

The new paper also divulges the secrets of boottest’s speed. I think there’s a lesson here about just how much more efficiently mathematical code can sometimes be made to run when you carefully state and analyze the algorithm. And in computationally intensive techniques such as bootstraps, speed can matter.