Who has the power to count chickens? 

In March I recounted how former colleagues Michael Clemens, Steven Radelet, and Rikhil Bhavnani wrote an excellent paper in 2004 on the impact of foreign aid on economic growth, “Counting Chickens When They Hatch.” The idea captured in that title is that it is important to think about the likely timing of the impacts of aid. Don’t design your analysis as if you expect that funding for teaching 6-year-olds will raise economic growth in four years. Match the follow-up period to the type of aid. Count your chickens only when they hatch.

Some years later, and joined by another CGD recruit, Samuel Bazzi, those authors overhauled their paper and published it in the top-flight Economic Journal (ungated version). The final version is quite different but also excellent (it won the journal’s best-article prize). Instead of doing its own econometrics afresh, it modifies the three most-cited studies in the aid-growth literature in light of the “counting chickens” insight. Although those studies disagree on whether and when aid “works,” in the sense of boosting growth, Clemens, Radelet, Bhavnani, and Bazzi (CRBB) conclude that revising the studies to take timing into account causes all results to converge, to a ginger but positive appraisal. (Listen to Michael speak cautionarily about “Chickens” in this Library of Economics and Liberty podcast.)

I say “excellent” and I mean it. But, true to type, I actually doubt the econometric reasoning. I am not persuaded by these results that “aid inflows are systematically associated with modest, positive subsequent growth.”

In 2010, the journal Public Finance Review inaugurated a section for replication studies, in order to increase incentives to check existing work, not just produce new. The journal recently accepted a “Chickens” replication analysis from me (pre-publication version here, older and longer one here). The journal also solicited a reply from CRBB. The junior authors, Bhavnani and Bazzi, responded. PFR posted their response a month ago and will soon post my article.

I learned from Bhavnani & Bazzi’s reply and appreciate its simplicity and straightforward tone. But I think it is easily rebutted… And now I must get technical.

I make a few points in my replication analysis. One is that if you replace CRBB’s “early-impact” aid variable—the one counting activities that can reasonably expected to affect growth within a few years—with its complement, non-early-impact aid, the statistical results don’t change much. Seemingly, aid not expected to have early impacts has early impact too. B&B pass over my variable swap as “atheoretical (or countertheoretical)” because no theory predicts that non-early-impact aid has early impact. But I think the test has relevance, because more than one theory is available to explain correlations found between aid/GDP and growth, and empirics is supposed to discriminate between theories. For example, if we thought growth were reverse-causing aid/GDP, we might well expect its associations with early and non-early aid to be about the same. CRBB create the impression that this is not the case, thus that they are tending to rule out that competing theory. I argue the opposite: both are correlated with growth. This observation does not contradict the hypothesis that early aid causes growth. But it should weaken our confidence that competing theories have been ruled out.

Also, I point out that CRBB’s preferred regressions don’t suggest a statistically strong aid-growth association, even at the estimated point of maximum impact. (CRBB allow impact to be quadratic.) As far as I can see, the reply from B&B does not address this.

Rather, B&B focus on something else.

In addition to narrowing the aid variable to activities that can reasonably expected to affect growth within a few years (“early-impact aid”), CRBB take two other timing-related steps to combat endogeneity. They lag early aid/GDP by one period. (The period lengths of the three panels analyzed are four, five, and ten years.) And they difference the data to remove omitted variable bias from fixed factors. I point out that the second step partly undoes the first, reopening the door to contemporaneous endogeneity. If we fear that aid/GDP in the late 1990s is endogenous to growth in the late 1990s, then we could well suspect the same of the change in aid/GDP from the early 1990s to the late 1990s and the change in growth from the late 1990s to the early 2000s. The timeframes of the two differenced variables overlap.

To eliminate the overlap, I twice-lag aid/GDP in differences. This turns out to flip the signs on the association of aid/GDP with growth while hardly affecting standard errors.

B&B argue that I made an elementary mistake:

We learn in introductory statistics that a test must have sufficient power to reject the null when the null is false. If power is too low, a failure to reject the null conveys little information. A common standard in empirical economics is a minimum power of 0.8 to 0.9….All of Roodman’s modified regressions of growth on early impact aid have been constructed in such a way that their power to reject the null is very low—around 0.1 to 0.2.

In other words, yes, twice-lagged aid is more exogenous, but since it’s from longer ago, the signal of its impact on growth is harder to detect. My null results can be easily explained by lack of power rather than lack of impact. B&B quote an e-mail from Michael to me: “You’re introducing tests that arguably have less bias, but clearly have less power….Claiming that a null result is meaningful requires [you] to demonstrate it has power. This is not optional.”

I tried to take the reminder about the bias-efficiency trade-off to heart. My longer original submission to PFR includes regressions in levels, which not only retain more information, not being differenced, but are exempt from the above argument for twice-lagging. The un-differenced, once-lagged regressions were stripped out of the journal article for concision, being seen as less rigorous.

Actually, I never took statistics, so I didn’t learn about power calculation methods like the one B&B deploy. But at some point I was taught about double standards. So that one doesn’t slip in here, I have augmented B&B’s table of power calculations for my regressions with parallel calculations for theirs. (Code here, starting at line 224. Results in table below.) It turns out that according to the B&B method, my twice-lagging does not deplete power. If my regressions “have been constructed in such a way that their power to reject the null is very low” then so have the corresponding ones in CRBB. For example, if the true coefficient on early-impact aid/GDP is 0.3 in the specification modeled on Burnside and Dollar (2000), then the CRBB version of it has a 19.9% chance of detecting the association at p = 0.05 while my versions have a 15.9–24.2% shot at detection.

Regression Description

Sample size

Power to detect association if true coefficient = 0.15

Power to detect association if true coefficient = 0.30

Standard error on early aid/GDP

Regressions based on Burnside and Dollar (2000)
CRBB Table 7, col 11 Once-lag aid/GDP

323

0.086

0.199

0.316

Roodman Table 5, col 1 Twice-lag aid/GDP

276

0.079

0.168

0.307

Roodman Table 6, col 1 + quadratic controls

276

0.076

0.159

0.338

Roodman Table 6, col 2 + more data

381

0.096

0.242

0.255

Regressions based on Rajan and Subramanian (2008)
CRBB Table 9, col 11 Once-lag aid/GDP

268

0.092

0.224

0.384

Roodman Table 5, col 3 Twice-lag aid/GDP

215

0.086

0.202

0.365

Roodman Table 5, col 5 + quadratic controls

215

0.085

0.195

0.373

Roodman Table 5, col 6 + more data

269

0.101

0.261

0.356

Note: Power is calculated according to Dupont and Plummer (1998), following Bazzi and Bhavnani (2014). The last column contains reported standard errors from associated regressions. Associated regressions are all run in differences, with Anderson-Hsiao instrumentation for initial GDP/capita, and (early-impact aid/GDP)2 as a regressor too.

That the CRBB regressions also clock in as under-powered poses a paradox: if the association is so hard to discern, how do CRBB discern it? I see two answers. First, as I mentioned, in CRBB’s preferred regressions, the association isn’t reliably there. So in my view, CRBB read too much into their results. (For the OLS-in-levels regressions dropped from my final paper, the story is largely the same, especially after adding more data. See the “Early vs. non-early rev ed” tab of this results spreadsheet.)

Second, the invoked Dupont-Plummer power calculation for detecting a linear relationship looks inappropriate for all of these regressions, which are quadratic in aid. The B&B calculations include \(x_1^2\), i.e., (early-impact aid/GDP)2, among the controls. So we are estimating the power to detect the influence of \(x_1\) on \(y\) while fixing \(x_1^2\). Conceptually, I think this doesn’t make sense: conditioning on its square, \(x_1\) doesn’t vary, so it can have no detectable impact. If we restrict ourselves, as in the regressions, to conditioning linearly, then we still face the challenge of estimating the coefficient on \(x_1\) while controlling for the nearly collinear \(x_1^2\). That challenge is substantial. And that is what all the low power values in the above table are telling us. It is true, but kind of irrelevant. We need a power calculation designed for testing for nonlinear associations.

So the argument in B&B should be set aside. Instead, I suggest, look at the standard errors in the various regressions. (Last column above.) The story stays the same: contra B&B, the twice-lagging modification doesn’t deplete power, even as it produces opposite results.

Come to think of it, I think power calculations are supposed to be run before analysis, in order to judge the likely benefit of an investment in data collection. Once data are collected and regressions run, they become better guides to statistical power, because the incorporate more information. I’m not sure it makes sense to judge regressions with power calculations.

I only saw the B&B response after it was frozen in journal type. Which makes me wonder: is there some more efficient and productive way for journals to encourage back-and-forth before publishing replications, so that authors and readers are served by stronger, debate-winnowed arguments? I well know that debate can spiral, to the bewilderment of editors, beleaguerment of authors, and befuddlement of readers, all of whom face many demands on their time. But maybe more can be done to improve matters. I’ll try to blog about that soon.

print