The April issue of the Journal of Development Studies includes the final version of my article with Jonathan Morduch replicating the study of the impact of microcredit in Bangladesh by Mark Pitt and Shahidur Khandker. Properly, the journal also carries a reply from Mark Pitt. (Ungated versions of the dueling documents are here and here.) To my surprise, JDS did not solicit a rejoinder from us the way they did in a nearly identical situation involving a JDS editor as replicating author. Perhaps this is a sign of the strength the editors see in our paper…which is to say, maybe I should have just chilled.
But as usual, Pitt’s arguments are strongly worded even as their subject remains technical. So the average reader will absorb the style more than the substance, and wonder, I fear, who are these fools Roodman and Morduch? So for the public record, here is a rejoinder from yours truly. It’s a quote-and-response.
“RM [Roodman & Morduch] have backed off many of their prior claims and methods.”
No. The first version of our paper questions the exogeneity of the core intent-to-treat variables; highlights that an asserted discontinuity in treatment, central to Pitt & Khandker’s (PK’s) claim to quasi-experimental status, is absent from the data; observes that the magnitude of the impact estimates depends on an arbitrary censoring choice for the “log of 0”; and demonstrates that a more-robust linear estimator produces no evidence of impact. Those arguments stand.
Of course, I have also learned from the debate with PK. RM (2009) failed to replicate PK’s original impact estimates, getting opposite signs. Pitt (2011) showed us how to match, by swapping in a missing control and adopting a different censoring value. As it happens, these corrections added to our doubts, because their ability to flip our impact estimates turned out to manifest another problem, bimodal instability.
Other than that, there have been minor corrections and improvements. E.g., RM (2009), Table 4, applies Sargan along with Hansen instrument validity tests to regressions for individual survey rounds; but I realized, along with Pitt, that the Sargan tests are invalid because the observation weighting of the regressions induces heteroskedasticity. And I came to appreciate that a measure of instrument weakness cited in RM (2011) was distorted by a high instrument count, so the final paper does not rely upon that.
The PK nonlinear estimator, “contrary to the claim of RM, is not complex.”
This is a surprising thing to dispute. Here, complexity is a secondary, subjective notion, not an econometric one. If you want to form your own opinion on the complexity question, browse the couple-dozen formulas in the PK (1998) appendix or my coding thereof. I know a brilliant economist who spent a week in graduate school puzzling over PK.
“The RM likelihood is…”; “Both PK and RM estimate their models by maximising their respective likelihoods.” “There are two differences between the PK log-likelihood…and the RM log-likelihood.” “…the RM approach…”
The suggestion of a dichotomy between PK and RM methods is deeply misleading. RM use two estimators: a nonlinear one proposed and implemented in PK (1998), and a linear one proposed in PK (1998, note 16) and implemented in Pitt (1999). Neither constitutes a distinctive “RM approach.” We deploy the nonlinear one in 20 variants (different samples, dependent variables, etc.) and the linear one in seven. We interpret the results through hypotheses about mathematical mechanisms and give less weight to some in light of specification tests. As best I can determine, “the RM approach” perceived by Pitt encompasses two of the seven linear regressions (our Table 5, columns 1 and 2). 1 Yet as our text makes plain, those two least resemble our preferred one (column 7).
“RM estimate their models using linear limited information maximum likelihood (LIML), which is particularly unstable compared to plain 2SLS…”; “…compounds the bias by using linear LIML rather than 2SLS.”
No. Our preferred variants of the PK/Pitt linear estimator are exactly identified. For them, LIML and 2SLS coincide.
“In what follows, it is useful to keep in mind that it is the specification of the first-stage equation that is the key difference between RM and PK…”
What follows is an enumeration of ways in which the first stage of the PK/Pitt linear estimator is unrealistic. Which distracts from a fundamental point: linearizing a two-stage nonlinear model sacrifices efficiency for consistency. Structural inaccuracy from linearity increases consistency. As we confirm with simulations, the linear model is robust to deviations from normality in the errors of the sort present in the PK nonlinear regressions, whereas the nonlinear one is not. The linear model is therefore a useful check on the nonlinear one.
Conceivably, Pitt’s confusion about the relative unimportance of structural accuracy in the first stage is exacerbated by never having formally analyzed the consistency requirements of the PK nonlinear estimator (at least, never in a public document). So far, the best shot at that is in RM’s sections 1.2–1.3.
“RM arbitrarily select one way to trim the data—dropping the 16 largest values for consumption—with no consideration of how other trimming strategies affect the results.”
We show that a) the errors in the headline (nonlinear) PK regression are right-skewed, violating the estimation model; that b) such a violation renders the estimator inconsistent; that c) the PK likelihood has two modes, the reported positive-impact one and a somewhat-lower-likelihood negative-impact one; and that d) reducing the skewness by dropping a handful of observations seemingly most responsible for it also causes the two modes to collapse to one, centered around zero impact. We interpret these facts as circumstantial evidence that skewness is destabilizing the PK estimator.
It is not arbitrary when discovering right skewness in a distribution assumed to be symmetric to drop rightmost observations.
That said, there are many ways to select outliers, so any given way is open to debate. Pitt advocates an alternative, which is to sort the sample by the residuals from the full-sample fit, then symmetrically trim the top and bottom 16, 100, 500, or 1,000 observations (out of 5,218). Pitt also points out that estimation on three of these four subsamples suggests that credit is not after all endogenous (the estimated correlations between first- and second-stage errors are not statistically different from zero), and that therefore OLS can be relied upon.
This argument is twice flawed. First, it engages in massively endogenous sample selection, since it is based on regression residuals. Dropping the 40% of observations that least fit a model predictably reduces standard errors and increases coefficients as attenuation bias is itself attenuated. Second, Pitt finds that credit appears endogenous when 0 or 2,000 observations are dropped, but not when 32, 200, or 1,000 are. To instrument regressions in the first group and not in the second is to assume that the underlying structural model oscillates as the sample varies. Otherwise, the indication of endogeneity when 0 or 2,000 observations are dropped argues for instrumenting all the regressions.
“Essentially, RM demonstrate that when they drop out observations that identify the model, the model is less well identified.”
No. This comment refers to sample modification of a different sort than just discussed. In Table 4, we restrict the PK nonlinear estimator to subsamples defined by patterns of credit availability by sex. The last column shows that restricting to villages where both sexes had access—where the intent-to-treat–based instruments therefore have the least ability to distinguish impacts by sex—leads to impact estimates that are strongest. They are the only subsample estimates in the table to come close to the full-sample, headline results. This is further circumstantial evidence that statistical degeneracy is a source of the headline PK finding.
“RM…omit the vital first step…” “Pitt (2013) documents various other failings in the way RM test for the existence of a discontinuity.”
The lynch pin to PK’s claim to high-quality identification is an “exogenous rule,…the restriction that households owning more than one-half acre of land are precluded from joining any of the three credit programs.” Morduch (1998) looks for evidence of that exogenous rule in the data and doesn’t find it. RM’s Figure 2 corroborates with two smoothed plots that allow a break in the male and female borrowing levels at 0.5 acres and find none.
Pitt says this detective work is inadequate. We must also do the same graph for the outcome variable, household consumption, even though this would in no way demonstrate the occurrence of a microcredit quasi-experiment. And in all three graphs, we should change the cut-off level from 0.5 to 0.55 acres, and add “controls for the actual forcing variable, all other independent variables, and a sixth degree polynomial for landholding.”
So, I did all that. 2 I took Stata’s defaults for kernel and bandwidth in order to minimize arbitrariness (which Pitt calls “arbitrary” 3 ). Still no discontinuities. Except at the under-sampled extremes, the graphs are pretty damn flat 4:
“Whether a land discontinuity exists in the data is irrelevant.”
If a study claims high-quality identification on the basis of a discontinuity, the invisibility of the discontinuity is relevant.
That the results of the supposedly quasi-experimental regressions can be matched by ones not avowedly built on such a quality foundation is not very reassuring.
“The RM simulation without any exogenous slope variables is a seriously deficient model from which to make claims about the PK and RM approaches, particularly without making any mention of the crucial importance of slope variables.”
Sorry, but this is bombast. Our appendix simulations show that the PK nonlinear estimator is inconsistent in the presence of 2nd-stage skewness. It does so in a simple context without controls. That falsifies any theoretical claim of consistency. Pitt continues to defend the estimator not by proving that is it consistent in the presence of violations of its own assumptions but by arguing that no one else has proved it is inconsistent. One can admire the chutzpah.
“RM fail to realize that by controlling for credit choice in the second stage, PK simulations have made the estimates of programme effects invariant to linear translation.”
He’s right. The point is arcane, so I won’t explain it here. it serves to defend a particular way that PK (2012) modify their nonlinear estimator when testing it on simulated data. As Pitt notes, that modification is absent from the actual PK regressions. So it leads to no change in our argument.5
“…the results are barely changed when credit is rescaled by dividing all credit by 10 (or 20), which is equivalent to increasing the credit assigned to non-participators from log (α=10) to log (α=1) , a change in scale responsive to the untested view of RM that log (α=1) is (implausibly low). Doing so modestly increases the estimated female credit effect…”
The condensed argument here is a defense against our charge that the magnitude, as distinct from statistical significance, of microcredit’s impacts cannot be determined by the PK nonlinear estimator, because the mathematical structure of the estimator requires us to treat non-borrowers as borrowing an indeterminate, non-zero amount. Whether that little bit is 1 taka or 10 taka or 100 taka ($0.025 or $0.25 or $2.50) affects the apparent size of the impacts. PK (1998) choose 1 taka, which I learned from Pitt (2012). Since the least a person could borrow was 1,000, this means that PK (1998) assume that going from 0 to 1,000 in borrowings (treated as going from 1 to 1,000) affected poverty as much as going from 1,000 to 1 million ($25 to $25,000). On the other hand, the RM (2009) choice to censor with 1,000, made out of confusion about what PK did and a desire to minimize arbitrariness, also seems problematic, since it implies no impact in moving from no borrowing to minimal borrowing.
There are obviously-worse choices for the censoring value but no obviously best. A generous lower bound looks to be 100, which would imply that going from 0 to 1,000 (treated as going from 100 to 1,000) had the same impact as multiplying borrowings tenfold, as from 1,000 to 10,000. A reasonable upper bound looks to be 500, which would imply approximately linear impact at low borrowing: going from 0 to 1,000, now treated as a doubling from 500 to 1,000, would be assumed to affect poverty the same as going from 1,000 to 2,000.
But this Pitt quote too is right: censoring with 10 or 20 instead of 1 taka—a move toward realism in my view—increases the apparent impact of microcredit under PK’s nonlinear estimator. Yet the graph below (code) provides more context by looking across a wider range of possible censoring values. Like our Figure 5, it traces how both peaks of the bimodal likelihood vary with the choice of censoring value. The black contour touches the higher mode at each step, which is the Maximum Likelihood estimate. The contour plunges negative around 480 taka. That’s where the negative-impact mode becomes taller:
This is a strange state of affairs. If we assume, not unreasonably, that moving from 0 to 1,000 in borrowing has the same proportional impact as moving from 1,000 to 3,000 (i.e., censor with 333), then we conclude that microcredit reduces poverty. But if we assume, also not unreasonably, that moving from 0 to 1,000 equates in proportional impact to moving from 1,000 to 2,000 (censor with 500), then we conclude that microcredit increases poverty. This is one way of seeing the instability of the PK nonlinear estimator.
“Roodman and Morduch (2009) erroneously claimed statistically significant and negative female credit effects, the opposite of the PK findings, a finding that they advertised widely. This finding came about because of what can only be called a typographical error on their part.”
My public discussion deemphasized the failure to match on sign and emphasized the doubts about whether endogeneity had been ruled out. I also continue to feel that complexity in econometrics is dangerous because it can obscure problems as well as solve them. One virtue of randomized studies is their mathematical simplicity. My thinking and writing on these themes have been stable, and I see no need for change now.
And as I have explained before RM (2009) contained two key discrepancies, with different provenances, neither pure carelessness. One was a missing control variable, which was indeed documented in PK (1998) and was theoretically important. I missed it in part because Pitt had provided us with an undocumented, cryptically labelled data set that pointed in a different direction. Columns xw1–xw15, xm1–xm15, and xb1–xb25 of this file line up perfectly with a detailed variable list embedded in regression results in the working paper version of PK, a list that lacks an analog in the tighter 1998 journal paper. Alas, xb25, matching “Participated but did not take credit” in the working paper, was a ringer. This history does not excuse my getting it wrong, but it does show the challenges of replication without proper supply of data and code. And it rebuts any implication of simple carelessness. With incomplete data and code, replication is like forensics with fragmentary information.
The second discrepancy was the choice of censoring value just discussed: ignorant of the PK’s undocumented choice, we used what seemed least arbitrary, which was the censoring threshold of 1,000 listed in PK. (If you’re confused by that sentence, then you glimpse why I was confused too.) That put us on the right edge of the graph above. That alone prevented us from matching PK on sign.
As I have written before, this experience illustrates twice over the value of data and code transparency. However normal in the late 1990s, the opacity of PK hindered our efforts to understand it. The original code has never been seen in public and a full data set did not appear till 2011. Inevitably, there were subtle errors, misunderstandings, and undocumented choices all around, which were hard to detect without the Rosetta Stone of the original data and code. In contrast, our transparency allowed Pitt to find our errors.
“This brief response shows that all of the criticisms are readily refuted.”
Our paper concludes with these words:
Our work replicating PK (1998) has left us with great admiration for its sophistication and creativity. But its econometric sophistication obscures problems:
- an imputation for the log of the treatment variable when it is zero that is undocumented, influential, and arbitrary at the margin, making the impact size essentially [un]identified;
- the absence of a discontinuity that is asserted as central to the identification strategy;
- a reclassification of formally ineligible but borrowing households as eligible, which presumably introduces endogeneity into the asserted quasi-experiment;
- a linear relationship between the instruments and the error;
- disappearance of the results when villages where both genders could borrow are excluded;
- instability in the estimator;
- disappearance of the results after dropping 16 outliers, 0.4 per cent of the sample, that especially violate a modelling assumption.
Pitt (2014) directly contradicts none of those points.
- Only they instrument with “interactions of x with a dummy variable for having the choice to participate,” where “choice” is as defined as in PK. (back)
- Pitt doesn’t do all of these things. Which reminds me of footnote 11 of Pitt (1999): “It strikes me that if one wishes to assert that a certain model may fit the data better than another, and one has the data, one should go out and test if that assertion is true. To claim ‘this could be’ and ‘that could be’, and then make no effort to examine whether these speculations are true when they are easily tested with the data in one’s possession, strikes me as unfairly redistributing the burden of criticizing another’s work.” (back)
- Pitt cites Imbens and Lemieux (2008) on the importance of giving careful attention to the choice of bandwidth, as opposed, in effect, to taking software defaults. But this is in reference to performing formal regressions, not graphical preliminaries. They write, “The formal statistical analyses…are essentially just sophisticated versions of [graphical analysis], and if the basic plot does not show any evidence of a discontinuity, there is relatively little chance that the more sophisticated analyses will lead to robust and credible estimates with statistically and substantially significant magnitudes.” The clear thrust is that if data have to be tortured to reveal a discontinuity, further analysis is probably pointless. (back)
- Unlike Pitt, I collapse the data set to one observation per household to produce these graphs. Since observations are assumed correlated over time, not collapsing would cause the plots to treat each household-season observation as independent, leading to misleadingly narrow confidence intervals. The confidence intervals are probably still too narrow because they do not factor in the uncertainty in the regressions used for partialling out controls. (back)
- In particular, the criticisms of the PK (2012) simulations in our footnote 22 stand. (back)