New package for extreme value analysis in Stata 

One topic I’m studying for my main client, the Open Philanthropy Project, is the risk of geomagnetic storms. I hadn’t heard of them either. Actually, they originate as solar storms, which hurl magnetically charged matter toward earth, jostling its magnetic field. Routine-sized storms cause the Auroras Borealis and Australis. Big ones happen roughly once a decade (1972, 1982, 1989, 2003, a near-miss in 2012…) and also mostly hit high latitudes. The worry: a really big one could send currents surging through long-distance power lines, frying hundreds of major transformers, knocking out power to continent-scale regions for months or even years, and causing an economic or humanitarian catastrophe.

My best assessment at this point is that if one extrapolates properly from the available modern data, the risk is much lower than the 12%-chance-per-decade cited by the Washington Post last summer. But that’s a preliminary judgment, and I’m not a seasoned expert. And even if the risk is only 1%, it almost certainly deserves more attention. More from me on that in time. (For a mathematician’s doubts about the 12% figure see Stephen Parrott.)

You can see “geomagnetic storms” beneath Cari Tuna’s elbow in this photo from a recent Post story about the Open Philanthropy Project:


Geomagnetic storms constitute an extremely rich topic, encompassing (ha ha) solar physics, geophysics, the fundamentals of electromagnetism, dendrochronology, power system dynamics, transformer engineering…and statistics. The statistical question is: given the historical data on the severity and frequency of geomagnetic disruptions, what can we say about the probability per unit time of one at or beyond the frontier of historical experience?

And that leads into the branch of statistics called extreme value theory. I think of it this way. A foundational result in mainstream statistics is the Central Limit Theorem. If you draw a sample from a population and take an average, as presidential pollsters do, and then do that again and again and again, you’ll get a somewhat different answer each time for how many will vote for Hillary. But the individual polling results will cluster around the true number according a bell curve. Few will be in the tails of the bell, far from the real number. This is an extremely powerful fact because it holds almost no matter the underlying distribution, which could be as ragged as the Rocky Mountains. Take enough averages from a sea of numbers and the bell curve will rise like a phoenix. And it’s a shape statisticians know how to work with. When pollsters quote margins of error, this is the statistical law they are invoking.

Extreme value theory forks from that line of logic in one way: it takes the maximum rather than the average. Example: the highest daily rainfall for each year over the last century, at a given spot. It turns out that almost regardless of the pattern of rainfall over the course of the year–rainy only in October, rainy only on Thursdays, rainy everyday–those  100 numbers will also tend to spread according to a particular curve (or family of curves). These curves are hump-shaped too but they usually have one tail that extends to infinity and one that is chopped short. With enough historical data on rainfall or meteor showers or stock price changes, you can estimate which member of this family governs the distribution of maxima, then follow the tails of that curve to compute the probabilities of extremes. You can also model how probabilities of extremes will shift in response, for example, to climate change.

Extreme value theory (EVT) is not magic. Properly done, it too yields margins of error, and the farther you project beyond historical experience–what’s the chance of a year with 100 inches of rain in a day?–the wider those margins should be. And no statistical system can do better than extrapolate from the available data. Past may not be prologue. The sun could unleash a cataclysm tomorrow the likes of which has not been seen in ten thousand years–earthly statisticians and their historical record be damned.

Me being me, as I entered into orbit around EVT, I felt pulled, gravitationally, to code it (even though I couldn’t in good conscience charge my client for that time). I found many sophisticated programs for EVT, but not much for Stata, my familiar statistical home. By coding EVT, I came to understand it better. My new software package, called “extreme” is free, but requires Stata, which is not. The package incorporates a lot of features from the popular ismev package for R, which is closely associated with the excellent introductory text by Stuart Coles. As a result, examples in my documentation can reproduce most of the statistical and graphical examples in that book.

The one novelty in the package (for the technically inclined) is an emphasis on small-sample bias correction. Maximum likelihood estimation has bias of order 1/N, which can be significant in the small samples common in EVT applications. Inspired by a new paper by David Giles and colleagues, I implemented the analytical Cox-Snell correction not only for the generalized Pareto distribution (which they derive) but also for the generalized extreme value distribution and its extension to multiple order statistics (which I derived at great effort and am writing up). I also implemented a parametric bootstrap-based bias correction. This is a bit slower but was a hell of a lot easier to code, and works very well too, indeed much better for negative values of the shape parameter ξ…to the point that my great mathematical effort may have been largely unnecessary! (See newer post.)

This was also my closest encounter with R, the rapidly rising open software environment for statistics. R influenced my Stata development in two ways. First, it led me to integrate capabilities to produce canned diagnostic graphs. Putting graphics on a par with numerical results is normal in R, but not so much in Stata. Second, it led me to construct an interface in Mata, Stata’s matrix programming language, so that, as in R, one can directly access the estimation functionality from a real programming language. (Not that I’ve documented that yet.)

Whether extreme will get used as much as xtabond2 and cmp, I don’t know. But I have learned, and will bring that understanding to the analysis of geomagnetic storm risks.