Resources

Correction to Conley Standard Errors

Darin Christensen and Thiemo Fetzer


Jordan Adamson found an error in the code that Solomon Hsiang developed to compute Conley standard errors in Stata. Unfortunately, we transcribed this error when we implemented Hsiang’s code in C++ and R. These errors happen, and Hsiang clearly warns users at the top of his code.

The problem is a single misplaced parathesis in the line calcluating the weight for the Bartlett kernel when correcting for temporal auto-correlation: weight = (1:-abs(time1[t,1] :- time1))/(lag_cutoff+1) (line 430 in the original ado file, version dated 4/29/2013).

Per Newey and West (1987), the Bartlett kernel is

However, the line above instead computes:

The fix is simple: the third parenthesis needs to be moved to the end of the line. Unfortunately, the fix is also consequential, as the uncorrected code can deliver negative weights and lead to standard errors that are too small when there is positive temporal auto-correlation.

Our old and new code is now posted in a public GitHub repo.


Original Code

Here’s the original Stata implementation.

clear
use "data/new_testspatial.dta"

tab year, gen(yy_)
tab FIPS, gen(FIPS_)

ols_spatial_HAC EmpClean00 HDD CDD yy_* FIPS_2-FIPS_362,
lat(lat ) lon(lon ) t(year) p(FIPS) dist(500) lag(5) bartlett disp

This code delivers the following standard errors:

-----------------------------------------------
Variable | OLS spatial spatHAC
-------------+---------------------------------
HDD | 0.650 0.886 0.894
CDD | 1.493 4.068 4.388

And our original C++/R implementation:

# Loading sample data:
dt <- read.dta("data/new_testspatial.dta") %>% data.table()
setnames(dt, c("latitude", "longitude"), c("lat", "lon"))

# Loading R function to compute Conley SEs:
source("code/archived-code/deprecated-conley.R")

m <- felm(EmpClean00 ~ HDD + CDD | year + FIPS | 0 | lat + lon,
data = dt[!is.na(EmpClean00)], keepCX = TRUE)

SE <- ConleySEs(reg = m,
unit = "FIPS",
time = "year",
lat = "lat", lon = "lon",
dist_fn = "SH", dist_cutoff = 500,
lag_cutoff = 5,
cores = 1,
verbose = FALSE)

sapply(SE, function(x) diag(sqrt(x))) %>% round(3)
      OLS Spatial Spatial_HAC
HDD 0.650 0.886 0.895
CDD 1.493 4.065 4.386

This matches the standard errors from the Stata output.


Corrected Code

Jordan caught the transcribed error on line 183 of our C++ code. Per Newey and West (1987), we correct (1 - t_diff[j]) / (cutoff + 1) to (1 - t_diff[j] / (cutoff + 1)) and recompute the standard errors.

source("code/conley.R")

SE <- ConleySEs(reg = m,
unit = "FIPS",
time = "year",
lat = "lat", lon = "lon",
dist_fn = "SH", dist_cutoff = 500,
lag_cutoff = 5,
cores = 1,
verbose = FALSE)

sapply(SE, function(x) diag(sqrt(x))) %>% round(3)
      OLS Spatial Spatial_HAC
HDD 0.650 0.886 0.721
CDD 1.493 4.065 3.631

As is apparent from the final column, correcting the error meaningfully changes the standard errors in the last column. Thiemo’s data is a bit unusual; in other applications with positive temporal auto-correlation, we find that the standard errors tend to increase with the corrected code.


sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.6 (unknown)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] RcppArmadillo_0.7.400.2.0 Rcpp_0.12.12
[3] geosphere_1.5-5 sp_1.2-5
[5] lfe_2.5-1998 Matrix_1.2-7.1
[7] ggplot2_2.2.0 foreign_0.8-67
[9] data.table_1.9.6 dplyr_0.7.2
[11] knitr_1.14

loaded via a namespace (and not attached):
[1] formatR_1.4 plyr_1.8.4 bindr_0.1
[4] tools_3.3.0 digest_0.6.12 evaluate_0.10
[7] tibble_1.3.3 gtable_0.2.0 lattice_0.20-34
[10] pkgconfig_2.0.1 rlang_0.1.1 yaml_2.1.13
[13] bindrcpp_0.2 stringr_1.2.0 rprojroot_1.1
[16] grid_3.3.0 glue_1.1.1 R6_2.2.2
[19] rmarkdown_1.2 Formula_1.2-1 magrittr_1.5
[22] backports_1.0.4 scales_0.4.1.9002 htmltools_0.3.5
[25] assertthat_0.2.0 colorspace_1.2-6 xtable_1.8-2
[28] sandwich_2.3-4 stringi_1.1.5 lazyeval_0.2.0
[31] munsell_0.4.3 chron_2.3-47 zoo_1.7-13

Conley Standard Errors in R

Correcting for Spatial and Temporal Auto-Correlation in Panel Data:

Using R to Estimate Spatial HAC Errors per Conley (1999, 2008)

Darin Christensen and Thiemo Fetzer


tl;dr: Fast computation of standard errors that allows for serial and spatial auto-correlation.


Economists and political scientists often employ panel data that track units (e.g., firms or villages) over time. When estimating regression models using such data, we often need to be concerned about two forms of auto-correlation: serial (within units over time) and spatial (across nearby units). As Cameron and Miller (2013) note in their excellent guide to cluster-robust inference, failure to account for such dependence can lead to incorrect conclusions: “[f]ailure to control for within-cluster error correlation can lead to very misleadingly small standard errors…” (p. 4).

Conley (1999, 2008) develops one commonly employed solution. His approach allows for serial correlation over all (or a specified number of) time periods, as well as spatial correlation among units that fall within a certain distance of each other. For example, we can account for correlated disturbances within a particular village over time, as well as between that village and every other village within one hundred kilometers.

We provide a new function that allows R users to more easily estimate these corrected standard errors. (Solomon Hsiang (2010) provides code for STATA, which we used to test our estimates and benchmark speed.) Moreover using the excellent lfe, Rcpp, and RcppArmadillo packages (and Tony Fischetti’s Haversine distance function), our function is roughly 20 times faster than the STATA equivalent and can scale to handle panels with more units. (We have used it on panel data with over 100,000 units observed over 6 years.)

This demonstration employs data from Fetzer (2014), who uses a panel of U.S. counties from 1999-2012. The data and code can be downloaded here.


STATA Code:

We first use Hsiang’s STATA code to compute the corrected standard errors (spatHAC in the output below).

cd "~/Dropbox/ConleySEs/Data"
clear
use "new_testspatial.dta"

tab year, gen(yy_)
tab FIPS, gen(FIPS_)

timer clear 1
timer on 1
ols_spatial_HAC EmpClean00 HDD CDD yy_* FIPS_2-FIPS_362, lat(lat ) lon(lon ) t(year) p(FIPS) dist(500) lag(5) bartlett disp

# -----------------------------------------------
#     Variable |   OLS      spatial    spatHAC   
# -------------+---------------------------------
         # HDD |   -0.283     -0.283     -0.283  
         #     |    0.650      0.886      0.894  
         # CDD |    2.497      2.497      2.497  
         #     |    1.493      4.068      4.388  

timer off 1
timer list 1
#    1:     25.42 /        1 =      25.4170

R Code:

Using the same data and options as the STATA code, we then estimate the adjusted standard errors using our new R function. This requires us to first estimate our regression model using the felm function from the lfe package.

# Loading sample data:
dta_file <- "~/Dropbox/ConleySEs/Data/new_testspatial.dta"
DTA <- data.table(read.dta(dta_file))

setnames(DTA, c("latitude", "longitude"), c("lat", "lon"))

# Loading R function to compute Conley SEs:
source("~/Dropbox/ConleySEs/ConleySEs_17June2015.R")

ptm <- proc.time()

We use the felm() from the lfe package to estimate model with year and county fixed effects.

Two important points:

  1. We specify our latitude and longitude coordinates as the cluster variables, so that they are included in the output (m).
  2. We specify keepCx = TRUE, so that the centered data is included in the output (m).
m <- felm(EmpClean00 ~ HDD + CDD |
year + FIPS | 0 | lat + lon,
data = DTA[!is.na(EmpClean00)], keepCX = TRUE)

# Same as the STATA result:
coefficients(m) %>% round(3)
   HDD    CDD
-0.283 2.497

We then feed this model to our function, as well as the cross-sectional unit (county FIPS codes), time unit (year), geo-coordinates (lat and lon), the cutoff for serial correlation (5 years), the cutoff for spatial correlation (500 km), and the number of cores to use.

SE <- ConleySEs(reg = m,
unit = "FIPS",
time = "year",
lat = "lat", lon = "lon",
dist_fn = "SH", dist_cutoff = 500,
lag_cutoff = 5,
cores = 1,
verbose = FALSE)

sapply(SE, function(x) diag(sqrt(x))) %>% round(3) # Same as the STATA results.
      OLS Spatial Spatial_HAC
HDD 0.650 0.886 0.895
CDD 1.493 4.065 4.386
proc.time() - ptm
   user  system elapsed
1.046 0.016 1.116

Estimating the model and computing the standard errors requires under two seconds, making it many times faster than the comparable STATA routine.


R Using Multiple Cores:

Even with a single core, we realize significant speed improvements. However, the gains are even more dramatic when we employ multiple cores. Using 4 cores, we can cut the estimation of the standard errors down to around 0.4 seconds. (These replications employ the Haversine distance formula, which is more time-consuming to compute.)

pkgs <- c("rbenchmark", "lineprof")
invisible(sapply(pkgs, require, character.only = TRUE))

bmark <- benchmark(replications = 25,
columns = c('replications','elapsed','relative'),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 1, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 2, verbose = FALSE),
ConleySEs(reg = m,
unit = "FIPS", time = "year", lat = "lat", lon = "lon",
dist_fn = "Haversine", lag_cutoff = 5, cores = 4, verbose = FALSE))
bmark %>% mutate(avg_eplased = elapsed / replications, cores = c(1, 2, 4))
  replications elapsed relative avg_eplased cores
1 25 26.43 2.298 1.0572 1
2 25 16.78 1.459 0.6714 2
3 25 11.50 1.000 0.4602 4

Given the prevalence of panel data that exhibits both serial and spatial dependence, we hope this function will be a useful tool for applied econometricians working in R.


Feedback Appreciated: Memory vs. Speed Tradeoff

This was Darin’s first foray into C++, so we welcome feedback on how to improve the code. In particular, we would appreciate thoughts on how to overcome a memory vs. speed tradeoff we encountered. (You can email Darin at darinc[at]luskin.ucla.edu)

The most computationally intensive chunk of our code computes the distance from each unit to every other unit. To cut down on the number of distance calculations, we can fill the upper triangle of the distance matrix and then copy it to the lower triangle. With $N$ units, this requires only $N (N-1) /2$ distance calculations.

However, as the number of units grows, this distance matrix becomes too large to store in memory, especially when executing the code in parallel. (We tried to use a sparse matrix, but this was extremely slow to fill.) To overcome this memory issue, we can avoid constructing a distance matrix altogether. Instead, for each unit, we compute the vector of distances from that unit to every other unit. We then only need to store that vector in memory. While that cuts down on memory use, it requires us to make twice as many ($N(N-1)$) distance calculations.

As the number of units grows, we are forced to perform more duplicate distance calculations to avoid memory constraints – an unfortunate tradeoff. (See the functions XeeXhC and XeeXhC_Lg in ConleySE.cpp.)


sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12.3 (unknown)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] RcppArmadillo_0.7.400.2.0 Rcpp_0.12.7
[3] geosphere_1.5-5 sp_1.2-3
[5] lfe_2.5-1998 Matrix_1.2-7.1
[7] ggplot2_2.2.1 foreign_0.8-67
[9] data.table_1.9.6 dplyr_0.5.0
[11] knitr_1.14

loaded via a namespace (and not attached):
[1] Formula_1.2-1 magrittr_1.5 munsell_0.4.3
[4] xtable_1.8-2 lattice_0.20-34 colorspace_1.2-6
[7] R6_2.1.3 stringr_1.1.0 plyr_1.8.4
[10] tools_3.3.0 grid_3.3.0 gtable_0.2.0
[13] DBI_0.5-1 htmltools_0.3.5 lazyeval_0.2.0
[16] yaml_2.1.13 assertthat_0.1 rprojroot_1.1
[19] digest_0.6.10 tibble_1.2 formatR_1.4
[22] evaluate_0.10 rmarkdown_1.2 sandwich_2.3-4
[25] stringi_1.1.1 scales_0.4.1 backports_1.0.4
[28] chron_2.3-47 zoo_1.7-13