[acs-r] warning / guidance: working with combined geo.sets

Ezra Haber Glenn eglenn at mit.edu
Sun Apr 3 21:08:11 EDT 2016


Dear acs-R users:

If you have been using the acs package to create custom geo.sets which
*combine* existing census geographies (i.e., geo.sets with
"combine=T"), please read on.

It has come to my attention that some users working with custom
*combined* geo.sets may be introducing errors into their data if they
attempt to combine census variables dealing with medians, percentages,
or similar derived summary data.

Most of the data available through the package (via the ACS and the
Decennial Census APIs) comes in the form of *raw counts* -- numbers of
people, households, commuters, etc.  When a geo.set includes multiple
elements and "combine=T", the package will fetch the data, and then
combine the geographies by (1) adding the estimates and (2)
calculating the standard errors of these aggregate estimates.  This
procedure is absolutely proper for count-data, but if is *not*
appropriate for median incomes (or median ages, or mean incomes, or
mean travel times, or derived percentages, etc.).

For example, if you attempt to aggregate three tracts with median
incomes of $25,000, $35,000, and $50,000 into a single neighborhood,
the acs.fetch will return a neighborhood with an "aggregate" median
income of $110,000: wrong.

A quick demonstration:

> all.us=geo.make(state=fips.state[1:51,2], combine=T)
> median.income=acs.fetch(geography=all.us, table.number="B06011", endyear=2014, span=1)
> median.income

Try this and you'll see that the country's "median income" is
$1,394,002...

In the package's defense, there really *isn't* a proper way to
aggregate median incomes like this.  Since medians -- or means, or
percentages - are *derived* from underlying data, they are really
"summaries," and without at least some more info about the underlying
data you can't always properly combine them.  So, in the example
above, we know that the median income for the neighborhood is
somewhere between $25,000 and $50,000, but not really where.  We can
take a median of the medians ($35,000), or a mean of the medians
($100,000 / 3 = $36,667), but these are just guesses as well: without
knowing how many observations there were in each tract and what they
incomes were, we simply can't calculate it.  (This is why I didn't
think it would be an issue -- but now I'm thinking at least a stronger
warning somewhere would be a good idea, hence this email and some new
language I'll add to the guidance docs.)

Please note that this issue only occurs when users create geo.sets
with multiple elements and then *combine* them (by setting "combine=T"
in the geo.set) before passing them to acs.fetch to download data.  As
long as you are not combining multiple tracts, counties, blockgroups,
etc., the package is still fine for fetching and working with median
incomes, percentage, and the like.  (But be careful: you own code may
slip in similar mistakes, if you combine this sort of data.)

Please pass on this info to your colleagues who may be using the
package, and be sure to check your code if it (a) deals with
*combined* geo.set and (b) downloads non-count data.  If you have any
questions or concerns, by all means ask contact me and I'll be happy
to discuss more.

Thanks, and sorry if this wasn't clear.

--Ezra

--
Ezra Haber Glenn, AICP
Department of Urban Studies and Planning
Massachusetts Institute of Technology
77 Massachusetts Ave., Room 7-337
Cambridge, MA 02139
eglenn at mit.edu 
http://dusp.mit.edu/faculty/ezra-glenn | http://eglenn.scripts.mit.edu/citystate/
617.253.2024 (w)
617.721.7131 (c)


More information about the acs-r mailing list