SEDAC icon

Translating Data Between Different Administrative Geographies:
An Online Geographic Correspondence Engine

Hendrik K. Meij
Consortium for International Earth Science Information Network
Socioeconomic Data and Applications Center
2250 Pierce Road
University Center, MI 48710

Presentation and paper prepared for the Conference on Scientific and Technical Data Exchange and Integration, sponsored by the U.S. National Committee for CODATA,
Bethesda, MD, December 15-17, 1997.



ABSTRACT:
The appropriate exchange or integration of data often depends on a clear understanding of the underlying administrative geography to which the data correspond and the ability to transform data consistently between different geographies. The Geographic Correspondence (Geocorr) Engine is a powerful online tool for obtaining a "correlation list" that shows how one geography relates to another. For example, Geocorr permits a user to create a report that shows how ZIP codes correspond to counties and cities (places) for a user-defined group of counties, states, or the U.S. as a whole. More advanced options include specifications of bounding boxes or concentric circles around a specified location (latitude/longitude). Geocorr presently permits two-way correlations between a variety of administrative geographies including the 1980 and 1990 U.S Census geographies (from TIGER 1992), hydrologic unit codes (watersheds), labor market areas, commuting zones, 1990 voting districts, 1990 Public Use Microdata Areas (PUMAs), and mid-census geographies.

1.   Introduction

In the 1990 decennial census the basis for the underlying geography for enumeration was presented in a geospatial database nicknamed TIGER (Topologically Integrated Geographic Encoding and Referencing system). Subsequent versions of this database have replaced former versions; each time, geocodes describing certain features were added or deleted from TIGER.

It soon became clear that the development of a database preserving these geocodes (such as county FIPS codes or tract suffix and postfix identifications) at the census block level would be of historical value. Not only to preserve the relationships but to provide ready access to them. The user would not have to reload all of the original geospatial information to resolve certain queries.

The Master Area Block Level Equivalency (MABLE) file is such a database with a record for each census block in the nation. Each record contains all the information necessary to uniquely identify each block (state, county, tract, blockgroup and block identifications) and also the total count values for population (pop), housing unit structures (hus), land area as reported by the Census Bureau (landarea), and the spatial coordinates for the "spatial centroids" (latitude, longitude) for each block. Note that the internal point is usually the centroid but if it falls outside of the perimeter of the block in question it has been adjusted to fall within.

The advantage of such a database is that now other geocodes can be added to each row at the lowest level of geography shared by all other geographies. In other words, the census blocks form the building blocks from which other geographies can be assembled. From a variety of sources, geocodes were collected and added to the database. These sources included the "ZIPEQ" file (ZIP Equivalency File, aka, the STF3B header file) and TIGER 1992 (which contains both 1980 and 1990 geocodes). Others geocodes were added in response to requests by users (such as labor market areas and the 1% and 5% PUMA geographies). The screen depicting the "Source" and "Target" options present the entire list currently available.

In some cases, census blocks can be split by another geography. For example, both zipcodes and watersheds can cross census block perimeters. In such cases, rules of associations are used to assign a split census block uniquely to a target unit. For example blocks can be assigned to a particular zipcode with which it shares the most population.

Once the database was built, access to and the ability to resolve queries rapidly were designed using World Wide Web based tools. Geocorr is the application which provides the access to MABLE, basically via a form that has few required fields and many optional fields. Geocorr enables the user to pose a question, specify needed criteria, submit this online, and receive results rapidly. The application processes about 100,000 blocks in 5-10 seconds. The results are "correlation files" which define the degree of intersection between the source and target geographies selected. Important to note is that either source or target selections can themselves also be a combination of geographies. For example, Geocorr can present the results of a query asking what population of zipcodes within counties (source) fall within which Congressional districts (perhaps divided further in urban versus rural portions).

Geocorr includes an extensive "Examples" document online that describes simple and complex queries and approaches to solving the specific problems. Documentation is also online regarding the usefulness of such a tool in continuous measurement efforts. Finally, the Master Area Geographic Glossary of Terms (MAGGOT) document describes in detail the geographic codes and their meanings and idiosyncracies.

2.   Inner Workings

So how does the generation of correlation files work? The screen shots presented in this paper show the Geocorr page on which parameters need to be specified. An example to describe certain functions of this tool has been visually depicted using the Four Corners Area, located at exactly 109 degrees longitude and 37 degrees latitude. Four states are involved with the San Juan River flowing through the area in a diagonal fashion.

Four Corners Area GIF

In the "Four Corners Area" visualization, all census blocks (black area outlines) are depicted within a 25-mile radius from the point were the four states meet; the counties involved are "San Juan" both in Utah and New Mexico, "Montezuma" in Colorado, and "Apache" in Aarizona. Since census blocks are delineated by visible features in the landscape (such as roads and rivers), the San Juan River can be seen in the outline of the blocks. For each census block in the entire viewing scene (inside and outside the previously mentioned radius), census block spatial centroids (internal points) are depicted in blue if a block has zero population and zero housing units (the "zph" blocks) and in red if there are housing unit structures or people living within those blocks. Since there are 1,940 zph blocks as compared to 1,211 non-zph blocks, the red dots are slightly larger (only for ease of viewing). The green circles have radii of 2,4,6,8 and 20 miles. A bounding box is visually depicted inside the 20-mile circle (it could have clipped any circle). Finally, the locations of towns have been deduced from the density of red dots and are labeled.

Tables

A simple query is "What is the total population of each state within the largest circle?" Table 1 presents the results. In this instance, the universe is defined as that area within the 20 mile circle for each state, thus the counties themselves are clipped. To complete this query, Geocorr calculates internal points for each target output area. These are new internal points for each "pie slice". Because the weighting variable is population (pop), the internal points calculated thus become the "population centroids" for each "pie slice". [Note: changing the weighting variable will generate different answers for pop, hus, or land area internal points.] The distance variable is the distance from that internal point to the specified point on which the radius is anchored, in this case the Four Corners location. Finally, the allocation factor (afact) depicted in this scenario is always 1.0 meaning 100% of all population counts fall within the target area. This is obvious since we have queried the universe itself. We can also request afact2, the portion of target geocode in the source geocodes. Since target is the universe, it essentially depicts the distribution of population across all the "pie slices".

A more complicated query is "What is the distribution of housing units within the 2,4,6, and 8 mile concentric circles by zipcode?" Table 2 indicates that within the 8-mile radius (which now represents the outer limits of the universe), each area within each state is covered by only one zipcode. In some states, there are no zipcodes for these areas, in which case the ring-entry does not appear (for example the 2 and 4 mile radii inside Utah). The ring "geocodes" are created on the fly, as opposed to being stored in the MABLE database, and assigned the maximum ring radius. Fractional radii are allowed and up to 10 rings may be defined by the user (ie, with some precalculations, ring radii can be entered yielding equal areas). Notice the abundance of housing units in ring 8 in Arizona (the "Teec Nos Pos" zipcode). Afact indicates that 93.5% of all Arizona housing units fall within this area while at the same time afact2 indicates that the 143 housing units represent 96% of the total number of housing units within this ring (or better, concentric circle). (Note: the query depicted in the screen shots is similar to this query except the largest ring was included and 'pop' is the weighting variable).

Another query might be "What is the relationship of 1% Public Use Microdata Areas (PUMAs) with the watersheds inside the 20 mile radius around the Four Corners location?" Table 4 indicates that these 1% PUMAs are bounded by their state lines (which is not always the case) and that each PUMA readily crosses two or more watersheds. The watersheds and PUMAs are themselves clipped by the 20 mile radius. Allocation factors are calculated by consulting the geocodes stored on each record for each census block in the MABLE database.

3.   Conclusions

With the geocorr application, the user fills out a form specifying the geographic coverages of interest as well as the states and other universe-limiting geographies. The output is returned to the user via their browser, allowing them to view and/or save it to their local disks. The generated "correlation lists" define the degree of intersection between two geographic code systems defined by the user.

The correlation files generated by Geocorr can support statistical sampling schemes, identify the scope of study areas, and form the basis of data transformations between different geographic layers. An application such as this, with the incorporation of historical layers, can function as a reference tool for future or past geographic changes as well as perform simple "lookup" queries. Geocorr could be connected to a "Geoagg" module which would house census data at the least common geography; the 1990 split blockgroups (slvl=090). The output of Geocorr then becomes the input to Geoagg, delivering to the user not only the allocation factor(s) but in addition the transformed (aggregated) data as well.

Another powerful use of a tool like Geocorr is to be able to track changes in geography in projects such as the American Community Survey (a continuous measurement survey). This would be of enormous value in the interpretation of moving averages.

4.   Figures

Screen Shot 1
Screen Shot 2
Screen Shot 3
Screen Shot 4
Screen Shot 5
Screen Shot 6
Screen Shot 7
Screen Shot 8
Screen Shot 9


© 1998 National Research Council. CIESIN and the U.S. Government retain the right to use this work for their respective nonprofit and governmental purposes.

The name CIESIN and the world map logo are both registered trademarks of the Consortium for International Earth Science Information Network.

This work was supported through CIESIN under NASA Contract NAS5-32632 for the development and operation of the Socioeconomic Data and Applications Center (SEDAC). The opinions, conclusions, and recommendations contained herein represent those of the author and are not necessarily those of CIESIN, SEDAC, NASA, or the National Research Council.