Generation of 5% and 1% PUMA Boundaries
For some time now, actually since the early 90's, the absence of an authoritative boundary layer for the geographies associated with the Public Use Micro Sample (PUMS) data files has been a problem. The PUMS data files represent the decennial "long-form" questionaire and are frequently used for national research. However, mapping these data or being able to compare them to other data sources has been tedious to say the least. From the very onset of our design to build a geographic correspondence engine (Geocorr), we explored the option to generate PUMA boundary layers. This engine provides the ability to query the relationship of geographic layers, or a combination of layers, to each other. It primarily performs this function by accessing the Master Area Block Level Equivalency (MABLE) database utilitzed by Geocorr. This database is essentially a collection of census block records identifying the geographic relationship of each census block to other geographic layers. From this database equivalency files, or correspondence files, can be created for any source to any target geographic layer selected. The generation of the areas representing Public Use Micro Sample units (PUMAs) started with the addition of the 5% and 1% PUMA geographies to the MABLE database. In effect, the equivalency files of any PUMA type to all geographies listed in the source/target geography selection windows of Geocorr are now provided online. Simply load the Geocorr URL and follow the "Samples" and "PUMA Matrixes" links; or create your own.
The general idea of this project is then to create these PUMA boundaries and post them on the "Archive of Census Related Products" (ACRP). Once they are available, we will attempt to aggregate census summary data and create standard extract files pre-linked to these boundaries. These files consist of about 225 frequently asked for variables that describe an area in demographic and economic terms. For more information consult the supporting documentation on the ACRP and the "Basic Tables" reports at the Urban Information Center (UIC). There are also cross tabulation engines accessible via the internet (see "Information Access" below) which provide the ability to generate custom tabulations of the PUMS microdata. Any tabulations generated by state for PUMA areas can also be linked to these boundaries for mapping purposes.
Building The Equivalency Files
The only geographic areas identified on the 1990 PUMS files are states and PUMAs (and some metropolitan areas). PUMS data files are released in three samples (representing the equivalent size of the total population): 5%, 1% and 3%. On the 5% sample ("A" Sample; we'll refer to its geography as APUMA), every effort was made to keep meaningful socioeconomic or planning areas together. On the 1% sample ("B" Sample; we'll refer to its geography as BPUMA), every effort was made to separate metropolitan areas from non-metropolitan areas. In both geographies PUMAs may contain noncontiguous parts to meet the minimum population requirements. In addition, BPUMAs may span state lines, whereas, APUMAs are always bounded by state lines. The data records for the BPUMAs which span state lines carry as state identification, the code "99". In conjunction with this discussion: the "elderly" PUMS, 3% sample ("E" Sample; we'll refer to its geography as EPUMA), has the same geography as the 5% sample.
PUMA boundaries were proposed by state or local officials within each state, with final approval by the Census Bureau. Boundaries of PUMA areas had to be defined in terms of counties, places, county subdivisions or census tracts. In the large majority of cases, PUMAs consist of one or more counties. In larger metro counties, they are frequently broken down along the smaller geographic area lines. A strict guideline for defining a PUMA is that it had to have a minimum population of 100,000 persons (75,000 in New England) during the 1990 census. The Census Bureau has distributed several products in an attempt to define the boundaries of these entities, none of which are complete; and which in many cases, obscure the fairly simple nature of the PUMA assignment, especially in metropolitan areas. From the "third generation" equivalency file we received (from Carmen Campbell at the Census Bureau), known as"pumgef"; an attempt was made to assign every census block to a PUMA. Establishing such relationships would allow for the creation of any equivalency files using the census blocks as atomic units. Approximately 1,300 census blocks remained unassigned (down from 70,000 in the "second generation" file). Some detective work was required to fill in the holes. Once the census block to APUMA/BPUMA relationships were established in the MABLE database, Geocorr was invoked to generate a national equivalency file. The process of merging the geography of census blocks into APUMA and BPUMA coverages was deemed too large of a project for the resources available, so a coarser geography was selected. The geography selected was census tracts; the "140" census tracts within the summary level designations of the Summary Tape Files (STF). Level"140" tracts may be split by PUMA geography. An analysis of a Geocorr generated national equivalency file, weighted by land area, revealed:
of 60,897 tracts, 2,736 were split by PUMAs (4.5%)
the mean overlap for all tracts is 99.37%
of the 2,736 split tracts, the mean overlap was 86.08%
the range of overlap spans from 37% to 99.9%
of 51 states, 20 contained no split tracts at all
of 60,897 tracts, 3,521 were split by PUMAs (5.8%)
the mean overlap for all tracts was 99.16%
of 3,521 split tracts, the mean overlap was 85.53%
the range of overlap spans from 37% to 99.9%
of 51 states, 14 contained no split tracts at all
These statistics were deemed robust enough, considering the land area overlap between census tract and PUMA geography, to proceed with merging some 60,000 census tracts into coarser levels of geographies. In effect, the PUMA boundaries were "rounded off" to census tract. For the 1990 PUMS files, there are a total of 1,726 APUMAs and 1,760 BPUMAs (including unique "99" areas).
Merging the Geographies
The census tract boundaries were retrieved from the publicly accessible ACRP data archive. The equivalency files obtained through Geocorr were processed one additional step to resolve the assignment of split tracts. The arbitrary rule followed was to assign split census tracts to the PUMA with the largest land area overlap. This ensured that a census tract was always uniquely assigned within PUMA type. Atlas*GIS desktop software was used for manipulating the geographies and the equivalency files. Several problems were manually resolved using Atlas*GIS, visual verification, and the set of PUMA maps provided as Appendix G of the Public Use Micro Sample technical documentation. These maps, although helpful, do not show any detail for PUMAs in urban areas -- in essence they are really "3-digit PUMA" maps. The PUMA identification we adopted is composed of a 5-digit character code (with trailing and leading zeroes) prefixed with the 2-digit character FIPS code for each state. Generally when the last two digits are not "00"'s it represents a county that has been split into subareas.
"Holes" in PUMA boundary
Usually these are fairly large holes which are census tracts that are only water areas which did not get a PUMA code assigned. In all cases, the census tracts constituting these holes were queried and the islands deleted. Certain holes did not have census tract boundaries underneath. This is a result of a failure in the chaining process used to create the ACRP. In such cases, the failure of the tracts (within county) were verified (see documentation on ACRP) and, in using the equivalency file and the hardcopy maps, the geography was corrected.
"Holes" on PUMA boundary
Holes are a fairly infrequent occurence and resolved using the approach mentioned above. Noteworthy in these cases is that tracts along water areas are often coded differently. For example, the Carolinas show both inclusion and exclusion of the barrier islands; and in the "mitten" of Michigan there are examples of tracts which contain small areas of land and large areas of water. Overall though, the coast and shore lines are pretty well defined.
Number of vertices per PUMA
The objective was to make these boundary layers compatible with Atlas*GIS for DOS. For very large PUMAs the maximum vertexes per PUMA can be exceeded (about 4,000 points). In such cases, the tracts which compose the target PUMA were first thinned (points were removed if not significant within 0.1 miles; then unioned. This means that the target PUMA boundary does not exactly match the surrounding PUMAs, if viewed at high resolution scales. Rather than lose information, it was decided to create these polygons this way. For users who want to bypass these restrictions, the boundaries of the surrounding polygons define the exact outline of the target PUMA. As an example, view the large PUMA in the city of Denver, Colorado.
PUMA's are not compact contiguous areas. They are frequently "gerry-mandered" to make them fall within the Bureau's guidelines for size and homogeneity of population. As a result, it is not uncommon to find PUMA's that are made of of multiple parts ("islands") and/or to contain embedded PUMA's ("lakes"). Such oddities could also be caused by errors in the geographic equivalency file. We are confident that we have researched those equivalencies and eliminated such errors, but we are very interested in feedback from any users who think they have spotted an "island" or a"lake" that was not part of the definition of a PUMA.
The extracted number of BPUMAs from the PUMS data matches the total number of unique BPUMAs in the geographic file. The data file BPUMAs prefixed with "99" must however be carefully mapped to all states matching only the 5-digit BPUMA. For a list of these see the list at the end of this file. For each BPUMA, not only are all states contributing to such BPUMAs listed but also the counties involved. For the APUMAs, one appears to be missing in the geographic files. This APUMA (3300601 in NH) did not appear on our pumgef equivalency file. Consult the detailed listing at the end of this file.
Note: The missing apuma was fixed and added to all boundary files posted on 1.17.97. The 00602 tracts carried the 00601 codes.
Readme files are present to guide you along. All the information in all the directories for the PUMAs (agf/ shp/ and bna/) are identical, just presented differently. The "master file", containing all apumas, bpumas and state outlines is the "agf/agfpumas.zip" file. A brief outline (using URLs):
The geographic correspondence engine, known as "Geocorr", is publicly available at:
Bureau of the Census
Office of Socioeconomic Data and Analysis (OSEDA)
Socioeconomic Data and Applications Center
For all general questions and problems contact SEDAC User Services at http://sedac.uservoice.com/knowledgebase
Atlas*GIS and ArcInfo are registered tradmarks of Environmental Systems Research Institute (ESRI), Redlands, CA.
Explanation: This section will show any discrepancies between the datafiles, the 'pumgef' equivalency file, and the polygon identifications we ended up with after merging. It also details the areas for the "99"BPUMAs which can span state lines.
programs written by Al Anderson that access the 5% and 1% PUMS files.
equiv="Geocorr equivalency file"
programs written by John Blodgett based on 'pumgef' equivalency file, obtained via Carmen Campbell from the Census Bureau.
programs written by Henk Meij merging census tracts (rounding off)
geography to PUMA geography (bna files with polid identification).
5% APUMA RESULTS
Summary: Found one APUMA which exists in the PUMS data files but not in
'pumgef' equivalency file (thus it is not in bna files).
APUMA ssAPUMA ss PERSONS COUNTIES
00601 3300601 NH 142,537 unknown
only exists in extract (not in equiv and not in polid)
Note: Fixed and new files posted on 1.17.97.
1% BPUMA RESULTS
Summary: all polygons matched. Since the data files use as state code "99" care must be taken matching any data records to geographic records. In the latter, the polygon identification carries the actual state FIPS code not the "99" designations. Here is a list of them. We did find two phantom BPUMAs whose existence we could not verify. Those are the ones labeled "unknown" below.
BPUMA ssBPUMA ssPUMS ss PERSONS COUNTIES
----- ------- ------- -- ------- --------
00600 ??00600 9900600 ?? 11,528 unknown
00700 ??00700 9900700 ?? 40,810 unknown
85800 1085800 9985800 DL 549,777 10003
2485800 ML 24015
3485800 NJ 34033
87900 2787900 9987900 MN 105,339 27025 27059
5587900 WI 55109
89500 1889500 9989500 IN 209,728 18029
3989500 OH 39061
92500 0592500 9992500 AR 150,726 05035
2892500 MS 28033
4792500 TN 47167
93500 3493500 9993500 NY 144,684 34041
4293500 PA 42025
94000 1994000 9994000 IA 95,214 19155
3194000 NE 31055 31177
94500 4794500 9994500 TN 127,635 47073
5194500 VA 51169 51191 51520
95900 1795900 9995900 IL 165,303 17073 17161
1995900 IA 19163
96000 1796000 9996000 IL 175,677 17161
1996000 IA 19163
96300 2196300 9996300 KY 305,022 21019 21043 21089
3996300 OH 39087
5496300 WV 54011 54099
96600 1896600 9996600 IN 113,354 18129 18173
2196600 KY 21101
97200 0197200 9997200 AL 225,530 01113
1397200 GA 13053 13215
97500 2797500 9997500 MN 227,667 27137
5597500 WI 55031
97800 2397800 9997800 MN 207,507 23031
3397800 NH 33015
98100 0598100 9998100 AR 171,652 05033 05131
4098100 OK 40135
98300 2198300 9998300 KY 161,865 21047
4798300 TN 47125
98500 3998500 9998500 OH 147,134 39013
5498500 WV 54051 54069
98700 2798700 9998700 MN 142,890 27027
3898700 ND 38017
98900 3998900 9998900 OH 141,306 39167
5498900 WV 54107
99100 3999100 9999100 OH 136,712 39081
5499100 WV 54009 54029
99300 0599300 9999300 AR 114,174 05091
4899300 TX 48037
99500 1999500 9999500 IA 112,266 19193
3199500 NE 31043
99700 2499700 9999700 ML 122,884 24001 24023
5499700 WV 54057