-
Notifications
You must be signed in to change notification settings - Fork 1
darrell/tiger_geocoder
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
TIGER Geocoder
2004/10/28
A plpgsql based geocoder written for TIGER census data.
Design:
There are two components to the geocoder, the address normalizer and the
address geocoder. These two components are described separately below.
The goal of this project is to build a fully functional geocoder that can
process an arbitrary address string and, using normalized TIGER censes data,
produce a point geometry reflecting the location of the given address.
- The geocoder should be simple for anyone familiar with PostGIS to install
and use.
- It should be robust enough to function properly despite formatting and
spelling errors.
- It should be extensible enough to be used with future data updates, or
alternate data sources with a minimum of coding changes.
Installation:
Refer to the INSTALL file for installation instructions.
Usage:
refcursor geocode(refcursor, 'address string');
Notes:
- The assumed format for the address is the US Postal Service standard:
() indicates a field required by the geocoder, [] indicates an optional field.
(address) [dirPrefix] (streetName) [streetType] [dirSuffix]
[internalAddress] [location] [state] [zipCode]
Address Normalizer:
The goal of the address normalizer is to provide a robust function to break a
given address string down into the components of an address. While the
normalizer is built specifically for the normalized US TIGER Census data, it
has been designed to be reasonably extensible to other data sets and localities.
Usage:
normalize_address('address string');
Support functions:
location_extract_countysub_exact('partial address string', 'state abbreviation')
location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
location_extract_place_exact('partial address string', 'state abbreviation')
location_extract_place_fuzzy('partial address string', 'state abbreviation')
cull_null('string')
count_words('string')
get_last_words('string')
state_extract('partial address string')
levenshtein_ignore_case('string', 'string')
Notes:
- A set of lookup tables, listed below, is used to provide street type,
secondary unit and direction abbreviation standards for a given set
of data. These are provided with the geocoder, but will need to be
customized for the data used.
direction_lookup
secondary_unit_lookup
street_type_lookup
- Additional lookup tables are required to perform matching for state
and location extraction. The state lookup is derived from the
US Postal Service standards, while the place and county subdivision
lookups are generated from the dataset. The creation statements for
the place and countysub tables are given in the INSTALL file.
state_lookup
place_lookup
countysub_lookup
- The use of lookup tables is intended to provide a versatile way of applying
the normalizer to data sets and localities other than the US Census TIGER
data. However, due to the need for matching based extraction in the event
of poorly formatted or incomplete address strings, assumptions are made about
the data available. Most notably the division of place and county
subdivision. For data sets without exactly two logical divisions in location
precision, code changes will be required.
- The normalizer will perform better the more information is provided.
- The process for normalization is roughly as follows:
Extract the address from the beginning.
Extract the zipCode from the end.
Extract the state, using a fuzzy search if exact matching fails.
Attempt to extract the location by parsing the punctuation
of the address.
Find and remove any internal address.
If internal address was found:
Set location as everything between internal address and state.
Extract the street type from the string.
If multiple potential street types are found:
If internal address was found:
Extract the last street type that preceeds the internal address.
Else:
Extract the last street type.
If street type was found:
If a word beginning with a number follows the street type.
This indicates the street type is part of the street name,
eg. 'State Hwy 92a'.
Set street type to NULL.
Else if location not yet found:
Set location as everything between street type and state.
Extract direction prefix from start of street name.
If internal address was found:
Extract direction suffix from end of street name.
Else:
Extract direction suffix from start of location.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
If internal address was found:
Extract direction prefix from beginning of string.
Extract direction suffix before internal address.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
Extract direction suffix.
If direction suffix is found:
Set location as everything between direction suffix and state,
zip or end of string as appropriate.
Extract direction prefix from beginning of string.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
Attempt to determine the location via exact comparison against
the places lookup.
Attempt to determine the location via exact comparison against
the countysub lookup.
Attempt to determine the location via fuzzy comparison against
the places lookup.
Attempt to determine the location via fuzzy comparison against
the countysub lookup.
Extract direction prefix.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Address Geocoder:
The goal of the address geocoder is to provide a robust means of searching
the database for a match to whatever data the user provides. To accomplish
this, the coder uses a series of checks and fallthrough cases. Starting with
the most specific combination of parameters, the algorithm works outwards
towards the most vague combination, until valid results are found. The result
of this is that the more accurate information that is provided, the faster the
algorithm will return.
Usage:
normalize_address('address string');
Support functions:
geocode_address(cursor, address, 'dirPrefix', 'streetName', 'streetType',
'dirSuffix', 'location', 'state', zipCode)
geocode_address_zip(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', zipCode)
geocode_address_countysub_exact(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_countysub_fuzzy(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_place_exact(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_place_fuzzy(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB')
rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB',
'locationA', 'locationB')
location_extract_countysub_exact('partial address string', 'state abbreviation')
location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
location_extract_place_exact('partial address string', 'state abbreviation')
location_extract_place_fuzzy('partial address string', 'state abbreviation')
cull_null('string')
count_words('string')
get_last_words('string')
state_extract('partial address string')
levenshtein_ignore_case('string', 'string')
interpolate_from_address(given address, from address L, to address L,
from address R, to address R, street segment)
interpolate_from_address(given address, 'from address L', 'to address L',
'from address R', 'to address R', street segment)
includes_address(given address, from address L, to address L,
from address R, to address R)
includes_address(given address, 'from address L', 'to address L',
'from address R', 'to address R')
Notes:
- The geocoder is quite dependent on the address normalizer. The direction
prefix and suffix, streetType and state are all expected to be standard
abbreviations that will match exactly to the database.
- Either a zip code, or a location must be provided. No exception will be
thrown, but the result will be null. If the zip code or location cannot
be matched, with the other information provided, against the database
the result is null.
- The process is as follows:
If a zipCode is provided:
Check if the zipCode, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location exactly matches a place:
Check if the place, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location exactly matches a countySubdivision:
Check if the countySubdivision, streetName and optionally state
match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location approximately matches a place:
Check if the place, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location approximately matches a countySubdivision:
Check if the countySubdivision, streetName and optionally state
match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
Current Issues / Known Failures:
- If a location starts with a direction, eg. East Seattle, and no suffix
direction is given, the direction from the location will be interpreted
as the streets suffix direction.
'18196 68th Ave East Seattle Washington'
address = 18196
dirPrefix = NULL
streetName = '68th'
streetType = 'Ave'
dirSuffix = 'E'
location = 'Seattle'
state = 'WA'
zip = NULL
- The last possible street type in the string is interpreted as the street type
to allow street names to contain type words. As a result, any location
containing a street type will have the type interpreted as the street type.
'29645 7th Street SW Federal Way 98023'
address = 29645
dirPrefix = NULL
streetName = 7th Street SW Federal
streetType = Way
dirSuffix = NULL
location = NULL
state = NULL
zip = 98023
- While some state misspellings will be picked up by the fuzzy searches,
misspelled or non-standard abbreviations may not be picked up, due to
the length (soundex uses an intial character plus three codeable
characters)
'2554 E Highland Dr Seatel Wash'
address = 2554
dirPrefix = 'E'
streetName = 'Highland'
streetType = 'Dr'
dirSuffix = NULL
location = 'Seatel Wash'
state = NULL
zip = NULL
- If neither a location or a zip code are found by the normalizer, no search
is performed.
- If neither street type, direction suffix nor location are given in the
address string, the street name is generally misclassified as the
location.
'98 E Main Washington 98012'
address = 98
dirPrefix = 'E'
streetName = NULL
streetType = NULL
dirSuffix = NULL
location = 'Main'
state = 'WA'
zip = 98012
- If no street type is given and the street name contains a type word, then the
type in the street name is interpreted as the street type.
'1348 SW Orchard Seattle wa 98106'
1348::SW:Orch::Seattle:WA:98106
address = 1348
dirPrefix = NULL
streetName = SW
streetType = Orch
dirSuffix = NULL
location = Seattle
state = WA
zip = 98106
- Misspellings of words are only handled so far as their soundex values match.
'Hiland' will not be matched with 'Highland'
soundex('Hiland') = 'H453'
soundex('Highland') = 'H245'
- Missing words in location or street name are not handled.
'Redmond Fall' will not be matched with 'Redmond Fall City'
- Unacceptable failure cases:
The street name is parsed out as 'West Central Park'
'500 South West Central Park Ave Chicago Illinois 60624'
About
Updated version of the PostGIS Geocoder for TIGER 2008.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published