postgis/extras/tiger_geocoder/tiger_2006andbefore
2011-01-28 15:01:30 +00:00
..
geocode reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
import reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
normalize reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
orig reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
tables reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
utility reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
create_geocode.sql reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
INSTALL reshuffle in preparation for merging in tiger 2010 support version 2011-01-25 06:31:28 +00:00
README some other minor doc corrections. Stamp files with svn author, revision etc keywords 2011-01-28 15:01:30 +00:00

$Id: README 6731 2011-01-25 18:08:57Z robe $
TIGER Geocoder

  2004/10/28

  A plpgsql based geocoder written for TIGER census data.

Design:

There are two components to the geocoder, the address normalizer and the
address geocoder.  These two components are described seperatly below.

The goal of this project is to build a fully functional geocoder that can 
process an arbitrary address string and, using normalized TIGER censes data,
produce a point geometry reflecting the location of the given address.

- The geocoder should be simple for anyone familiar with PostGIS to install
  and use.
- It should be robust enough to function properly despite formatting and 
  spelling errors.
- It should be extensible enough to be used with future data updates, or
  alternate data sources with a minimum of coding changes.

Installation:

	Refer to the INSTALL file for installation instructions.

Usage:

	refcursor geocode(refcursor, 'address string');

Notes:

- The assumed format for the address is the US Postal Service standard:
  () indicates a field required by the geocoder, [] indicates an optional field.

	(address) [dirPrefix] (streetName) [streetType] [dirSuffix]
	[internalAddress] [location] [state] [zipCode]



Address Normalizer:

The goal of the address normalizer is to provide a robust function to break a
given address string down into the components of an address.  While the
normalizer is built specifically for the normalized US TIGER Census data,  it
has been designed to be reasonably extensible to other data sets and localities.

Usage:

	normalize_address('address string');

Support functions:

	location_extract_countysub_exact('partial address string', 'state abbreviation')
	location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
	location_extract_place_exact('partial address string', 'state abbreviation')
	location_extract_place_fuzzy('partial address string', 'state abbreviation')
	cull_null('string')
	count_words('string')
	get_last_words('string')
	state_extract('partial address string')
	levenshtein_ignore_case('string', 'string')

Notes:

- A set of lookup tables, listed below, is used to provide street type,
  secondary unit and direction abbreviation standards for a given set
  of data.  These are provided with the geocoder, but will need to be
  customized for the data used.

	direction_lookup
	secondary_unit_lookup
	street_type_lookup

- Additional lookup tables are required to perform matching for state
  and location extraction.  The state lookup is derived from the
  US Postal Service standards, while the place and county subdivision
  lookups are generated from the dataset.  The creation statements for
  the place and countysub tables are given in the INSTALL file.

	state_lookup
	place_lookup
	countysub_lookup

- The use of lookup tables is intended to provide a versatile way of applying
  the normalizer to data sets and localities other than the US Census TIGER
  data.  However, due to the need for matching based extraction in the event
  of poorly formatted or incomplete address strings, assumptions are made about
  the data available.  Most notably the division of place and county
  subdivision.  For data sets without exactly two logical divisions in location
  precision, code changes will be required.

- The normalizer will perform better the more information is provided.

- The process for normalization is roughly as follows:

	Extract the address from the beginning.
	Extract the zipCode from the end.
	Extract the state, using a fuzzy search if exact matching fails.
	Attempt to extract the location by parsing the punctuation 
			of the address.
	Find and remove any internal address.
	If internal address was found:
		Set location as everything between internal address and state.
	Extract the street type from the string.
	If multiple potential street types are found:
	    If internal address was found:
			Extract the last street type that preceeds the internal address.
		Else:
			Extract the last street type.
	If street type was found:
		If a word beginning with a number follows the street type.
			This indicates the street type is part of the street name,
					eg. 'State Hwy 92a'.
			Set street type to NULL.
		Else if location not yet found:
			Set location as everything between street type and state.
		Extract direction prefix from start of street name.
		If internal address was found:
			Extract direction suffix from end of street name.
		Else:
			Extract direction suffix from start of location.
		Set street name as everything that is not the address, direction
				prefix or suffix, internal address, location, state or
				zip code.
	Else:
		If internal address was found:
			Extract direction prefix from beginning of string.
			Extract direction suffix before internal address.
			Set street name as everything that is not the address, direction
					prefix or suffix, internal address, location, state or
					zip code.
		Else:
			Extract direction suffix.
			If direction suffix is found:
				Set location as everything between direction suffix and state,
						zip or end of string as appropriate.
				Extract direction prefix from beginning of string.
				Set street name as everything that is not the address, direction
						prefix or suffix, internal address, location, state or
						zip code.
			Else:
				Attempt to determine the location via exact comparison against 
						the places lookup.
				Attempt to determine the location via exact comparison against
						the countysub lookup.
				Attempt to determine the location via fuzzy comparison against
						the places lookup.
				Attempt to determine the location via fuzzy comparison against
						the countysub lookup.
				Extract direction prefix.
				Set street name as everything that is not the address, direction
						prefix or suffix, internal address, location, state or
						zip code.



Address Geocoder:

The goal of the address geocoder is to provide a robust means of searching
the database for a match to whatever data the user provides.  To accomplish
this, the coder uses a series of checks and fallthrough cases.  Starting with
the most specific combination of parameters, the algorithm works outwards
towards the most vague combination, until valid results are found.  The result
of this is that the more accurate information that is provided, the faster the
algorithm will return.

Usage:

	normalize_address('address string');

Support functions:

	geocode_address(cursor, address, 'dirPrefix', 'streetName', 'streetType',
			'dirSuffix', 'location', 'state', zipCode)
	geocode_address_zip(cursor, address, 'dirPrefix', 'streetName',
			'streetType', 'dirSuffix', zipCode)
	geocode_address_countysub_exact(cursor, address, 'dirPrefix', 'streetName',
			'streetType', 'dirSuffix', 'location', 'state')
	geocode_address_countysub_fuzzy(cursor, address, 'dirPrefix', 'streetName',
			'streetType', 'dirSuffix', 'location', 'state')	
	geocode_address_place_exact(cursor, address, 'dirPrefix', 'streetName',
			'streetType', 'dirSuffix', 'location', 'state')
	geocode_address_place_fuzzy(cursor, address, 'dirPrefix', 'streetName',
			'streetType', 'dirSuffix', 'location', 'state')
	rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
			'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB')
	rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
			'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB',
			'locationA', 'locationB')
	location_extract_countysub_exact('partial address string', 'state abbreviation')
	location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
	location_extract_place_exact('partial address string', 'state abbreviation')
	location_extract_place_fuzzy('partial address string', 'state abbreviation')
	cull_null('string')
	count_words('string')
	get_last_words('string')
	state_extract('partial address string')
	levenshtein_ignore_case('string', 'string')
	interpolate_from_address(given address, from address L, to address L,
			from address R, to address R, street segment)
	interpolate_from_address(given address, 'from address L', 'to address L',
			'from address R', 'to address R', street segment)
	includes_address(given address, from address L, to address L,
			from address R, to address R)
	includes_address(given address, 'from address L', 'to address L',
			'from address R', 'to address R')

Notes:

- The geocoder is quite dependent on the address normalizer.  The direction
  prefix and suffix, streetType and state are all expected to be standard
  abbreviations that will match exactly to the database.

- Either a zip code, or a location must be provided.  No exception will be
  thrown, but the result will be null.  If the zip code or location cannot
  be matched, with the other information provided, against the database
  the result is null.

- The process is as follows:

	If a zipCode is provided:
		Check if the zipCode, streetName and optionally state match any roads.
		If they do:
			Check if the given address fits any of the roads.
			If it does:
				Return the matching road segment information, rating and
						interpolated geographic point.
	If location exactly matches a place:
		Check if the place, streetName and optionally state match any roads.
		If they do:
			Check if the given address fits any of the roads.
			If it does:
				Return the matching road segment information, rating and
						interpolated geographic point.
	If location exactly matches a countySubdivision:
		Check if the countySubdivision, streetName and optionally state
				match any roads.
		If they do:
			Check if the given address fits any of the roads.
			If it does:
				Return the matching road segment information, rating and
						interpolated geographic point.
	If location approximately matches a place:
		Check if the place, streetName and optionally state match any roads.
		If they do:
			Check if the given address fits any of the roads.
			If it does:
				Return the matching road segment information, rating and
						interpolated geographic point.
	If location approximately matches a countySubdivision:
		Check if the countySubdivision, streetName and optionally state
				match any roads.
		If they do:
			Check if the given address fits any of the roads.
			If it does:
				Return the matching road segment information, rating and
						interpolated geographic point.


Current Issues / Known Failures:

- If a location starts with a direction, eg. East Seattle, and no suffix
  direction is given, the direction from the location will be interpreted
  as the streets suffix direction.

		'18196 68th Ave East Seattle Washington'
			address = 18196
			dirPrefix = NULL
			streetName = '68th'
			streetType = 'Ave'
			dirSuffix = 'E'
			location = 'Seattle'
			state = 'WA'
			zip = NULL

- The last possible street type in the string is interpreted as the street type
  to allow street names to contain type words.  As a result, any location
  containing a street type will have the type interpreted as the street type.

		'29645 7th Street SW Federal Way 98023'
			address = 29645
			dirPrefix = NULL
			streetName = 7th Street SW Federal
			streetType = Way
			dirSuffix = NULL
			location = NULL
			state = NULL
			zip = 98023

- While some state misspellings will be picked up by the fuzzy searches,
  misspelled or non-standard abbreviations may not be picked up, due to
  the length (soundex uses an intial character plus three codeable
  characters)

		'2554 E Highland Dr Seatel Wash'
			address = 2554
			dirPrefix = 'E'
			streetName = 'Highland'
			streetType = 'Dr'
			dirSuffix = NULL
			location = 'Seatel Wash'
			state = NULL
			zip = NULL

- If neither a location or a zip code are found by the normalizer, no search
  is performed.

- If neither street type, direction suffix nor location are given in the
  address string, the street name is generally misclassified as the
  location.

			'98 E Main Washington 98012'
				address = 98
				dirPrefix = 'E'
				streetName = NULL
				streetType = NULL
				dirSuffix = NULL
				location = 'Main'
				state = 'WA'
				zip = 98012

- If no street type is given and the street name contains a type word, then the
  type in the street name is interpreted as the street type.

			'1348 SW Orchard Seattle wa 98106'
				1348::SW:Orch::Seattle:WA:98106
				address = 1348
				dirPrefix = NULL
				streetName = SW
				streetType = Orch
				dirSuffix = NULL
				location = Seattle
				state = WA
				zip = 98106

- Misspellings of words are only handled so far as their soundex values match.

			'Hiland' will not be matched with 'Highland'
				soundex('Hiland') = 'H453'
				soundex('Highland') = 'H245'

- Missing words in location or street name are not handled.

			'Redmond Fall' will not be matched with 'Redmond Fall City'

- Unacceptable failure cases:
		The street name is parsed out as 'West Central Park'
		'500 South West Central Park Ave Chicago Illinois 60624'