Table of Contents
Intro
In a previous post I wrote about PostGIS and ways of querying geographical data.
This post will focus on building a system that queries free geolocation services 1 and aggregates their results.
Overview
In summary, we'll make requests to different web services (or APIs), then we're going to reverse geocoding the results and then aggregate the results.
Comparison between geonames and openstreetmap
To relate to the previous post, here are some differences between geonames and openstreetmap:
criterion | OSM | geonames |
---|---|---|
size | 49.1 GB compressed | 309 MB compressed |
has administrative area data | yes | yes |
has lat/long city data | yes | yes |
has region/area polygonal areas | yes | no |
has neighbourhood/district data | yes | no |
has intracity-level metadata | yes | no |
has terrain metadata | yes | no |
They are meant for different purposes. Geonames is meant for city/administrative area/country data. Openstreetmap has much more detailed data than geonames.
Asynchronous requests to geolocation services
We're using the gevent library to make asynchronous requests to the geolocation services.
import gevent import gevent.greenlet from gevent import monkey; gevent.monkey.patch_all() geoip_service_urls=[ ['geoplugin' , 'http://www.geoplugin.net/json.gp?ip={ip}' ], ['ip-api' , 'http://ip-api.com/json/{ip}' ], ['nekudo' , 'https://geoip.nekudo.com/api/{ip}' ], ['geoiplookup' , 'http://api.geoiplookup.net/?query={ip}' ], ] # fetch url in asynchronous mode (makes use of gevent)deffetch_url_async(url, tag, timeout=2.0): data = Nonetry: opener = urllib2.build_opener(urllib2.HTTPSHandler()) opener.addheaders = [('User-agent', 'Mozilla/')] urllib2.install_opener(opener) data = urllib2.urlopen(url,timeout=timeout).read() exceptException, e: passreturn [tag, data] # expects req_data to be in this format: [ ['tag', url], ['tag', url], .. ]deffetch_multiple_urls_async(req_data): # start the threads (greenlets)threads_ = [] for u in req_data: (tag, url) = u new_thread = gevent.spawn(fetch_url_async, url, tag) threads_.append(new_thread) # wait for threads to finish gevent.joinall(threads_) # retrieve threads return valuesresults = [] for t in threads_: results.append(t.get(block=True, timeout=5.0)) return results defgeolocate_ip(ip): urls = [] for grp in geoip_service_urls: tag, url = grp urls.append([tag, url.format(ip=ip)]) results = fetch_multiple_urls_async(urls) processed_answers = process_service_answers(results) aggregate = aggregate_answers(processed_answers) return aggregate defprocess_service_answers(location_data): # 1) extract lat/long data from responses# 2) reverse geocoding using geonames# 3) aggregate location data# (for example, one way of doing this would# be to choose the location that most services# agree on)pass
City name ambiguity
Cities with the same name within the same country
There are many cities with the same name within a country, in different states/administrative regions. There's also cities with the same name in different countries.
For example, according to Geonames, there are 24 cities named Clinton in the US (in 23 different states, with two cities named Clinton in the same state of Michigan).
WITH duplicate_data AS ( SELECT city_name, array_agg(ROW(country_code, region_code)) AS dupes FROM city_region_data WHERE country_code = 'US'GROUPBY city_name, country_code ORDERBYCOUNT(ROW(country_code, region_code)) DESC ) SELECT city_name, ARRAY_LENGTH(dupes, 1) AS duplicity, ( CASEWHEN ARRAY_LENGTH(dupes,1) > 9 THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...') ELSE ARRAY_TO_STRING(dupes,',') END ) AS sample FROM duplicate_data LIMIT 5;
city_name | duplicity | sample |
---|---|---|
Clinton | 24 | (US,NY),(US,AR),(US,NC),(US,MA),(US,MD),(US,OH),(U… |
Franklin | 19 | (US,ME),(US,MA),(US,NC),(US,TX),(US,NC),(US,LA),(U… |
Springfield | 19 | (US,MN),(US,KY),(US,SD),(US,MI),(US,VA),(US,IL),(U… |
Madison | 18 | (US,CT),(US,MN),(US,NJ),(US,ME),(US,SD),(US,FL),(U… |
Greenville | 18 | (US,NC),(US,SC),(US,MS),(US,KY),(US,RI),(US,ME),(U… |
Cities with the same name in the same country and region
Worldwide, even in the same region of a country, there can be multiple cities with the exact same name.
Take for example Georgetown, in Indiana. Geonames says there are 3 towns with that name in Indiana. Wikipedia says there are even more:
- Georgetown, Floyd County, Indiana
- Georgetown Township, Floyd County, Indiana
- Georgetown, Cass County, Indiana
- Georgetown, Randolph County, Indiana
WITH duplicate_data AS ( SELECT city_name, array_agg(ROW(country_code, region_code)) AS dupes FROM city_region_data WHERE country_code = 'US'GROUPBY city_name, region_code, country_code ORDERBYCOUNT(ROW(country_code, region_code)) DESC ) SELECT city_name, ARRAY_LENGTH(dupes, 1) AS duplicity, ( CASEWHEN ARRAY_LENGTH(dupes,1) > 9 THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...') ELSE ARRAY_TO_STRING(dupes,',') END ) AS sample FROM duplicate_data LIMIT 4;
city_name | duplicity | sample |
---|---|---|
Plantation | 3 | (US,FL),(US,FL),(US,FL) |
Georgetown | 3 | (US,IN),(US,IN),(US,IN) |
Robinwood | 3 | (US,MD),(US,MD),(US,MD) |
Prospect Park | 2 | (US,NJ),(US,NJ) |
Reverse geocoding
Both (city_name, country_code)
and (city_name, country_code, region_name)
tuples have failed as candidates to uniquely identify location.
We would have the option of using zip codes or postal codes except we can't use those since most geolocation services don't offer those.
But most geolocation services do offer longitude and latitude, and we can make use of those to eliminate ambiguity.
Geometric data types in PostgreSQL
I looked further into the PostgreSQL docs and found that it also has geometric data types and functions for 2D geometry. Out of the box you can model points, boxes, paths, polygons, circles, you can store them and query them.
PostgreSQL has some additional modules in the contrib directory. They are available out of the box in most Linux and Windows distributions.
In this situation we're interested in the cube and earthdistance modules 2. The cube extension allows you to model n-dimensional vectors, and the earthdistance extension uses 3-cubes to store vectors and represent points on the surface of the Earth.
We'll be using the following:
- the
earth_distance
function is available, and it allows you to compute the great-circle distance between two points - the
earth_box
function to check if a point is within a certain distance of a reference point - a gistexpression index on the expression
ll_to_earth(lat, long)
to make fast spatial queries and find nearby points
Designing a view for city & region data
Geonames data was imported into 3 tables:
geo_geoname
(data from cities1000.zip )geo_admin1
(data from admin1CodesASCII.txt )geo_countryinfo
(data from countryInfo.txt )
Then we create a view that pulls everything together 3. We now have population data, city/region/country data, and lat/long data, all in one place.
CREATEOR REPLACE VIEWcity_region_dataAS ( SELECT b.country AS country_code, b.asciiname AS city_name, a.nameAS region_name, b.region_code, b.population, b.latitude AS city_lat, b.longitude AS city_long, c.nameAS country_name FROM geo_admin1 a JOIN ( SELECT *, (country || '.' || admin1) AS country_region, admin1 AS region_code FROM geo_geoname WHERE fclass = 'P' ) b ON a.code = b.country_region JOIN geo_countryinfo c ON b.country = c.iso_alpha2 );
Designing a nearby-city query and function
In the most nested SELECT
, we're only keeping the cities in a 23km
radius around the reference point, then we're applying a country
filter and city pattern filter (these two filters are optional), and
we're only getting the closest 50 results to the reference point.
(the @>
check will make use of the gist index we're creating).
Next, we're reordering by population because geonames sometimes has districts/neighbourhoods cities around bigger cities too 4, and it does not mark them in a specific way, so we just want to select the larger city (for example let's say the geolocation service returned a lat/long that would resolve to a distrct of a larger metropolitan area. in my case, I'd like to resolve this to the larger city it's associated with instead of getting back the district/neighbourhood)
CREATE INDEX geo_geoname_latlong_idx ON geo_geoname USING gist(ll_to_earth(latitude,longitude)); CREATEOR REPLACE FUNCTIONgeo_find_nearest_city_and_region( latitude doubleprecision, longitude doubleprecision, filter_countries_arr varchar[], filter_city_pattern varchar, ) RETURNSTABLE( country_code varchar, city_name varchar, region_name varchar, region_code varchar, population bigint, _lat doubleprecision, _long doubleprecision, country_name varchar, distance numeric ) AS $$ BEGINRETURN QUERY SELECT * FROM ( SELECT * FROM ( SELECT *, ROUND(earth_distance( ll_to_earth(c.city_lat, c.city_long), ll_to_earth(latitude, longitude) )::numeric, 3) AS distance_ FROM city_region_data c WHERE earth_box(ll_to_earth(latitude, longitude), 23000) @> ll_to_earth(c.city_lat, c.city_long) AND (filter_countries_arr ISNULLOR c.country_code=ANY(filter_countries_arr)) AND (filter_city_pattern ISNULLOR c.city_name LIKE filter_city_pattern) ORDERBY distance_ ASCLIMIT 50 ) d ORDERBY population DESC ) e LIMIT 1; END; $$ LANGUAGE plpgsql;
Conclusion
We've started from the design of a system that would query multiple geoip services, would gather the data and would then aggregate it to get a more reliable result.
We first looked at some ways of uniquely identifying locations.
We've then picked a way that would eliminate ambiguity in identifying them. In the second half, we've looked at different ways of structuring, storing and querying geographical data in PostgreSQL.
Then we've built a view and a function to find cities near a reference point which allowed us to do reverse geocoding.
Footnotes:
By using multiple services (and assuming they use different data sources internally) after aggregation, we can have a more reliable answer than if we were using just one.
Another advantage here is that we're using free services, no setup is required, we don't have to take care of updates, since these services are maintained by their owners.
However, querying all these web services will be slower than querying a local geoip data structures. But, there are city/country/region geolocation database out there such as geoip2 from maxmind, ip2location or db-ip.
There's a nice post here using the earthdistance
module to compute
distances to nearby or far away pubs.
Geonames has geonameIds as well, which are geonames-specific ids we can use to accurately refer to locations.
geonames does not have polygonal data about cities/neighbourhoods or metadata about the type of urban area (like openstreetmap does) so you can't query all city polygons (not districts/neighbourhoods) that contain that point.