Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Stefan Petrea: Geolocation using multiple services

$
0
0

Intro

In a previous post I wrote about PostGIS and ways of querying geographical data.

This post will focus on building a system that queries free geolocation services 1 and aggregates their results.

Overview

In summary, we'll make requests to different web services (or APIs), then we're going to reverse geocoding the results and then aggregate the results.

Comparison between geonames and openstreetmap

To relate to the previous post, here are some differences between geonames and openstreetmap:

criterionOSMgeonames
size49.1 GB compressed309 MB compressed
has administrative area datayesyes
has lat/long city datayesyes
has region/area polygonal areasyesno
has neighbourhood/district datayesno
has intracity-level metadatayesno
has terrain metadatayesno

They are meant for different purposes. Geonames is meant for city/administrative area/country data. Openstreetmap has much more detailed data than geonames.

Asynchronous requests to geolocation services

We're using the gevent library to make asynchronous requests to the geolocation services.

import gevent
import gevent.greenlet
from gevent import monkey; gevent.monkey.patch_all()

geoip_service_urls=[
        ['geoplugin'    , 'http://www.geoplugin.net/json.gp?ip={ip}'                    ],
        ['ip-api'       , 'http://ip-api.com/json/{ip}'                                 ],
        ['nekudo'       , 'https://geoip.nekudo.com/api/{ip}'                           ],
        ['geoiplookup'  , 'http://api.geoiplookup.net/?query={ip}'                      ],
        ]

# fetch url in asynchronous mode (makes use of gevent)deffetch_url_async(url, tag, timeout=2.0):
    data = Nonetry:
        opener = urllib2.build_opener(urllib2.HTTPSHandler())
        opener.addheaders = [('User-agent', 'Mozilla/')]
        urllib2.install_opener(opener)
        data = urllib2.urlopen(url,timeout=timeout).read()
    exceptException, e:
        passreturn [tag, data]

# expects req_data to be in this format: [ ['tag', url], ['tag', url], .. ]deffetch_multiple_urls_async(req_data):

    # start the threads (greenlets)threads_ = []
    for u in req_data:
        (tag, url) = u
        new_thread = gevent.spawn(fetch_url_async, url, tag)
        threads_.append(new_thread)

    # wait for threads to finish
    gevent.joinall(threads_)

    # retrieve threads return valuesresults = []
    for t in threads_:
        results.append(t.get(block=True, timeout=5.0))

    return results

defgeolocate_ip(ip):
    urls = []
    for grp in geoip_service_urls:
        tag, url = grp
        urls.append([tag, url.format(ip=ip)])
    results = fetch_multiple_urls_async(urls)
    processed_answers = process_service_answers(results)
    aggregate = aggregate_answers(processed_answers)
    return aggregate

defprocess_service_answers(location_data):
    # 1) extract lat/long data from responses# 2) reverse geocoding using geonames# 3) aggregate location data#    (for example, one way of doing this would#     be to choose the location that most services#     agree on)pass

City name ambiguity

Cities with the same name within the same country

There are many cities with the same name within a country, in different states/administrative regions. There's also cities with the same name in different countries.

For example, according to Geonames, there are 24 cities named Clinton in the US (in 23 different states, with two cities named Clinton in the same state of Michigan).

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'GROUPBY city_name, country_code
    ORDERBYCOUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASEWHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 5;
city_nameduplicitysample
Clinton24(US,NY),(US,AR),(US,NC),(US,MA),(US,MD),(US,OH),(U…
Franklin19(US,ME),(US,MA),(US,NC),(US,TX),(US,NC),(US,LA),(U…
Springfield19(US,MN),(US,KY),(US,SD),(US,MI),(US,VA),(US,IL),(U…
Madison18(US,CT),(US,MN),(US,NJ),(US,ME),(US,SD),(US,FL),(U…
Greenville18(US,NC),(US,SC),(US,MS),(US,KY),(US,RI),(US,ME),(U…

Cities with the same name in the same country and region

Worldwide, even in the same region of a country, there can be multiple cities with the exact same name.

Take for example Georgetown, in Indiana. Geonames says there are 3 towns with that name in Indiana. Wikipedia says there are even more:

WITH duplicate_data AS (
    SELECT
    city_name,
    array_agg(ROW(country_code, region_code)) AS dupes
    FROM city_region_data
    WHERE country_code = 'US'GROUPBY city_name, region_code, country_code
    ORDERBYCOUNT(ROW(country_code, region_code)) DESC
)
SELECT
city_name,
ARRAY_LENGTH(dupes, 1) AS duplicity,
( CASEWHEN ARRAY_LENGTH(dupes,1) > 9 
  THEN CONCAT(SUBSTRING(ARRAY_TO_STRING(dupes,','), 1, 50), '...')
  ELSE ARRAY_TO_STRING(dupes,',') END
) AS sample
FROM duplicate_data
LIMIT 4;
city_nameduplicitysample
Plantation3(US,FL),(US,FL),(US,FL)
Georgetown3(US,IN),(US,IN),(US,IN)
Robinwood3(US,MD),(US,MD),(US,MD)
Prospect Park2(US,NJ),(US,NJ)

Reverse geocoding

Both (city_name, country_code) and (city_name, country_code, region_name) tuples have failed as candidates to uniquely identify location.

We would have the option of using zip codes or postal codes except we can't use those since most geolocation services don't offer those.

But most geolocation services do offer longitude and latitude, and we can make use of those to eliminate ambiguity.

Geometric data types in PostgreSQL

I looked further into the PostgreSQL docs and found that it also has geometric data types and functions for 2D geometry. Out of the box you can model points, boxes, paths, polygons, circles, you can store them and query them.

PostgreSQL has some additional modules in the contrib directory. They are available out of the box in most Linux and Windows distributions.

In this situation we're interested in the cube and earthdistance modules 2. The cube extension allows you to model n-dimensional vectors, and the earthdistance extension uses 3-cubes to store vectors and represent points on the surface of the Earth.

We'll be using the following:

  • the earth_distance function is available, and it allows you to compute the great-circle distance between two points
  • the earth_box function to check if a point is within a certain distance of a reference point
  • a gistexpression index on the expression ll_to_earth(lat, long) to make fast spatial queries and find nearby points

Designing a view for city & region data

Geonames data was imported into 3 tables:

Then we create a view that pulls everything together 3. We now have population data, city/region/country data, and lat/long data, all in one place.

CREATEOR REPLACE VIEWcity_region_dataAS ( 
    SELECT
        b.country AS country_code,
        b.asciiname AS city_name,
        a.nameAS region_name,
        b.region_code,
        b.population,
        b.latitude AS city_lat,
        b.longitude AS city_long,
        c.nameAS country_name
    FROM geo_admin1 a
    JOIN (
        SELECT *, (country || '.' || admin1) AS country_region, admin1 AS region_code
        FROM geo_geoname
        WHERE fclass = 'P'
    ) b ON a.code = b.country_region
    JOIN geo_countryinfo c ON b.country = c.iso_alpha2
);

Designing a nearby-city query and function

In the most nested SELECT, we're only keeping the cities in a 23km radius around the reference point, then we're applying a country filter and city pattern filter (these two filters are optional), and we're only getting the closest 50 results to the reference point. (the @> check will make use of the gist index we're creating).

Next, we're reordering by population because geonames sometimes has districts/neighbourhoods cities around bigger cities too 4, and it does not mark them in a specific way, so we just want to select the larger city (for example let's say the geolocation service returned a lat/long that would resolve to a distrct of a larger metropolitan area. in my case, I'd like to resolve this to the larger city it's associated with instead of getting back the district/neighbourhood)

CREATE INDEX geo_geoname_latlong_idx ON geo_geoname USING gist(ll_to_earth(latitude,longitude));
CREATEOR REPLACE FUNCTIONgeo_find_nearest_city_and_region(
    latitude doubleprecision,
    longitude doubleprecision,
    filter_countries_arr varchar[],
    filter_city_pattern  varchar,
) RETURNSTABLE(
    country_code varchar,
    city_name varchar,
    region_name varchar,
    region_code varchar,
    population bigint,
    _lat doubleprecision,
    _long doubleprecision,
    country_name varchar,
    distance numeric
    ) AS $$
BEGINRETURN QUERY
    SELECT *
    FROM (
        SELECT
        *
        FROM (
            SELECT 
            *,
            ROUND(earth_distance(
                   ll_to_earth(c.city_lat, c.city_long),
                   ll_to_earth(latitude, longitude)
                  )::numeric, 3) AS distance_
            FROM city_region_data c
            WHERE earth_box(ll_to_earth(latitude, longitude), 23000) @> ll_to_earth(c.city_lat, c.city_long) AND
                  (filter_countries_arr ISNULLOR c.country_code=ANY(filter_countries_arr)) AND
                  (filter_city_pattern  ISNULLOR c.city_name LIKE filter_city_pattern)
            ORDERBY distance_ ASCLIMIT 50
        ) d
        ORDERBY population DESC
    ) e
    LIMIT 1;
END;
$$
LANGUAGE plpgsql;

Conclusion

We've started from the design of a system that would query multiple geoip services, would gather the data and would then aggregate it to get a more reliable result.

We first looked at some ways of uniquely identifying locations.

We've then picked a way that would eliminate ambiguity in identifying them. In the second half, we've looked at different ways of structuring, storing and querying geographical data in PostgreSQL.

Then we've built a view and a function to find cities near a reference point which allowed us to do reverse geocoding.

Footnotes:

1

By using multiple services (and assuming they use different data sources internally) after aggregation, we can have a more reliable answer than if we were using just one.

Another advantage here is that we're using free services, no setup is required, we don't have to take care of updates, since these services are maintained by their owners.

However, querying all these web services will be slower than querying a local geoip data structures. But, there are city/country/region geolocation database out there such as geoip2 from maxmind, ip2location or db-ip.

2

There's a nice post here using the earthdistance module to compute distances to nearby or far away pubs.

3

Geonames has geonameIds as well, which are geonames-specific ids we can use to accurately refer to locations.

4

geonames does not have polygonal data about cities/neighbourhoods or metadata about the type of urban area (like openstreetmap does) so you can't query all city polygons (not districts/neighbourhoods) that contain that point.


Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>