(For more resources related to this topic, see here.)
The basic question that we seek to answer in this article is the fundamental distance question, “Which are the closest (name what you are searching for) to me?”, for example, “Which are the five coffee shops closest to me?” It turns out that while it is a fundamental question, it’s not always easy to answer, though we will make this possible in this article. We will approach this with two approaches. The first way in which we’ll approach this is in a simple heuristic, which will allow us to come to a solution quickly. Then, we’ll take advantage of the deeper PostGIS functionality to make the solution faster and more general with a K-Nearest Neighbor (KNN) approach.
A concept that we need to understand from the outset is that of a spatial index. A spatial index, like other database indexes, functions like a book index. It is a special construct to make looking for things inside our table easier, much in the way a book index helps us find content in a book faster. In the case of a spatial index, it helps us find faster where things are in space. Therefore, by using a spatial index in our geographic searches, we can speed up our searches by many orders of magnitude.
To learn more about spatial indexes, see http://en.wikipedia.org/wiki/Spatial_index#Spatial_index.
We will start by loading our data. Our data are the address records from Cuyahoga County, Ohio, USA.
shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom CUY_ADDRESS_POINTS chp04.knn_addresses | psql -U me -d postgis_cookbook
As this dataset may take a while to load, you can alternatively load a subset.
shp2pgsql -s 3734 -d -i -I -W LATIN1 -g the_geom CUY_ADDRESS_POINTS_ subset chp04.knn_addresses | psql -U me -d postgis_cookbook
We specified the -I flag in order to request a spatial index be created upon the import of this data.
Let us start by seeing how many records we are dealing with:
SELECT COUNT(*) FROM chp04.knn_addresses; --484958
We have, in this address table, almost half a million address records, which is not an insubstantial number against which to perform a query.
How to do it…
KNN is an approach to searching for an arbitrary number of points closest to a given point. Without the right tools, this can be a very slow process that requires testing the distance between the point of interest and all the possible neighbors. The problem with this approach is that the search becomes exponentially slower as the number of points increases. Let’s start with this naïve approach and then improve upon it.
Suppose we were interested in finding 10 records closest to the geographic location, -81.738624, 41.396679. The naïve approach would be to transform this value into our local coordinate system and compare the distance to each point in the database from the search point, order those values by distance, and limit the search to the first 10 closest records (it is not recommended that you run the following query—it could run indefinitely)
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint ORDER BY ST_Distance(searchpoint.the_geom, addr.the_geom) LIMIT 10;
This is a fine approach for smaller datasets. This is a logical, simple, fast approach for relatively small numbers of records. This approach scales very poorly, however, getting exponentially slower with the addition of records, and with 500,000 points, this would take a very long time.
An alternative is to only compare my point to the ones I know are close by setting a search distance. So, for example, in the following diagram, we have a star that represents my current location, and I want to know the 10 closest addresses. The grid in the diagram is a 100 foot grid, so I can search for the points within 200 feet, then measure the distance to each of these points, and return the closest 10 points to my search location.
So far, our approach to answering this question is to limit the search using the ST_DWithin operator to only search for records within a certain distance. ST_DWithin uses our spatial index, so the initial distance search is fast and the list of returned records should be short enough to do the same pair-wise distance comparison we did earlier in this section. In our case here, we could limit the search to within 200 feet as follows.
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint WHERE ST_DWithin(searchpoint.the_geom, addr.the_geom, 200) ORDER BY ST_Distance(searchpoint.the_geom, addr.the_geom) LIMIT 10;
This approach performs well so long as our search window, ST_DWithin, is the right size for the data. The problem with this approach is that, in order to optimize it, we need to know how to set a search window that is about the right size. Any larger than the right size and the query will run more slowly than we’d like. Any smaller than the right size and we might not get all the points back that we need. Inherently, we don’t know this ahead of time, so we can only hope for the best guess.
In this same dataset, if we apply the same query in another location, the output will return no points because the 10 closest points are further than 200 feet away. We can see this in the following diagram:
Fortunately, for PostGIS 2.0, we can leverage the distance operators (<-> and <#>) to do indexed nearest-neighbor searches. This makes for very fast KNN searches that don’t require us to guess ahead of time how far away we need to search. Why are the searches fast? The spatial index helps, of course, but in the case of the distance operator, we are using the structure of the index itself, which is hierarchical, to very quickly sort our neighbors.
When used in an ORDER BY clause, the distance operator uses the index:
SELECT ST_Distance(searchpoint.the_geom, addr.the_geom) AS dist, * FROM chp04.knn_addresses addr, (SELECT ST_Transform(ST_SetSRID(ST_MakePoint(-81.738624, 41.396679), 4326), 3734) AS the_geom) searchpoint ORDER BY addr.the_geom <-> searchpoint.the_geom LIMIT 10;
This approach requires no a priori knowledge of how far the nearest neighbors might be. It also scales very well, returning thousands of records in not more than the time it takes to return a few records. It is sometimes slower than using ST_DWithin, depending on how small our search distance is and how large the dataset we are dealing with. But the tradeoff is that we don’t need to make a guess as to our correct search distance and for large queries, it can be much faster than the naïve approach.
How it works…
What makes this magic possible is that PostGIS uses an R-Tree index. This means that the index itself is sorted hierarchically based on spatial information. As demonstrated, we can leverage the structure of the index in sorting distances from a given arbitrary location and, thus, use the index to directly return the sorted records. This means that the structure of the spatial index itself helps us answer such fundamental questions quickly and inexpensively.
More information about KNN and R-tree can be found at http://workshops.boundlessgeo.com/postgis-intro/knn.html and https://en.wikipedia.org/wiki/R-tree.
This article covered the introduction and learning of KNN filters to increase the performance of proximity queries. Now you must be quite familiar with the usage of KNN filters to increase the performance of proximity queries.
Resources for Article:
- PostgreSQL: Tips and Tricks [article]
- Backup in PostgreSQL 9 [article]
- Python Libraries for Geospatial Development [article]