Executing a spatial query in Spark

This topic contains examples that illustrate the most commonly used spatial functions.

Common types of spatial queries are:
  • In a set of points, find all points that are withing a certain distance of a particular point. For example, find all hospitals that are within a certain distance of a particular location.
  • In a set of polygons, find all polygons that contain a particular point. For example, find all risk areas (fire, flood, hurricane, etc.) that contain a particular location.
  • In a set of points, find all points that are contained within a particular polygon. For example, find all retail outlets in a particular region.

Often, a spatial function has one parameter that refers to a spatial column in one table and a second parameter that refers to a spatial constant or to a spatial column in another table.

Determine which points are close to another point

A simple example is to find the hospitals that are within a certain distance of a given location (which is constructed using the ST_Point constructor).
// register the data frame  as a table
hospitalsDf.createOrReplaceTempView("hospitals")
The following query returns a list of the hospitals within 10 km of Grand Central Station in NYC.
SELECT name, city, state
FROM hospitals
WHERE ST_Distance(location, ST_Point(-77.574722, 43.146732)) < 10000.0

Determine which polygon contains a point

Examples of spatial functions that determine which polygon contains a point are:
  1. ST_Contains(geom1, geom2) -- returns TRUE if geom2 values are completely contained by the polygons identified by geom1.
  2. ST_Within(geom1, geom2) -- returns TRUE if geom1 values are within the polygons identified by geom2.
  3. ST_Intersects(geom1, geom2) -- returns TRUE if geom1 and geom2 intersect spatially in any way, touching, crossing, or containing each other
Each of the following queries uses one of these functions, and returns the name of the county that contains the specified point:
SELECT NAME 
FROM counties 
WHERE 
ST_Contains(shape, ST_Point(-74.237, 42.037))

SELECT NAME
FROM counties
WHERE
ST_Within(ST_Point(-74.237, 42.037), shape)

SELECT NAME
FROM counties
WHERE
ST_Intersects(shape, ST_Point(-74.237, 42.037))

Each query returns the answer Ulster.

Determine which points are in a polygon

Each of the following queries determines which hospitals are located within the specified polygon, which is defined as a constant using well-known text (WKT) representation. The polygon definition consists of the character string POLYGON followed by a pair of x,y coordinates for each vertex, separated by a comma. The individual x and y values are separated by a space. The entire list of coordinate pairs must be in parentheses.

SELECT name
FROM hospitals
WHERE
ST_Contains(ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'), location)

SELECT name 
FROM hospitals 
WHERE ST_Within(location, ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))

SELECT name 
FROM hospitals 
WHERE ST_Intersects(location, ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))
Each query returns a list of 21 hospitals. The first five hospitals are:
  1. Marshall Hospital
  2. Southwestern Medical Center
  3. Hillcrest Hospital
  4. Saint Lukes Hospital
  5. Adventist Home

Determine which points are in a polygon using a spatial join

Just as a regular join function can join two tables based on the values in columns that contain character or numeric data, spatial join functions can be used to join tables based on the values in columns that contain spatial data. The following examples use the counties and hospitals tables.

You can use the spatial join function to find the hospitals located within a specific county. For example, the following query returns a list of all the hospitals in Dutchess county:
SELECT c.NAME, h.name 
FROM counties AS c, hospitals AS h 
WHERE c.NAME = 'Dutchess' 
AND ST_Intersects(c.shape, h.location)
This query produces the following result set:
Hospital Name County Name
Hudson River State Hospital Dutchess
Vassar Brothers Hospital Dutchess
Bowne Hospital Dutchess
Harlem Valley State Hospital Dutchess
Matteawan State Hospital Dutchess
New York State Hospital Dutchess
Saint Francis Hospital Dutchess
United States Veterans Hospital Dutchess
Alternatively, you can use the SQL JOIN ... ON notation, which is equivalent to a spatial predicate in the WHERE clause. For example, the following query produces the same result set as the previous query:
SELECT h.name, c.NAME
FROM counties AS c
JOIN hospitals AS h
ON c.NAME = 'Dutchess'
AND ST_Intersects(h.location, c.shape)
The following query returns the name of the county in which a particular hospital is located:
SELECT c.NAME, h.name
FROM hospitals AS h, counties AS c
WHERE ST_Intersects(h.location, c.shape)
AND h.name = 'Vassar Brothers Hospital'
This query produces the following result set:
"Vassar Brothers Hospital", "Dutchess"

Additional predicates and aggregation

This example that uses spatial joins in conjunction with additional predicates and aggregation, which can address business problems. These examples continue to use the hospitals and counties tables, but the same principles can be applied to any type of data.

This example queries the hospitals within each county in New York state, qualifying by the state name in the counties table.
SELECT c.NAME, h.name
FROM counties AS c, hospitals AS h
WHERE ST_Intersects(h.location, c.shape)
AND c.STATE_NAME='New York'
ORDER BY c.NAME, h.name
The result shows the first five of the 230 hospitals in New York.
Hospital Name County Name
Albany Hospital Albany
Albany Hospital for Incurables Albany
Albany Hospital Sanatorium Albany
Cohoes Memorial Hospital Albany
Saint Peters Hospital Albany
The same results can be obtained by rewriting the above query with using the field from hospitals table:
SELECT c.NAME, h.name
FROM hospitals AS h, counties AS c
WHERE ST_Intersects(h.location, c.shape)
AND h.state='NY'
ORDER BY c.NAME, h.name
The following example summarizes the number of hospitals per county in New York.
SELECT c.NAME, COUNT(h.name) AS hospital_count
FROM counties AS c, hospitals AS h
WHERE ST_Intersects(h.location, c.shape)
AND c.STATE_NAME='New York'
GROUP BY c.NAME
This generates a total of 45 results with the first five results (order may not be as shown below always) shown in the below table.
County_name Hospital_count
Essex 3
Livingston 1
Orange 4
Herkimer 1
Queens 19
To identify counties where the population is underserved by hospitals, an interesting metric might be the number of people per hospital in each county. Using the population for each county in the year 2000, you can calculate this number.
SELECT c.NAME, 
COUNT(h.name) AS hospital_count, 
c.POP2000 AS Population, 
c.POP2000/COUNT(h.name) AS people_per_hospital
FROM counties AS c, hospitals AS h
WHERE c.STATE_NAME='New York'
AND ST_Intersects(h.location, c.shape)
GROUP BY c.NAME, c.POP2000
ORDER BY people_per_hospital DESC
The results of this (a total of 45 records) are shown for the first five.
County_name Hospital_Count Population People_per_hospital
Bronx 9 1332650 148072
Chautauqua 1 139750 139750.0
Oswego 1 122377 122377.0
Nassau 11 1334544 121322
Queens 19 2229379 117335
If one had additional details such as number of beds, number of doctors per hospital, one can determine a better measure of health care capability.

Window queries

A common use case for mapping applications and in particular for web mapping is to select objects that fall within a rectangular region. This can be done by creating a polygon to represent the rectangle and the ST_Intersects spatial predicate.
SELECT name
FROM hospitals
WHERE ST_Intersects(location, ST_WKTToSQL(
 'POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))
Another spatial predicate that does the same is EnvelopesIntersect, which can be used to select objects whose envelope intersects a rectangular region. EnvelopesIntersect takes as a parameter the name of the spatial column, four Double values representing the lower-left, and upper-right corners of the rectangle. This spatial predicate is simpler to use and is more efficient than ST_Intersects for rectangular windows.
SELECT name
FROM hospitals
WHERE EnvelopesIntersect(location, -74.0, 42.0, -73.0, 43.0)

Because this predicate is true if any portion of a line or polygon geometry falls within the specified window, parts of the line or polygon might lie outside of the window. This is generally not a problem with mapping applications, which will discard geometries that lie outside the display window.

When building web-mapping applications with widely used web-mapping APIs from providers such as Google Maps, Yahoo! Maps, Bing Maps, and others, it is necessary to provide the longitude and latitude values to be used in placing custom markers on the map. Obtain this information with a query such as this one:
SELECT name, ST_X(location) AS longitude, ST_Y(location) AS latitude
FROM hospitals
WHERE EnvelopesIntersect(location, -74.0, 42.0, -73.0, 43.0)

Distance queries

Another common type of spatial query is to find things within a specified distance of a particular location. You have probably used web mapping applications to get this kind of information. You can issue SQL queries from your application for queries like:
  • Finding customers within 10 miles of a store
  • Finding ATMs within 500 meters of the current location
  • Finding competitive stores within 10 kilometers of a proposed store location
The primary spatial function here is ST_Distance, which computes the distance between spatial values and returns a result in meters. The following query generates eight results:
SELECT name
FROM hospitals
WHERE ST_Distance(location, ST_Point(-74.237, 42.037)) < 46800
ORDER BY name
Results
Name
Adventist Home
Bowne Hospital
Columbia Memorial Hospital
Firemens Home
Greene County Memorial Hospital
Hudson River State Hospital
Saint Francis Hospital
Vassar Brothers Hospital
A different way of querying the same as above is to use the ST_Buffer, where a circular buffer is created around the given geometry and the desired geometries within that buffer are determined. The ST_Buffer function takes as parameters a spatial geometry and a distance in meters to buffer around this spatial value. The results are the same.
SELECT name
FROM hospitals
WHERE
ST_Intersects(location,
  ST_Buffer(ST_Point(-74.237, 42.037), 46800.0))
ORDER BY name
Continuing the previous examples, the distance from the specified point to each object within a 30 mile (or approximately 46800m) radius can be found with this query, the results of which are shown below.
SELECT name, ST_Distance(location, ST_Point(-74.237, 42.037)) AS distance
FROM hospitals
WHERE ST_Distance(location, ST_Point(-74.237, 42.037)) < 46800.0
ORDER BY distance
Name distance (meters)
Greene County Memorial Hospital 36634.88
Adventist Home 39014.09
Hudson River State Hospital 43362.54
Saint Francis Hospital 43786.58
Bowne Hospital 44319.48
Columbia Memorial Hospital 44533.58
Firemens Home 45066.09
Vassar Brothers Hospital 45525.13
Or, ST_Buffer can be used to compute the spatial relation and then determine the distance.
SELECT name, ST_Distance(location, ST_Point(-74.237, 42.037)) AS distance
FROM hospitals
WHERE
  ST_Intersects(location,
  ST_Buffer(ST_Point(-74.237, 42.037), 46800.0))
ORDER BY distance
A key difference to be noted here is that the ST_Buffer in this package supports buffering of arbitrary geometries and can be used to compute in that manner. Note that:
  • The ST_Buffer query on large geometries can be expensive.
  • For a large number of geometries, the user is advised to calculate the buffers separately, store the buffers in columns, and operate on the stored buffers.
SELECT name, ST_Distance(location, ST_WKTToSQL(
 'LINESTRING (-74.0 42.0, -73.0 42.0)'))
FROM hospitals
WHERE ST_Intersects(location, ST_Buffer(ST_WKTToSQL(
 'LINESTRING (-74.0 42.0, -73.0 42.0)'), 46800.0))
This returns 31 results, the first 5 of which are:
Name Distance
Avery Hospital 38777.96
Hartford Hospital 38204.014
Springfield Municipal Hospital 39761.18
The Noble Hospital 23831.78
Hudson River State Hospital 29315.15