Executing a spatial query in Spark
This topic contains examples that illustrate the most commonly used spatial functions.
- In a set of points, find all points that are withing a certain distance of a particular point. For example, find all hospitals that are within a certain distance of a particular location.
- In a set of polygons, find all polygons that contain a particular point. For example, find all risk areas (fire, flood, hurricane, etc.) that contain a particular location.
- In a set of points, find all points that are contained within a particular polygon. For example, find all retail outlets in a particular region.
Often, a spatial function has one parameter that refers to a spatial column in one table and a second parameter that refers to a spatial constant or to a spatial column in another table.
Determine which points are close to another point
ST_Point
constructor).// register the data frame as a table
hospitalsDf.createOrReplaceTempView("hospitals")
The
following query returns a list of the hospitals within 10 km of Grand Central Station in
NYC.SELECT name, city, state
FROM hospitals
WHERE ST_Distance(location, ST_Point(-77.574722, 43.146732)) < 10000.0
Determine which polygon contains a point
ST_Contains(geom1, geom2)
-- returnsTRUE
ifgeom2
values are completely contained by the polygons identified bygeom1
.ST_Within(geom1, geom2)
-- returnsTRUE
ifgeom1
values are within the polygons identified bygeom2
.ST_Intersects(geom1, geom2)
-- returnsTRUE
ifgeom1
andgeom2
intersect spatially in any way, touching, crossing, or containing each other
SELECT NAME
FROM counties
WHERE
ST_Contains(shape, ST_Point(-74.237, 42.037))
SELECT NAME
FROM counties
WHERE
ST_Within(ST_Point(-74.237, 42.037), shape)
SELECT NAME
FROM counties
WHERE
ST_Intersects(shape, ST_Point(-74.237, 42.037))
Each query returns the answer Ulster.
Determine which points are in a polygon
Each of the following queries determines which hospitals are located within the specified
polygon, which is defined as a constant using well-known text (WKT) representation. The polygon
definition consists of the character string POLYGON
followed by a pair of
x,y coordinates for each vertex, separated by a comma. The individual x
and
y
values are separated by a space. The entire list of coordinate pairs must be in
parentheses.
SELECT name
FROM hospitals
WHERE
ST_Contains(ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'), location)
SELECT name
FROM hospitals
WHERE ST_Within(location, ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))
SELECT name
FROM hospitals
WHERE ST_Intersects(location, ST_WKTToSQL('POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))
- Marshall Hospital
- Southwestern Medical Center
- Hillcrest Hospital
- Saint Lukes Hospital
- Adventist Home
Determine which points are in a polygon using a spatial join
Just as a regular join function can join two tables based on the values in columns that contain character or numeric data, spatial join functions can be used to join tables based on the values in columns that contain spatial data. The following examples use the counties and hospitals tables.
SELECT c.NAME, h.name
FROM counties AS c, hospitals AS h
WHERE c.NAME = 'Dutchess'
AND ST_Intersects(c.shape, h.location)
This
query produces the following result set:
Hospital Name | County Name |
---|---|
Hudson River State Hospital | Dutchess |
Vassar Brothers Hospital | Dutchess |
Bowne Hospital | Dutchess |
Harlem Valley State Hospital | Dutchess |
Matteawan State Hospital | Dutchess |
New York State Hospital | Dutchess |
Saint Francis Hospital | Dutchess |
United States Veterans Hospital | Dutchess |
SQL JOIN ... ON
notation, which is equivalent to
a spatial predicate in the WHERE
clause. For example, the following query produces
the same result set as the previous
query:SELECT h.name, c.NAME
FROM counties AS c
JOIN hospitals AS h
ON c.NAME = 'Dutchess'
AND ST_Intersects(h.location, c.shape)
SELECT c.NAME, h.name
FROM hospitals AS h, counties AS c
WHERE ST_Intersects(h.location, c.shape)
AND h.name = 'Vassar Brothers Hospital'
This
query produces the following result
set:"Vassar Brothers Hospital", "Dutchess"
Additional predicates and aggregation
This example that uses spatial joins in conjunction with additional predicates and aggregation, which can address business problems. These examples continue to use the hospitals and counties tables, but the same principles can be applied to any type of data.
SELECT c.NAME, h.name
FROM counties AS c, hospitals AS h
WHERE ST_Intersects(h.location, c.shape)
AND c.STATE_NAME='New York'
ORDER BY c.NAME, h.name
The
result shows the first five of the 230 hospitals in New York.
Hospital Name | County Name |
---|---|
Albany Hospital | Albany |
Albany Hospital for Incurables | Albany |
Albany Hospital Sanatorium | Albany |
Cohoes Memorial Hospital | Albany |
Saint Peters Hospital | Albany |
SELECT c.NAME, h.name
FROM hospitals AS h, counties AS c
WHERE ST_Intersects(h.location, c.shape)
AND h.state='NY'
ORDER BY c.NAME, h.name
The
following example summarizes the number of hospitals per county in New
York.SELECT c.NAME, COUNT(h.name) AS hospital_count
FROM counties AS c, hospitals AS h
WHERE ST_Intersects(h.location, c.shape)
AND c.STATE_NAME='New York'
GROUP BY c.NAME
This
generates a total of 45 results with the first five results (order may not be as shown below always)
shown in the below table.
County_name | Hospital_count |
---|---|
Essex | 3 |
Livingston | 1 |
Orange | 4 |
Herkimer | 1 |
Queens | 19 |
SELECT c.NAME,
COUNT(h.name) AS hospital_count,
c.POP2000 AS Population,
c.POP2000/COUNT(h.name) AS people_per_hospital
FROM counties AS c, hospitals AS h
WHERE c.STATE_NAME='New York'
AND ST_Intersects(h.location, c.shape)
GROUP BY c.NAME, c.POP2000
ORDER BY people_per_hospital DESC
The
results of this (a total of 45 records) are shown for the first five.
County_name | Hospital_Count | Population | People_per_hospital |
---|---|---|---|
Bronx | 9 | 1332650 | 148072 |
Chautauqua | 1 | 139750 | 139750.0 |
Oswego | 1 | 122377 | 122377.0 |
Nassau | 11 | 1334544 | 121322 |
Queens | 19 | 2229379 | 117335 |
Window queries
ST_Intersects
spatial
predicate.SELECT name
FROM hospitals
WHERE ST_Intersects(location, ST_WKTToSQL(
'POLYGON ((-74.0 42.0, -73.0 42.0, -73.0 43.0, -74.0 43.0, -74.0 42.0))'))
EnvelopesIntersect
, which can be
used to select objects whose envelope intersects a rectangular region.
EnvelopesIntersect
takes as a parameter the name of the spatial column, four
Double
values representing the lower-left, and upper-right corners of the
rectangle. This spatial predicate is simpler to use and is more efficient than
ST_Intersects
for rectangular
windows.SELECT name
FROM hospitals
WHERE EnvelopesIntersect(location, -74.0, 42.0, -73.0, 43.0)
Because this predicate is true if any portion of a line or polygon geometry falls within the specified window, parts of the line or polygon might lie outside of the window. This is generally not a problem with mapping applications, which will discard geometries that lie outside the display window.
SELECT name, ST_X(location) AS longitude, ST_Y(location) AS latitude
FROM hospitals
WHERE EnvelopesIntersect(location, -74.0, 42.0, -73.0, 43.0)
Distance queries
- Finding customers within 10 miles of a store
- Finding ATMs within 500 meters of the current location
- Finding competitive stores within 10 kilometers of a proposed store location
ST_Distance
, which computes the distance
between spatial values and returns a result in meters. The following query generates eight
results:SELECT name
FROM hospitals
WHERE ST_Distance(location, ST_Point(-74.237, 42.037)) < 46800
ORDER BY name
Name |
---|
Adventist Home |
Bowne Hospital |
Columbia Memorial Hospital |
Firemens Home |
Greene County Memorial Hospital |
Hudson River State Hospital |
Saint Francis Hospital |
Vassar Brothers Hospital |
ST_Buffer
,
where a circular buffer is created around the given geometry and the desired geometries within that
buffer are determined. The ST_Buffer
function takes as parameters a spatial
geometry and a distance in meters to buffer around this spatial value. The results are the
same.SELECT name
FROM hospitals
WHERE
ST_Intersects(location,
ST_Buffer(ST_Point(-74.237, 42.037), 46800.0))
ORDER BY name
Continuing
the previous examples, the distance from the specified point to each object within a 30 mile (or
approximately 46800m) radius can be found with this query, the results of which are shown
below.SELECT name, ST_Distance(location, ST_Point(-74.237, 42.037)) AS distance
FROM hospitals
WHERE ST_Distance(location, ST_Point(-74.237, 42.037)) < 46800.0
ORDER BY distance
Name | distance (meters) |
---|---|
Greene County Memorial Hospital | 36634.88 |
Adventist Home | 39014.09 |
Hudson River State Hospital | 43362.54 |
Saint Francis Hospital | 43786.58 |
Bowne Hospital | 44319.48 |
Columbia Memorial Hospital | 44533.58 |
Firemens Home | 45066.09 |
Vassar Brothers Hospital | 45525.13 |
ST_Buffer
can be used to compute the spatial relation and then
determine the
distance.SELECT name, ST_Distance(location, ST_Point(-74.237, 42.037)) AS distance
FROM hospitals
WHERE
ST_Intersects(location,
ST_Buffer(ST_Point(-74.237, 42.037), 46800.0))
ORDER BY distance
A
key difference to be noted here is that the ST_Buffer
in this package supports
buffering of arbitrary geometries and can be used to compute in that manner. Note that:- The
ST_Buffer
query on large geometries can be expensive. - For a large number of geometries, the user is advised to calculate the buffers separately, store the buffers in columns, and operate on the stored buffers.
SELECT name, ST_Distance(location, ST_WKTToSQL(
'LINESTRING (-74.0 42.0, -73.0 42.0)'))
FROM hospitals
WHERE ST_Intersects(location, ST_Buffer(ST_WKTToSQL(
'LINESTRING (-74.0 42.0, -73.0 42.0)'), 46800.0))
This
returns 31 results, the first 5 of which are:
Name | Distance |
---|---|
Avery Hospital | 38777.96 |
Hartford Hospital | 38204.014 |
Springfield Municipal Hospital | 39761.18 |
The Noble Hospital | 23831.78 |
Hudson River State Hospital | 29315.15 |