Querying Geospatial Data Using IBM SQL Query
11 min read
Querying geospatial data using IBM SQL Query
As a business’s requirements and competition evolve and grow, it is important to be able to access and analyze data that drives organizational decisions. Geospatial data plays a crucial role in data forecasting, spatial analytics, and reporting, especially in the fields of logistics and finance. Additionally, there are increasingly more use cases where technologies like artificial intelligence have been used in tandem with geospatial applications.
With IBM Cloud SQL Query (SQL Query), you can now run SQL queries on geospatial data on files stored as CSV, Parquet, or JSON in IBM Cloud Object Storage (COS) using IBM’s geospatial toolkit. There are many specialized geospatial functions and spatial data types that are used in the toolkit, such as point, area, disjoint, azimuth, length, intersect, bounding box, and distance. There are also functions that convert SQL geometries to GeoJSON and vice versa, as well as encode and decode Geohash’s and convert WKB and WKT geometries. A full list of functions can be found in the documentation.
In this article, we’ll show you how to create and run geospatial queries with SQL Query on bicycle location data we’ve downloaded and stored in IBM Cloud Object Storage (COS) using IBM Watson Studioand PixieDust.
Getting started with cycle data
This example assumes that you have a running instance of SQL Query and COS. For a primer on getting set up with SQL Query and COS, see my article, “Query and Analyze Call Logs with IBM Cloud SQL Query.” This article takes you through a step-by-step process of setting up COS and SQL Query. Also, check out my introduction to SQL Query, “Analyzing Data with IBM Cloud SQL Query,” where I show you how to set up SQL Query in an IBM Watson Studio notebook, and how to get the IBM Cloud API Key, SQL Query CRN number, and SQL Query COS bucket URL which we’ll need to run queries inside the notebook.
The data we’ll use for this article comes from the Seattle cycle share company Pronto, which operated from 2014 to 2017. It comprises two data sets as CSV files: one shows the locations of each cycle station, and the other shows the data for each bicycle trip, including trip duration and the gender and birthdate of each subscriber. Download the two CSV files and store them in a COS bucket. Make sure that the COS bucket name you create is unique, since all bucket names must be unique across all existing bucket names in COS.
Next, you’ll need to have a Jupyter Notebook to run the queries for this example since we are using PixieDust to visualize the data. I’ll be using IBM Watson Studio, which comes as a suite of data analysis tools that makes it easy to analyze data in a notebook using Python or R. If you don’t have a subscription, you can sign up for a trial account.
Once you’re logged into Watson Studio, set up a project by clicking on the New project tile.
It will then ask you to select a project tile, which will pre-load a set of tools based on the type of project you want to create. The default selection is Complete, and we’ll just use that one.
Since you set up a COS service beforehand, Watson Studio will use that COS service to save all the project’s metadata and assets. Click on Assets, scroll down that page to the Notebooks section, and click New notebook.
Once you’re in a new notebook, create a name, and use the default runtime (which is Python 3.5 at the time of writing). Click Create at the bottom of the page.
When the notebook is created, it will redirect you to your Watson Studio Notebook.
Setting up SQL Query and PixieDust in the Notebook
Since we’re using a Python notebook, we’ll use the Python open-source library from SQL Query ibmcloudsql
. This library will allow us to use SQL Query programmatically inside the notebook, or you can use it in any Python application you create. Also, we’ll install the PixieDust and use its chart and map features to visualize our data in the notebook. Install and import those in your notebook as follows:
Then, after getting your IBM Cloud API Key, SQL Query CRN number, and SQL Query COS URL, add those to your notebook for example like this:
When you run getpass,
it will prompt you to copy and paste your IBM Cloud API Key into the box generated by getpass
. Do that and click return. Then enter your SQL Query CRN number and the endpoint of the COS bucket created by SQL Query. Your bucket will be created in us-south
and will have sql-
as the start of the bucket name.
When that’s done, you should have two notebook cells similar to the following:
Note that I’ve added the suffix /cycle
at the end of my SQL Query endpoint URL. This is so we can distinguish these queries from others that we might have made previously and stored within the SQL Query results bucket.
With the API Key, CRN, and COS endpoint added, all we need to do is run SQL Query. To do that, we’ll invoke the ibmcloudsql
library’s SQLQuery
function and provide the API Key, CRN, and COS endpoint.
We can then use the run_sql
function provided by the library to run a SQL query. The following code example provides a suggested way of how to do that:
Next, we’ll view the first five results:
Now that SQL Query is giving us back the results of the query, we’ll look at using SQL Query with PixieDust to query and visualize the data in a chart or on a map.
Sampling the data
As mentioned above, SQL Query has a geospatial toolkit that lets you query data in COS using an array of geospatial functions and data types.
In order to work with geospatial functions, we’ll need to get the longitude and latitude coordinates for each bike station. Since the coordinates are in a(47.618549, -122.317017)
format, we’ll separate them into separate columns with a common table expression, or CTE:
The CTE uses the station_id
as a reference so that we can join the station data set with the new longitude
and latitude
columns. So, an example query in the notebook might look like this:
Since we’re using the COS bucket with our data twice in the query, we need to indicate that within the query string. We do that using {0}
so that format
uses the data_source
within the CTE SELECT
statement and the SELECT
statement outside the CTE. After running the query, a sample of results looks like this:
Using the locations of the stations doesn’t provide us with much data that’s useful, except for their coordinates and how many bicycles each station contains. Therefore, to give us a more meaningful example, let’s also examine the subscriber data and view how the data is organized and how we might use it in conjunction with our station data set.
This data set provides us with an overview of each trip grouped by birth year, the average duration of the trips, and the total number of trips grouped within each birth year. To view the trend of average trip duration and the number of total trips by subscriber birth year, we’ll query the data in COS using SQL Query and then use PixieDust to display the data as two line charts: one showing the average trip for subscribers by birth year, and another that shows the total number of trips by birth year:
This query will filter out any row that doesn’t contain a birthyear
and only returns groups of birth years containing more than one subscriber.
After running the query, use the display
method to view the avg_cycle_duration,
and select the Line Chart from the top menu. This will open a popup window to select the data that’ll be plotted.
Select the birthyear
as the key and avg_cycle_duration
and count
as the values.
That will give you a chart with both values in one chart. Since the number of total subscribers skews the chart, we should probably separate these values into separate charts.
To do that, click on Type on the right of the chart and select subplots. This will separate the values into separate charts that’ll look like this:
You can change the renderer for the chart on the top right corner. I chose bokeh for this example.
Querying and visualizing geospatial data
Now that we’ve looked at the data, let’s move on to combining the data sets using our geospatial functions. To get the distance between stations on a map, we’ll use the ST_Distance
function. This function uses the longitude and latitude coordinates for the point of origin and destination, using the ST_Point
function, then calculates the distance between them in meters.
An example SQL query to calculate the distances for each subscriber’s trip might look like the following:
Running this query, we’ll get a sample of results like the following:
Now, let’s try running another query, this time using PixieDust’s display
function to visualize the stations on a Mapbox map inside the notebook cell. To do that, let’s get the five top bicycle stations that subscribers rent bicycles from with the following query:
Below are the locations along with the total number of trips from each location in descending order:
Using the PixieDust display
function, we can see where those stations are located on the map. In the image below, I am highlighting station WF-01, which is the station where the most subscribers rented their bikes from:
Knowing that station WF-01 is the most popular station for renting bikes, we can then figure out the destinations of each subscriber’s trip from that station. Instead of looking at all destinations throughout the city, let’s confine our search to destination stations in Downtown Seattle. We can do that using the geospatial function ST_Contains
to get only points contained within an area, and we can define the area of Downtown Seattle with a bounding box using the function ST_BoundingBox
and the area’s coordinates.
A really simple way to find bounding box coordinates is with GeoFabrik’s Tile Calculator. All you need to do is select the city, and then select the area within that city, and the calculator will provide you with the bounding box coordinates. The bounding box coordinates for Downtown Seattle can be copied from the Simple Copybox. Make sure to separate each coordinate with commas in ST_BoundingBox
:
With those coordinates, we can write the following SQL Query using ST_Contains
and ST_BoundingBox
:
Then, using PixieDust’s display()
function, we can visualize that we correctly confined our search to Downtown Seattle and have returned stations only within the defined bounding box:
These examples are meant only as a primer to using geospatial functions with SQL Query. For a more complex example using New York City taxicab data, check out the NYC Taxicab Analysis notebook on Github, which includes a multitude of examples using geospatial functions and data types using SQL Query.
Summary
This article focused on getting started with Watson Studio and using SQL Query geospatial functions within a notebook. Our queries comprised getting the distance between bicycle rental stations for each bicycle trip and getting the top five stations throughout Seattle. Using the top station, we then found all the stations in Downtown Seattle that originated from that point. We encourage you to give SQL Query a try and to discover new ways to query geospatial data you have stored in COS.