How-tos

Querying Geospatial Data Using IBM SQL Query

Share this post:

Querying geospatial data using IBM SQL Query

As a business’s requirements and competition evolve and grow, it is important to be able to access and analyze data that drives organizational decisions. Geospatial data plays a crucial role in data forecasting, spatial analytics, and reporting, especially in the fields of logistics and finance. Additionally, there are increasingly more use cases where technologies like artificial intelligence have been used in tandem with geospatial applications.

With IBM Cloud SQL Query (SQL Query), you can now run SQL queries on geospatial data on files stored as CSV, Parquet, or JSON in IBM Cloud Object Storage (COS) using IBM’s geospatial toolkit. There are many specialized geospatial functions and spatial data types that are used in the toolkit, such as point, area, disjoint, azimuth, length, intersect, bounding box, and distance. There are also functions that convert SQL geometries to GeoJSON and vice versa, as well as encode and decode Geohash’s and convert WKB and WKT geometries. A full list of functions can be found in the documentation.

In this article, we’ll show you how to create and run geospatial queries with SQL Query on bicycle location data we’ve downloaded and stored in IBM Cloud Object Storage (COS) using IBM Watson Studio and PixieDust.

Getting started with cycle data

This example assumes that you have a running instance of SQL Query and COS. For a primer on getting set up with SQL Query and COS, see my article, “Query and Analyze Call Logs with IBM Cloud SQL Query.” This article takes you through a step-by-step process of setting up COS and SQL Query. Also, check out my introduction to SQL Query, “Analyzing Data with IBM Cloud SQL Query,” where I show you how to set up SQL Query in an IBM Watson Studio notebook, and how to get the IBM Cloud API Key, SQL Query CRN number, and SQL Query COS bucket URL which we’ll need to run queries inside the notebook.

The data we’ll use for this article comes from the Seattle cycle share company Pronto, which operated from 2014 to 2017. It comprises two data sets as CSV files: one shows the locations of each cycle station, and the other shows the data for each bicycle trip, including trip duration and the gender and birthdate of each subscriber. Download the two CSV files and store them in a COS bucket. Make sure that the COS bucket name you create is unique, since all bucket names must be unique across all existing bucket names in COS.

Next, you’ll need to have a Jupyter Notebook to run the queries for this example since we are using PixieDust to visualize the data. I’ll be using IBM Watson Studio, which comes as a suite of data analysis tools that makes it easy to analyze data in a notebook using Python or R. If you don’t have a subscription, you can sign up for a trial account.

Once you’re logged into Watson Studio, set up a project by clicking on the New project tile.

It will then ask you to select a project tile, which will pre-load a set of tools based on the type of project you want to create. The default selection is Complete, and we’ll just use that one.

Since you set up a COS service beforehand, Watson Studio will use that COS service to save all the project’s metadata and assets. Click on Assets, scroll down that page to the Notebooks section, and click New notebook.

 

Once you’re in a new notebook, create a name, and use the default runtime (which is Python 3.5 at the time of writing). Click Create at the bottom of the page.

 

When the notebook is created, it will redirect you to your Watson Studio Notebook.

Setting up SQL Query and PixieDust in the Notebook

Since we’re using a Python notebook, we’ll use the Python open-source library from SQL Query ibmcloudsql. This library will allow us to use SQL Query programmatically inside the notebook, or you can use it in any Python application you create. Also, we’ll install the PixieDust and use its chart and map features to visualize our data in the notebook. Install and import those in your notebook as follows:

!pip install --upgrade ibmcloudsql
!conda install -c conda-forge pyarrow
!pip install --upgrade pixiedust
import ibmcloudsql
import pixiedust

Then, after getting your IBM Cloud API Key, SQL Query CRN number, and SQL Query COS URL, add those to your notebook for example like this:

import getpass
cloud_api_key = getpass.getpass('Enter your IBM Cloud API Key')

When you run getpass, it will prompt you to copy and paste your IBM Cloud API Key into the box generated by getpass. Do that and click return. Then enter your SQL Query CRN number and the endpoint of the COS bucket created by SQL Query. Your bucket will be created in us-south and will have sql- as the start of the bucket name.

sql_crn = '<your_crn>'

sql_cos_endpoint = 'cos://us-south/sql-0ad3cb02-59f7-474f-a895-8a697890a063/cycle'

When that’s done, you should have two notebook cells similar to the following:

Note that I’ve added the suffix /cycle at the end of my SQL Query endpoint URL. This is so we can distinguish these queries from others that we might have made previously and stored within the SQL Query results bucket.

With the API Key, CRN, and COS endpoint added, all we need to do is run SQL Query. To do that, we’ll invoke the ibmcloudsql library’s SQLQuery function and provide the API Key, CRN, and COS endpoint.

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

We can then use the run_sql function provided by the library to run a SQL query. The following code example provides a suggested way of how to do that:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

data_source = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv"

query = """
SELECT 
    *
FROM  {}
""".format(data_source)

cycle_locations = sqlClient.run_sql(query)

Next, we’ll view the first five results:

cycle_locations.head()

Now that SQL Query is giving us back the results of the query, we’ll look at using SQL Query with PixieDust to query and visualize the data in a chart or on a map.

Sampling the data

As mentioned above, SQL Query has a geospatial toolkit that lets you query data in COS using an array of geospatial functions and data types.

In order to work with geospatial functions, we’ll need to get the longitude and latitude coordinates for each bike station. Since the coordinates are in a(47.618549, -122.317017) format, we’ll separate them into separate columns with a common table expression, or CTE:

WITH location AS ( 
    SELECT 
        station_id, 
        cast(split(coordinates, ',')[0] as float) as latitude, 
        cast(split(coordinates, ',')[1] as float) as longitude 
    FROM (SELECT 
            station_id, 
            regexp_replace(`Location 1`, '[()]', '') as coordinates 
        FROM cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv
    ) 
)

The CTE uses the station_id as a reference so that we can join the station data set with the new longitude and latitude columns. So, an example query in the notebook might look like this:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

data_source = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv"

query = """
WITH location AS ( 
    SELECT 
        station_id, 
        cast(split(coordinates, ',')[0] as float) as latitude, 
        cast(split(coordinates, ',')[1] as float) as longitude 
    FROM (SELECT 
            station_id, 
            regexp_replace(`Location 1`, '[()]', '') as coordinates 
        FROM {0}
    ) 
)
SELECT 
    t.station_id,
    t.name,
    t.install_date,
    t.install_dockcount,
    t.current_dockcount,
    l.latitude,
    l.longitude
FROM {0} AS t
    JOIN 
    location AS l
    ON t.station_id = l.station_id
""".format(data_source)

cycle_geo_locations = sqlClient.run_sql(query)

Since we’re using the COS bucket with our data twice in the query, we need to indicate that within the query string. We do that using {0} so that format uses the data_source within the CTE SELECT statement and the SELECT statement outside the CTE. After running the query, a sample of results looks like this:

Using the locations of the stations doesn’t provide us with much data that’s useful, except for their coordinates and how many bicycles each station contains. Therefore, to give us a more meaningful example, let’s also examine the subscriber data and view how the data is organized and how we might use it in conjunction with our station data set.

This data set provides us with an overview of each trip grouped by birth year, the average duration of the trips, and the total number of trips grouped within each birth year. To view the trend of average trip duration and the number of total trips by subscriber birth year, we’ll query the data in COS using SQL Query and then use PixieDust to display the data as two line charts: one showing the average trip for subscribers by birth year, and another that shows the total number of trips by birth year:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

data_source = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Trip_Data.csv"

query = """
SELECT
    birthyear,
    AVG(tripduration) as avg_trip_duration,
    count(birthyear) as count
FROM  {}
where birthyear > 0
group by birthyear
having count > 1
order by birthyear asc
""".format(data_source)

avg_cycle_duration = sqlClient.run_sql(query)

This query will filter out any row that doesn’t contain a birthyear and only returns groups of birth years containing more than one subscriber.

After running the query, use the display method to view the avg_cycle_duration, and select the Line Chart from the top menu. This will open a popup window to select the data that’ll be plotted.

Select the birthyear as the key and avg_cycle_duration and count as the values.

That will give you a chart with both values in one chart. Since the number of total subscribers skews the chart, we should probably separate these values into separate charts.

To do that, click on Type on the right of the chart and select subplots. This will separate the values into separate charts that’ll look like this:

You can change the renderer for the chart on the top right corner. I chose bokeh for this example.

Querying and visualizing geospatial data

Now that we’ve looked at the data, let’s move on to combining the data sets using our geospatial functions. To get the distance between stations on a map, we’ll use the ST_Distance function. This function uses the longitude and latitude coordinates for the point of origin and destination, using the ST_Point function, then calculates the distance between them in meters.

An example SQL query to calculate the distances for each subscriber’s trip might look like the following:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

stations = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv"
trips = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Trip_Data.csv"

query = """
WITH location AS ( 
    SELECT 
        station_id,
        cast(split(coordinates, ',')[0] as float) as latitude, 
        cast(split(coordinates, ',')[1] as float) as longitude 
    FROM (SELECT 
            station_id, 
            regexp_replace(`Location 1`, '[()]', '') as coordinates 
        FROM {0}
    ) 
)
SELECT 
    trip.trip_id,
    trip.tripduration,
    trip.from_station_id,
    trip.to_station_id,
    origin.latitude as from_lat,
    origin.longitude as from_lon,
    destination.latitude as to_lat,
    destination.longitude as to_lon,
    st_distance(st_point(origin.longitude, origin.latitude), st_point(destination.longitude, destination.latitude)) as distance
FROM {1} AS trip
    JOIN 
    location AS origin
    ON trip.from_station_id = origin.station_id
    JOIN
    location AS destination
    ON trip.to_station_id = destination.station_id
WHERE trip.from_station_id != trip.to_station_id 
    AND trip.birthyear > 1
""".format(stations, trips)

cycle_geo_trips_stations = sqlClient.run_sql(query)

Running this query, we’ll get a sample of results like the following:

Now, let’s try running another query, this time using PixieDust’s display function to visualize the stations on a Mapbox map inside the notebook cell. To do that, let’s get the five top bicycle stations that subscribers rent bicycles from with the following query:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

stations = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv"
trips = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Trip_Data.csv"

query = """
WITH location AS ( 
    SELECT 
        station_id,
        cast(split(coordinates, ',')[0] as float) as latitude, 
        cast(split(coordinates, ',')[1] as float) as longitude 
    FROM (SELECT 
            station_id, 
            regexp_replace(`Location 1`, '[()]', '') as coordinates 
        FROM {0}
    ) 
)
SELECT
    *
FROM
    (SELECT 
        trip.from_station_id as station_id,
        origin.longitude as lon,
        origin.latitude as lat,
        count(*) as count
    FROM {1} AS trip
        JOIN 
        location AS origin
        ON trip.from_station_id = origin.station_id
    group by trip.from_station_id, origin.longitude, origin.latitude
    order by count desc
    limit 5)

""".format(stations, trips)

top_locations = sqlClient.run_sql(query)

Below are the locations along with the total number of trips from each location in descending order:

Using the PixieDust display function, we can see where those stations are located on the map. In the image below, I am highlighting station WF-01, which is the station where the most subscribers rented their bikes from:

Knowing that station WF-01 is the most popular station for renting bikes, we can then figure out the destinations of each subscriber’s trip from that station. Instead of looking at all destinations throughout the city, let’s confine our search to destination stations in Downtown Seattle. We can do that using the geospatial function ST_Contains to get only points contained within an area, and we can define the area of Downtown Seattle with a bounding box using the function ST_BoundingBox and the area’s coordinates.

A really simple way to find bounding box coordinates is with GeoFabrik’s Tile Calculator. All you need to do is select the city, and then select the area within that city, and the calculator will provide you with the bounding box coordinates. The bounding box coordinates for Downtown Seattle can be copied from the Simple Copy box. Make sure to separate each coordinate with commas in ST_BoundingBox:

With those coordinates, we can write the following SQL Query using ST_Contains and ST_BoundingBox:

sqlClient = ibmcloudsql.SQLQuery(cloud_api_key, sql_crn, sql_cos_endpoint)

stations = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Station_Data.csv"
trips = "cos://us-geo/cycle-data/Pronto_Cycle_Share_Trip_Data.csv"

query = """
WITH location AS ( 
    SELECT 
        station_id,
        cast(split(coordinates, ',')[0] as float) as latitude, 
        cast(split(coordinates, ',')[1] as float) as longitude 
    FROM (SELECT 
            station_id, 
            regexp_replace(`Location 1`, '[()]', '') as coordinates 
        FROM {0}
    ) 
)
SELECT
    to_station_id,
    lon,
    lat,
    count(*) as count
FROM 
    (SELECT  
        trip.to_station_id,
        destination.longitude as lon, 
        destination.latitude as lat  
    FROM {1} AS trip
        JOIN 
        location AS destination
        ON trip.to_station_id = destination.station_id
    WHERE 
        trip.from_station_id = 'WF-01'
        AND
        trip.from_station_id != trip.to_station_id
        AND
        st_contains(ST_BoundingBox(-122.3587, 47.6016, -122.3273, 47.6185), st_point(destination.longitude, destination.latitude)))
group by to_station_id, lon, lat
order by count desc
""".format(stations, trips)

top_downtown_locations = sqlClient.run_sql(query)

Then, using PixieDust’s display() function, we can visualize that we correctly confined our search to Downtown Seattle and have returned stations only within the defined bounding box:

These examples are meant only as a primer to using geospatial functions with SQL Query. For a more complex example using New York City taxicab data, check out the NYC Taxicab Analysis notebook on Github, which includes a multitude of examples using geospatial functions and data types using SQL Query.

Summary

This article focused on getting started with Watson Studio and using SQL Query geospatial functions within a notebook. Our queries comprised getting the distance between bicycle rental stations for each bicycle trip and getting the top five stations throughout Seattle. Using the top station, we then found all the stations in Downtown Seattle that originated from that point. We encourage you to give SQL Query a try and to discover new ways to query geospatial data you have stored in COS.

Developer Advocate - IBM Cloud

More How-tos stories
February 13, 2019

Simplify and Automate Deployments Using GitOps with IBM Multicloud Manager 3.1.2

Use Argo CD, a GitOps continuous delivery tool for Kubernetes, and IBM Multicloud Manager to achieve declarative and automated deployment of applications to multiple Kubernetes clusters.

Continue reading

February 11, 2019

Solving Business Problems with Splunk on IBM Cloud Kubernetes Service

In this tutorial, we will install Splunk Connect for Kubernetes into an existing Splunk instance. Splunk Connect for Kubernetes provides a way to import and search your Kubernetes logging, object, and metrics data in Splunk.

Continue reading

February 8, 2019

A How-To for Migrating Redis to IBM Cloud Databases for Redis

If you’re moving your data over to IBM Cloud Databases for Redis, you’ll need to take some steps to successfully migrate all of your data. We’ve got you covered. In this post, we’ll show you a quick way to start migrating your data across to Databases for Redis, whether your database is on-premise or in the cloud.

Continue reading