Community

dashDB functions for spatial analysis are all about location

Share this post:

When you buy a house, what are the 3 top most things to look for? The answer is LOCATION, LOCATION, and LOCATION! When you open a business, you want to avoid traffic hotspots and accident hotspots. You want to explore nearby places and know navigation information about streets and intersections. You’ll also want to know information like where is the nearest airport.

Who can answer all these questions? The dashDB functions for spatial analysis!

Simple scenario – where to locate new office?

In this post, we will walk through a simple scenario to introduce some spatial functions. You will need a Bluemix account. If you don’t have one, go to bluemix.net/ and click on sign up in the upper right corner. Then, log into a dashDB instance, and follow the example step-by-step.

In this scenario, a small company (MYCO) has two offices, but business has been growing and there are now customers across the country. Many of the customers have expressed a preference to meet company representatives in person. The company owners want to explore where to open a new office.

Some of the questions in MYCO company owners want to answer are:

  • We already have some ideas where to open a new office. How can we find out which of these potential locations can serve the most customers?
  • How can we reach the customers with the highest business volume?
  • Are there other locations that should be considered?

Spatial analysis functions can help find the answers

We will use some of the geospatial data in the SAMPLE schema in dashDB to bring this example to life. It contains data about customers in the GEO_CUSTOMER table and county data in the GEO_COUNTIES table. You can use the Tables menu to view the structure and browse the content of these tables:

tables

In the sample county data, the shape of a county is stored in the spatial data type ST_MULTIPOLYGON, which accepts a series of coordinates, while the customer addresses are stored as ST_POINT, which holds the coordinates of specific locations.

MYCO already has two offices, so let’s create a table and add the locations using the RunSQL option in the dashDB menu:

CREATE TABLE offices
(
id integer not null primary key,
name varchar(30) not null,
location db2gse.st_point not null,
contact varchar(128),
status integer not null default 0
) organize by row;

The current offices are in Pittsburgh and Atlanta. The locations are inserted with the datatype ST_POINT and, as in the following SQL statements, the spatial reference system with the id 1005 is used, which refers to the most popular coordinate system WGS1984:

insert into offices
values (1, 'Pittsburgh', db2gse.st_point(-79.5835, 40.2623, 1005), 'Dan Smith',0),
(2, 'Atlanta', db2gse.st_point(-84.2324, 33.4518, 1005), 'Jane Miller',0);

The customer base has been growing in the west and southwest, so the MYCO owners consider opening an office in Austin, Dallas, Las Vegas, or San Jose. The question now is: Which of these locations is the best suited location as far as customers in its vicinity are concerned.

To answer this question, the potential locations are added to the OFFICE table with status ‘1’ to indicate ‘planned’ status:

insert into offices
values (11, 'Las Vegas', db2gse.st_point(-115.1739, 36.1215, 1005), 'tbd', 1),
(12, 'San Jose', db2gse.st_point(-121.924 , 37.3591, 1005), 'tbd', 1),
(13, 'Austin', db2gse.st_point(-97.76, 30.3208, 1005), 'tbd', 1),
(14, 'Dallas', db2gse.st_point(-96.974, 32.6406, 1005), 'tbd', 1);

It is mainly the customers with high business volume that are interested in a local contact. If we define 200 miles as the maximum distance for this scenario, then the first step is to find out how many customers with a given business volume are located closer than 200 miles to existing and planned offices.

DashDB Spatial supports two options to find distances, one is using the st_distance function and the other is a combination of st_buffer and st_intersects. The second option is recommended when using a unit, here the ‘STATUTE MILE’, so we use the second option:

select off.id as office_id, off.name as office_name, count(*) as count_customers
from offices off, samples.geo_customer cust
where cust.insurance_value > 200000 and
db2gse.st_intersects(db2gse.st_buffer(off.location,200,'STATUTE MILE'), cust.shape) = 1
group by off.id, off.name
order by count_customers desc;

customer count per office

Clearly an office in Austin or Dallas will reach more customers than one in Las Vegas or San Jose, but still considerably less than in the already existing offices in Atlanta and Pittsburgh.

In the following picture we have visualized this result on a map drawn with ArcGIS for Desktop.

Map with planned office locations

In a data layer, ArcGIS accesses the database and returns the office locations along with the number of customers. The number of customers is then represented by the size of the data point.

From this initial exploration or analysis, a location in Dallas looks most promising. However, is this really the best place? We should probably open up the list of potential future office locations and find out whether there are more suitable locations. For instance, we may want to look at counties with the largest number of high business-volume customers. Since one office might serve multiple counties, the customers in neighboring counties should also be considered.

For comparison, let’s first get the count for the existing and planned offices. To optimize the query runtime we use a two stage approach: in the first pass the st_envIntersects function reduces the candidates. The second pass finally checks which county contains the location:

--  find id of the counties where the offices are located
with offloc(officeid, countyid) as
(select off.id, gc.objectid
from offices off, samples.geo_county gc
where db2gse.st_envIntersects(gc.shape, off.location) = 1 and
db2gse.st_contains(gc.shape, off.location) = 1
),
-- count customers in these counties
custcount (officeid, count_customers) as
(
select officeid, count(*)
from offloc ol, samples.geo_county gc, samples.geo_customer cust
where ol.countyid = gc.objectid and
cust.insurance_value > 200000 and
db2gse.st_envIntersects(gc.shape, cust.shape) = 1 and
db2gse.st_contains(gc.shape, cust.shape) = 1
group by officeid
),
-- find id of the neighboring counties
offnei(officeid, countyid) as
(select officeid, gnc.objectid
from offloc ol, samples.geo_county gc, samples.geo_county gnc
where ol.countyid = gc.objectid and
gc.objectid <> gnc.objectid and
db2gse.st_touches(gc.shape, gnc.shape) = 1
),
-- count customers in neighboring counties
neigborcount (officeid, count_neighbors) as
(
select officeid, count(*)
from offnei on, samples.geo_county gc, samples.geo_customer cust
where on.countyid = gc.objectid and
cust.insurance_value > 200000 and
db2gse.st_envIntersects(gc.shape, cust.shape) = 1 and
db2gse.st_contains(gc.shape, cust.shape) = 1
group by officeid
)
-- put it all together
select cc.officeid, off.name, off.status, cc.count_customers, nc.count_neighbors
from custcount cc, neigborcount nc, offices off
where cc.officeid = nc.officeid and off.id = cc.officeid;

count customers in county and neighboring counties

We are now ready to explore other locations and query for the 10 counties with the highest customer count adding the customers in neighboring counties (be patient, this might take a minute to complete):

--  get the counties with most customers and their customer count
with
candidates(objectid, name, count_customers) as
(
select county.objectid, county.name, count(*) as count_customers
from samples.geo_customer cust, samples.geo_county county
where insurance_value > 200000 and
db2gse.st_contains(county.shape, cust.shape) = 1
group by county.name, county.objectid
order by count_customers desc
fetch first 10 rows only
),
-- get the neighboring counties
neighbors(candid, neighbor_id) as
(
select cand.objectid, gc2.objectid
from candidates cand, samples.geo_county gc1, samples.geo_county gc2
where cand.objectid = gc1.objectid and
gc1.objectid <> gc2.objectid and
db2gse.st_touches(gc1.shape, gc2.shape) = 1
),
-- count customers in neighboring counties
neighborcount(candid, count_neighbors) as
(
select candid, count(*)
from neighbors n, samples.geo_county co, samples.geo_customer cust
where n.neighbor_id = co.objectid and
insurance_value > 200000 and
db2gse.st_contains(co.shape, cust.shape) = 1
group by n.candid
)
-- put it all together
select c.name, c.objectid, c.count_customers, nc.count_neighbors
from candidates c, neighborcount nc
where c.objectid = nc.candid
order by c.count_customers
+ nc.count_neighbors desc
;

10 counties with most customers

Interesting. According to this analysis, a location in county Carroll would be best.

See the following image that displays the results of the query that color-codes the number of customers in the county and its neighboring counties (red means many customers, blue means fewer customers). You can also see the neighboring counties that contribute to the total number of customers.

Map showing 10 counties with most customers

But let’s also check customers in a specified distance like we did earlier, using the center of the county:

--  use id to search as county names occur multiple times
select cc.objectid as county_id, cc.name as county_name, count(*) as count_customers
from samples.geo_county cc, samples.geo_customer cust
where
cc.objectid in (821, 646, 542, 698, 183) and
cust.insurance_value > 200000 and
db2gse.st_intersects(db2gse.st_buffer(cc.shape,200,'STATUTE MILE'), cust.shape) = 1
group by cc.objectid,cc.name
order by count_customers desc;

That changes the sequence slightly…

Customer count in radius of 200 miles

So let’s see where the center of Lincoln, Carroll, and Douglas county are:


select substr(db2gse.st_astext(db2gse.st_centroid(shape)),1,50) as county_center
from samples.geo_county where objectid in (542, 821, 183);

Center of selected counties

According to this exploration, with both Lincoln and Douglas county located in Missouri, it seems we should look more closely at places in Missouri for our next planned office location.

Map showing customer count

To learn more about spatial features, review the topics on Spatial Extender in the IBM Knowledge Center:
www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.db2.luw.spatial.topics.doc/doc/csbp1001.html, but also watch the “Analyzing Geospatial Data with IBM dashDB and ESRI ArcGIS for Desktop” video.

More stories
May 7, 2019

We’ve Moved! The IBM Cloud Blog Has a New URL

In an effort better integrate the IBM Cloud Blog with the IBM Cloud web experience, we have migrated the blog to a new URL: www.ibm.com/cloud/blog.

Continue reading

May 1, 2019

Two Tutorials: Plan, Create, and Update Deployment Environments with Terraform

Multiple environments are pretty common in a project when building a solution. They support the different phases of the development cycle and the slight differences between the environments, like capacity, networking, credentials, and log verbosity. These two tutorials will show you how to manage the environments with Terraform.

Continue reading

April 29, 2019

Transforming Customer Experiences with AI Services (Part 1)

This is an experience from a recent customer engagement on transcribing customer conversations using IBM Watson AI services.

Continue reading