Mine spatial data with space-time-boxes in IBM SPSS Modeler and visualize the data with R
Analyze traditional, unstructured, and now also spatial data from multiple sources and build powerful views using R
IBM® SPSS® Modeler is a powerful, versatile data and text analytics workbench that helps analysts build accurate, predictive models quickly and intuitively, without programming. IBM SPSS Modeler can deal with various data formats, including unstructured data, with the text analytics component included in SPSS Modeler Premium and SPSS Modeler Gold. But what about geospatial data?
Mobile devices are estimated to generate 600 billion geospatially tagged transactions a day. Every call, text message, email, and data transfer creates a data point with space and time coordinates. Radio-frequency identification (RFID) and sensor data can be even larger. At IBM Research, projects are underway to explore efficient algorithms for finding frequent patterns in spatial context from data in relational databases and from web pages that contain spatial information, such as addresses. Spatial databases are too large to analyze manually. Spatial data mining can help to discover interesting, useful, non-trivial patterns from large spatial data sets.
IBM SPSS Modeler V16 makes it possible to work with geospatial data by applying a spatial data mining function called space-time-boxes.
Marketers, intelligence analysts, customer satisfaction representatives, shippers, supply chain and logistics experts, and others who can use this information to improve their ROI are a step ahead of those colleagues who don't. Most events are associated with a particular space and time and they are an important variable in many scenarios and industries.
Space-time-boxes can help resolve problems, such as:
- How can I track individuals that are suspected of terrorist activity?
- How can I anticipate demand for a product at the right time and place and for the right person?
- How can I identify when a bad guy is logging on to the network?
- How can I tell if two mobile phones might belong to the same person?
To take advantage of this information, you need to understand geospatial and time data and how to analyze the data by using space-time-boxes.
What is a geohash?
Space-time-boxes use geohashes and timestamps to understand where and when entities exist. A geohash is a unique identifier where the latitude and longitude of a location is converted into an alphanumeric string (a hash). It has a hierarchical data structure that divides space into a bounded, rectangular, grid shape. The precision of a geoshash depends on the length of the string: The longer the string, the more precise the location.
For example, the geohash
dr5ru7 is New York City (Midtown
Manhattan, Hells Kitchen, to be more precise), but how do you know
The Earth can be put on a grid of 32 boxes, as shown in Figure 1. For the
dr5ru7, find the grid section D,
which includes parts of eastern North America and northern South America.
This location is not precise.
Figure 1. Earth is divided into a grid of 32 boxes
Each of the 32 boxes on the map in Figure 1 can be
further divided into 32 boxes, as shown in Figure 2. For geohash
dr5ru7, the DR indicates the northeastern
part of the United States. The location is still not precise but is
Figure 2. Box D is divided into another 32 boxes
The DR box of Figure 2 can be further
divided into another 32 boxes, and so on, and so on. For geohash
dr5ru7, the DR5 narrows the location down to
the northeastern part of the United States. DR5R
indicates New York City, as shown in Figure 3.
Figure 3. Geohash DR5 and DR5R point to New York City
And finally each box gets subdivided enough times until the geohash
indicates the precision that is required. In this case
points to midtown Manhattan at coordinates (40.76, -74.0).
Figure 4. Final representation of geohash DR5RU7
What is a space-time-box node?
Space-time-boxes are an extension of geohash locations that include a third dimension, time, as shown in Listing 1. Analysts can use space-time-boxes to verify that two entities are the same because they're in the same place at the same time. Analysts can also use space-time-boxes to verify proximity between entities. This precision enhances your understanding of potential relationships. To use this extension, look for a new node that is called Space-Time-Boxes in the Record Ops palette of SPSS Modeler V16, as shown in Figure 5.
Figure 5. Space-time-box node in SPSS Modeler V16
Listing 1. Example of space-time-box geohash
|Geohash|-- Start timestamp-- |-- End timestamp -- | dr5ru7 | 2013-01-01 00:00:00 | 2013-01-01 00:15:00|
In SPSS Modeler, space-time-boxes include two modes:
- Individual records mode: Identifies the location of an entity at a specific time and can be used to analyze proximity and verify identity of entities.
- Hangouts mode: A hangout is a location or time (or both) where an entity is continually or repeatedly found. This mode accounts for these hangouts.
Required fields for space-time-boxes for individual records mode
The following fields are required:
The space-time-box calculates the geohash. However, the Density must be selected. The Density is the size of the space-time-box of interest and includes the physical area and the elapsed time.
Figure 6. Parameters to configure space-time-boxes in individual records mode
Example of individual records
For example, a taxi cab company needs to understand demand on New Year's
Eve. It has geospatial and time data for phones from people who opted in
smart-taxi phone application. Each record is a
geospatial ping from the application, as shown in Figure 7.
Figure 7. Ping from the smart-taxi mobile application
Space-time-boxes can be used to see the number of phones within a specific density (for example, 2.4 KM and 1 hour, as shown in Figure 8). For a density of GH5 (2.4 km) and one hour, you can aggregate to see how many devices are located within this space-time-box.
Figure 8. Configuration of the space-time-box density
Figure 9 shows the stream in SPSS Modeler.
Figure 9. Stream in SPSS by using space-time boxes
The aggregation exercise in Figure 8 is a simple demand calculation in a specific space-time-box. You can see the results in Figure 10.
Figure 10. Devices that are located in the same space-time box
Required fields for space-time-boxes for hangouts mode
In addition to the settings required for the individual records mode, you need to specify:
- Entity ID: The entity to be used as the hangout
identifier. In this example, the Entity ID is
- Minimum number of events: The smallest number of rows (events) to be included in a hangout.
- Dwell time: The smallest duration of time an entity remains in a hangout (for example, 8 hours at work, or number of minutes at a stop light).
- Allow hangouts to span STB: Allows entities to hang out across space-time-boxes.
- Minimum proportion of events in the qualifying space-time box: This option is available only if Allow hangouts to span STB is selected. Use this setting to control the degree to which a hangout reported in one space-time-box might overlap another. Select the minimum proportion of events that must occur within a single space-time-box to identify a hangout. For example, if Minimum proportion of events in the qualifying space-time-box is set to 25%, and the proportion of events is 26%, this event qualifies as being a hangout.
Think of a hangout as a location or a time period (or both) where an entity can repeatedly or continually be found. Examples include people at work or a vehicle's regular transportation run. Deviations from this regularity might be interesting to analyze. A row of data that is used in hangout mode is an event.
Examples of hangouts mode
Start with the taxi data mentioned earlier. Assume, for example:
- You have geospatial and time data from taxis in the company and you want a better sense of the supply of taxis over time and space.
- You want to account for hangouts that might occur because taxis wait at stoplights or stop signs and taxis stop for passengers.
Use the settings in Figure 11 to configure hangouts for this example. Aggregate by the space-time-box and derive the count of Device_IDs to see the availability of taxis at specific time and places, as shown in Figure 11.
Figure 11. Configuration of hangouts in space-time-boxes node
The fields and values that configure hangouts in Figure 11 are:
- Entity ID field shows the Device ID (in this case the Taxi ID)
- The minimum Dwell time is five minutes
- The Minimum number of events is two
- The Allow Hangouts to span STB boundaries is selected
Example that is applied to a taxi management stream
To extend the previous examples to anticipate taxi demand and distribute taxis across locations to meet this demand by using SMS alerts, use the following two data sets:
As shown in Figure 12 the final stream combines the two data sets and the space-time-boxes. Using a simple aggregate, it calculates the number of people and the number of taxis per space-time-box.
Figure 12. Final stream to forecast the taxi demand
The goal is to determine whether there are enough taxis at certain
locations and times to meet demand. A taxi-to-people ratio is calculated
and only the top percentages are selected. A notification is provided as a
form of an alert (
Alert: More Business expected in this area)
using the new Control Language for Expression Manipulation (CLEM)
functions, as shown in Figure 13.
Figure 13. Alerts generated when lots of taxis are needed
Get dynamic, data-driven maps by using R
IBM SPSS Modeler comes with many powerful visualization options to verify the results. But in some cases, extra tools are needed. In this example, the integration between IBM SPSS Modeler and R is used to create a map that illustrates where the taxis are needed. The output in the previous example is a table with alerts but an interactive map can make it easier to understand the results.
Use the R package
plotGoogleMaps. This package provides an
interactive plot device that handles the geographic data within web
browsers. It is based on the Google Maps API and it enables the creation
of an interactive web map. Google supplies the base map. All map elements
and additional functions are handled by R commands from the package. To
install a package, you need to build the stream that is shown in Figure 14
and insert the code
Note: To use R, install the extension
SPSS Modeler -
Essentials for R, which is available for free.
To use the
plotGoogleMaps package, install it into R, by
using the instructions in the developerWorks article, Calling R from SPSS: An introduction to the R plug-in for
Figure 14. Stream that is used to install R packages in SPSS Modeler 16
In the SPSS Modeler V16, you can use three types of R nodes:
- R Transform
- R Modeling
- R Output
This example uses the R Output node to add extra visualization. As shown in Figure 15, the final stream includes the table output and the R output with the R code from Listing 2. The stream identifies the five percent of Space-Time-Boxes with the lowest taxi-to-people ratios.
Figure 15. Taxi management stream with R visualization
In the example in Figure 15, a taxi cab company uses data that is gathered from customers who opt in to a smartphone application to sense consumer density. Matching this data with their taxi cab location data, they can make recommendations to drivers about where to find more business. The taxi operation runs more efficiently overall as taxi supply is matched to demand.
A growing number of privacy laws that are related to location data are emerging around the world. Be sure to think about the legal and policy aspects in such projects. Privacy by Design (PbD) is an approach to consider.
Listing 2. Example of space-time-box geohash
library(plotGoogleMaps) coordinates(modelerData)<-~TAXIS_NEEDED_LONGITUDE+TAXIS_NEEDED_LATITUDE # convert to SPDF proj4string(modelerData) <- CRS('+init=epsg:4326') # adding Coordinate Referent Sys. # Create web map of Point data m<-plotGoogleMaps(modelerData,filename='C:/Users/Administrator/Desktop/myMap12.htm')
As shown in Figure 16, in SPSS Modeler V16, three new CLEM expressions under General Functions in the Expression Builder make it easier to plot frequencies on a map:
To_geohash: Returns the geohashed string that corresponds to the latitude and longitude and at the specified density.
Stb_centroid_latitude: Returns an integer value for the latitude that corresponds to the centroid of the geohash.
Stb_centroid_longitude: Returns an integer value for the longitude that corresponds to the centroid of the geohash.
Figure 16. New functions in the Expression Builder in SPSS Modeler
As shown in Figure 17, the result is a dynamic map, which cannot easily be drafted with the current SPSS Modeler nodes. However, by integrating the R package, you can add a whole new set of visualization capabilities to the solution.
Figure 17. Taxi management stream with R visualization
Use space-time-boxes in SPSS Modeler V16 to mine spatial data and create a complete solution for data analysis. Space-time-boxes make it possible to combine traditional data, unstructured data, and spatial data from many different types of data sources (even from a Hadoop cluster). The integration with R expands the possibilities to apply more algorithms, data transformations, and, as in this example, new powerful visualizations.
- Refer to the SPSS Modeler V16 documentation to learn more about SPSS Modeler.
- Explore more about visualizing geohash.
- SPSS software: Learn more about the SPSS product portfolio.
- See The Comprehensive R Archive Network, the main site for the R
project and each R package. The help pages and manuals that are associated
Rcgminare detailed. Numerous references are provided.
- Read Do I need to learn R? (Catherine Dalzell, developerWorks, September 2013) to learn why R is a valuable tool for data analytics that was expressly designed to reflect the way that statisticians think and work.
- Download the R plug-in for SPSS.