Contents


Mine spatial data with space-time-boxes in IBM SPSS Modeler and visualize the data with R

Analyze traditional, unstructured, and now also spatial data from multiple sources and build powerful views using R

Comments

IBM® SPSS® Modeler is a powerful, versatile data and text analytics workbench that helps analysts build accurate, predictive models quickly and intuitively, without programming. IBM SPSS Modeler can deal with various data formats, including unstructured data, with the text analytics component included in SPSS Modeler Premium and SPSS Modeler Gold. But what about geospatial data?

Mobile devices are estimated to generate 600 billion geospatially tagged transactions a day. Every call, text message, email, and data transfer creates a data point with space and time coordinates. Radio-frequency identification (RFID) and sensor data can be even larger. At IBM Research, projects are underway to explore efficient algorithms for finding frequent patterns in spatial context from data in relational databases and from web pages that contain spatial information, such as addresses. Spatial databases are too large to analyze manually. Spatial data mining can help to discover interesting, useful, non-trivial patterns from large spatial data sets.

IBM SPSS Modeler V16 makes it possible to work with geospatial data by applying a spatial data mining function called space-time-boxes.

Marketers, intelligence analysts, customer satisfaction representatives, shippers, supply chain and logistics experts, and others who can use this information to improve their ROI are a step ahead of those colleagues who don't. Most events are associated with a particular space and time and they are an important variable in many scenarios and industries.

Space-time-boxes can help resolve problems, such as:

  • How can I track individuals that are suspected of terrorist activity?
  • How can I anticipate demand for a product at the right time and place and for the right person?
  • How can I identify when a bad guy is logging on to the network?
  • How can I tell if two mobile phones might belong to the same person?

To take advantage of this information, you need to understand geospatial and time data and how to analyze the data by using space-time-boxes.

What is a geohash?

Space-time-boxes use geohashes and timestamps to understand where and when entities exist. A geohash is a unique identifier where the latitude and longitude of a location is converted into an alphanumeric string (a hash). It has a hierarchical data structure that divides space into a bounded, rectangular, grid shape. The precision of a geoshash depends on the length of the string: The longer the string, the more precise the location.

For example, the geohash dr5ru7 is New York City (Midtown Manhattan, Hells Kitchen, to be more precise), but how do you know that?

The Earth can be put on a grid of 32 boxes, as shown in Figure 1. For the geohash dr5ru7, find the grid section D, which includes parts of eastern North America and northern South America. This location is not precise.

Figure 1. Earth is divided into a grid of 32 boxes
Map of the Earth, sectioned into a 32-box grid
Map of the Earth, sectioned into a 32-box grid

Each of the 32 boxes on the map in Figure 1 can be further divided into 32 boxes, as shown in Figure 2. For geohash dr5ru7, the DR indicates the northeastern part of the United States. The location is still not precise but is getting closer.

Figure 2. Box D is divided into another 32 boxes
Map of the Earth that shows Section D, subdivided into 32 boxes
Map of the Earth that shows Section D, subdivided into 32 boxes

The DR box of Figure 2 can be further divided into another 32 boxes, and so on, and so on. For geohash dr5ru7, the DR5 narrows the location down to the northeastern part of the United States. DR5R indicates New York City, as shown in Figure 3.

Figure 3. Geohash DR5 and DR5R point to New York City
Image of New York City highlighted on the map
Image of New York City highlighted on the map

And finally each box gets subdivided enough times until the geohash indicates the precision that is required. In this case dr5ru7 points to midtown Manhattan at coordinates (40.76, -74.0).

Figure 4. Final representation of geohash DR5RU7
Screen capture of zoomed-in map shows precise street location
Screen capture of zoomed-in map shows precise street location

What is a space-time-box node?

Space-time-boxes are an extension of geohash locations that include a third dimension, time, as shown in Listing 1. Analysts can use space-time-boxes to verify that two entities are the same because they're in the same place at the same time. Analysts can also use space-time-boxes to verify proximity between entities. This precision enhances your understanding of potential relationships. To use this extension, look for a new node that is called Space-Time-Boxes in the Record Ops palette of SPSS Modeler V16, as shown in Figure 5.

Figure 5. Space-time-box node in SPSS Modeler V16
Close-up screen capture Space-Time-Boxes node in SPSS Modeler 16
Listing 1. Example of space-time-box geohash
|Geohash|-- Start timestamp-- |-- End timestamp -- |
 dr5ru7 | 2013-01-01 00:00:00 | 2013-01-01 00:15:00|

In SPSS Modeler, space-time-boxes include two modes:

  • Individual records mode: Identifies the location of an entity at a specific time and can be used to analyze proximity and verify identity of entities.
  • Hangouts mode: A hangout is a location or time (or both) where an entity is continually or repeatedly found. This mode accounts for these hangouts.

Required fields for space-time-boxes for individual records mode

The following fields are required:

  • Latitude
  • Longitude
  • Timestamp

The space-time-box calculates the geohash. However, the Density must be selected. The Density is the size of the space-time-box of interest and includes the physical area and the elapsed time.

Figure 6. Parameters to configure space-time-boxes in individual records mode
Screen capture of settings to configure space-time-boxes in individual records mode
Screen capture of settings to configure space-time-boxes in individual records mode

Example of individual records

For example, a taxi cab company needs to understand demand on New Year's Eve. It has geospatial and time data for phones from people who opted in to their smart-taxi phone application. Each record is a geospatial ping from the application, as shown in Figure 7.

Figure 7. Ping from the smart-taxi mobile application
Screen capture showing timestamp, latitude, longitude, device
Screen capture showing timestamp, latitude, longitude, device

Space-time-boxes can be used to see the number of phones within a specific density (for example, 2.4 KM and 1 hour, as shown in Figure 8). For a density of GH5 (2.4 km) and one hour, you can aggregate to see how many devices are located within this space-time-box.

Figure 8. Configuration of the space-time-box density
Screen capture of configuration of the space-time-box density
Screen capture of configuration of the space-time-box density

Figure 9 shows the stream in SPSS Modeler.

Figure 9. Stream in SPSS by using space-time boxes
Screen capture of SPSS stream that shows space-time-boxes
Screen capture of SPSS stream that shows space-time-boxes

The aggregation exercise in Figure 8 is a simple demand calculation in a specific space-time-box. You can see the results in Figure 10.

Figure 10. Devices that are located in the same space-time box
Screen capture of list of device IDs for devices that are in the same space-time-box
Screen capture of list of device IDs for devices that are in the same space-time-box

Required fields for space-time-boxes for hangouts mode

In addition to the settings required for the individual records mode, you need to specify:

  • Entity ID: The entity to be used as the hangout identifier. In this example, the Entity ID is Taxi-Number.
  • Minimum number of events: The smallest number of rows (events) to be included in a hangout.
  • Dwell time: The smallest duration of time an entity remains in a hangout (for example, 8 hours at work, or number of minutes at a stop light).
  • Allow hangouts to span STB: Allows entities to hang out across space-time-boxes.
  • Minimum proportion of events in the qualifying space-time box: This option is available only if Allow hangouts to span STB is selected. Use this setting to control the degree to which a hangout reported in one space-time-box might overlap another. Select the minimum proportion of events that must occur within a single space-time-box to identify a hangout. For example, if Minimum proportion of events in the qualifying space-time-box is set to 25%, and the proportion of events is 26%, this event qualifies as being a hangout.

Think of a hangout as a location or a time period (or both) where an entity can repeatedly or continually be found. Examples include people at work or a vehicle's regular transportation run. Deviations from this regularity might be interesting to analyze. A row of data that is used in hangout mode is an event.

Examples of hangouts mode

Start with the taxi data mentioned earlier. Assume, for example:

  • You have geospatial and time data from taxis in the company and you want a better sense of the supply of taxis over time and space.
  • You want to account for hangouts that might occur because taxis wait at stoplights or stop signs and taxis stop for passengers.

Use the settings in Figure 11 to configure hangouts for this example. Aggregate by the space-time-box and derive the count of Device_IDs to see the availability of taxis at specific time and places, as shown in Figure 11.

Figure 11. Configuration of hangouts in space-time-boxes node
Screen capture of hangout configurations fields and values
Screen capture of hangout configurations fields and values

The fields and values that configure hangouts in Figure 11 are:

  • Entity ID field shows the Device ID (in this case the Taxi ID)
  • The minimum Dwell time is five minutes
  • The Minimum number of events is two
  • The Allow Hangouts to span STB boundaries is selected

Example that is applied to a taxi management stream

To extend the previous examples to anticipate taxi demand and distribute taxis across locations to meet this demand by using SMS alerts, use the following two data sets:

  • TaxiLocationData.csv
  • PhoneLocationData.csv

As shown in Figure 12 the final stream combines the two data sets and the space-time-boxes. Using a simple aggregate, it calculates the number of people and the number of taxis per space-time-box.

Figure 12. Final stream to forecast the taxi demand
Screen capture of stream that merges people density and taxi density
Screen capture of stream that merges people density and taxi density

The goal is to determine whether there are enough taxis at certain locations and times to meet demand. A taxi-to-people ratio is calculated and only the top percentages are selected. A notification is provided as a form of an alert (Alert: More Business expected in this area) using the new Control Language for Expression Manipulation (CLEM) functions, as shown in Figure 13.

Figure 13. Alerts generated when lots of taxis are needed
Screen capture of alerts generated when lots of taxis are needed
Screen capture of alerts generated when lots of taxis are needed

Get dynamic, data-driven maps by using R

IBM SPSS Modeler comes with many powerful visualization options to verify the results. But in some cases, extra tools are needed. In this example, the integration between IBM SPSS Modeler and R is used to create a map that illustrates where the taxis are needed. The output in the previous example is a table with alerts but an interactive map can make it easier to understand the results.

Use the R package plotGoogleMaps. This package provides an interactive plot device that handles the geographic data within web browsers. It is based on the Google Maps API and it enables the creation of an interactive web map. Google supplies the base map. All map elements and additional functions are handled by R commands from the package. To install a package, you need to build the stream that is shown in Figure 14 and insert the code installPackage(plotGoogleMaps).

Note: To use R, install the extension SPSS Modeler - Essentials for R, which is available for free. To use the plotGoogleMaps package, install it into R, by using the instructions in the developerWorks article, Calling R from SPSS: An introduction to the R plug-in for SPSS.

Figure 14. Stream that is used to install R packages in SPSS Modeler 16
Close-up screen capture of stream used to install R packages in SPSS Modeler 16

In the SPSS Modeler V16, you can use three types of R nodes:

  • R Transform
  • R Modeling
  • R Output

This example uses the R Output node to add extra visualization. As shown in Figure 15, the final stream includes the table output and the R output with the R code from Listing 2. The stream identifies the five percent of Space-Time-Boxes with the lowest taxi-to-people ratios.

Figure 15. Taxi management stream with R visualization
Screen capture of stream ends with R visualization
Screen capture of stream ends with R visualization

In the example in Figure 15, a taxi cab company uses data that is gathered from customers who opt in to a smartphone application to sense consumer density. Matching this data with their taxi cab location data, they can make recommendations to drivers about where to find more business. The taxi operation runs more efficiently overall as taxi supply is matched to demand.

A growing number of privacy laws that are related to location data are emerging around the world. Be sure to think about the legal and policy aspects in such projects. Privacy by Design (PbD) is an approach to consider.

Listing 2. Example of space-time-box geohash
library(plotGoogleMaps)
coordinates(modelerData)<-~TAXIS_NEEDED_LONGITUDE+TAXIS_NEEDED_LATITUDE
# convert to SPDF
proj4string(modelerData) <- CRS('+init=epsg:4326')
# adding Coordinate Referent Sys.
# Create web map of Point data
m<-plotGoogleMaps(modelerData,filename='C:/Users/Administrator/Desktop/myMap12.htm')

As shown in Figure 16, in SPSS Modeler V16, three new CLEM expressions under General Functions in the Expression Builder make it easier to plot frequencies on a map:

  • To_geohash: Returns the geohashed string that corresponds to the latitude and longitude and at the specified density.
  • Stb_centroid_latitude: Returns an integer value for the latitude that corresponds to the centroid of the geohash.
  • Stb_centroid_longitude: Returns an integer value for the longitude that corresponds to the centroid of the geohash.
Figure 16. New functions in the Expression Builder in SPSS Modeler
Screen capture showing list of general functions in the Expression Builder in SPSS Modeler
Screen capture showing list of general functions in the Expression Builder in SPSS Modeler

As shown in Figure 17, the result is a dynamic map, which cannot easily be drafted with the current SPSS Modeler nodes. However, by integrating the R package, you can add a whole new set of visualization capabilities to the solution.

Figure 17. Taxi management stream with R visualization
Image of map with green location pointers
Image of map with green location pointers

Conclusion

Use space-time-boxes in SPSS Modeler V16 to mine spatial data and create a complete solution for data analysis. Space-time-boxes make it possible to combine traditional data, unstructured data, and spatial data from many different types of data sources (even from a Hadoop cluster). The integration with R expands the possibilities to apply more algorithms, data transformations, and, as in this example, new powerful visualizations.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=964812
ArticleTitle=Mine spatial data with space-time-boxes in IBM SPSS Modeler and visualize the data with R
publish-date=03182014