Context map crawler

Overview

The Context Map Crawler is a dedicated module that is used to periodically collect context data such as real-time traffic flow and weather conditions from external sites and store the data into the context map.

You can install this module into an arbitrary node from which the HBase server can be accessed. The default configuration for the Context Map Crawler is set up to run as a Spark Streaming job on HDP (Hortonworks Data Platform) nodes so that it can handle huge volumes of context data in a scalable manner. The crawler can be run as a Spark job or systemd daemon.

The following two context map crawlers are available with IBM® IoT Connected Vehicle Insights V3.1.

Weather Data Crawler. Collects current weather condition and forecast data from The Weather Company API.
Traffic flow Data Crawler. Collects INRIX's real-time traffic flow data from The Weather Company API.

Weather Data Crawler

This crawler collects current weather condition from arbitrary regions and weather forecast data up to 15 days ahead. The crawler divides the given region into multiple meshes with the same size as level 4 GeoHash, and queries weather condition and forecast data for each mesh.

Both current weather condition data and forecast data are stored in the same context map. The current condition data continues to be stored as historical weather data in the context map, while the forecast data is overwritten while updating weather forecast in the next crawling iteration. If you give a future timestamp when you query the weather context, you get the weather forecast data from the context map.

The default interval value for crawling is 1 hour. You can configure the interval value in the properties file.

Configuration parameters for the weather Data Crawler

The following variables in Ansible® scripts can be changed before installation. Refer to the installation guide for more information.

Variable	Description	Default value
moma.contextmap.crawler.twcweather.enabled	Enable crawler when `enabled = true`, otherwise disable it.	`false`
moma.contextmap.crawler.twcweather.interval	Crawling interval in seconds.	`3600`
moma.contextmap.crawler.twcweather.regions	List of target regions	`- {zoom: 10, left: 908, top: 402, right: 910, bottom: 404}`
moma.contextmap.crawler.twcweather.maxforecast	Number of hourly forecast data to be stored.	`48`
moma.contextmap.crawler.twcweather.threads	Number of threads for crawler job.	`8`
moma.contextmap.crawler.twcweather.backupdir	Path to the directory in which backup file will be stored.	N/A
moma.contextmap.crawler.twcweather.spark.enabled	Launch crawler as a Spark job or not	`False`
moma.contextmap.crawler.twcweather.spark.executor_cores	Number of executor cores	`2`
moma.contextmap.crawler.twcweather.spark.num_executors	Number of executor nodes	`3`
moma.contextmap.crawler.twcweather.spark.executor_memory	Executor's memory size	`4G`
moma.contextmap.crawler.twcweather.spark.driver_memory	Driver's memory size	`2G`
moma.contextmap.crawler.twcweather.spark.master	Master	`yarn`
moma.contextmap.crawler.twcweather.spark.deploy_mode	Deploy mode	`cluster`

A region can be defined by giving a zoom level and tile coordinates of both the upper-left corner and the right-bottom corner of each region.

Your API key is necessary to access The Weather Company API. Define the following variable in the password file for each inventory.

vault_moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"

Alternatively, you can define Your API key as plain text in the group variable file for each inventory.

moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"

In both cases, apikey is encoded by an encryption tool during deployment and is securely stored in the properties file.

Context data specification of weather condition and forecast

Weather context that includes current condition data and forecast data is stored in the context map, per the following table.

Type	Value
Context Category	Weather
Context Source	TWC
Context Type	HourlyWeather
Feature Category	MOMA
Feature Source	MOMA
Feature Type	grid4

The Context ID of level 4 GeoHash mesh is the following value.

grid4__MOMA__<Level 4 GeoHash>__HourlyWeather__TWC

The Context Map Service Java library provides several kinds of APIs to query data from the context map. You must give this context ID to those APIs. For more information about the Java API, see the following sections.

The context map contains the following measures to be provided as weather context.

Measure name	Description	Type
icon_code	TWC's icon code 0 - 47. Refer to Icon Code Specification.	Integer
Phrase	Short description of weather condition	String
day_indicator	D: Day, N: Night	String
Temperature	Temperature in degree	Integer
wind_speed	Wind speed in km/h	Double
wind_direction	Wind direction in degree	Double
Pressure	Mean sea level pressure (in mb)	Double
precip_1hr	One-hour liquid precipitation amount (mm)	Double
snow_1hr	One-hour snowfall amount (cm)	Double
weather_summary	Summarized weather data for offline analysis	JSON String
Details	Raw data of The Weather Company API	JSON String

The details measure contains The Weather Company's raw data. Note that the data format of current conditions and the data format of weather forecast are slightly different. For more information about this data, see Weather Company Data - Enhanced Current Conditions > Currents On Demand - v3.0 and Weather Company Data - Enhanced Forecast > Hourly Forecast - (2 Day, 15 Day) - v3.0.

Starting the weather data crawler

If you have installed the weather data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twcweather is enabled and you can control the crawler by using systemctl command. Run following command to start the weather data crawler as a systemd service.

$ systemctl start contextmap-crawler-twcweather

A shell script that is named start_twcweather_crawler.sh is also installed on the installation directory (default location is /opt/ibm/cvi/ctxmap/crawler). You can manually start the weather data crawler by executing this script.

Traffic flow Data Crawler (The Weather Company)

The Weather Company traffic flow Data Crawler collects the INRIX's real-time traffic-flow data that is published on The Weather Company API. The crawler periodically queries the current traffic condition and stores them into the context map every 15 minutes, which is the default period configuration.

Before the traffic flow data is stored into the context map, the crawler runs bulk map-matching on Dynamic Map Manager (DMM) server. Also, the crawler converts the INRIX traffic flow data into each link segment data, depending on the map that is used. For performance reasons, the results of bulk map matching are stored in HBase as a cache and will be reused next time the same INRIX's segment is given. The traffic flow data of each link is grouped by the level 7 GeoHash mesh, which is detected by the starting point of the link.

Configuration parameters of The Weather Company traffic flow Data Crawler

The following variables can be changed before installation. Refer to the installation guide for more information.

Variable	Description	Default value
moma.contextmap.crawler.twctrafficflow.enabled	Enable crawler when `enabled = true`.	`false`
moma.contextmap.crawler.twctrafficflow.interval	Crawling interval in seconds.	`900`
moma.contextmap.crawler.twctrafficflow.regions	List of target regions	`{zoom: 12, left: 2045, top: 1360, right: 2047, bottom: 1362}`
moma.contextmap.crawler.twctrafficflow.threads	Number of threads for crawler job.	`8`
moma.contextmap.crawler.twctrafficflow.backupdir	Path to the directory in which backup file will be stored.	`N/A`
moma.contextmap.crawler.twctrafficflow.spark.enabled	Launch crawler as a Spakr job	`false`
moma.contextmap.crawler.twctrafficflow.spark.enabled	Launch crawler as a Spark job or not	`False`
moma.contextmap.crawler.twctrafficflow.spark.executor_cores	Number of executor cores	`2`
moma.contextmap.crawler.twctrafficflow.spark.num_executors	Number of executor nodes	`3`
moma.contextmap.crawler.twctrafficflow.spark.executor_memory	Executor's memory size	`4G`
moma.contextmap.crawler.twctrafficflow.spark.driver_memory	Driver's memory size	`2G`
moma.contextmap.crawler.twctrafficflow.spark.master	Master	`yarn`
moma.contextmap.crawler.twctrafficflow.spark.deploy_mode	Deploy mode	`cluster`
moma.contextmap.crawler.twctrafficflow.dmm_user	Username for accessing DMM API	`N/A`
moma.contextmap.crawler.twctrafficflow.dmm_password	Password for accessing DMM API	`N/A`

As with the The Weather Company weather Data Crawler, Your API key is necessary to access The Weather Company's real-time traffic flow API. Define the following variable in the password file for each inventory.

vault_moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"

Alternatively, you can define Your API key as plain text in the group variable file for each inventory.

moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"

Both apikey and dmm_password are encoded by an encryption tool during deployment and are securely stored in the properties files.

Context data specification of real-time traffic flow

Traffic flow data is stored in the context map as shown in the following table.

Type	Value
Context Category	Traffic
Context Source	TWC
Context Type	TrafficFlow
Feature Category	MOMA
Feature Source	MOMA
Feature Type	grid7

The Context ID of level 7 GeoHash mesh is the following value.

grid7__MOMA__<Level 7 GeoHash>__TrafficFlow_TWC

This Context ID is necessary to query context data from the context map by using Java API. For more information about the Java API, see the following sections.

The context map contains the following measures to provide the real-time INRIX-based traffic flow data.

Measure name	Description
flow_summary	Summarized traffic flow data throughout the link
flow_detail	Low-level flow data based on INRIX's original data

Summarized data

INRIX's segment is not the same as the link defined in the map. In many cases, the length of the INRIX one segment is much longer than the length of the map link. However, exceptional cases exist, in which the link length is much longer than the length of the INRIX's segment. Therefore, extra data conversion by map-matching is required.

The flow_summary measure contains the aggregated speed data throughout the link, even if a link corresponds to multiple INRIX's segments. The following segment is an example of flow_summary context data.

[
  {
    "mapId" : 1,
    "linkId" : "68400002957288",
    "geoHash" : "9q5ctkg",
    "timestamp" : 1529903401608,
    "forward" : {
      "currentSpeed" : 27,
      "freeFlowSpeed" : 28,
      "averageSpeed" : 27,
      "trafficStatus" : 3,
    },
    "backward" : {
      "currentSpeed" : 27,
      "freeFlowSpeed" : 28,
      "averageSpeed" : 27,
      "trafficStatus" : 3
    }
  }
]

The flow_summary measure is an array of each link data that contains summarized traffic flow value in JSON format. flow_summary contains traffic flow data for either forward direction or backward direction if the link is for one-way road. Otherwise, it might contain traffic flow data for both forward direction and backward direction.

The following table shows the meaning of each entry for flow_summary data.

Key	Value
mapId	Map ID
linkId	Link ID
geoHash	Level 7 GeoHash code in which the starting point of the link exists
timestamp	Time stamp of the flow record in Epoch milliseconds
forward	Summarized speed data for forward direction
backward	Summarized speed data for backward direction

Each summarized speed for both forward direction and backward direction contains the following values.

Key	Value
currentSpeed	Current speed in km/h
freeFlowSpeed	Free-flow speed in km/h
averageSpeed	1 hour average speed in km/h
trafficStatus	0: severe congestion　 1: congestion 2: steady flow 3: free flow -1: closed

Detailed data

The flow_detail measure contains INRIX's original data in addition to the link attributes as a result of map-matching.See the following example for flow_detail data.

[
  {
    "mapId" : 1,
    "linkId" : "6840003120410",
    "length" : 83.752,
    "timestamp" : 1529903401608,
    "contexts" : [
      {
	"direction" : "FORWARD",
	"from" : 0.0,
	"to" : 83.752,
	"context" : {
	  "currentSpeed" : 27.35878,
	  "averageSpeed" : 27.35878,
	  "freeFlowSpeed" : 28.96812,
	  "closed" : false,
	  "id" : "1642649919",
	  "rawData" : {
	    "validTime" : 1529903401608,
	    "inrix.country" : "United States of America",
	    "inrix.travelTimeMinutes" : 0.675000011920929,
	    "inrix.speedBucket" : 3,
	    "inrix.speed" : 32,
	    "inrix.segmentClosed" : null,
	    "inrix.leftHanded" : false,
	    "inrix.fow" : 3,
	    "inrix.reference" : 17,
	    "inrix.frc" : 3,
	    "inrix.average" : 17
	  }
	}
      }
    ]    	
  },
  {
    ...
  }
]

Key	Value
mapId	Map ID
linkId	Link ID
length	Length of the link in meters
timestamp	Time stamp of the traffic flow record in Epoch milliseconds
contexts.direction	Flow direction in the link (FORWARD or BACKWARD)
contexts.from	Offset to the start point of the segment in the link
contexts.to	Offset of the end point of the segment in the link
contexts.context	Flow data details (see the following section)

The flow details data is per the following table.

Key	Value
currentSpeed	Current speed in km/h
averageSpeed	1 hour average speed in km/h
freeFlowSpeed	Free-flow speed in km/h
Closed	Link is closed or not
rawData	INRIX's original data. Refer to Real-Time Traffic Flow

Starting the traffic flow data crawler

If you have installed the traffic flow data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twctrafficflow is enabled. Run the following command to start the traffic flow data crawler as a systemd service:

$ systemctl start contextmap-crawler-twctrafficflow

Alternatively, you can manually launch the traffic flow data crawler by running start_twctrafficflow_crawler.sh shell script that is located at the installation directory.