Context map crawler

Overview

The Context Map Crawler is a dedicated module that is used to periodically collect context data such as real-time traffic flow and weather conditions from external sites and store the data into the context map.

You can install this module into an arbitrary node from which the HBase server can be accessed. The default configuration for the Context Map Crawler is set up to run as a Spark Streaming job on HDP (Hortonworks Data Platform) nodes so that it can handle huge volumes of context data in a scalable manner. The crawler can be run as a Spark job or systemd daemon.

The following two context map crawlers are available with IBM® IoT Connected Vehicle Insights V3.1.

  • Weather Data Crawler. Collects current weather condition and forecast data from The Weather Company API.

  • Traffic flow Data Crawler. Collects INRIX's real-time traffic flow data from The Weather Company API.

component_diagram

Weather Data Crawler

This crawler collects current weather condition from arbitrary regions and weather forecast data up to 15 days ahead. The crawler divides the given region into multiple meshes with the same size as level 4 GeoHash, and queries weather condition and forecast data for each mesh.

Both current weather condition data and forecast data are stored in the same context map. The current condition data continues to be stored as historical weather data in the context map, while the forecast data is overwritten while updating weather forecast in the next crawling iteration. If you give a future timestamp when you query the weather context, you get the weather forecast data from the context map.

The default interval value for crawling is 1 hour. You can configure the interval value in the properties file.

Configuration parameters for the weather Data Crawler

The following variables in Ansible® scripts can be changed before installation. Refer to the installation guide for more information.

Variable Description Default value
moma.contextmap.crawler.twcweather.enabled Enable crawler when enabled = true, otherwise disable it. false
moma.contextmap.crawler.twcweather.interval Crawling interval in seconds. 3600
moma.contextmap.crawler.twcweather.regions List of target regions - {zoom: 10, left: 908, top: 402, right: 910, bottom: 404}
moma.contextmap.crawler.twcweather.maxforecast Number of hourly forecast data to be stored. 48
moma.contextmap.crawler.twcweather.threads Number of threads for crawler job. 8
moma.contextmap.crawler.twcweather.backupdir Path to the directory in which backup file will be stored. N/A
moma.contextmap.crawler.twcweather.spark.enabled Launch crawler as a Spark job or not False
moma.contextmap.crawler.twcweather.spark.executor_cores Number of executor cores 2
moma.contextmap.crawler.twcweather.spark.num_executors Number of executor nodes 3
moma.contextmap.crawler.twcweather.spark.executor_memory Executor's memory size 4G
moma.contextmap.crawler.twcweather.spark.driver_memory Driver's memory size 2G
moma.contextmap.crawler.twcweather.spark.master Master yarn
moma.contextmap.crawler.twcweather.spark.deploy_mode Deploy mode cluster

A region can be defined by giving a zoom level and tile coordinates of both the upper-left corner and the right-bottom corner of each region.

Your API key is necessary to access The Weather Company API. Define the following variable in the password file for each inventory.

vault_moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"

Alternatively, you can define Your API key as plain text in the group variable file for each inventory.

moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"

In both cases, apikey is encoded by an encryption tool during deployment and is securely stored in the properties file.

Context data specification of weather condition and forecast

Weather context that includes current condition data and forecast data is stored in the context map, per the following table.

Type Value
Context Category Weather
Context Source TWC
Context Type HourlyWeather
Feature Category MOMA
Feature Source MOMA
Feature Type grid4

The Context ID of level 4 GeoHash mesh is the following value.

grid4__MOMA__<Level 4 GeoHash>__HourlyWeather__TWC

The Context Map Service Java library provides several kinds of APIs to query data from the context map. You must give this context ID to those APIs. For more information about the Java API, see the following sections.

The context map contains the following measures to be provided as weather context.

Measure name Description Type
icon_code TWC's icon code 0 - 47. Refer to Icon Code Specification. Integer
Phrase Short description of weather condition String
day_indicator D: Day, N: Night String
Temperature Temperature in degree Integer
wind_speed Wind speed in km/h Double
wind_direction Wind direction in degree Double
Pressure Mean sea level pressure (in mb) Double
precip_1hr One-hour liquid precipitation amount (mm) Double
snow_1hr One-hour snowfall amount (cm) Double
weather_summary Summarized weather data for offline analysis JSON String
Details Raw data of The Weather Company API JSON String

The details measure contains The Weather Company's raw data. Note that the data format of current conditions and the data format of weather forecast are slightly different. For more information about this data, see Weather Company Data - Enhanced Current Conditions > Currents On Demand - v3.0 and Weather Company Data - Enhanced Forecast > Hourly Forecast - (2 Day, 15 Day) - v3.0.

Starting the weather data crawler

If you have installed the weather data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twcweather is enabled and you can control the crawler by using systemctl command. Run following command to start the weather data crawler as a systemd service.

$ systemctl start contextmap-crawler-twcweather

A shell script that is named start_twcweather_crawler.sh is also installed on the installation directory (default location is /opt/ibm/cvi/ctxmap/crawler). You can manually start the weather data crawler by executing this script.

Traffic flow Data Crawler (The Weather Company)

The Weather Company traffic flow Data Crawler collects the INRIX's real-time traffic-flow data that is published on The Weather Company API. The crawler periodically queries the current traffic condition and stores them into the context map every 15 minutes, which is the default period configuration.

Before the traffic flow data is stored into the context map, the crawler runs bulk map-matching on Dynamic Map Manager (DMM) server. Also, the crawler converts the INRIX traffic flow data into each link segment data, depending on the map that is used. For performance reasons, the results of bulk map matching are stored in HBase as a cache and will be reused next time the same INRIX's segment is given. The traffic flow data of each link is grouped by the level 7 GeoHash mesh, which is detected by the starting point of the link.

Configuration parameters of The Weather Company traffic flow Data Crawler

The following variables can be changed before installation. Refer to the installation guide for more information.

Variable Description Default value
moma.contextmap.crawler.twctrafficflow.enabled Enable crawler when enabled = true. false
moma.contextmap.crawler.twctrafficflow.interval Crawling interval in seconds. 900
moma.contextmap.crawler.twctrafficflow.regions List of target regions {zoom: 12, left: 2045, top: 1360, right: 2047, bottom: 1362}
moma.contextmap.crawler.twctrafficflow.threads Number of threads for crawler job. 8
moma.contextmap.crawler.twctrafficflow.backupdir Path to the directory in which backup file will be stored. N/A
moma.contextmap.crawler.twctrafficflow.spark.enabled Launch crawler as a Spakr job false
moma.contextmap.crawler.twctrafficflow.spark.enabled Launch crawler as a Spark job or not False
moma.contextmap.crawler.twctrafficflow.spark.executor_cores Number of executor cores 2
moma.contextmap.crawler.twctrafficflow.spark.num_executors Number of executor nodes 3
moma.contextmap.crawler.twctrafficflow.spark.executor_memory Executor's memory size 4G
moma.contextmap.crawler.twctrafficflow.spark.driver_memory Driver's memory size 2G
moma.contextmap.crawler.twctrafficflow.spark.master Master yarn
moma.contextmap.crawler.twctrafficflow.spark.deploy_mode Deploy mode cluster
moma.contextmap.crawler.twctrafficflow.dmm_user Username for accessing DMM API N/A
moma.contextmap.crawler.twctrafficflow.dmm_password Password for accessing DMM API N/A

As with the The Weather Company weather Data Crawler, Your API key is necessary to access The Weather Company's real-time traffic flow API. Define the following variable in the password file for each inventory.

vault_moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"

Alternatively, you can define Your API key as plain text in the group variable file for each inventory.

moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"

Both apikey and dmm_password are encoded by an encryption tool during deployment and are securely stored in the properties files.

Context data specification of real-time traffic flow

Traffic flow data is stored in the context map as shown in the following table.

Type Value
Context Category Traffic
Context Source TWC
Context Type TrafficFlow
Feature Category MOMA
Feature Source MOMA
Feature Type grid7

The Context ID of level 7 GeoHash mesh is the following value.

grid7__MOMA__<Level 7 GeoHash>__TrafficFlow_TWC

This Context ID is necessary to query context data from the context map by using Java API. For more information about the Java API, see the following sections.

The context map contains the following measures to provide the real-time INRIX-based traffic flow data.

Measure name Description
flow_summary Summarized traffic flow data throughout the link
flow_detail Low-level flow data based on INRIX's original data

Summarized data

INRIX's segment is not the same as the link defined in the map. In many cases, the length of the INRIX one segment is much longer than the length of the map link. However, exceptional cases exist, in which the link length is much longer than the length of the INRIX's segment. Therefore, extra data conversion by map-matching is required.

The flow_summary measure contains the aggregated speed data throughout the link, even if a link corresponds to multiple INRIX's segments. The following segment is an example of flow_summary context data.

[
  {
    "mapId" : 1,
    "linkId" : "68400002957288",
    "geoHash" : "9q5ctkg",
    "timestamp" : 1529903401608,
    "forward" : {
      "currentSpeed" : 27,
      "freeFlowSpeed" : 28,
      "averageSpeed" : 27,
      "trafficStatus" : 3,
    },
    "backward" : {
      "currentSpeed" : 27,
      "freeFlowSpeed" : 28,
      "averageSpeed" : 27,
      "trafficStatus" : 3
    }
  }
]

The flow_summary measure is an array of each link data that contains summarized traffic flow value in JSON format. flow_summary contains traffic flow data for either forward direction or backward direction if the link is for one-way road. Otherwise, it might contain traffic flow data for both forward direction and backward direction.

The following table shows the meaning of each entry for flow_summary data.

Key Value
mapId Map ID
linkId Link ID
geoHash Level 7 GeoHash code in which the starting point of the link exists
timestamp Time stamp of the flow record in Epoch milliseconds
forward Summarized speed data for forward direction
backward Summarized speed data for backward direction

Each summarized speed for both forward direction and backward direction contains the following values.

Key Value
currentSpeed Current speed in km/h
freeFlowSpeed Free-flow speed in km/h
averageSpeed 1 hour average speed in km/h
trafficStatus 0: severe congestion  1: congestion 2: steady flow 3: free flow -1: closed

Detailed data

The flow_detail measure contains INRIX's original data in addition to the link attributes as a result of map-matching.See the following example for flow_detail data.

[
  {
    "mapId" : 1,
    "linkId" : "6840003120410",
    "length" : 83.752,
    "timestamp" : 1529903401608,
    "contexts" : [
      {
	"direction" : "FORWARD",
	"from" : 0.0,
	"to" : 83.752,
	"context" : {
	  "currentSpeed" : 27.35878,
	  "averageSpeed" : 27.35878,
	  "freeFlowSpeed" : 28.96812,
	  "closed" : false,
	  "id" : "1642649919",
	  "rawData" : {
	    "validTime" : 1529903401608,
	    "inrix.country" : "United States of America",
	    "inrix.travelTimeMinutes" : 0.675000011920929,
	    "inrix.speedBucket" : 3,
	    "inrix.speed" : 32,
	    "inrix.segmentClosed" : null,
	    "inrix.leftHanded" : false,
	    "inrix.fow" : 3,
	    "inrix.reference" : 17,
	    "inrix.frc" : 3,
	    "inrix.average" : 17
	  }
	}
      }
    ]    	
  },
  {
    ...
  }
]
Key Value
mapId Map ID
linkId Link ID
length Length of the link in meters
timestamp Time stamp of the traffic flow record in Epoch milliseconds
contexts.direction Flow direction in the link (FORWARD or BACKWARD)
contexts.from Offset to the start point of the segment in the link
contexts.to Offset of the end point of the segment in the link
contexts.context Flow data details (see the following section)

The flow details data is per the following table.

Key Value
currentSpeed Current speed in km/h
averageSpeed 1 hour average speed in km/h
freeFlowSpeed Free-flow speed in km/h
Closed Link is closed or not
rawData INRIX's original data. Refer to Real-Time Traffic Flow

Starting the traffic flow data crawler

If you have installed the traffic flow data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twctrafficflow is enabled. Run the following command to start the traffic flow data crawler as a systemd service:

$ systemctl start contextmap-crawler-twctrafficflow

Alternatively, you can manually launch the traffic flow data crawler by running start_twctrafficflow_crawler.sh shell script that is located at the installation directory.