Context map crawler
Overview
The Context Map Crawler is a dedicated module that is used to periodically collect context data such as real-time traffic flow and weather conditions from external sites and store the data into the context map.
You can install this module into an arbitrary node from which the HBase server can be accessed. The default configuration for the Context Map Crawler is set up to run as a Spark Streaming job on HDP (Hortonworks Data Platform) nodes so that it can handle huge volumes of context data in a scalable manner. The crawler can be run as a Spark job or systemd daemon.
The following two context map crawlers are available with IBM® IoT Connected Vehicle Insights V3.1.
-
Weather Data Crawler. Collects current weather condition and forecast data from The Weather Company API.
-
Traffic flow Data Crawler. Collects INRIX's real-time traffic flow data from The Weather Company API.
Weather Data Crawler
This crawler collects current weather condition from arbitrary regions and weather forecast data up to 15 days ahead. The crawler divides the given region into multiple meshes with the same size as level 4 GeoHash, and queries weather condition and forecast data for each mesh.
Both current weather condition data and forecast data are stored in the same context map. The current condition data continues to be stored as historical weather data in the context map, while the forecast data is overwritten while updating weather forecast in the next crawling iteration. If you give a future timestamp when you query the weather context, you get the weather forecast data from the context map.
The default interval value for crawling is 1 hour. You can configure the interval value in the properties file.
Configuration parameters for the weather Data Crawler
The following variables in Ansible® scripts can be changed before installation. Refer to the installation guide for more information.
| Variable | Description | Default value |
|---|---|---|
| moma.contextmap.crawler.twcweather.enabled | Enable crawler when enabled = true, otherwise disable it. |
false
|
| moma.contextmap.crawler.twcweather.interval | Crawling interval in seconds. |
3600
|
| moma.contextmap.crawler.twcweather.regions | List of target regions |
- {zoom: 10, left: 908, top: 402, right: 910, bottom: 404}
|
| moma.contextmap.crawler.twcweather.maxforecast | Number of hourly forecast data to be stored. |
48
|
| moma.contextmap.crawler.twcweather.threads | Number of threads for crawler job. |
8
|
| moma.contextmap.crawler.twcweather.backupdir | Path to the directory in which backup file will be stored. | N/A |
| moma.contextmap.crawler.twcweather.spark.enabled | Launch crawler as a Spark job or not |
False
|
| moma.contextmap.crawler.twcweather.spark.executor_cores | Number of executor cores |
2
|
| moma.contextmap.crawler.twcweather.spark.num_executors | Number of executor nodes |
3
|
| moma.contextmap.crawler.twcweather.spark.executor_memory | Executor's memory size |
4G
|
| moma.contextmap.crawler.twcweather.spark.driver_memory | Driver's memory size |
2G
|
| moma.contextmap.crawler.twcweather.spark.master | Master |
yarn
|
| moma.contextmap.crawler.twcweather.spark.deploy_mode | Deploy mode |
cluster
|
A region can be defined by giving a zoom level and tile coordinates of both the upper-left corner and the right-bottom corner of each region.
Your API key is necessary to access The Weather Company API. Define the following variable in the password file for each inventory.
vault_moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"
Alternatively, you can define Your API key as plain text in the group variable file for each inventory.
moma.contextmap.crawler.twcweather.apikey = "<Your API Key>"
In both cases, apikey is encoded by an encryption tool during deployment and is securely stored in the properties file.
Context data specification of weather condition and forecast
Weather context that includes current condition data and forecast data is stored in the context map, per the following table.
| Type | Value |
|---|---|
| Context Category | Weather |
| Context Source | TWC |
| Context Type | HourlyWeather |
| Feature Category | MOMA |
| Feature Source | MOMA |
| Feature Type | grid4 |
The Context ID of level 4 GeoHash mesh is the following value.
grid4__MOMA__<Level 4 GeoHash>__HourlyWeather__TWC
The Context Map Service Java library provides several kinds of APIs to query data from the context map. You must give this context ID to those APIs. For more information about the Java API, see the following sections.
The context map contains the following measures to be provided as weather context.
| Measure name | Description | Type |
|---|---|---|
| icon_code | TWC's icon code 0 - 47. Refer to Icon Code Specification. | Integer |
| Phrase | Short description of weather condition | String |
| day_indicator | D: Day, N: Night | String |
| Temperature | Temperature in degree | Integer |
| wind_speed | Wind speed in km/h | Double |
| wind_direction | Wind direction in degree | Double |
| Pressure | Mean sea level pressure (in mb) | Double |
| precip_1hr | One-hour liquid precipitation amount (mm) | Double |
| snow_1hr | One-hour snowfall amount (cm) | Double |
| weather_summary | Summarized weather data for offline analysis | JSON String |
| Details | Raw data of The Weather Company API | JSON String |
The details measure contains The Weather Company's raw data. Note that the data format of
current conditions and the data format of weather forecast are slightly different. For more information about this data, see Weather Company Data - Enhanced Current Conditions > Currents On Demand - v3.0 and Weather Company Data - Enhanced Forecast > Hourly Forecast - (2 Day, 15 Day) - v3.0.
Starting the weather data crawler
If you have installed the weather data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twcweather is enabled and you can control the crawler by using systemctl command. Run following command to start the weather data crawler as a systemd service.
$ systemctl start contextmap-crawler-twcweather
A shell script that is named start_twcweather_crawler.sh is also installed on the installation directory (default location is
/opt/ibm/cvi/ctxmap/crawler). You can manually start the weather data crawler by executing this script.
Traffic flow Data Crawler (The Weather Company)
The Weather Company traffic flow Data Crawler collects the INRIX's real-time traffic-flow data that is published on The Weather Company API. The crawler periodically queries the current traffic condition and stores them into the context map every 15 minutes, which is the default period configuration.
Before the traffic flow data is stored into the context map, the crawler runs bulk map-matching on Dynamic Map Manager (DMM) server. Also, the crawler converts the INRIX traffic flow data into each link segment data, depending on the map that is used. For performance reasons, the results of bulk map matching are stored in HBase as a cache and will be reused next time the same INRIX's segment is given. The traffic flow data of each link is grouped by the level 7 GeoHash mesh, which is detected by the starting point of the link.
Configuration parameters of The Weather Company traffic flow Data Crawler
The following variables can be changed before installation. Refer to the installation guide for more information.
| Variable | Description | Default value |
|---|---|---|
| moma.contextmap.crawler.twctrafficflow.enabled | Enable crawler when enabled = true. |
false
|
| moma.contextmap.crawler.twctrafficflow.interval | Crawling interval in seconds. |
900
|
| moma.contextmap.crawler.twctrafficflow.regions | List of target regions |
{zoom: 12, left: 2045, top: 1360, right: 2047, bottom: 1362}
|
| moma.contextmap.crawler.twctrafficflow.threads | Number of threads for crawler job. |
8
|
| moma.contextmap.crawler.twctrafficflow.backupdir | Path to the directory in which backup file will be stored. |
N/A
|
| moma.contextmap.crawler.twctrafficflow.spark.enabled | Launch crawler as a Spakr job |
false
|
| moma.contextmap.crawler.twctrafficflow.spark.enabled | Launch crawler as a Spark job or not |
False
|
| moma.contextmap.crawler.twctrafficflow.spark.executor_cores | Number of executor cores |
2
|
| moma.contextmap.crawler.twctrafficflow.spark.num_executors | Number of executor nodes |
3
|
| moma.contextmap.crawler.twctrafficflow.spark.executor_memory | Executor's memory size |
4G
|
| moma.contextmap.crawler.twctrafficflow.spark.driver_memory | Driver's memory size |
2G
|
| moma.contextmap.crawler.twctrafficflow.spark.master | Master |
yarn
|
| moma.contextmap.crawler.twctrafficflow.spark.deploy_mode | Deploy mode |
cluster
|
| moma.contextmap.crawler.twctrafficflow.dmm_user | Username for accessing DMM API |
N/A
|
| moma.contextmap.crawler.twctrafficflow.dmm_password | Password for accessing DMM API |
N/A
|
As with the The Weather Company weather Data Crawler, Your API key is necessary to
access The Weather Company's real-time traffic flow API. Define the following
variable in the password file for each inventory.
vault_moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"
Alternatively, you can define Your API key as plain text in the
group variable file for each inventory.
moma.contextmap.crawler.twctrafficflow.apikey = "<Your API Key>"
Both apikey and dmm_password are encoded by an encryption tool during deployment and are securely stored in the properties files.
Context data specification of real-time traffic flow
Traffic flow data is stored in the context map as shown in the following table.
| Type | Value |
|---|---|
| Context Category | Traffic |
| Context Source | TWC |
| Context Type | TrafficFlow |
| Feature Category | MOMA |
| Feature Source | MOMA |
| Feature Type | grid7 |
The Context ID of level 7 GeoHash mesh is the following value.
grid7__MOMA__<Level 7 GeoHash>__TrafficFlow_TWC
This Context ID is necessary to query context data from the context map by using Java API. For more information about the Java API, see the following sections.
The context map contains the following measures to provide the real-time INRIX-based traffic flow data.
| Measure name | Description |
|---|---|
| flow_summary | Summarized traffic flow data throughout the link |
| flow_detail | Low-level flow data based on INRIX's original data |
Summarized data
INRIX's segment is not the same as the link defined in the map. In many cases, the length of the INRIX one segment is much longer than the length of the map link. However, exceptional cases exist, in which the link length is much longer than the length of the INRIX's segment. Therefore, extra data conversion by map-matching is required.
The flow_summary measure contains the aggregated speed data throughout the link,
even if a link corresponds to multiple INRIX's segments. The following segment is
an example of flow_summary context data.
[
{
"mapId" : 1,
"linkId" : "68400002957288",
"geoHash" : "9q5ctkg",
"timestamp" : 1529903401608,
"forward" : {
"currentSpeed" : 27,
"freeFlowSpeed" : 28,
"averageSpeed" : 27,
"trafficStatus" : 3,
},
"backward" : {
"currentSpeed" : 27,
"freeFlowSpeed" : 28,
"averageSpeed" : 27,
"trafficStatus" : 3
}
}
]
The flow_summary measure is an array of each link data that contains summarized
traffic flow value in JSON format. flow_summary contains traffic flow data for
either forward direction or backward direction if the link is for
one-way road. Otherwise, it might contain traffic flow data for both
forward direction and backward direction.
The following table shows the meaning of each entry for flow_summary
data.
| Key | Value |
|---|---|
| mapId | Map ID |
| linkId | Link ID |
| geoHash | Level 7 GeoHash code in which the starting point of the link exists |
| timestamp | Time stamp of the flow record in Epoch milliseconds |
| forward | Summarized speed data for forward direction |
| backward | Summarized speed data for backward direction |
Each summarized speed for both forward direction and backward direction contains the following values.
| Key | Value |
|---|---|
| currentSpeed | Current speed in km/h |
| freeFlowSpeed | Free-flow speed in km/h |
| averageSpeed | 1 hour average speed in km/h |
| trafficStatus | 0: severe congestion 1: congestion 2: steady flow 3: free flow -1: closed |
Detailed data
The flow_detail measure contains INRIX's original data in addition to the link
attributes as a result of map-matching.See the following example for
flow_detail data.
[
{
"mapId" : 1,
"linkId" : "6840003120410",
"length" : 83.752,
"timestamp" : 1529903401608,
"contexts" : [
{
"direction" : "FORWARD",
"from" : 0.0,
"to" : 83.752,
"context" : {
"currentSpeed" : 27.35878,
"averageSpeed" : 27.35878,
"freeFlowSpeed" : 28.96812,
"closed" : false,
"id" : "1642649919",
"rawData" : {
"validTime" : 1529903401608,
"inrix.country" : "United States of America",
"inrix.travelTimeMinutes" : 0.675000011920929,
"inrix.speedBucket" : 3,
"inrix.speed" : 32,
"inrix.segmentClosed" : null,
"inrix.leftHanded" : false,
"inrix.fow" : 3,
"inrix.reference" : 17,
"inrix.frc" : 3,
"inrix.average" : 17
}
}
}
]
},
{
...
}
]
| Key | Value |
|---|---|
| mapId | Map ID |
| linkId | Link ID |
| length | Length of the link in meters |
| timestamp | Time stamp of the traffic flow record in Epoch milliseconds |
| contexts.direction | Flow direction in the link (FORWARD or BACKWARD) |
| contexts.from | Offset to the start point of the segment in the link |
| contexts.to | Offset of the end point of the segment in the link |
| contexts.context | Flow data details (see the following section) |
The flow details data is per the following table.
| Key | Value |
|---|---|
| currentSpeed | Current speed in km/h |
| averageSpeed | 1 hour average speed in km/h |
| freeFlowSpeed | Free-flow speed in km/h |
| Closed | Link is closed or not |
| rawData | INRIX's original data. Refer to Real-Time Traffic Flow |
Starting the traffic flow data crawler
If you have installed the traffic flow data crawler as a systemd service (that is, spark.enabled = false), a new service that is named contextmap-crawler-twctrafficflow is enabled. Run the following command to start the traffic flow data crawler as a systemd service:
$ systemctl start contextmap-crawler-twctrafficflow
Alternatively, you can manually launch the traffic flow data crawler by running start_twctrafficflow_crawler.sh shell script that is located at the installation directory.