SD-WAN Nokia-Nuage Networks Collector Architecture & Technical Insights / Features Guide

About

This document provides Nokia-Nuage's architectural diagram along with technical insights / features it supports.

Architecture

The diagram below shows the design for Nokia-Nuage collector per tenant.

Architecture Nokia Nuage

The table below provides a list of cronjobs, whether or not it runs continuously, its interval, and if it configurable.

CRONJOB CONTINUOUS? INTERVAL CONFIGURABLE?
Alarm Yes No Yes
JSON Decoder Yes 5 minutes Yes
Installer No 24 hours Yes
Create device No 30 minutes Yes
Tunnel description runner No 1 hour Yes
Device description runner No 15 minutes Yes
Interface speed updater No 1 hour Yes
Device health runner No 5 minutes Yes
Tunnel runner No 5 minutes Yes
Interface runner No 5 minutes Yes
Interface queue runner No 5 minutes Yes
SLA stats runner No 5 minutes Yes
APM tunnel runner No 5 minutes Yes
vPort & LAN aggregation runner No 5 minutes Yes
Events runner No 5 minutes Yes
IKE Tunnel Poller No 5 minutes No
IKE Probe Poller No 5 minutes No
ITE Interface Poller No 5 minutes No

Technical Insights / Features

Nokia-Nuage Collector

NPM Tunnels / Probes for Tunnel Objects and Object Count Relation to it

  • Nokia-Nuage tunnels are probes which are based on different Network Performance Measurement (NPM) classes. Each combination of the following can have multiple probes based on the NPM classes.
    • Source Device
    • Source Port
    • Destination Device
    • Destination Port
  • Each NPM class probe has its own metric value for jitter, latency, and loss percentage.
  • Tunnel / Probe object-naming convention is,
    <Source Device>::<SourceUplink>→<Destination Device>::<Destination Uplink>::<NPM Group>
  • If there are N number of NPM classes and the classes are all attached to the same source and destination device, then for each unique link there are N probe objects.
  • If you have 3 different NPM classes as your object count, probe will multiply by 3 as well.
Nokia Nuage Object Manager

NPM-level Object Groups

Based on NPM Probe's tunnel object naming, the NPM groups are created with the Object Group rules based on NPM name.

Nokia Nuage Object Groups 1

Flow Data APM vs. Tunnel Data NPM

  • Flowstats data from elasticsearch has APM Group which is related to the Application Performance Measurement (APM).
  • Probestats data from elasticsearch has both kinds of data points which are based on APM or NPM groups.
    • At present, for Nuage-only, NPM group-based Probestats are considered for the tunnel object creation.
  • APM and NPM groups are two separate entities in Virtualized Services Directory (VSD) and are not the same even if they have the same names.

    Example

    APM NPM

Device / Port / VLAN concept for Interface and Object Naming

  • For Nokia-Nuage, each device is a NSGateway.
  • Each NSGateway can have multiple NSPorts.
    • NSPorts can be NETWORK and ACCESS. NETWORK port is considered for interface creation.
      • Each NSPort can have multiple VLANs and each VLAN has a unique VLAN number.
        • Each VLAN has a specific link type of transport - internet or other.
  • Each VLAN is considered as an interface to the device interface object type.
  • Naming of device is done under a device which has <NSPort-Name>::<VlanID>

LAN Aggregation for Domain, Subnet, and concept of vPort

  • Nokia-Nuage devices are segregated under logical entities such as,
    • domain
    • zone
    • subnet
  • Each enterprise can have multiple domains / VPNs.
    • Each domain can have multiple zones.
      • Each zone can have multiple subnets.
        • Each subnet is associated with a NSgateway.
        • Each subnet can be linked to 1 or more vPort objects.
          • vPort is a base raw object and the data is obtained from the elasticsearch using the Nuage_Vport index.
  • Aggregation functionality is available for the domain and subnet level. Object type is the same as device interface.
  • Subnet level aggregation objects are under their respective device to which the subnet is associated.
  • vPort level raw objects are under their respective device.
  • Domain is a larger entity than a device. For domain-level objects, a separate dummy device is created with the <tenant name>.
  • For grouping of the domain, subnet-level is based on the Object Groups to provide the drill-down reporting from domain to subnets.
Nokia Nuage Object Groups 2

Alarm Runner and AMQP support from VSD

  • For alarms, Virtualized Services Directory, VSD, provides support for AMQP Bus for which, VSD exposes port 5672.
  • Connection for an alarm can also be done using the JMS. However, for now, AMQP is not using this type of connection.
  • Alarm Runner container runs continuously. Other containers run as part of the cronjob.
  • For AMQP connection, durable connection is used. With each separate connection, 10 minutes of durability is provided by the VSD for the connection. This helps in obtaining the alarm data in case the connection goes down for up to 10 minutes. Post this time, alarm data is lost.

Events vs. Alarms in Nokia-Nuage

Events Alarms
  • Events are optional logs of VSD which are notified as API or as AMQP Bus push notification.
  • Events are fetched from the VSD using the API call directly based on the timestamp value on periodic polling intervals.
  • Only events related to the device, NSPorts, and VLANs are pushed to SevOne NMS as alerts with severity INFO.
  • Alarms are actual error conditions generated on the VSD and pushed to collector via AMQP Bus.
  • Alarms are also pushed as alerts to the VSD with different severity level. i.e., MAJOR, CRITICAL.

Device Health with / without Elasticsearch support

  • Nokia-Nuage versions <= 5.3.3 do not provide support for Nuage_Sysmon index on Elasticsearch. Due to this, device health data is fetched from VSD for versions <= 5.3.3.
  • Nokia-Nuage versions >= 5.4.1 contains elasticsearch data with Nuage_Sysmon index, and data for the device resources stats along with VSD.
  • On deployment of Nokia-Nuage collector, es_device_health flag can be set to true or false. When flag is set to true, data is fetched from elasticsearch else the data is fetched from VSD.

BW-up and BW-down Interface Indicators - relationship with VSD rate-limiter profile

  • Each VLAN has a QOS Policy attached to it. Each QOS Policy is associated to multiple rate-limiter profiles.
    • Out of all the associated rate-limiter profiles, parent rate-limiter profile is being chosen.
  • Commited Information Rate of the rate-limiter profile is used for bw_up and bw_down indicators, which is almost constant until changed in the VSD profile. Data Center Gateways Network Services Gateways

30-second Datapoint impact on sizing of SevOne NMS

  • SevOne NMS sizing is designed based on getting 1 data point every 5 minutes.
  • For Nokia-Nuage, by default, elasticsearch provides 10 data points every 5 minutes, which is 30-second data point for each object. Due to this, SevOne NMS sizing is impacted as storage in NMS. Performance on SevOne Data Insight and, performance and aggregation of flow data are also impacted.

Customer-friendly Host Name for Device Name / Object Name

  • Device description field can be used to store the Customer-friendly Host Name (CFHN) of the device in VSD.
  • At present, Device alternate-name in SevOne NMS is populated by CFHN.

Topology updation and storing of maps in Redis cache to avoid frequent calls to VSD

  • All required topology information from VSD is stored in redis cache to avoid frequent calls to VSD.
  • Every 30 minutes, when CREATE DEVICE container is running, redis cache is updated by making the calls to VSD.
  • If any container runs before CREATE DEVICE RUNNER and data is not present in redis, the required data is fetched from VSD.
  • To avoid a missed case from cache during deployment, by default CREATE DEVICE RUNNER is run after the INSTALLER.

Collection Offset Configuration and Impact

Sometimes data is collected by Nokia-Nuage collector before it is populated by VSD to Elasticsearch node. i.e., if current time is T, VSD to Elasticsearch data population can run from T to T + T1 seconds or minutes.

Example: Assume T1 = 5 minutes


Important: If the collector runs at T + 5 minutes to collect the data from T to T + 5 minutes, there are cases when by the time the collector starts collecting the data, all records are not populated by VSD to Elasticsearch and as a result, those records are missed.

Examples

As part of the above, collector offset variable, collection_offset, can be configured in seconds at the time of the deployment. The collector runs in delayed mode for that much duration in seconds. The default value is 0 seconds which indicates that the collector will run in current time. Variable collection_offset can be configured by the user. It value set applies to all tenants; it is not tenant-specific.

Changing the collection offset can have the some impact in certain scenarios.

Scenario# 1: New offset is less than the previous offset

@ 9:00 pm, collection happens from 8:45 pm - 8:55 pm. Change the offset from 5 minutes to 2 minutes. i.e.,

  • Old offset: 5 minutes
  • New offset: 2 minutes
  • Polling frequency: 10 minutes

@ 9:10 pm, collection happens from 8:55 pm - 9:08 pm (i.e., 9:10 pm - 2 minutes = 9:08 pm); data collection is for 13 minutes which is more than the polling frequency which can be supported and should work.


Scenario# 2: New offset is more than the previous offset

@ 9:00 pm, collection happens from 8:45 pm - 8:55 pm. Change the offset from 5 minutes to 8 minutes. i.e.,

  • Old offset: 5 minutes
  • New offset: 8 minutes
  • Polling frequency: 10 minutes

@ 9:10 pm, collection happens from 8:55 pm - 9:02 pm (i.e., 9:10 pm - 8 minutes = 9:02 pm); data collection is for 7 minutes, which is less than the polling frequency which can be supported and should work.


Scenario# 3: New offset is set to more than the polling frequency

@ 9:00 pm, collection happens from 8:45 pm - 8:55 pm. Change new offset from to be more than the polling frequency. i.e.,

  • Old offset: 5 minutes
  • New offset: 15 minutes
  • Polling frequency: 10 minutes

@ 9:10 pm (i.e., 9:10 pm - 15 minutes = 8:55 pm), collection will not happen as collection has already happened from 8:45 pm - 8:55 pm.

@ 9:20 pm, collection happens from 8:55 pm - 9:05 pm (i.e. 9:20 pm - 15 minutes = 9:05 pm); data collection is for 10 minutes, having 15 minutes offset and 10 minute polling frequency.

@ 9:30 pm, collection happens from 9:05 pm - 9:15 pm (i.e., 9:30 pm - 15 minutes = 9:15 pm).

Flow Augmenter

Create device for fetching the topology data from VSD and populate redis for flow augmenter

  • For Nokia-Nuage, the flow data is not being pushed from the device directly to flow relay or to the augmenter. It is pulled by the JSON decoder using index: nuage_dpi_flowstats from the elasticsearch server.
  • Flow Augmenter in Nokia-Nuage consists of following the components.
    • Create Device Cron
      • by default, it runs every 30 minutes as a cronjob.
      • it fetches topology from VSD and stores in redis container.
      • uses the collector image for docker container.
    • JSON Decoder
      • is continuously running the container.
      • contains 40,000 FPS capacity per json decoder.
      • fetches data from elasticsearch in JSON format using REST API calls.
      • sends data towards DNC in IPFIX packets after augmentation.
    • Redis
      • the container is also part of the augmenter which is common in case the collector and the augmenter are deployed on the same virtual machine.
      • if the augmenter is deployed separately, redis container is separate for both the collector and the augmenter.
      • redis is used to store the topology information to modify some of the flow fields as per the topology information.
Architecture Flow Augmenter

Multiple JSON Decoder concept

  • To increase throughput, multiple JSON decoders are deployed using the configuration value of the total number of JSON decoders required at the time of the deployment.
  • Each JSON decoder has its own time slice from the polling duration for which it fetches the data without conflicting with another JSON decoder.
  • Each JSON decoder has its own threading mechanism to fetch the data from Elasticsearch and send the data to the DNC.
  • If total JSON decoder is configured as 3,
    • JSON decoder A1 fetches from T1 to T1 + X time
    • JSON decoder A2 fetches from T1 + X to T1 + 2X time
    • JSON decoder A3 fetches from T1 + 2X to T2 time
      Note:
      T1 is the last timestamp to start the fetching.
      T2 is the time until data is to be fetched.
      X is the gap decided based on the total JSON decoder and the time interval to fetch the data. i.e., X = (T2 - T1) / <total_json_decoders>
  • Each JSON decoder can handle up to 40,000 FPS.

Nokia-Nuage specific Flow Fields

There are set of fields specific to Nokia-Nuage which are added or modified in the raw flow data.

  • Flow Timestamp (4333) - flow timestamp indicates the timestamp of the flow data packet. The timestamp is the actual timestamp of that flow packet.
  • Domain Name (4334) - domain name field represents the description of the flow data domain name. It uses the name from topology information, respective domain description is fetched and populated in this field.
  • Source Uplink (4335) - source uplink field represents the actual name of the port and VLAN number. From the flow data, physical name of the port and VLAN number are obtained. Using the name from topology information, respective actual port name and VLAN number are populated. If not available, field is populated with NA.
  • Destination Uplink (4336) - destination uplink field represents the actual name of the port and VLAN number. From the flow data, physical name of the port and VLAN number are obtained. Using that name from the topology information, respective actual port name and VLAN number are populated. If not available, field is populated with NA.
  • Underlay Name (4337) - this field is as per the flow data. If not available, field is populated with NA.
  • Ingress Bandwidth (4338) - this field is populated with EgressBytes from the flow fields. Nokia-Nuage flow data is with respect to the LAN side and SevOne's SD-WAN representation is with respect to the WAN side. The egress and ingress bytes from the flow data and in SevOne NMS are reversed.
  • Egress Bandwidth (4339) - this field is populated with IngressBytes from the flow fields. Nokia-Nuage flow data is with respect to the LAN side and SevOne's SD-WAN representation is with respect to the WAN side. The egress and ingress bytes from the flow data and in SevOne NMS are reversed.
  • HostIP (4340) - this field represents the actual host from where the data is outgoing as a source, or the data is incoming as a destination. For flow data with ingress bytes, it represents the Source IP and for the flow data with the egress packets, it represents the Destination IP.

Flow interface ingress/egress index generation for flow fields 10 and 14

  • Nuage flow data does not contain the interface index from where the data is coming. The mapping of the index with SevOne NMS interface objects and device requires additional logic to generate them.
  • As part of the index generation, it is important to focus on the index. The index must be unique across the different interfaces of same device.
  • The following logic is used to generate the index.
    • Each NSPort has a unique physical port name within a device with range 0-4096.
    • Each VLAN under a NSPort has a unique VLAN number within the NSPort with range 0-4096.
      Note: Based on this, a unique 8-digit number is generated and added to an ifIndex in the Object description. i.e., for GigaEthernet2.0 interface, if the physical name is port2 and VLAN number is 0, the interface ifIndex will be 20000.

      The generated ifIndex is used to convert the ifIndex to ifName in Flow Interface Manager running in a separate script.

Flow direction field logic based on flow data and reverse mapping for flow field 61

Flow field 61 is the standard field for flow direction. Nokia-Nuage flow data is with respect to the LAN side and SevOne's SD-WAN representation is with respect to the WAN side. The flow direction for the flows where EgressBytes are present, is considered to be incoming. The flow direction for the flows where IngressBytes are present, is considered to be outgoing.

  • Flows with EgressBytes - Direction - 0
  • Flows with IngressBytes - Direction - 1

Flow Source and destination Address based on the direction of flow flow logic

  • Nokia-Nuage flow data is with respect to the LAN side and SevOne's SD-WAN representation is with respect to the WAN side. Due to this, the source address of the flow data packet is changed as per the direction of the flow.
  • If the flow data with above computed direction 1, sourceheader and destinationheader in the flow represents the actual source and destination device IP of the flows.
  • If the flow data with above computed direction 0, sourceheader and destinationheader in the flow are reversed; the source address in header represents the destination device IP and Destination address in header represents the source device IP.

Flow data de-duplication handling

Background

To collect the flow stats data from Elasticsearch, it uses the search_after functionality of Elasticsearch, where you need to provide the timestamp after which data is fetched. Along with this, you need to also provide a second parameter to avoid the collision between the data records on same timestamp.

Example query

{"size":5000,"search_after":[1572846005998,<tie_breaker>],"sort":[{"timestamp":"asc"}

,{"TotalBytesCount":"desc"}],"query":{"bool":{"must":[{"term":{"EnterpriseName":"0Fingles Fries"}},{"range":{"timestamp":

{"gte":1572846005999,"lt":1572846065999}

}}]}}}

Ideally, the second parameter must be unique, so that when there are multiple records at the same timestamp, it can help to move the search ahead for the subsequent queries.

Problem

For the flow stats data, the doc_id field is unique for all records. However, using this field results in significant memory requirement on the Elasticsearch Server and it is heavy on the server resources. It is recommended that some other field is used as a tie-breaker along with the timestamp. At present, TotalBytesCount is used as a tie-breaker but, due to this, there is a possibility of duplicate flow records at the edge of the call to the Elasticsearch.As of now we are using "TotalBytesCount" as the tie breaker, due to which there is a possibility of the duplicate flow records at the edge of the call to ES.

Example: Assume the following Elasticsearch data

Timestamp TotalBytesCount #Flow Records @ Timestamp A and BytesCount B
T1 1000 450
T2 2000 1
T3 5000 3000
T4 100 1000
T5 50 800
T6 942 10
T7 999 230
T8 23 25
Total Records 5816

If Page Size of the query is 5000,

  • In the first query, Q1, data is for timestamp T1 (= 450), T2 (= 1), T3 (= 3000), T4 (= 1000) and T5(= 549 out of 800 records).
  • In the next query, data records which are left for T5 timestamp (251 records) can either be skipped if you move ahead in timestamp in next query
    or 549 records from the previous query will again get collected if the data records are recollected from the last timestamp received. There is no other parameter to keep the search context alive here.

To overcome the issue of duplicates, data of the last timestamp is recollected and the last timestamp from the previous query is discarded. So, in the case above, 549 records from query Q1 are discarded if all of them have the same timestamp along with the same TotalBytesCount. The next query refetches that data and processes it for DNC.

Important:
  • The issue above only impacts the flows and does not impact the non-flow data. In case of flow only, duplicate records participate in the total aggregation. For non-flow data, such duplicate records are neglected by SevOne NMS.
  • For the non-flow data such as, interface, interface_queue, device-health, tunnels (probes), apart from the probestats, doc_id does not have an impact on the Elasticsearch query due to considerable low volume compared to flowstats and probestats.
  • Increasing page_size also reduces the occurrence of duplicate records as the number of Elasticsearch queries reduce.

Fetcher vs. Sender threads for Nokia-Nuage

  • Fetcher threads - perform the task of fetching the data from the Elasticsearch server for flow in parallel. Having more fetcher thread increases more parallel queries and results in higher data per second to be processed for collector.
  • Sender threads - once fetcher thread starts fetching the data from Elasticsearch, it puts the data to the processing queue. The sender threads pick up the flow data points from the queue and send them further to rabbitmq.

Fetcher and Sender threads by default are 2 and 4. Based on the data load, it can be fine-tuned from the environment. Higher fetcher threads can create high consumption of memory. High sender threads adds more load on the CPU overall.

Log Rotation

The collector creates log rotation rules in /etc/logrotate.d/nuage-collector file for files stored in log directory /var/log/sdwan-nuage/<tenant-name>.

Log Rotation Rules

/var/log/sdwan-nuage/*.log {
    daily
    size 100M
    missingok
    rotate 5
    compress
    delaycompress
    notifempty
    dateext
    dateformat %s
}

This essentially means that the logs will be compressed and rotated daily if file size is greater than 100MB, and will add the timestamp of tomorrow to a file rotated today. It will preserve only the most recent 5 rotated log files, and will not return an error if the log file is missing.

FAQs

The FAQs below are related to Collector and Augmenter communication.

How much data is being transfered between the collector and the augmenter? What is the frequency of syncing both?

  • For Nokia-Nuage, the collector and augmenter are running in isolation. There is no communication between them when deployed on two different virtual machines.
  • If both the collector and the augmenter are running on the same virtual machine, the redis container is shared between them for resource optimization.

If the collector virtual machine goes down, does it affect the augmenter virtual machine? What happens in the reverse case?

Important: Consider both the collector and the augmenter are on different virtual machine / PAS appliance.
  • If the collector virtual machine goes down, the augmenter virtual machine is not affected. Flows continue to be pushed towards DNC.
  • If the augmenter virtual machine goes down, the collector virtual machine is not affected. Non-flow data continues to be pushed towards PAS.