SevOne Cluster Sizing Methodology

Note: Terminology usage...

In this guide if there is,

  • [any reference to master] OR
  • [[if a CLI command contains master] AND/OR
  • [its output contains master]],
    it means leader.

And, if there is any reference to slave, it means follower.

What is sizing and how are clusters sized?

When observability requirements grow, it is common for a SevOne administrator to revisit the sizing of the existing cluster. There are also times when an administrator may be required to deploy an entirely new or separate cluster. This guide aims to help administrators with such tasks by answering questions like:
  • How do I size a new cluster?
  • What are common ways to add additional capacity to my cluster?
  • How do specific use-cases impact my available cluster capacity?"

This guide is divided into two parts.

Basic Concepts

The SevOne platform architecture is a cluster of n-distributed peer appliances, with one peer appliance elected as the Cluster Leader. The cluster leader is responsible for federating the cluster data and insights to the presentation layer appliance for reporting, analytics, alerting, and administration.
Example: SevOne Cluster
nmsCluster


There are three types of appliances.
  1. vPAS - collects, stores, and analyzes raw and aggregate time-series metrics.
  2. vDNC - collects, stores, and analyzes large numbers of raw and aggregate flow records.
  3. vSDI - is an advanced presentation, analytics, and workflow appliance; vSDI sizing is out of the scope for this guide.

The vPAS and vDNC virtual appliances are available in several sizes. Additionally, take note of the following.
  • The Cluster Leader capacity must be equal to or larger than the largest capacity vPAS in the cluster. In large clusters, it is typically recommended that the Cluster Leader refrain from discovery or polling functions.
  • A cluster may have a mix of peer types and sizes, but all peers must be on the same SevOne release.
  • There is no hard limit to the number of peers within a cluster, allowing SevOne to scale horizontally as needed. Generally, only one Data Insight instance is required per cluster, regardless of its size and may serve multiple independent clusters of vPAS and vDNC.
Important: Each peer (vPAS, vDNC) in the cluster may be paired with a standby appliance of the same type to provide High Availability. These are called Hot Standby Appliances (HSAs). Production deployment best practices state that the Primary peer and its HSA must be equal in size and capacity.

Part# 1 - Determine IPS Requirements for the Cluster

Clusters are sized according to the number of IPS polled and the number of FPS received. A cluster's total capable IPS and FPS is the cumulative IPS and FPS of all its peers (HSAs excluded). When sizing a cluster, it is important to know your IPS and FPS requirements for the cluster and how IPS and FPS requirements can be distributed among one or more peers in the cluster.

IPS is impacted by the number of polled objects, the number of indicators for that object (defined by the Object Type), and the polling rate.

       IPS = ((Number of Objects) * (Average Number of Indicators per Object)) / (Polling Rate)

Determining IPS is not an exact science. Typical telecommunications or large enterprise environments are highly eclectic, with multiple domains and device types, resulting in a broad range of object types and associated indicators. This guide aims to serve as a methodology for sizing rather than an exact calculator.

For these reasons, sizing the initial deployment of a cluster in a previously unmonitored network is best done in conjunction with an experienced SevOne technical specialist. They have deep knowledge and experience of specific network vendors, devices, and configurations. That said, some of the largest contributors to the number of monitored objects and indicators in a cluster include:
  • Routers come in various capacities and services. A large MPLS router may consume upwards of 25,000 objects (though far fewer in practice using rules-based collection filtering), resulting in 100,000 indicators or more, while a small office router may only have 100 objects, resulting in 1,000 indicators or less. Next to the total number of interfaces, MPLS segments (LSP) are the most common major contributing factors to the number of monitored objects and indicators, as they can be monitored individually.
  • Switches, like routers, vary in capacity and services, but unlike routers, they generally have a greater number of interfaces. Each interface maps to an object type with 20 indicators and more, depending on the vendor and switch configuration. Additional services, such as QoS, can greatly increase the number of objects and resulting indicators depending on the number of classmaps, policies, and QoS enabled interfaces. A large switch may have 1,000 or more objects resulting in 20,000 or more indicators.
  • Hosts include physical hosts and virtual machines. In the most common cases, they range between 50-200 objects resulting in 1,000 or more indicators
It is important to remember that the information above is highly generalized and can vary significantly.

Example
Using the IPS formula above and assuming 20,000,000 indicators with a default polling period of 5 minutes (300 seconds), a cluster’s overall IPS requirement can be calculated.

       IPS = 20,000,000 indicators / 300 seconds = 66,667 IPS

Estimating the number of objects for a device in existing clusters is usually easier as there are often representative devices that can be used to extrapolate to the total number of objects and indicators.

In all cases, it is easiest to iterate to a final answer by using a small percentage of representative devices of each device type to be monitored to extrapolate an estimate of the total number of objects and indicators.

Tip: SevOne sizing new and existing deployments
Every network is unique and the average number of indicators per object and the number of objects may vary greatly across environments. It is normal to be uncertain about the number of objects (i.e., interfaces, MPLS paths, QoS queues, CPUs, etc.) and IPS, and it is best to iterate towards a final result. A good practice is to first record the current number of objects and IPS for the cluster, then add a small number (say 10-20%) of a representative device and note the increase in the number of objects and IPS after several successful polls have elapsed. The total number of objects and IPS can then be extrapolated using the total number of devices of that type.

The cluster's object count and IPS can be monitored on a per-peer basis by logging into the Cluster Manager and navigating to Administration > Cluster Manager > Peers tab. clusterPeers For example, if you have 10 core MPLS routers (of similar make and configuration) and adding 2 of them to the cluster consumes 20,000 objects and 1334 IPS, then a good estimate of the total number of MPLS objects for 10 core MPLS routers in such environment is ~100,000 objects and 6,667 IPS.

Determine FPS Requirements for the Cluster

Where the vPAS collects metrics over time, resulting in a number of indicators collected per second (IPS), the vDNC receives and processes Netflow/IPFIX and other flow protocols generated by flow-capable devices and interfaces, resulting in a number of flows per second (FPS) received by the SevOne vDNC appliance.
The number of flows generated and exported to the vDNC by a flow-capable device is highly dependent on the volume of traffic routed by that device and the specific settings of its flow-enabled interfaces, such as the sampling rate. This results in large variations of flow volume across networks, which can make estimating the total number of flows difficult.

That said, the most common way to calculate the total number of flows is also the most accurate, which is to retrieve the flow rate and the number of flow interfaces directly from the device itself. To calculate the rate of flows exported, export the instantaneous number of flows exported. The table below contains the commands to retrieve the current flow count for the most common network vendors and models.

Manufacturer Operating System Command for Instantaneous Flow Count
Cisco iOS show ip flow export
Cisco iOS XR show flow exporter fem1 location 0/0/CPU0
Cisco NX-OS show flow export
Juniper EX, MX show services accounting flow
Arista, Brocade, Foundry Arista/Network EOS show flow
Nokia SR OS show cflowd collector

To calculate the average flow rate over a period of time T1 - T0 seconds,

  1. Execute the command at T0 and take note of the current flow count, F0.
  2. Execute the command at T1 and take note of the current flow count, F1.
  3. The average flow rate for that period can be calculated as,

           FPS = (F1 - F0) / (T1 - T0)

The sum of the rate of all flows across all flow-capable devices configured to export to the vDNC is the maximum possible total FPS that a vDNC would be required to process (i.e., the worst-case scenario). The choice of where in the network to collect exported flow records is dependent on flow export and aggregation. Please contact IBM SevOne Support Team for any specific questions regarding flow collection.


Tip: It is important to measure the number of exported flows during peak traffic times; otherwise, you may be caught short! A practical duration between T1 and T0 could be 30 seconds.

In practice, the number of flows (FPS) ultimately processed by the vDNC is nearly always refined by the policy settings provided by the vDNC, such as allowing or disallowing the processing of specific flow-enabled interfaces, as seen below.

Number of Flow Interfaces and Flows per Second on a device and interface level; Granular policy for flow-enabled interface flowInterfacesAndFPS

Part# 2 - Distribute IPS and FPS across SevOne Appliances

Determine the number of appliances for a cluster

Now that we know the required IPS and FPS for the cluster, how can it be distributed among the primary peers in the cluster?
There are two ways to add capacity to a cluster.
  1. horizontally (typical and most straightforward)
  2. vertically (less common, more involved)
Deploying a new cluster and horizontally scaling an existing cluster are similar in that the required number of peers to satisfy the IPS requirements are added to the cluster; as a result, the number of member peers in the cluster grows. Vertical scaling is to increase the capacity of the peer instead of adding additional peers. In existing clusters, it requires data to be transferred (for example, using the Device Mover feature) from an existing smaller peer to a larger peer. In this case, the number of peers in the cluster remain the same, but one or more of the peers are swapped out for a larger variant. This guide focuses on horizontal scaling; adding peers to increase the cluster capacity.

Peers are delivered as virtual appliances containing a single Virtual Machine. They are available in several discrete sizes, each with its associated resource requirements.

The following table contains acceptable and tested maximum IPS and FPS fixed for each appliance type.

Note: The number of objects denoted by an appliance (i.e., vPAS100K) is the maximum, not nominal, number of objects an appliance can support. Also, note the appliance IPS has a linear relationship to the stated capacity (i.e., vPAS200K has 2x the IPS of 100K).
Appliance Type vCPU Cores RAM (GB) Hard Drives Flow Limit (FPS) Max Indicators per Second (IPS)
vPAS5k 2 8 150GB - 333
vPAS20k 8 24 600GB - 1,333
vPAS60k 8 44 150GB/1.3TB - 4,000
vPAS100k 8

96

Higher demands (for example,

xStats) may require more memory.

500GB/2TB - 6,666
vPAS200k 16 220 600GB/4TB - 13,333
vDNC100 8 16 150GB/400GB 30,000 -
vDNC300 16 48 150GB/800GB 80,000 -
vDNC1000 24 96 150GB/1500GB 80,000 -
vDNC1500 24 128 150GB/3000GB 80,000 -

Note: SevOne performs extensive scale testing to determine each appliance's maximum FPS or IPS; they are fixed parameters. Consequently, the maximum number of objects or flow-enabled interfaces for an appliance is also fixed. However, the actual number of objects or flow-enabled interfaces (up to the appliance maximum), the average number of indicators per object, and the average polling rate of an appliance are variable.

Determine required number of vPAS for a cluster

IPS is impacted by the number of polled objects, the number of indicators monitored on that object, and the polling rate. SevOne calculates the maximum IPS of an appliance in the following manner.

       IPS = ((Number of Objects) * (Average Number of Indicators per Object)) / (Polling Rate)

Based on the formula above, it is evident that the polling period is inversely proportional to the appliance's object capacity. For example, halving the polling period, will halve the appliance's object capacity. Doubling the average number of indicators will also halve the appliance's object capacity.


To determine the maximum object capacity of an appliance, SevOne assumes an average of 20 indicators per object and a default polling rate of 300 seconds (5 minutes), and since for each appliance there is a maximum IPS, the maximum number of objects an appliance can monitor can be determined. Based on the assumptions, a vPAS100K, with a maximum IPS of 6,667, supports a maximum of 100,000 objects.


Example: How to calculate max object count for appliances based on the appliance's maximum IPS.

       Number of Objects (max) = (IPS * Polling Rate) / (Average Number of Indicators per Object)

       Max Objects for vPAS100K: (6,667 IPS * 300) / 20 = 100,000 Objects (max)
       Max Objects for vPAS2000K: (13,334 IPS * 300) / 20 = 200,000 Objects (max)


Important: As the maximum IPS and the maximum number of objects are fixed, an increase in average number of indicators per object or a decrease from the default polling interval will reduce the appliance's effective capacity from its maximum.

Sections Determine IPS Requirements for the Cluster and Determine FPS Requirements for the Cluster describe a methodology to distribute a cluster's capacity requirements across SevOne appliances.

Let's use the appliance sizing to distribute the IPS from Part# 1 across SevOne appliances.

In Part# 1, based on the calculation, the cluster requires an estimated 20,000,000 indicators. Assume, upon investigation using the methodology in Part# 1, it is determined that the average number of indicators per object is 40, and the network team has decided to use the default of 300 seconds for the polling period.

The simplest way to distribute the IPS is to divide it by the largest capacity appliance you can deploy to determine the number of required appliances.

       (66,667 IPS / 13,334 IPS per vPAS200K) = 5x vPAS200K

There are some cases where a smaller appliance is required. In this case, 10 vPAS100K have the same capacity as 5 vPAS200K.

       IPS = (66,667 IPS / 6,667 IPS per vPAS100K) = 10x vPAS100K

Most commonly, there is a mix of appliance sizes. For example, 3x vPAS200K and 4x vPAS100K would also satisfy the requirement.
Note: In the example above, the effective maximum number of objects rated for an appliance is halved.
Important: In all cases, do not forget to add a Cluster Leader that is as large or larger than the largest peer in the cluster.

Architectural and administrative decisions may impact the choice of vPAS or vDNC sizes in your cluster. For example, there may be an administrative or architectural benefit to group polled devices by region, by tenant, by business unit, etc. SevOne's distributed platform maximizes the available deployment architecture options.

Non-standard polling rates or skewed object-to-indicator ratios

In certain cases, the default polling interval may need to be increased, or there is a higher ratio of indicators to object for a significant proportion of monitored object types (custom metrics, for example). How does this impact IPS calculations for the cluster?

  • Sizing Example #1: Non-standard polling rate
    Assume the user has a vPAS100K with 60,000 objects polled at the standard 5 minute interval. There are a number of critical objects the user would like to poll more frequently to better observe microbursts of traffic. To do this, the user will poll 5,000 objects at a 1-minute interval while continuing to poll the remaining 55,000 objects at the standard interval. Will this appliance have enough capacity?
    • (55K Objects * 20 Indicators per Object) / 300 seconds = (55,000 * 20) / 300 = 3,667 IPS
      (or check Cluster Master for actuals)
    • (5K Objects * 20 Indicators per Object) / 60 seconds = (5,000 * 20) / 60 = 1,667 IPS
    • 3,667 IPS + 1,667 IPS = 5,334 IPS total
    Note: The user's vPAS100K, while only monitoring 60K objects out of a maximum of 100K, would be at 80% capacity (5,334 of 6,667 IPS).


  • Sizing Example #2: Monitored object type has more than 20 indicators per object
    There are many situations in which an object type has more than 20 indicators: RAN Cell monitoring, customized object types, synthetic indicators, custom adaptors, etc. Assume a vPAS200K with 100K monitored objects polled at the standard 5 minute interval. An administrator wants to add 55K objects, all with the same object type, for which the object type has been customized to include approximately 60 polled indicators. Will there be enough capacity for 55K objects on the vPAS200K?
    • The maximum acceptable IPS for vPAS200K is 16,667.
    • The user is currently using (100,000 * 20) / 300 = 6,667 IPS
      (or check Cluster Master for actuals)
    • The user is adding (55,000 * 60) / 300 = 11,000 IPS.
    • The total required IPS is 17,667 and is greater than the available 16,667. An additional 1,000 IPS is required and will require an additional vPAS appliance. The appropriate vPAS size depends on a combination of future polling requirements and resource availability.

Note: Actual IPS for an existing cluster can be monitored on a per-peer basis by logging into the Cluster Manager > Administration > Cluster Manager > Peers tab, as shown in the screenshot above.

Determine required number of vDNC for a cluster

If assumed that in Part# 1, there are approximately 800,000 flows per second generated across 11,000 flow-enabled interfaces, the number of vDNC appliances required can be determined.

From table in section Determine the number of appliances for a cluster above, you will notice that vDNC300, vDNC1000, and vDNC1500 all have FPS limits of 80,000. The 300, 1000, and 1500 denote the maximum number of flow-enabled interfaces that can processed by a vDNC.

With this knowledge, the simplest case would be to deploy and manage the fewest vDNCs required. Since all 3 appliances handle the same FPS (80,000), the required number of vDNCs will come down to the number of flow-enabled interfaces.


Appliance Type Count Max FPS (GB) Max Interfaces  
vDNC300 10 800,000 3,000 RedX
vDNC1000 10 800,000 10,000 RedX
vDNC1000 15 1,200,000 15,000 greenCheck
vDNC1500 10 800,000 15,000 greenCheck

The table above shows that 10 vDNC1500s have the capacity for 800,000 FPS and 15,000 flow-enabled interfaces. However, 15 vDNC1000s can also be used for 15,000 interfaces, resulting in an additional 400,000 FPS (1,200,000 FPS total).

SevOne NPM Data Retention

SevOne allows users to adjust the data retention for polled data on the Cluster Manager page > Cluster Settings tab > subsection Storage. This is a cluster-wide setting and is applied to all peers. When a user adjusts this setting, you get the following warning.

Warning: SevOne NMS is tuned to store 365 days of data at 300 seconds granularity when operating at full capacity. Modifying data retention or polling frequency from their default values can cause the indicators-per-second load to exceed rated capacity, which may result in service disruption or data loss.

Please contact Expert Labs for sizing guidance before modifying data retention settings.

In the warning message, if you answer Yes without obtaining the guidance from Expert Labs, you are proceeding at your own risk.

The max allowed retention is 730 days (2 years).

Important: The Data retention calculations below do not account for appliances that have flow data.

Adjust Objects and IP addresses,

Time 12 Months (default) 18 Months 24 Months
NMS Size Objects Max IPS Objects Max IPS Objects Max IPS
vPAS5k 5,000 333 3,750 250 2,500 166
PAS10k 10,000 666 7,500 500 5,000 333
PAS20k / vPAS20k 20,000 1,333 15,000 1,000 10,000 666
PAS40k 40,000 2,664 30,000 2,000 20,000 1,333
PAS60k / vPAS60k 60,000 4,000 45,000 3,000 30,000 2,000
vPAS100k 100,000 6,666 75,000 5,000 50,000 3,333
PAS200k / vPAS200k 200,000 13,333 150,000 10,000 100,000 6,666
PAS300k 300,000 20,000 225,000 15,000 150,000 10,000

where, IPS = Indicators per Second


Adjust Storage - this is another option to increase your storage to account for increased data retention.

Time 12 Months (Default) 18 Months 24 Months
NMS Size Storage Size Storage Size Storage Size
vPAS5k 150GB 225GB 300GB
PAS20k / vPAS20k 600GB 900GB 1.2TB
PAS60 / vPAS60k 1.3TB 2TB 2.6TB
vPAS100k 2TB 3TB 4TB
PAS200k / vPAS200k 4TB 6TB 8TB