IBM Support

QRadar: Data Node rebalancing troubleshooting

How To


Summary

When a new Data Node is added to a deployment, the next deployment triggers a rebalancing within the Data Node Cluster.

Objective

This technical note includes advanced procedures on how to verify the state of rebalance, specific log messages, and the log files for rebalancing refer to the data node ID and cluster ID. As the logs refer to both numeric node and cluster IDs, users might need to understand these references to interpret the log files.

Environment

All QRadar versions.

Steps

When a new data node is added, the next deployment triggers a rebalancing in the cluster. For more information on rebalancing, see Data rebalancing after a data node is added.

Tip: If you plan to add multiple Data Nodes, you can add each of them without deploying the changes. Skipping the deploy until all hosts are added allows you to add multiple Data Nodes to the deployment without having to wait for the first appliance to rebalance. Adding multiple Data Nodes, then completing a deploy allows the Data Nodes to rebalance in parallel.

1. Reviewing datanode_status table for rebalancing progress

You can monitor the progress of a Data Node rebalancing in the user interface or the command line. The datanode_status table displays the status of the rebalancing or any errors, if the rebalance failed.
psql -U qradar -c "select * from datanode_status"
Sample output from the command
id | node_id | database | master_node_id |        status      |error_messages_json | last_rebalancing_start_time | last_rebalancing_end_time
---+---------+----------+----------------+--------------------+--------------------+-----------------------------+---------------------------
 1 |       8 | flows    |              8 | rebalancingStarted |                    |               1656362320849 |             1656362334071
 2 |       8 | events   |              8 | rebalancingStarted |                    |               1656362322276 |             1656362337311
 3 |     121 | flows    |              8 | rebalancingStarted |                    |               1656362320338 |             1656362332576
 4 |     121 | events   |              8 | rebalancingStarted |                    |               1656362321316 |             1656362337015
Index of table properties:
node_id The component ID that we are sending data to. For example, node_id 121 =dataNode component on DN1.
database  The type of data rebalanced. For example, flows or events.
master_node_id The ecs-ep component on the EP, which is the "Parent" of the cluster.
status  The status of the rebalancing. For example, rebalancigStarted, rebalancingCompleted, rebalancingFailed.
error_message_json The error message if the rebalance process fails.
last_rebalancing_start_time
last_rebalancing_end_time
EPOCH start and end time of the rebalance.

 

2. Rebalancing disk space targets

The log files on the parent Event Processor component also provides useful information on how the Data Node is expected to grow or shrink disk space as it rebalances event and flow data. When you add a data node, the expected target for all nodes can be viewed from the logs. 
grep -i target /var/log/qradar.log
Sample output that displays the expected target disk space across all Data Nodes:
[ariel.ariel_query_server] [agt0_6:flows] com.ibm.si.ariel.dcs.databalancing.MasterTask: [INFO] [NOT:0000006000][<IP>/- -] [-/- -]Node[8] has free space: 65.406032
[ariel.ariel_query_server] [agt0_6:flows] com.ibm.si.ariel.dcs.databalancing.MasterTask: [INFO] [NOT:0000006000][<IP>/- -] [-/- -]Node[111] has free space: 64.089390
[ariel.ariel_query_server] [agt0_6:flows] com.ibm.si.ariel.dcs.databalancing.MasterTask: [INFO] [NOT:0000006000][<IP>/- -] [-/- -]Node[121] has free space: 64.062085
[ariel.ariel_query_server] [agt0_6:flows] com.ibm.si.ariel.dcs.databalancing.MasterTask: [INFO] [NOT:0000006000][<IP>/- -] [-/- -]Targeting free space: 64.519169 for all nodes
The "Target" is the percentage of free space that the rebalancing hopes to achieve on all members of the cluster. The target is also displayed in the user interface. For more information, see Viewing the progress of data rebalance.
Data Node Target
 

3. Monitoring Rebalancing from the command line

To monitor the log files during rebalancing.
 tailf /var/log/qradar.log |grep -i balanc
Sample log files entries for rebalancing beginning
[ariel.ariel_query_server] [ariel_client:47092] com.ibm.si.ariel.dcs.config.DataClusterConfiguration: [INFO] [NOT:0130005100][<IP>/- -] [-/- -]Data node cluster 8 has begun rebalancing. Until the rebalancing is complete, you cannot modify the cluster membership, or the hosts they belong.
Sample log file entries for the rebalancing complete
[ariel_proxy.ariel_proxy_server] [agt0_1:events] com.ibm.si.ariel.dcs.databalancing.MasterWorker: [INFO] [NOT:0000006000][/- -] [-/- -]Source node id: 8. Result: DNStatus [status=COMPLETED, usableSpace=37925330944, totalSpace=151095197696, volume=/dev/mapper/storerhel-store, storeInfo/store (/dev/mapper/storerhel-store)]
[ariel_proxy.ariel_proxy_server] [ariel_client /127.0.0.1:60642] com.ibm.si.ariel.dcs.config.DataClusterConfiguration: [INFO] [NOT:0130005101][ /- -] [-/- -]Data node cluster 8 has finished rebalancing. You can now modify the cluster membership and the hosts they belong to.
The log files refer to the cluster ID and the data node ID.  
The Cluster ID is the eventprocessor element of the parent node.
To determine the parent node 
Admin > System and License Management > Deployment Actions > View Deployment > Reset Layout
Data Node Cluster2
Determine the managed host ID for the parent node and the data nodes from the managedhost table.
psql -U qradar -c "select * from managedhost where status='Active'" 
Sample output from the command.  Note the ID of the parent node of the cluster and the datanodes.
 id  |      ip       |   hostname   | status | isconsole | appliancetype |      creationdate       |       updatedate        | qradar_version | primary_host | se
condary_host | haoptions | email_server_id
-----+---------------+--------------+--------+-----------+---------------+-------------------------+-------------------------+----------------+--------------+---
-------------+-----------+-----------------
  53 | 192.168.xx.xx | console      | Active | t         | 3199          | 2022-05-12 08:49:09.055 | 2022-05-12 08:49:09.055 | 7.4.3          |           51 |
             |           |               2
 106 | 192.168.xx.xx | processor01  | Active | f         | 1699          | 2022-05-13 00:43:18.943 | 2022-11-17 16:52:58.936 | 7.4.3          |          104 |
             |           |               1
 108 | 192.168.xx.xx | datanode1    | Active | f         | 1400          | 2022-05-13 01:14:21.219 | 2022-11-17 16:59:52.307 | 7.4.3          |          106 |
             |           |               2

In this example:
  • Parent "managed_host_ID" is 106
  • DataNode "managed_host_ID" is 108
Determine the Cluster ID from the eventprocessor element of the parent node in the deployed_component table
psql -U qradar -c "select * from deployed_component where managed_host_id=106 and name like 'eventprocessor%'"
Sample output from the command shows that the Cluster ID is 8.
 id  |           name           | managed_host_id | component_id | changed
-----+--------------------------+-----------------+--------------+---------
 8   | eventprocessor8        |             106 |          113 | f
Determine the Data Node ID from the data node element in the deployed_component table for the data nodes.
psql -U qradar -c "select * from deployed_component where managed_host_id=108 and name like 'dataNode%'"
Sample out from the command shows that the Data Node is 121.
 id  |    name     | managed_host_id | component_id | changed
-----+-------------+-----------------+--------------+---------
 121 | dataNodeA |             108 |          108 | f
Review the logs on the parent and data nodes.  
grep -i balancing /var/log/qradar.log
Sample out of Data being transferred from the Cluster Parent to the Data Node.
[ariel.ariel_query_server] [agt0_1:events] com.ibm.si.ariel.dcs.databalancing.DTClient: [INFO] [NOT:0000006000][x.x.x.x/- -] [-/- -]DataBlockBegin to x.xx.xxx.xxx:32006 (8 -> 121, Path: BlockInfo [fInfo=/store/ariel/events/records/2022/9/15/6[20-09-15,06:00:00], attrs={}])  DNStatus [status=EXECUTE, usableSpace=161560137728, totalSpace=174601854976, volume=/dev/mapper/storerhel-store, storeInfo/store (/dev/mapper/storerhel-store)]
The logs show the Source ID, which displays where the data is being copied. For example, in the log message we can see that rebalancing occurs from 8 to 121.
 

4. Encryption

In QRadar 7.5.0 Update Package 3 or later, the ability to add encrypted Data Node appliances was added to the product. Administrators on QRadar 7.5.0 Update Package 2 or earlier experience issues where the data cannot rebalance properly between encrypted hosts.

You cannot mix both unencrypted and encrypted Data Nodes, but all Data Nodes can be encrypted if you are using newer versions of QRadar. QRadar uses TCP port 32006 to rebalance and move data from the EP to an available Data Node. If port 32006 is blocked on the network, then rebalancing can continually fail until the port is opened, regardless of encryption.

Note: Encryption is enabled by default when you add a host in QRadar version 7.5.0. There is no option to disable encryption when the Data Node is initially added. If your Data Nodes are unencrypted, you might be required to edit the host from System and License Management to disable encryption before you deploy changes. As you cannot mix encrypted and nonencrypted Data Nodes in a deployment.

For more information, see Data Nodes and data storage.
 

5. Hourly folders

It is expected that when rebalancing that the paths for hourly folder do not exist on the destination that the EP or another Data Node is trying to balance.  A destination rejects the request to rebalance when the folder is already present. If a directory or folder already exists, then the EP either tries another Data Node or sets the data as "Cancelled" and the hourly directory stays local to the source and is not rebalanced.
 
The Data Node process is intended to move the entire hourly folder and cannot merge data. The hourly folder might exist on the Data Node from a previous failed attempt to add a Data Node or if the Data Node is removed and is added again.
 
To compare hourly files between each host, you can use the du command. For example, if you run the following command on the EP and Data Nodes individually, we expect to see approximately the same amount of event data each hour and in total
 
du -h -d 1 /store/ariel/events/records/year/month/day

 

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSV4BL","label":"IBM QRadar"},"ARM Category":[{"code":"a8m0z000000cwsyAAA","label":"Admin Tasks"},{"code":"a8m0z000000cwtNAAQ","label":"Deployment"},{"code":"a8m0z000000cwszAAA","label":"Install"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
31 July 2023

UID

ibm16844737