IBM Support

QRadar on Cloud: Troubleshooting Data Gateways in UKNOWN state

Troubleshooting


Problem

Data Gateways (DG) is the collection appliance in QRadar on Cloud (QRoC) and can be deployed in multiple places. When the connection is affected, DGs are considered in an UNKNOWN state.
This article guides administrators through identifying and resolving common issues when a Data Gateway goes to an UNKNOWN state.

Symptom

  1. Log in to the QRadar on Cloud Console as the administrator user.
  2. Click the Cloud icon in the upper right section.
  3. Expand the "DOWN" list.

    Figure01
     

Cause

As the connection from the QRadar on Cloud Console to Data Gateways and vice-versa travels over the internet, there are numerous places when this connection might be affected. Networking changes unknown to the administrator, network congestion, and high latency (at DG premises or the Internet) are the most common causes for a DG to be in an UNKNOWN state.

Environment

QRadar on Cloud Data Gateway appliances or virtual machines.

Diagnosing The Problem

Administrators must have CLI (command-line interface) access to the affected Data Gateway to run the commands suggested. Refer to the "Data Gateway (DG) Administration" section in the QRadar on Cloud Support FAQ to configure CLI access by using SSH.

Unknown networking changes

Data Gateways require a Private IP address, a Public IP address allowed in QRoC's firewall, and a 1:1 NAT configuration (1 private IP address translated to a single public IP address and vice versa) to grant internet access. 
When either IP address or the 1:1 NAT Configuration changes can cause the VPN tunnel to fail.
  1. Log in to the Data Gateway CLI as the root user.
  2. Review the /var/log/openvpn.log file by using the cat command.
cat /var/log/openvpn.log

{Date} us=172847 TCP_CLIENT link remote: [AF_INET]<QRoC VPN Server Public IP>:443
{Date} us=190955 Connection reset, restarting [-1]
{Date} us=191070 TCP/UDP: Closing socket

Network congestion


Data Gateways require a minimum bandwidth of 40Mbps symmetrical to function properly. When the affected transfer rate decreases, it impacts other QRadar processes such as configuration deploy changeslong time to download the database dumps, and the events being buffered in the persistent queue.


Note: The following diagnosis steps require the VPN tunnel to be established. See "Unknown networking changes" to determine whether the VPN is established.
 
  1. Log in to the Data Gateway CLI as the root user.
  2. Review the /var/log/qradar.log file, by using the grep command and ensure the downloading time is under 5000 milliseconds.
grep "Replication download" /var/log/qradar.log | tail

<DG HOSTNAME> replication[13866]: Replication download timing: Downloading: 3500 ms Overall: 453 ms DB transaction: 20 ms Transaction verification: 6 ms
In the previous output, the "Downloading:" value shows 3500 ms, which indicates acceptable congestion in the network. A 40Mbps link can transfer 200MB in approximately 40 seconds.  Low-bandwidth links are prone to be congested and contribute to a latency increase.

High Latency


As Data Gateways can be deployed in different places, the latency (transmission time between points A and B) might vary, however, a value of 150ms is considered acceptable. The lowest latency possible is always preferred. High latency links can decrease the effective transfer rate and affects bandwidth.

Note: The following diagnosis steps require the VPN tunnel to be established. See "Unknown networking changes" to determine whether the VPN is established.
 
  1. Obtain the Console's Private IP.
    1. Method #1 - WebUI.
      1. Log in to the QRadar Console as an administrator user.
      2. Click the Admin tab.
      3. In the left pane, select Apps.
      4. Click QRoC Self-Serve.
      5. Click Deployment. 
      6. Look for the Console's Private IP.
    2. Method #2 - CLI.
      1. Log in to the Data Gateway CLI as the root user.
      2. Use the grep command to obtain the Console's Private IP.
        grep CONSOLE_PRIVATE_IP /opt/qradar/conf/nva.conf
  2. Use the ping and tcptraceroute command to measure the latency from the DG to the QRoC Console.

    Example of ping:
    [root@<DG hostname> ~]# ping <QRoC Console Private IP> -c 10
    
    PING <QRoC Console Private IP> (<QRoC Console Private IP>) 56(84) bytes of data.
    64 bytes from <QRoC Console Private IP> icmp_seq=1 ttl=63 time=2500 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=2 ttl=63 time=2184 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=3 ttl=63 time=2547 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=4 ttl=63 time=2201 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=5 ttl=63 time=2117 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=6 ttl=63 time=3008 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=7 ttl=63 time=2946 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=8 ttl=63 time=2530 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=9 ttl=63 time=2870 ms
    64 bytes from <QRoC Console Private IP>: icmp_seq=10 ttl=63 time=2830 ms
    
    --- <QRoC Console Private IP> ping statistics ---
    
    10 packets transmitted, 8 received, 25% packet loss, time 11005ms
    rtt min/avg/max/mdev = 2117.453/2545.293/3008.992/318.923 ms, pipe 4
    In the previous output, the DG reports the connection to the Console is 2545.293 ms and has a 25% of packet loss.

    Example of tcptraceroute:
    [root@<DG HOSTNAME> ~]# tcptraceroute <QRoC Console Private IP> 443
    traceroute to <QRoC Private IP> (<QRoC Private IP>), 30 hops max, 60 byte packets
     1  192.168.47.1 (192.168.47.1)  52.806 ms  52.769 ms  52.756 ms
     2  * * *
     3  console-xxxxx.qradar.ibmcloud.com (<QRoC Private IP>) <syn,ack>  52.767 ms  52.702 ms  52.706 ms
    In the previous output, the "hop 3" reports the Console was reached in 52.767 ms.

Resolving The Problem

Important
Once the DG connection is restored and transitions to an Active state, all the events and flows buffered while it was in an Unknown state are forwarded. This behavior can cause SAR Sentinel alerts about packet drops and interface load over the threshold.
This behavior is expected and must clear by itself once the buffered events and flows are forwarded and collection services resume the forwarding at a regular interval.

Unknown networking changes


Administrators must follow the steps in this technote: QRadar on Cloud: Troubleshooting Data Gateway appliance connectivity before running the steps in this technote.

  1. Determine whether the DG's Public IP changed by using the curl command.
    Note: The following command works only if the DG has open internet access or at least access to https://ifconfig.me.
    curl -k https://ifconfig.me
  2. Verify the reported Public IP is in the QRoC allowlist.
    1. Log in to the QRadar Console as an administrator user.
    2. Click the Admin tab.
    3. In the left pane, select Apps.
    4. Click QRoC Self-Serve.
    5. Click the Allowlist Management menu.
  3. Add the Public IP to the allowlist. For more information, see How to add an IP address to the allowlist.
  4. Reach to the networking team where the DG is hosted and request an update in the networking configuration:
    1. The 1:1 NAT configuration must use the new IP.
    2. The proxy server (when present) allows the connection to the QRoC (Console and VPN Server) Public IP by using the new IP. 
  5. Restart the openvpn@client service.
    systemctl restart openvpn@client
  6. Verify the tunnel (tun0 interface) establishes and appears in the routing table by using the route command.
    route -n
    Example:
    [root@<DG HOSTNAME> ~]# route -n
    Kernel IP routing table
    Destination                Gateway         Genmask           Flags     Metric Ref Use Iface
    0.0.0.0                    10.10.10.1      0.0.0.0           UG        0       0  0   ens192
    10.10.10.0                 0.0.0.0         255.255.255.0     U         0       0  0   ens192
    <QRoC Console Private IP>  192.168.13.1    255.255.255.255   UGH       0       0  0   tun0
    192.168.13.0               0.0.0.0         255.255.255.0     U         0       0  0   tun0
  7. Test the Console response by using the nc command.
    nc <QRoC Console Private IP> 443
    Example:
    [root@<DG HOSTNAME> ~]# nc -zv <QRoC Console Private IP> 443
    Ncat: Version 7.50 ( https://nmap.org/ncat )
    Ncat: Connected
  8. Wait 10 minutes for the DG to report the active status.
Results
The VPN tunnel is established and the communication is restored. The Console considers the DG as active, the processes start working as usual and the events and flows buffered while disconnected are forward to the Console or (Event or Flow) Processor appliance.

Network congestion


Administrators must request bandwidth tests to the networking team where the DG is hosted to determine where the network congestion occurs. Usually, bandwidth tests in these sections are enough to determine where the network congestion is:

  1. ​​​​​​DG premises (where the DG is hosted). Bandwidth test from DG premises to network boundaries (typically a network border router or server) must report at least 40Mbps symmetrical.​​​​​​
  2. The internet. Bandwidth test from the DG to a server on the cloud (ideally on IBM Cloud). This test verifies the ISP link the DG connection uses to reach QRadar on Cloud (IBM Cloud).
Results
If the DG premises bandwidth test reports a 40Mbps symmetrical minimum, it is likely the network congestion exists in the internet link and must be reported to the ISP (Internet Service Provider).
Alternatively, the Administrators can open a case with QRadar on Cloud Support to request an end-to-end bandwidth test (from QRoC Console to DG) and provide the output. This bandwidth test requires the VPN to be established and can be used to confirm the whole network path has at least the minimum bandwidth.

High Latency


High latency is caused by a sum of geographical distance (expected) and network congestion. The administrators can use the traceroute commands to determine where the latency increases out of the expected values.
Note: QRadar on Cloud by default does not respond to external (by using the Public IP) ICMP probes. The administrators can use the ping command against the QRoC Private IP address, ICMP-enabled Public servers (for example 8.8.8.8), or internally owned servers.
  1. Use the tcptraceroute command to the QRoC Console Public IP and VPN Server Public IP.
    1. Obtain the Public IP of the Console or VPN Server.
      1. Log in to the QRadar Console as an administrator user.
      2. Click the Admin tab.
      3. In the left pane, select Apps.
      4. Click QRoC Self-Serve.
      5. Click Deployment. 
      6. Look for the Console's Public IP or VPN Server Public IP.
    2. Run the tcptraceroute command.
      tcptraceroute <QRoC Console Public IP> 443
      
      or
      
      tcptraceroute <QRoC VPN Server Public IP> 443
      Example:
      tcptraceroute <QRoC Console Public IP> 443
      
      traceroute to <QRoC Console Public IP> (<QRoC Console Public IP>), 30 hops max, 60 byte packets
      1 10.5.195.251 (10.5.195.251) 0.333 ms 10.5.195.252 (10.5.195.252) 0.256 ms 0.349 ms
      2 10.5.106.20 (10.5.106.20) 0.525 ms 0.648 ms 0.502 ms
      3 209.12.237.60 (209.12.237.60) 0.844 ms 0.952 ms 0.943 ms
      4 66.55.35.81 (66.55.35.81) 1.667 ms 1.898 ms 1.758 ms
      5 ae0-250G.ar1.DAL1.gblx.net (67.17.95.74) 3.909 ms 4.022 ms 4.000 ms
      6 * * *
      7 4.14.131.62 (4.14.131.62) 3.060 ms 3.244 ms 3.221 ms
      8 ae5.cbs01.eq01.dal03.networklayer.com (50.97.17.52) 3.948 ms * *
      9 * * *
      10 * * *
      11 ae7.cbs02.cs01.lax01.networklayer.com (50.97.17.61) 41.580 ms 40.424 ms 39.680 ms
      12 ae0.cbs02.eq01.sjc02.networklayer.com (50.97.17.87) 44.297 ms 43.954 ms 42.194 ms
      13 e5.11.6132.ip4.static.sl-reverse.com (50.97.17.229) 52.228 ms e7.11.6132.ip4.static.sl-reverse.com
      (50.97.17.231) 42.238 ms 43.581 ms
      14 po1.fcr02a.sjc03.networklayer.com (169.45.118.153) 40.735 ms po2.fcr02b.sjc03.networklayer.com
      (169.45.118.159) 40.717 ms 42.443 ms
      15 e5.96.2ca9.ip4.static.sl-reverse.com (169.44.150.229) 43.537 ms 41.603 ms 40.283 ms
      16 14.b4.2ca9.ip4.static.sl-reverse.com (<QRoC Console Public IP>) <syn,ack> 41.607 ms 71.718 ms 42.271 ms
      In the previous output, hops 1 and 2 belong to customer premises, hops 3 - 10 are the ISP network and hop 11 - 15 are IBM Cloud. The hop 16 shows the QRoC Console Public IP is reachable in 42.271 ms on average.
Results
With the tcptraceroute command output, the administrators can engage the networking team where the DG resides and provide a probe in case the latency increases on-premises or in the ISP network. Once the networking team takes action to decrease the latency, the DG status returns back to the Active state.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSKMKU","label":"IBM QRadar on Cloud"},"ARM Category":[{"code":"a8m0z000000cwtNAAQ","label":"Deployment"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
16 May 2022

UID

ibm16585626