Troubleshooting
Problem
This article is intended to help customers monitor and troubleshoot their deployment issues.
Symptom
Deploys can report "Timed Out" but continue in the background and finish successfully
Networking bandwidth and disk space issues can also affect deployments.
Cause
- Timeout
- Performance
- Disk space issues
- Service issues
- Bandwidth
- Tunnels or connection issues
Resolving The Problem
There are two types of deployments administrators can complete in the user interface:
- Admin tab > "Deploy Changes"
"Deploy Changes" is an incremental deployment that sends administrative changes to the managed hosts in the QRadar deployment and does not impact core services - Admin tab > Advanced > "Deploy Full Configuration"
"Deploy Full Configuration" rebuilds the full configuration and restarts services on each managed host.
NOTE: some businesses require a change request, or have policies and procedures before you process a "Deploy Full Configuration", such as notifying users.
The process for monitoring the Deployment from the command line is the same for both.
Monitoring the logs
After you select deployment from the web UI, monitor from the logs:
tail -f /var/log/qradar.log | grep -i deploy
The logs can help determine where the "deploys" are failing or why they are timing out.Files Generated During Deployment
The log files show the deployment "Initiating" and say which .zip files are being created. Sample messages are shown.
Monitor the files on the managed host. The files increase in size until they match the size on the Console.
Deploy: Global Set Builder is creating Zip file zipfile_GEN.full.zip, fullDeploy:true, firstTime:false
Deploy: Global Set Builder is creating Zip file zipfile_QVM.full.zip, fullDeploy:true, firstTime:false, qvmFile:false
The deployment files are created in /store/configservices/configurationsets/
:
ls -tail /store/configservices/configurationsets/
total 509256
585562 drwxr-xr-x 2 nobody nobody 4096 Oct 28 07:21 .
585548 -rw-r--r-- 1 nobody nobody 69 Oct 28 07:21 x.xxx.xxx.x.deploymentToken.txt
1631905 -rw-r--r-- 1 root root 64 Oct 28 07:21 x.xxx.xxx.x_zipfile_GEN.full.zip.chk
696725 -rw-r--r-- 1 root root 64 Oct 28 07:21 x.xxx.xxx.x_zipfile_QVM.full.zip.chk
1631906 -rw-r--r-- 1 root root 64 Oct 28 07:21 x.xxx.xxx.x_zipfile_QVM.zip.chk
1631904 -rw-r--r-- 1 root root 64 Oct 28 07:21 x.xxx.xxx.x_zipfile_GEN.zip.chk
260428 -rw-r--r-- 1 nobody nobody 1682 Oct 28 07:17 globalset_list.xml
260427 -rw-r--r-- 1 nobody nobody 64 Oct 28 07:17 zipfile_QVM.full.zip.chk
260426 -rw-r--r-- 1 nobody nobody 22 Oct 28 07:17 zipfile_QVM.full.zip
260425 -rw-r--r-- 1 nobody nobody 64 Oct 28 07:17 zipfile_GEN.full.zip.chk
260423 -rw-r--r-- 1 nobody nobody 260705034 Oct 28 07:17 zipfile_GEN.full.zip
714988 -rw-r--r-- 1 nobody nobody 64 Oct 28 07:17 zipfile_QVM.zip.chk
993314 -rw-r--r-- 1 nobody nobody 22 Oct 28 07:17 zipfile_QVM.zip
993313 -rw-r--r-- 1 nobody nobody 64 Oct 28 07:17 zipfile_GEN.zip.chk
993312 -rw-r--r-- 1 nobody nobody 260705034 Oct 28 07:17 zipfile_GEN.zip
Monitor the files on the managed host. The files increase in size until they match the size on the Console.
globalset_list.xml | Contains the deployment token and an entry for each of the hosts that require a deployment. |
zipfile_* | Contains the files to be deployed. The relevant files are copied out to each of the managed hosts. |
zipfile*.chk |
Contains the sha256sum of the generated .zip file. |
IP*.chk |
Created for each managed host. Used to ensure integrity during transfer. |
*.deploymentToken.txt | Contains the deployment token from the globalset_list.xml. |
Watching the Progress of the Deployment
During the deployment, a status file is created for the Console and each of the Managed Hosts.
ls -tail /store/tmp/status/deploy* && watch -n 2 "more /storetmp/status/deploy* | cat | sed 's/:::::::::::::://' | sed '/^$/d'"
Ensure that the date and time stamp is the current time. The size of this file indicates the progress of the deployment for each host. The following status codes are equivalent to the following status message:
- 21 = Initiating Deployment
- 11 = In Progress
- 7 = Success
- 9 = Timed Out
- 5 = Error
cat /store/tmp/status/deploy*
Check the Deployment on the Managed Host
The Managed Host checks the Console for a deployment request every 10 seconds. Once it finds the request, it does the following.
Check the configuration and database change downloads by:
- Downloads the globalset_list.xml from the Console to /store/configservices/configurationsets directory.
- Checks if the Deployment Token from the globalset_list.xml file matches /store/configservices/configurationsets/<IP>.deploymentToken.txt where <IP> is the private IP address of the host.
- Downloads either the incremental or full Global Set archive (.zip files) depending on the deployment type.
- If the host contains the QVM processor component, QVM files are downloaded.
Check the configuration and database change downloads by:
- On the Console, note the date, time, and size of the deployment files.
ls -tail /store/configservices/configurationsets/
- Open a session to a Managed Host:
ssh <MH_IP>
- Monitor files growth:
ls -tail /store/configservices/configurationsets/
- Compare with the size of the files on the Console. If the files stop growing, then the issue is likely due to networking issues.
- Once the files are the same size as the Console, the deployment completes and changes to Success status.
- If the Managed Host was previously reporting timeout, it now displays "Success". On the Console, check the status of the progress file in /store/tmp/status. A size of 7 indicates success.
ls -tail /store/tmp/status/deployment*
Hostcontext
The hostcontext service shows various scripts that are run while the deployment is in progress. If hostcontext is not stable or replication is failing, deployment can fail. Check how long hostcontext was active. If it shows a few minutes, monitor to ensure the time increases by running
systemctl
command a few times. If it shows failed, there is an issue.
systemctl status hostcontext
● hostcontext.service - hostcontext daemon
Loaded: loaded (/usr/lib/systemd/system/hostcontext.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/hostcontext.service.d
└─timeout.conf, ulimit.conf
Active: active (running) since Sun 2022-10-30 01:01:15 GMT; 1 weeks 5 days ago <<<<<========
Main PID: 37360 (java)
Tasks: 229
Memory: 17.5G
CGroup: /system.slice/hostcontext.service
├─19560 /bin/sh /opt/qradar/bin/check_sar.sh 5 /store/tmp/sar_report.1668163030432
├─19564 sar -S -d -p -r -u -q -I SUM -n DEV -n EDEV 5 1
├─19565 grep -v drbd
├─19566 grep -E -v ^([0-9]{2}:[0-9]{2}:[0-9]{2})\s+(AM|PM)\s+(rhel|rootrhel|storerhel|docker)
├─19567 iostat -p -m -x -y 5 1
├─19568 grep -v -E ^drbd
├─19569 grep -v -E ^dm-
├─19571 sadc 5 2 -z -S 768
└─37360 /bin/java -Dapplication.name=hostcontext -Dapp_id=hostcontext -Djava.library.path=/opt/qradar/lib -Dapplication.baseURL=file:///opt/qradar/...
Preparing incremental database dump as transaction 0000000000000043026
Replication incremental transaction for 3 relations, 0 JMS messages: Duration: 1169 ms
Preparing incremental database dump as transaction 0000000000000043027
Replication incremental transaction for 2 relations, 0 JMS messages: Duration: 1177 ms
Preparing incremental database dump as transaction 0000000000000043028
Replication incremental transaction for 2 relations, 0 JMS messages: Duration: 1201 ms
Preparing incremental database dump as transaction 0000000000000043029
Replication incremental transaction for 2 relations, 0 JMS messages: Duration: 1251 ms
Preparing incremental database dump as transaction 0000000000000043030 <<<<<========
Replication incremental transaction for 2 relations, 0 JMS messages: Duration: 1187 ms
Look at the "Replication incremental database dumps". These files are downloaded and applied every minute. Check hostcontext on the Console also and compare the transaction versions.
Bandwidth Test
Deployments fail when there is insufficient bandwidth between the Console and the Managed Host.
To test the bandwidth, create a 1GB file on the Console.
fallocate -l 1G /store/1gbfile
Copy it to the Managed Host and wait for it to complete:
scp /store/1gbfile <MH_IP>:/store/
1gbfile 100% 1024MB 93.1MB/s 00:11
The bandwidth in the example is 93.1MBs. Refer to Bandwidth for managed hosts or supported bandwidth.
Tomcat connection
Each of the managed hosts needs to be able to talk to Tomcat on the console.
To check this connection:
To check this connection:
/opt/qradar/bin/test_tomcat_connection.sh
Starting up...
Connected to tomcat
If
test_tomcat_connection.sh
is unable to connect, check hostcontext and host tokens.Disk Space
If there are space issues, deployments fail.
In the logs, messages that relate to "critical disk space" are visible.
[hostcontext.hostcontext] [ConfigChangeObserver Timer[1]] com.q1labs.configservices.util.ConfigServicesUtil: [ERROR] [NOT:0000003000][/- -] [-/- -]Deployment is blocked due to critical disk space issue
[hostcontext.hostcontext] [ConfigChangeObserver Timer[1]] com.q1labs.hostcontext.configuration.ConfigChangeObserver: [INFO] [NOT:0000006000][/- -] [-/- -]Setting deployment status to Error
Check the diskSpace on all the servers:
/opt/qradar/support/all_servers.sh -Ck "df -Th"
In addition, QRadar 101 Community site, Disk Space 101, has more information.
Performance
During the deployment, it can be useful to monitor the performance of the system and identify any bottlenecks. Some useful commands are
top
, iotop, and sar
. The sar
command gives block device IO Activity. The -p option shows in "pretty" format and gives the device name. Without the -p, the block device names are displayed by using the major and minor numbers.
sar -pd 1 5
Linux 3.10.0-1160.71.1.el7.x86_64 (q1csdesx-250.uk.ibm.com) 11/11/2022 _x86_64_ (40 CPU)
09:15:49 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
09:15:50 AM sda 3.00 0.00 56.00 18.67 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-root 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-storetmp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-tmp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-home 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-opt 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-varlogaudit 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-varlog 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM rootrhel-var 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM storerhel-transient 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:15:50 AM storerhel-store 3.00 0.00 80.00 26.67 0.00 0.00 0.00 0.00
The top
command is also useful.
top - 09:21:33 up 13 days, 19:43, 1 user, load average: 1.15, 1.12, 1.25
Tasks: 867 total, 1 running, 865 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.8/0.4 1[ ]
KiB Mem : 35.1/13182942+[||||||||||||||||||||||||||||||||||| ]
KiB Swap: 0.0/25165820 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3866 root 20 0 199524 85744 1452 S 8.9 0.1 827:54.35 /bin/bash --login /opt/qradar/perf/systemStabMon -interval 23
22285 root 20 0 0 0 0 Z 5.6 0.0 0:00.17 [date] <defunct>
8419 root 10 -10 37.5g 8.7g 16616 S 3.0 6.9 80:09.82 /bin/java -Dapplication.name=ecs-ep -Dapp_id=ecs-ep -Djava.library.path=/opt/qradar/lib -Da+
32196 root 0 -20 30.5g 5.2g 16308 S 3.0 4.2 39:55.04 /bin/java -Dapplication.name=ecs-ec -Dapp_id=ecs-ec -Djava.library.path=/opt/qradar/lib -Da+
21826 root 20 0 163212 3668 1964 R 1.3 0.0 0:00.22 top
33294 root 0 -20 24.9g 2.4g 17164 S 1.3 1.9 284:46.84 /bin/java -Dapplication.name=ecs-ec-ingress -Dapp_id=ecs-ec-ingress -Djava.library.path=/op+
37360 root 20 0 17.7g 521116 16672 S 1.3 0.4 1339:24 /bin/java -Dapplication.name=hostcontext -Dapp_id=hostcontext -Djava.library.path=/opt/qrad+
22292 root 20 0 162856 3148 1820 S 1.0 0.0 0:00.03 top -b -n 1
3998 root 20 0 111908 8308 4480 S 0.7 0.0 29:36.27 /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
6394 postgres 20 0 256184 3568 668 S 0.7 0.0 44:29.15 postgres: stats collector
19427 qvmuser 39 19 11.4g 610648 16328 S 0.7 0.5 40:12.90 /bin/java -classpath .:/opt/qradar/conf:/opt/qvm/console/meta:/opt/qvm/console/conf:/opt/qr+
9 root 20 0 0 0 0 S 0.3 0.0 91:36.24 [rcu_sched]
1153 root 20 0 0 0 0 S 0.3 0.0 6:19.85 [xfsaild/dm-1]
1283 root 20 0 0 0 0 S 0.3 0.0 4:48.29 [xfsaild/dm-5]
19754 postgres 20 0 561672 3344 1452 S 0.3 0.0 14:07.47 postgres: autovacuum launcher
22578 root 20 0 3538720 77376 26324 S 0.3 0.1 25:57.97 /usr/bin/dockerd
28930 nobody 20 0 218428 51840 6684 S 0.3 0.0 26:05.46 /usr/bin/python3.6 /usr/bin/celery worker -A app.celery_worker.config -Q celery --loglevel=+
1 root 20 0 192240 5308 2644 S 0.0 0.0 65:57.50 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
2 root 20 0 0 0 0 S 0.0 0.0 0:01.55 [kthreadd]
4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 [kworker/0:0H]
6 root 20 0 0 0 0 S 0.0 0.0 1:03.19 [ksoftirqd/0]
7 root rt 0 0 0 0 S 0.0 0.0 0:05.82 [migration/0]
The following key sequences and be used to display CPU and Memory activity in a status bar format:
ctmt c - cpu
t - toggle
m - memory
t - toggle
For online help use h.
Help for Interactive Commands - procps-ng version 3.3.10
Window 1:Def: Cumulative mode Off. System: Delay 3.0 secs; Secure mode Off.
Z,B,E,e Global: 'Z' colors; 'B' bold; 'E'/'e' summary/task memory scale
l,t,m Toggle Summary: 'l' load avg; 't' task/cpu stats; 'm' memory info
0,1,2,3,I Toggle: '0' zeros; '1/2/3' cpus or numa node views; 'I' Irix mode
f,F,X Fields: 'f'/'F' add/remove/order/sort; 'X' increase fixed-width
L,&,<,> . Locate: 'L'/'&' find/again; Move sort column: '<'/'>' left/right
R,H,V,J . Toggle: 'R' Sort; 'H' Threads; 'V' Forest view; 'J' Num justify
c,i,S,j . Toggle: 'c' Cmd name/line; 'i' Idle; 'S' Time; 'j' Str justify
x,y . Toggle highlights: 'x' sort field; 'y' running tasks
z,b . Toggle: 'z' color/mono; 'b' bold/reverse (only if 'x' or 'y')
u,U,o,O . Filter by: 'u'/'U' effective/any user; 'o'/'O' other criteria
n,#,^O . Set: 'n'/'#' max tasks displayed; Show: Ctrl+'O' other filter(s)
C,... . Toggle scroll coordinates msg for: up,down,left,right,home,end
k,r Manipulate tasks: 'k' kill; 'r' renice
d or s Set update interval
W,Y Write configuration file 'W'; Inspect other output 'Y'
q Quit
( commands shown with '.' require a visible task display window )
Press 'h' or '?' for help with Windows,
Type 'q' or <Esc> to continue
On VM systems the "st" parameter (end of 3rd line) can indicate issues with underlying VM resources. The Steal Time (st) indicates the amount of CPU 'stolen' from the virtual machine by the hypervisor for other tasks.
top - 09:55:31 up 1 day, 17:17, 2 users, load average: 0.73, 0.92, 1.11
Tasks: 758 total, 4 running, 754 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.0 us, 3.8 sy, 0.1 ni, 83.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65806276 total, 16518864 free, 25496820 used, 23790592 buff/cache
KiB Swap: 25165820 total, 25165820 free, 0 used. 36488324 avail Mem
The iotop command is useful to see read/write activity. The online help provides more information.
iotop -h
Usage: /usr/sbin/iotop [OPTIONS]
DISK READ and DISK WRITE are the block I/O bandwidth used during the sampling
period. SWAPIN and IO are the percentages of time the thread spent respectively
while swapping in and waiting on I/O more generally. PRIO is the I/O priority at
which the thread is running (set using the ionice command).
Controls: left and right arrows to change the sorting column, r to invert the
sorting order, o to toggle the --only option, p to toggle the --processes
option, a to toggle the --accumulated option, i to change I/O priority, q to
quit, any other key to force a refresh.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-o, --only only show processes or threads actually doing I/O
-b, --batch non-interactive mode
-n NUM, --iter=NUM number of iterations before ending [infinite]
-d SEC, --delay=SEC delay between iterations [1 second]
-p PID, --pid=PID processes/threads to monitor [all]
-u USER, --user=USER users to monitor [all]
-P, --processes only show processes, not all threads
-a, --accumulated show accumulated I/O instead of bandwidth
-k, --kilobytes use kilobytes instead of a human friendly unit
-t, --time add a timestamp on each line (implies --batch)
-q, --quiet suppress some lines of header (implies --batch)
The -o option shows processes that are currently performing IO.
iotop -o
Total DISK READ : 0.00 B/s | Total DISK WRITE : 336.89 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 1200.93 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
24327 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.08 % [kworker/2:3]
33527 be/4 postgres 0.00 B/s 12.48 K/s 0.00 % 0.01 % postgres: fusionvm fusionvm 127.0.0.1(55058) idle
19753 be/4 postgres 0.00 B/s 6.24 K/s 0.00 % 0.01 % postgres: walwriter
15905 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % java -Dapplication.name=hostcontext -Dapp_id=hostc~.jar:/opt/qradar/jars/guice-jmx- [pool-19-thread-]
15760 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % java -Dapplication.name=hostcontext -Dapp_id=hostc~.jar:/opt/qradar/jars/guice-jmx- [pool-9-thread-2]
19755 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: stats collector
3998 be/4 root 0.00 B/s 3.12 K/s 0.00 % 0.00 % syslog-ng -F -p /var/run/syslogd.pid
13368 rt/2 root 0.00 B/s 15.60 K/s 0.00 % 0.00 % java -Dapplication.name=ecs-ep -Dapp_id=ecs-ep -Dj~tgnosis ecs-ep.ecs 220 noconsole [Ariel Writer#ev]
6394 be/4 postgres 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: stats collector
1361 be/3 root 0.00 B/s 3.12 K/s 0.00 % 0.00 % auditd
Many more performance commands and utilities are available.
Gathering Log Files
When you open a case for failed deployments, and include a date and time stamp for the time frame of the deployment.
Support can focus on the relevant section of logs, by using the time frames.
Support can focus on the relevant section of logs, by using the time frames.
- Obtain the system time and date:
date
- In the UI, start a deployment: Note: to perform a partial deployment: "Admin" > "Deploy Changes"
To perform a full deployment: "Admin" > "Advanced" > "Deploy Full Configuration"
There is a difference between the two types of deployments. - Once the deployment finishes, take another date and time stamp:
date
- Open a support case, include the start and end time for the deployment, and include fresh logs from the Console as well as any Managed Hosts with deployment issues.
Related Information
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtNAAQ","label":"Deployment"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
03 June 2024
UID
ibm16832804