IBM Support

On APM 8.1.4 Console Agent Offline Alarms does not disappear

Technical Blog Post


Abstract

On APM 8.1.4 Console Agent Offline Alarms does not disappear

Body

Problem:

We can see many old (more than 1 month) alarms on the APM 8.1.4 (Application erformance Management) dashboard which are not clearing from the console.  We have even changed the removal time to 15 min but still older alarms have not cleared. Here is example for APM_Agent_Offline, see Fig.1.

image

 

Description:

 

It is not possible to manually clear the APM_Agent_Offline events (i.e. alarms, alerts).  They will automatically clear when the agent comes back online or if the agent is removed from the application [1]. You cannot edit or customize the default APM_Agent_Offline threshold. The APM_Agent_Offline threshold is hard coded into the APM server, and you won't see a threshold definition [2]. By default APM_Agent_Offline measures the heartbeat interval that is IRA_ASF_SERVER_HEARTBEAT=60 [3, 4].

 

By default, “APM>System Configuration>Advance Configuration> Agent Subscription Facility>Remove Offline System Delay” is set to 5760 minutes (4 days). The managed system continues to display, even if you uninstall the agent, until this delay time has passed. The intention is if something happen to the Agent on Friday, you can still see it on Monday. You can change Remove Offline System Delay” to any lower e.g. 15 min interval, see Fig. 2.

image

 

In case you still get in Threshold Manager some alarm "APM_Agent_Offline" that does not dissapear it means:

(a) you have some old Agent (e.g. forgotten) on the system that should be removed,

(b) or problematic Agent is not added to any application or custom resource group setting,

(c) or check if you have the same agentId on multiple servers,

(d) or oslc and min were not updated thus restart them,

(e) or if HA is used and multiple OSLC providers appear then you need to remove it manually from SCR with oslcmaint.sh.

 

Steps:

(a) Check if you have some old Agents on your system that are not working anymore?

Under Components select the “Status Overview” and under “Current Component Status” it shows the status for offline Agents as “Unknown”, see Fig. 3.

image

Or see also Fig 2.  in document - Examples of offline agents

https://www.ibm.com/support/knowledgecenter/SSHLNR_8.1.4/com.ibm.pm.doc/install/admin_agent_offline_example.html

In Applications select Edit and find this agent that is not in use anymore and remove it (i.e. select and delete). See steps in document – Viewing and removing offline agents

https://www.ibm.com/support/knowledgecenter/en/SSHLNR_8.1.4/com.ibm.pm.doc/install/admin_agent_removeoffline.html

 

(b) If some of your agents have been offline for more than 4 days, then looks they are not added to any application or custom resource group (thus can still appear in the UI).

Check in collect log if in this file apm/ibm/ccm/oslc_pm/logs/hostname_as_***.log

you found setting KAS_MERGE_MSN=YES and KAS_KEEP_ALL_MSN=YES .

If yes, please can you manually force APM to remove them with following steps:

1) Edit the <apm-install-dir>/ccm/oslc_pm/config/as.environment file andadd these lines:

KAS_MERGE_MSN=NO

KAS_KEEP_ALL_MSN=NO

2) apm stop oslc

3) apm start oslc

4) Wait 20 minutes to confirm that the agents no longer appear in the UI. Then perform same steps and back set them to YES (or remove lines).

 

Reference:

see original document - APM/IPM V8.1.x Offline Agents no longer in use and still show in the console...What to do?

https://www.ibm.com/developerworks/community/blogs/0587adbc-8477-431f-8c68-9226adea11ed/entry/APM_IPM_V8_1_x_Offline_Agents_no_longer_in_use_and_still_show_in_the_console_What_to_do?lang=en

 

(c) Check if you have the same broker name (i.e. agentId) on multiple servers?

(agentId must be unique across all hosts.)

Reference:

 see original document – Why are many APM v8 IIB (kqi) agent instances on one host showing offline?

https://developer.ibm.com/answers/questions/354275/why-are-many-apm-813…

 

(d)  Check in /apm/ibm/wlp/usr/servers/server1/logs/kd8collect.log file for

APMSelfMonitoring","origin":"xxx","description":"IBM APM: Agent xxx on hostname is off-line.","threshold_name":"APM_Agent_Offline","vendor_product":"IBM

The OSLC service is responsible for deleting an agent's resources from SCR after the agent has been deleted from the internal MIN "KNOWN" tables and from the CURI DP managed system (msys) dataset. Once the resources are removed from SCR then the agent will no longer appear in My Components. If tables were not updated, try restart following:

First  "restart server1" service and check if the offline agents are removed?

Then "restart oslc" service and check if the offline agents are removed?

Then also "restart min" service and check if the offline agents are removed?

 

 

(e) If multiple OSLC providers appear (thus already deleted Agents will still show up in the apmui along with their offline records) then manual maintenance with oslcmaint.sh is needed on SCR. This can happen if you have APM High Availability (HA) and if setting KAS_HOSTNAME in /opt/ibm/ccm/oslc_pm/config/as.environment is not the same value on both APM servers.

 

At the beginning KAS_HOSTNAME is hostname of the primary APM server. The value of KAS_HOSTNAME on the primary and standby APM servers must match. Then KASSTATE will be recreated when oslc restarts provided APM has access to the SCR database. If trouble this message appears in the SCR logs:

/apm/ibm/ccm/SCR/XMLtoolkit/log/msgGTM_XT.log.0

“[2019/07/19-01:00:11.556] com.ibm.tbsm.cltools.service.ASIApiQueueMaint process [44]  GTMCL5582W: Multiple OSLC providers have published data to the SCR, this can result in resources and agents that cannot be deleted. Contact IBM for assistance with resolving this issue.”

 

You have to find which OSLC provider is the current one and which one should not be used and should be passed to use oslcmaint.sh. Here are some instructions (but in case not sure how to determine the right OSLC provider, please contact IBM Support).

 

To see in SCR what's causing this check the output of the following steps:

1. run http://apm-server-ip:8090/SCRViewer/viewer?view=provider

You get table of Providers (ID, CDM Identity, Product, Resources, Mod Time).

Check if some CDM Identity is similar and ID is different (further herein marked with X and Y;

where X (usually is 4 or 21) as a Provider id of primary APM server;

where Y (usually is 61) as a Provider id of backup (standby) APM server).

 

2. run http://apm-server-ip:8090/SCRCLUSTER_SCR_oslc/oslc/rr/registration/collection?oslc.select=*&oslc.pageSize=1000000&oslc.paging=true

Note: The userid/password for those queries are the smadmin user and password.

 

Check in SCRCLUSTER_SCR_oslc.xml if any modified like here X:

<rdf:type rdf:resource="http://jazz.net/ns/ism/registry#RegistrationRecord"/&gt;
<oslc:serviceProvider rdf:resource="http://hostname:8090/SCRCLUSTER_SCR_oslc/oslc/providers/X"/&gt;
<oslc:domain rdf:resource="http://open-services.net/ns/perfmon#"/&gt;
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"&gt; yyyy-mm-ddThh:mm</dcterms:modified>

 

3. run http://apm-server-ip:8090/SCRCLUSTER_SCR_oslc/oslc/rr/collection?oslc.s…

Note: The userid/password for those queries are the smadmin user and password.
 

Check in SCRCLUSTER_SCR_oslc.xml if any X is modified to Y like here:

<oslc:serviceProvider rdf:resource="http:// hostname:8090/SCRCLUSTER_SCR_oslc/oslc/providers/Y"/>

<oslc:serviceProvider rdf:resource="http:// hostname:8090/SCRCLUSTER_SCR_oslc/oslc/providers/X"/>

<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">yyyy-mm-ddThh:mm</dcterms:mo…;

</rdf:Description>

</crtv:contextAddressSpace>

<oslc:serviceProvider rdf:resource="http://hostname:8090/SCRCLUSTER_SCR_oslc/oslc/providers/Y"/>

<crtv:address>xx.xx.xx.xx</crtv:address>

<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime"&gt; yyyy-mm-ddThh:mm</dcterms:modified>

 

In this files you can check if the hostname used by OSLC was changed. This stranded all of the resources (agents, etc) that had already been registered and will cause the recently deleted agents to still appear since they are still registered in SCR by the original OSLC provider. We confirmed they are still in the SCR database as being registered by the X provider, e.g.

ID  CDM Identity  Product  Resources  Mod Time

X      http://hostname.localhost            CURI_DP Performance Monitoring service provider   2018-10-08 

Y     http://hostname.localhost              CURI_DP Performance Monitoring service provider  2019-06-12 

 

This is also validated by these messages in the SCR logs:

/apm/ibm/ccm/SCR/XMLtoolkit/log/msgGTM_XT.log.0

com.ibm.tbsm.cltools.service.ASIApiQueueMaint process [44]  GTMCL5582W: Multiple OSLC providers have published data to the SCR, this can result in resources and agents that cannot be deleted. Contact IBM for assistance with resolving this issue.

com.ibm.tbsm.cltools.service.ASIApiQueueMaint process [44]  GTMCL5583W: OSLC provider: http://hostname /localhost , number of resources published

 

 ---- Solution ----

Since we know that the current OSLC provider is ID Y, you must remove all of the resources registered by the X provider via the following command:

/apm/ibm/ccm/SCR/XMLtoolkit/bin/oslcmaint.sh -U itmuser -P db2Usrpasswd@08 -r X

Where:

-U is the SCR Database userid

-P is the SCR Database password

It will take SCR a few minutes to remove this resources.

----------------------

Reference:

APM High Availability Installation and Upgrade

https://developer.ibm.com/apm/wp-content/uploads/sites/119/2019/08/APM_High_Availability_V4.5.pdf

(see chapter Agents are not removed from My Components' on page 113)

 

If alarms are still not clearing please contact IBM Support.

 

 

Additional References:

[1] Is it possible to edit APM_Agent_Offline alert?

https://www.ibm.com/developerworks/community/forums/html/topic?id=b172f72c-ce7b-42b9-bf97-e739324a04fb

[2] Is it possible to customize the APM_Agent_Offline threshold?

https://developer.ibm.com/answers/questions/489332/is-it-possible-to-customize-the-apm-agent-offline/

[3] How frequently is the heartbeat checked and how to change it?

https://developer.ibm.com/answers/questions/469390/how-frequently-is-the-heartbeat-checked-and-how-to/

[4] APM_Agent_Offline alerts misbehaving

https://www.ibm.com/developerworks/community/forums/html/topic?id=a54832bb-83f2-4c45-84c8-a86f8aaa2357

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSVJUL","label":"IBM Application Performance Management"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":""}]

UID

ibm11277422