IBM Support

Found a slow Collector EMS discovery? Consider this...

Technical Blog Post


Abstract

Found a slow Collector EMS discovery? Consider this...

Body

This blog entry provides some guidance on what to consider when assessing Collector discovery performance.
 
Usually an ITNM Collector discovery will be faster than and equivalent SNMP discovery but that may not always be enough, especially as we integrate with Element Mangement Systems supporting ever greater numbers of managed elements.
 
There are a number of reasons why a Collector based discovery can be slower than desired; the main areas and considerations are show in the following diagram:
image

Of the above I'd say the most important things to be aware of are;

Large numbers of devices

  • Impact on EMS-to-Collector data transfer

How the EMS' NBI copes with large numbers of devices will depend on the EMS/NBI.

[Collector Developer Tip] Ensure that a minimal data set is retrieved, with configurable extras, and that the optimal NBI call is used for that EMS.

  • Impact on Collector-to-Agent data transfer
As the discovery process issues XML-RPC calls for each device (GetDeviceList, GetInfo and UpdateData aside) each device managed by an EMS will result in approximately 10 separate XML-RPC calls (maybe more in post 4.1 releases, and maybe less if any unrequired Collector Agents are disabled).
While these calls and subsequest response parsing are fast* they can add up if the Collector is on a system remote to the discovery process (not usually the case).
One thing to try is to make sure the thread count (m_NumThreads) for the CollectorDetails Agent in DiscoAgents.cfg is set to, say, 40. By default it is 1 (originally because the XML-RPC Helper is single threaded), but performance can improve when set above 1.
* yet to be confirmed, but be aware that I've seen slow Collector-to-Agent times (i.e. the time between "Data Collection Phase 1 Starting" and  "Data Collection Phase 2 Starting" (Collectors are phase 1 Agents)) on an AIX 5.3 system; it was fine with a Java Collector (and on other platforms
  • Impact on stitching
The stitching is likely to be at least as long as the EMS-to-Collector-to-Agent time (unless you have a slow EMS/EMS connection).
Many of the discovery stitchers will get slower as the number of devices rises (though it should be far from an exponential growth).
See ncp_disco.<domain>.log for a very useful summary of the stitcher times following a discovery. Note that some stitchers, e.g. CreateAndSendTopology.stch, call other stitchers and so will naturally have a higher elapsed time despite not necessarily being slow themselves.
Keep the recommended per-domain entity counts in mind.

Sheer volume of per-device data (e.g. Layer 2 VPNs)

  • Impact on EMS-to-Collector data transfer
Discovering an EMS that has, say, hundreds of thousands of layer 2 VPNs can impact performance. 
If the Collector downloads all of the data from the EMS at the start of discovery (i.e. when the XML-RPC call for UpdateData() is received) then it can lead to a long delay before the discovery appears to start.
[Collector Developer Tip] In such cases it may be worth delaying the download of such data until an XML-RPC call is received specifically for that data for a given device (e.g. GetLayer2VPNs()) - users that disabled the CollectorVPN Agent or had unscoped the device in question would be especially grateful in this case...
[Collector Developer Tip] Ensure that a minimal data set is retrieved, with configurable extras, and that the optimal NBI call is used for that EMS.
  • Impact on stitching
The stitchers that process the type of data will of course be affected.
Keep the recommended per-domain entity counts in mind.

 

If the time taken to download data from an EMS is still simply taking too long or you can't modify the Collector code then there is still an option in: ncp_fake_collector.pl.
It may be acceptable for you to run the real Collector against the EMS _prior_ to running an ITNM discovery, then take a data dump from that Collector (via ncp_query_collector.pl) and then load it into a new Collector, ncp_fake_collector.pl.
You could then run discoveries against this new Collector with minimal delay to discoery time. Or course a major limitation here is that the data Disco discovered will only be as fresh as the last time it was sucked in from the real Collector.

 

I'll write more on ncp_query_collector.pl and ncp_fake_collector.pl at a later date (note: these scripts are shipped in 4.1).

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11082133