Configuring and Tuning data collection
When an Agent Builder agent is created, you can configure and tune its data collection to achieve the best results.
How you configure and tune your agent can be different for different Agent Builder agents and even between attribute groups in a single agent. Agent Builder agents can include two types of data and they support two basic methods of data collection for the most common type of data.
Data types
- Most Tivoli® Monitoring attribute groups represent snapshots of data. Someone asks for the data and it is returned. Agents use this type of data to represent configuration, performance, status, and other information where a one time collection of a set of data makes sense. This data is called sampled data.
- Some Tivoli Monitoring data represents events. In this case, an event happens and the agent must forward data to Tivoli Monitoring. Examples of events are SNMP Traps, Windows Event Log entries, and new records that are written to a log file. For simplicity, these types of data are grouped and referred to as event data.
Sampled data
When sampled data is required, a request is sent to the agent for a specific attribute group. The request might be initiated by clicking a workspace in the Tivoli Enterprise Portal. Other things that might initiate a request are a situation that is running, a data collection for the Warehouse, or a SOAP request. When the agent receives the request, the agent returns the current data for that attribute group. Tivoli Enterprise Portal requests target a specific attribute group in a particular Managed System Name (MSN). Situations and historical requests are more interesting, especially in an agent which includes subnodes. When a situation needs data for an attribute group in a subnode, the agent receives one request with a list of the targeted subnodes. The agent must respond with all the data for the requested attribute group for all of the subnodes before Tivoli Monitoring can work on the next request.
The most straightforward way for an agent to satisfy a request is to collect data every time it receives a request from Tivoli Monitoring. Agent Builder agents do not collect data every time. Data is not collected every time because it often takes time or uses resources to collect data. And in many cases the same data is requested many times in a short period. For example, a user might define several situations that run at the same interval on an attribute group and the situations can signal several different conditions. Each of these situations results in a request to the agent, but you might prefer each of the situations to see the same data. It is likely that as each situation sees the same data, more consistent results are obtained, minimizing the demand for system resources by the monitoring agent.
- On-demand collection: The agent collects data when it receives a request and returns that data.
- Scheduled collection: The agent runs data collection in the background on scheduled intervals and returns the most recently collected data when it receives a request.
The agent uses a short-term cache in both of these modes. If another request for data is received while the cache is valid, the agent returns data from the cache without collecting new data for each request. Using data from the cache solves the problem that is caused by multiple concurrent situations (and other types of) requests. The amount of time the data remains valid, the scheduled collection interval, the number of threads that are used for collection and whether the agent runs in on demand or scheduled mode are all defined by environment variables. Using the environment variables, you can tune each agent for the best operation in its environment.
- Agent 1 (on-demand collection): A simple agent that collects a small amount of data that is normally accessed only by situations or on an infrequent basis in the Tivoli Enterprise Portal. Data collection is reasonably fast, but it can use up computing and networking resources. This agent is normally defined to run on demand. If no situations are running or no one clicks the Tivoli Enterprise Portal, the agent does nothing. When data is needed, it is collected and returned. The data is placed into the short-term cache so that further requests at about the same time return the same data. This type of collection is likely the most efficient way for this agent to run because it collects data only when someone actually needs it.
- Agent 2 (scheduled collection): A complex agent that includes subnodes and collects data from multiple copies of the monitored resource. Many copies of the resource can be managed by one agent. It is normal to run situations on the data on a relatively frequent basis to monitor the status and performance of the monitored resource. This agent is defined to run a scheduledcollection. One reason for running a scheduled collection is the way that situations are evaluated by Tivoli Monitoring agents. Because situations are running on the attribute groups in the subnodes, the agent receives one request for the data from all of the subnodes simultaneously. The agent cannot respond to other requests until all of the data is returned for a situation. If the agent collected all of the data when the request arrived, the agent would freeze when you click one of its workspaces in theTivoli Enterprise Portal. To avoid freezing the agent, the agent builder automatically defines all subnode agents to run as scheduled collection. The agent developer tunes the number of threads and refresh interval to collect the data at a reasonable interval for the data type. For example, the refresh interval can be one time a minute, or one time every 5 minutes.
Environment variables
env)
file on Windows or initialization
(ini) file on UNIX.
The environment variables that control data collections for sampled
attribute groups are: CDP_DP_CACHE_TTL=<validity period for the cached data - default value 55 seconds>CDP_DP_THREAD_POOL_SIZE=<number of threads to use for concurrent collection - default value 15 for subnode agents>CDP_DP_REFRESH_INTERVAL=<number of seconds between collections - default value 60 seconds for subnode agents>CDP_DP_IMPATIENT_COLLECTOR_TIMEOUT=<amount of time to wait for new data after validity period expires - default value 5 seconds>
CDP_DP_CACHE_TTL,
CDP_DP_REFRESH_INTERVAL, and CDP_DP_THREAD_POOL_SIZE.
If CDP_DP_THREAD_POOL_SIZE has a value greater
than or equal to 1 or the agent includes subnodes, the agent operates
in scheduled collection mode. If CDP_DP_THREAD_POOL_SIZE is
not set or is 0, the agent runs in on-demand collection
mode.
CDP_DP_REFRESH_INTERVAL seconds.
It uses a set of background threads to do the collection. The number
of threads is set by using CDP_DP_THREAD_POOL_SIZE.
The correct value for the CDP_DP_THREAD_POOL_SIZE varies
based on what the agent is doing. For example: - If the agent is collecting data from remote systems by using SNMP,
it is best to have
CDP_DP_THREAD_POOL_SIZEsimilar to the number of remote systems monitored. By setting the pool size similar to the number of monitored remote systems, the agent collects data in parallel, but limits the concurrent load on the remote systems. SNMP daemons tend to throw away requests when they get busy. Discarding requests forces the agent into a try-again mode and it ends up taking more time and more resources to collect the data. - If the agent includes a number of attribute groups that take a long time to collect, use enough threads so that long data collections can run in parallel. You can probably add a few more for the rest of the attribute groups. Use threads in this way if the target resource can handle it. Examples of when attribute groups can take a long time to collect are if the script runs for a long time, or a JDBC query takes a long time.
When data is collected, it is placed in the internal cache.
This cache is used to satisfy further requests until new data is collected.
The validity period for the cache is controlled by CDP_DP_CACHE_TTL.
By default the validity period is set to 55 seconds. When an agent
is running in scheduled mode, it is best to set the validity period
to the same value as CDP_DP_REFRESH_INTERVAL. Set
it slightly larger if data collection can take a long time. When set
the validity period in this way, the data is considered valid until
its next scheduled collection.
The final variable is CDP_DP_IMPATIENT_COLLECTOR_TIMEOUT.
This variable comes into play only when CDP_DP_CACHE_TTL expires
before new data is collected. When the cache expires before new data
is collected, the agent schedules another collection for the data
immediately. It then waits for this collection to complete up to CDP_DP_IMPATIENT_COLLECTOR_TIMEOUT seconds.
If the new collection completes, the cache is updated and fresh data
is returned. If the new collection does not complete, the existing
data is returned. The agent does not clear the cache when CDP_DP_CACHE_TTL completes
to prevent a problem that is seen with the Universal Agent. The Universal
Agent always clears its data cache when the validity period ends.
If the Universal Agent clears its data cache before the next collection
completes, it has an empty cache for that attribute group and returns
no data until the collection completes. Returning no data becomes
a problem when situations are running. Any situation that runs after
the cache cleared but before the next collection completes sees no
data and any of the situations that fire are cleared. The result is
floods of events that fire and clear just because data collection
is a little slow. The Agent Builder agents do not cause this problem.
If the 'old' data causes a situation to fire generally the same data
leaves that situation in the same state. After the next collection
completes, the situation gets the new data and it either fires or
clears based on valid data.
Attribute groups
Performance Object Status,Average Collection Durationattribute. This attribute shows you how long each attribute group is taking to collect data. Often a small percentage of the attribute groups in an agent represents most of the processor usage or time that is used by the agent. You might be able to optimize the collection for one or more of these attribute groups. Or you can modify the collection interval for one or more groups, if you do not need some data to be as up-to-date as other data. For more information, see (Examples and advanced tuning).Performance Object Status,Intervals Skippedattribute. This attribute shows you how many times the agent tried to schedule a new collection for the attribute group and it found that the previous collection was still on the queue, waiting to be run, or already running. In a normally behaved agent this attribute value is zero for all attribute groups. If this number starts growing, you tune the data collection, by adding threads, lengthening the interval between collections, or optimizing the collection.Thread Pool Status,Thread Pool Avg Active Threadsattribute. You can compare this value to the Thread Pool Size attribute group to see how well your thread pool is being used. Allocating a thread pool size of 100 threads when the average number of active threads is 5 is probably just wasting memory.These attributes represent the time an average data collection spends waiting on the queue to be processed by a thread and the average number of collections on the queue. Because of the way this data is collected, even an idle system indicates that at least an average of one job is waiting on the queue. A larger number of waiting jobs or a large average wait time indicates that collections are being starved. You can consider adding threads, lengthening the interval between collections or optimizing the collection for one or more attribute groups.Thread Pool Status,Thread Pool Avg Job waitandThread Pool Avg Queue Lengthattributes.
Event data
Agent Builder agents can expose several types of event
data. Some behavior is common for all event data. The agent receives
each new event as a separate row of data. When a row of event data
is received, it is sent immediately to Tivoli Monitoring for processing, and added
to an internal cache in the agent. Situations and historical collection
are performed by Tivoli Monitoring
when each row is sent to Tivoli Monitoring.
The cache is used to satisfy Tivoli Enterprise
Portal or SOAP requests for the data. The agent can use the cache
to perform duplicate detection, filtering, and summarization if defined
for the attribute group. The size of the event cache for each attribute
group is set by CDP_PURE_EVENT_CACHE_SIZE. This cache
contains the most recent CDP_PURE_EVENT_CACHE_SIZE events
with the most recent event returned first. There are separate caches
for each event attribute group. When the cache for an attribute group
fills, the oldest event is dropped from the list.
- Windows Event Log entries
- SNMP Traps or Informs
- Records added to log files
- JMX MBean notifications
- JMX monitors
- Events from a Java™ API provider or socket provider.
- Joined attribute groups (where one of the data sources is an event data source)
File monitoring is more complicated.
The agent must monitor the existence of the files and when new records
are added to the files. The agent can be configured to monitor files
by using patterns for the file name or a static name. As the set of
files that matches the patterns can change over time, the agent checks
for new or changed files every KUMP_DP_FILE_SWITCH_CHECK_INTERVAL seconds.
This global environment variable governs all file monitoring in an
agent instance. When the agent determines the appropriate files to
monitor, it must determine when the files change. On Windows systems, the agent uses Operating
System APIs to listen for these changes. The agent is informed when
the files are updated and processes them immediately. On UNIX systems, the agent checks for file changes
every KUMP_DP_EVENT seconds. This global environment
variable governs all file monitoring in an agent instance. When the
agent notices that a file changed, it processes all of the new data
in the file and then waits for the next change.
Examples and advanced tuning
Example
CDP_DP_CACHE_TTLCDP_DP_IMPATIENT_COLLECTOR_TIMEOUTKUMP_DP_FILE_SWITCH_CHECK_INTERVALKUMP_DP_EVENT
CDP_DP_REFRESH_INTERVALCDP_PURE_EVENT_CACHE_SIZE
- EventDataOne
- EventDataTwo
- EventDataThree
- SampledDataOne
- SampledDataTwo
- SampledDataThree
CDP_DP_CACHE_TTL=55CDP_DP_IMPATIENT_COLLECTOR_TIMEOUT=2CDP_DP_REFRESH_INTERVAL=60CDP_PURE_EVENT_CACHE_SIZE=100
These settings might work perfectly,
or there might be reasons that you must control the settings at a
more granular level. For example, what if EventDataOne generally receives
10 times as many events as EventDataTwo and EventDataThree? To further
complicate things, there really is a link between EventDataOne and
EventDataTwo. When one event is received for EventDataTwo, there are
always multiple events for EventDataOne and users want to correlate
these events. There is not a single correct setting for the cache
size. It would be nice to be able to have EventDataOne store a larger
number of events and EventDataTwo store a smaller number. You can
achieve this storage by setting CDP_PURE_EVENT_CACHE_SIZE to
the size that makes sense for most of the event attribute groups,
100 seems good. Then, you can set CDP_EVENTDATAONE_PURE_EVENT_CACHE_SIZE to
1000. That way all of the corresponding events are visible in the Tivoli Enterprise Portal.
The
same thing can be done with CDP_DP_REFRESH_INTERVAL.
Set a default value that works for the largest number of attribute
groups in the agent. Then set CDP_attribute group
name_REFRESH_INTERVAL for the attribute groups
which must be collected differently. To optimize collection, set the
default CDP_DP_REFRESH_INTERVAL to match the CDP_DP_CACHE_TTL value. CDP_DP_CACHE_TTL is
a global value so if set to a value less than a refresh interval,
unexpected collections might occur.