• Share
  • ?
  • Profiles ▼
  • Communities ▼
  • Apps ▼

Blogs

  • My Blogs
  • Public Blogs
  • My Updates

This community can have members from outside your organization. IBM Tivoli Monitoring Wonderful World of Situations

  • Log in to participate

▼ Tags

▼ Similar Entries

Use a non-zero NPGTH...

Blog: Db2 for z/OS ...
Paul_McWilliams 110000JT36
Updated
0 people like thisLikes 0
No CommentsComments 0

Linux on IBM Z perfo...

Blog: Ingolf's z/VS...
Ingolf24 120000DRN3
Updated
0 people like thisLikes 0
No CommentsComments 0

Building baas: (serv...

Blog: IBM and Googl...
MJonker 100000GP4U
Updated
0 people like thisLikes 0
No CommentsComments 0

Cache Invalidation U...

Blog: CSE-WebSphere...
ShoebBihari 3100001AME
Updated
1 people likes thisLikes 1
No CommentsComments 0

How to Identify Clas...

Blog: Application I...
MicheleCalcavecchia 270000HCF1
Updated
0 people like thisLikes 0
No CommentsComments 0

▼ Similar Ideas

Re: 2014 2nd Edition...

Ideation Blog: IBM PureData-...
shubho 270001FMSR
Updated
No Votes 0 No CommentsComments 0

Statistics in Netezz...

Ideation Blog: IBM PureData-...
DeepashriKrishnaraja 270001C7Y3
Updated
Votes 1 CommentsComments 3

Importance of settin...

Ideation Blog: IBM PureData-...
DeepashriKrishnaraja 270001C7Y3
Updated
Votes 2 CommentsComments 5

Understanding Netezz...

Ideation Blog: IBM PureData-...
vinoy 270001RPDP
Updated
Votes 2 CommentsComments 2

▼ Archive

  • April 2018
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • November 2016
  • September 2016
  • June 2016
  • May 2016
  • March 2016
  • January 2016
  • December 2015
  • September 2015
  • August 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • November 2014
  • October 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013

▼ Blog Authors

IBM Tivoli Monitoring Wonderful World of Situations

View All Entries
Clicking the button causes a full page refresh. The user could go to the "Entry list" region to view the new content.) Entry list

Sitworld: Mixed Up Situations

jalvord 1200009463 | | Tags:  efficiency performance situation ‎ | 12 Comments ‎ | 13,300 Views

 

Sitworld: Mixed Up Situations

image

By John Alvord, IBM Corporation
jalvord@us.ibm.com

 

Inspiration

A customer was experiencing a high CPU condition on a remote TEMS. High CPU was also seen on some agents, The first issue was an expensive test to determine if two processes were missing. I documented full details for a zero cost solution here.

The second major issue was the result of a situation that used two different attribute groups. For that case I show an alternative solution which reduces the number of result bytes incoming by 98%.

 

Mixed Attribute Situation formula

The problem situation was XXXXX_XX_SYSLoadAvg15Min_C  and resulted in 1.836 megabytes a minute of result data even though it only ran every 15 minutes.

The situation formula was not obviously inefficient. It certainly did not draw my attention until the TEMS Audit process showed it a top impacter,

*IF ( ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 2 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 8.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 3 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 12.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 4 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 16.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 5 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 20.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 6 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 24.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *EQ 7 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 28.00 ) *OR

      ( *VALUE Linux_Machine_Information.Number_of_Processors_Online *GE 8 *AND

        *VALUE KLZ_System_Statistics.System_Load_15min *GT 32.00 )

    ) 

 

That was formatted for easier viewing. The scheme is designed to be true if the 15 minute system load average was more then 4 times the number of processors on the system being run. The formula seems reasonable. However, an agent can only run a situation with a single attribute group. To calculate this situation, TEMS creates hidden sub-situations which achieve the same goal. In this case TEMS created 22 situation rules in the SITDB table:

 

XXXXX_XX_SYSLoadAvg15Min_C______
XXXXX_XX_SYSLoadAvg15Min_C_____0
XXXXX_XX_SYSLoadAvg15Min_C_____1
XXXXX_XX_SYSLoadAvg15Min_C_____2
XXXXX_XX_SYSLoadAvg15Min_C_____3
XXXXX_XX_SYSLoadAvg15Min_C_____4
XXXXX_XX_SYSLoadAvg15Min_C_____5
XXXXX_XX_SYSLoadAvg15Min_C_____6
XXXXX_XX_SYSLoadAvg15Min_C_____7
XXXXX_XX_SYSLoadAvg15Min_C_____8
XXXXX_XX_SYSLoadAvg15Min_C_____9
XXXXX_XX_SYSLoadAvg15Min_C_____a
XXXXX_XX_SYSLoadAvg15Min_C_____b
XXXXX_XX_SYSLoadAvg15Min_C_____c
XXXXX_XX_SYSLoadAvg15Min_C_____d
XXXXX_XX_SYSLoadAvg15Min_C_____e
XXXXX_XX_SYSLoadAvg15Min_C_____f
XXXXX_XX_SYSLoadAvg15Min_C_____g
XXXXX_XX_SYSLoadAvg15Min_C_____h
XXXXX_XX_SYSLoadAvg15Min_C_____i
XXXXX_XX_SYSLoadAvg15Min_C_____j
XXXXX_XX_SYSLoadAvg15Min_C_____k

This was the most number of sub-situations I have every seen.

 I won’t bore you with every sub-situation definition but here are three selected examples found in the SITDB table. The SITDB table contains the SQL which represents the situation in the TEMS dataserver,

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C______

PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____0 OR XXXXX_XX_SYSLoadAvg15Min_C_____3 OR XXXXX_XX_SYSLoadAvg15Min_C_____6 OR XXXXX_XX_SYSLoadAvg15Min_C_____9 OR XXXXX_XX_SYSLoadAvg15Min_C_____c OR XXXXX_XX_SYSLoadAvg15Min_C_____f OR XXXXX_XX_SYSLoadAvg15Min_C_____

 

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____i

PREDICATE: XXXXX_XX_SYSLoadAvg15Min_C_____j
AND XXXXX_XX _SYSLoadAvg15Min_C_____k

 

RULENAME: XXXXX_XX_SYSLoadAvg15Min_C_____k

PREDICATE: SELECT BIOSREL, BIOSVER, BRAND, CONFCPU, HOSTNAME, MACSERIAL, MODEL, ONLNCPU, ORIGINNODE, TIMESTAMP, UUID FROM KLZ.LNXMACHIN WHERE SYSTEM.PARMA("SITNAME", "LZIOS_BP_SYSLoadAvg15Min_C", 26) AND SYSTEM.PARMA("NUM_VERSION", "8", 1) AND SYSTEM.PARMA("LSTDATE", "1130405010445000", 16) AND SYSTEM.PARMA("SITINFO", "TFWD=N;OV=N;", 12) AND LNXMACHIN.ONLNCPU = 7 ;

This situation and a parallel one for 5 minutes load resulted in about 40% the incoming workload to the remote TEMS. The fact that situations contain a test for other situations requires TEMS Evaluation which is always very expensive.  Just this situation alone could cause high CPU at the best and might cause a remote TEMS crash.

There was also a much higher workload at the agents where the situation was distributed. All 22 situations were evaluated at the sampling interval. Even if the issue did not exists, the situation tests required TEMS evaluation and so many many duplicate results had to be sent.

 

Alternative Situation(s) Example Solution

The following three situations use marker and data files to communicate information. Example situations are available here. The marker/data files are stored in the <install>/ tmp directory. If more that one scheme is being used, the situation name would be made part of the marker file name.

 

IBM_processor_count

Since systems rarely change the number of online processors, the value is calculated at agent start up and then just once every 999 days. If your environment makes use of  CPU hot plug technology then you could run it more frequently. In this example, the ongoing TEMS impact is zero since it is only evaluated every 999 days or during a TEMS connection,

The action commands are presented for ease of understanding, but will be one long line in the situation editor Action command,

Attribute Group: Linux Machine Information
Formula: (Number of Processors Online >= 0)
Sampling Interval: 999 days
rowsize:764
Action Command:

cd $CANDLEHOME/tmp ;
echo  &{Linux_Machine_Information.Number_of_Processors_Online}  >ponline.txt

The purpose is to record the number of online processors into a known file.

 

IBM_sysload15_calc

The situation uses the Linux System Statistics attribute group. The formula is set to be always true. The action command is configured to run at every interval.

Attribute Group: Linux System Statistics
Formula: (System Load Last 15 Minutes >= 0.00)
Sampling Interval: 15 minutes
rowsize: 236
Action Command: 

cd $CANDLEHOME/tmp; 

(
   echo "&{KLZ_System_Statistics.System_Load_15min} " ;
   (cat ponline.txt 2>/dev/null || echo 1);
) |
   awk  '{load=$1;getline;cpu=$1;}
         END{
         if (load/100 > cpu*4.0)
         system(sprintf("echo %.2f >sysload15.hi",load/100));
         else system("rm -f sysload15.hi");
         exit 0;}'

Here is an explanation of the action command. Remember that (…) creates a subshell environment, semicolon denotes one command and then the next. || means that the second command runs only if the first one had a non-zero exit command [or failed]. The | alone means the standard output is fed into the standard input of the next command.

--- cd $CANDLEHOME/tmp;
  ===> make the $CANDLEHOME tmp directory the current directory.

---  (  
  ====> Begin a first level subshell environment

--- echo "&{KLZ_System_Statistics.System_Load_15min}" ;
  ===> output the 15 minute system load into standard output.
  ===> The echo –n is not used because not all platforms support it.

--- (cat ponline.txt 2>/dev/null || echo 1)
  ====> Within a new subshell, copy the ponline.txt into standard output. Suppress error
  ====> messages with 2>/dev/null and if no file exists, put 1 into the standard output/

--- ) |
  ====> Close the first subshell and two lines are put into standard output

---   awk  '{load=$1;getline;cpu=$1;}
====> Run the awk command and get one number from each of the two lines

--- END{
===> After reading all input perform the ending logic

---  if (load/100 > cpu*4.00)
  ===> The load is the 15 minute system load. It is divided by 100 for scaling. The agent sends just an integer.
  ===> The  cpu is the number of online processors, from the ponline.txt file
  ===> The test is the business rule to signal if system load is more then 4.0 times number of processors.

--- system(sprintf("echo %.2f >sysload15.hi",load/100));
  ===> The first system is the if statement true logic. It creates a marker file.

---  else system("rm -f sysload15.hi");
  ===> The second  system  is the if statement false logic. It erases the marker file.

--- exit 0;}'
===> exit the awk process.

 

IBM_sysload15_high

Check the marker file. When present a high sysload 15 minute condition exists.

Attribute Group: Linux File Information
Formula: (  Path == '/opt/IBM/ITM/tmp/' AND File == 'sysload15.hi')
Sampling Interval: 5 minutes
Rowsize: 3580

 

Event Receivers and Helper situations

The first two situations are helper situations. After testing make sure they are not associated with any Portal Client Navigation nodes. The EIF tab should be set so the events are not transmitted to any event receiver like Omnibus.

 

Remote TEMS Performance Estimate for 600 agents

Situation 1 – ignored since only runs once.

 

Situation 2 – always true so result is sent each time.

Every 15 minutes there will be 600 results at 236 bytes which averages 9440 bytes/minute

 

Situation 3 – Assume 600 agents and 5% are showing a problem.

The 30 agents that have a problem will send

Count*rowsize*freq/hr = 30*3580*12  = 1,632,480  bytes/hour or 27208 per minute.

In this scenario, there will be 36,648 result bytes per minute when combined with situation 2.

That is a 98.0% reduction from the observed rate on the remote TEMS.

You can adjust the situation intervals to speed recognition of the condition with a parallel increase in the remote TEMS burden. Here is a formula you can to estimate the bytes per minute burden on the remote TEMS.

(Agents*Agent_Sysload15%*60*3580)/Check_Interval + (Agents*236*60)/Calc_Interval

Agents = number of agents being looked at
Agent_Sysload15% = fraction of agents exceeding the sysload15 benchmark
Check_Interval = Sampling interval in minutes for condition check
Calc_Interval = Sampling interval in minutes for calculation situation

              

Note:

In some environments you may need to use nawk [new awk] to get correct results. This was seen in a Solaris platform.

Summary

This shows how to write a situation to reduce the number of result bytes to achieve a business rule. In the particular example the reduction was 98% compared to a single situation.

Sitworld: Table of Contents

 

Photo note: Three mixed up cats enjoying the sun.

Modified on by jalvord 1200009463
  • Add a Comment Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry
Notify Other People
notification

Send Email Notification

+

Quarantine this entry

deleteEntry
duplicateEntry

Mark as Duplicate

  • Previous Entry
  • Main
  • Next Entry
Feed for Blog Entries | Feed for Blog Comments | Feed for Comments for this Entry