IBM Support

AIX user: "Austin, We've Got a Problem" & Performance PMRs

How To


Summary

have never understood why system admin people are so reluctant to get help which has been paid for already. If your AIX support is up to date you have the right to ask for support. Sure check for a few obvious things but then engage IBM or your support escalation process.

Objective

Nigels Banner

Steps

I often get asked: "I (or my customer) thinks that the AIX system is going a bit slow!" and can I send you some nmon data.
  1. First, while nmon has performance data it is not aimed at problem diagnostics.
  2. Second, until you have a PMR (the IBM Problem Management Record) - you don't actually have a problem by definition here at IBM.
     

Here is my boiler plate answer - I regard this as a "work in progress" as I might add further thoughts

  • (GREEN is a pre PMR sanity checkBLUE is the PMR preparationRED is the PMR phase):

errpt -a
  • Check for errors from hardware and software
  • AIX might be telling you what is wrong but are you listening?
  • Also check the HMC as it might know about other issues like a VIOS with issues
  • Running the VIOS Advisor (VIOS command called "part") is a very quick way to health check the VIOS.
df -g
  • Check the filesystems are not full - often a cause of problems
lsps -a
  • Check paging space is OK - if 100% used then the AIX kernel has no option but crash the occasional processes
nmon then type cmdnt
  • Quick look with nmon to see if you can pinpoint any specific issues with CPU, Memory, Disks, Network and Top processes etc
  • Not going to make this a nmon tutorial but hopefully you know the machine well enough to know what it should look like. If not compare nmon data captured before the issue to look for large differences.
    • CPU assigned less than expected? or much larger in physical CPU use.
    • Memory - assigned less than expected? or any serious paging, free list size OK, numperm as expected (filesystem cache size)
    • Disks - any disks over 80% busy
    • Network - not being limited the line speed
    • Top processes - The expected processes at the top and no crazy spinning processes or Zip eating whole CPUs
snap -ac
  • Take an AIX snap which has all the configuration and software levels - AIX support will demand this before they do anything else.
  • Output will be in /tmp/ibmsupt/testcase/snap.pax.Z
perfpmr.sh
  • Take a perfPMR - AIX support will demand this before they do anything else on a performance PMR.
  • If you haven't got this performance capture tool then download it from http://www-01.ibm.com/support/docview.wss?uid=aixtools-42612263 and read the Readme link
  • The 600 is the minimum seconds it will run for so about 10 minutes.  Longer is good too. As it captures lots of data beforehand and in phases it will take much longer to finish. Be patient.
  • It can effect performance so don't run it during your yearly peak or vital periods but you want to capture an active period with the problem.
Write a clear description of the symptoms based on measurable real facts
  • Don't include your guesses, feeling, project history from 1900, names of 20 managers or your mother maiden name.
  • Include anything you changed recently (i.e. confessing your sins, immediately) and any observation you have made as that might give an important clue to the issue.
  • Machine brief: Machine serial number, Model details, LPAR details (Entitlement, Virtual CPUs, capped, shared,  memory size), AIX level (oslevel -s), virtual or physical network and Disks, and number and type of disks.
  • Have a description of the workload, data sizes, numbers of users so AIX Support knows what they are looking at in the data and not having to guess.
Raise a PMR
  • You should previously have the process documented and including you IBM customer number, product serial numbers etc. so this should take you no more than 5 minutes.
  • I prefer to do this on the website so I can add error output, screen captures and files while other people prefer phoning in to start the process.
  • Service Request once you have registered is here https://www.ibm.com/support/servicerequest
  • Severity
    • Severity 1 means the production system is not available and you will have a techie sleeping by the machine to respond to IBM Support 24 hours a day.
    • Severity 2 means the system has reduced function but only manned during working hours
    • Severity 3 is everything else
    • Severity 4 is a request for information
  • Have one known Technical Owner of the PMR. I often get complaints via the sales channel that "IBM is not responding to the PMR" when, in fact, the customer techie and IBM support have been in hourly contact all week and making good progress.
Be prepared to gather information quickly
  • When support asks you for more information 9 out of 10 times they do very little until you send them back the details.
  • So the ball is in your court - IBM is waiting on you. 
  • If the PMR is important to you then Check Every Day the status of the PMR and to determine: "whose back the monkey is on".
  • If you don't check every day then the PMR is not important and should be closed.
Be prepared to make changes
  • IBM is completely out of "magic pixie dust" so changes will be needed to fix a problem(s).
  • Including updating the system firmware, upgrading AIX with service packs and technology levels and changing settings - these may require a system restart.
  • The more out of date your system is the more painful these are likely to be.  If you don't maintain your car for 5 years and it breaks down then "it really is your own fault!".
  • So plan for an outage or two in advance.
  • Most customers operate Live Partition Mobility which is a great tool for system firmware and VIOS upgrades with zero down time.
  • Any system that claims "cannot be shutdown" is doomed to go down at the wrong time and in an ugly way and take a long time to sort out.
  • Expect it to take a few iterations - an AIX guru friend of mine once said the "The little important bug is hiding behind the great big trivial bugs that we have to fix first to get them out of the way so we can see the problem clearly"

Only read this next bit if you are a really Smart Person

  1. Have the escalation process pasted on the wall
  2. Have you configuration details off the machine
  3. Know that your back-ups will work by experience
  4. Know your root passwords or how to get them - quickly
  5. Keep up to date on your system firmware, HMC, VIOS and AIX
    • By "up to date" I mean less than 1 year - as it gets to one year it should be flagged as an non-production unsafe environment and probably not secure.
    • Get a manager to sign it off the out-of-date list, once signed say "Oh thank goodness, now you <insert managers name here> get the sack when it goes belly up and no me!"
  6. Collect perfPMR regularly on "working as normal" days.
    • The AIX Support guys tell me comparing a good day and a bad day takes a fraction of the time - the problem leaps out of the data.
    • I suggest once a quarter or yearly.
    • Plus before and after any system hardware or software change.
  7. Get to know your AIX workload
    • So you can spot odd changes
  8. Keep up to date on your hands-on POWER and AIX skills

 Good hunting and may all your PMRs be small quick ones.

Additional Information


Other places to find content from Nigel Griffiths IBM (retired)

Document Location

Worldwide

[{"Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions"}]

Document Information

Modified date:
14 June 2023

UID

ibm11116261