I had a new article published today on IBM developerWorks: Tracing IBM AIX hdisks back to IBM System Storage SAN Volume Controller (SVC) volumes
Brian Smith's AIX / UNIX / Linux / Open Source blog
brian_s 270002K5X3 4,722 Views
This is an update to my previous post on a Script to show recent Error Report (errpt) entries on AIX
Anthony and Dan had some good suggestions such as being able to specify the interval to go back in days instead of just minutes, and also having an option to just have the script show error report entries since the last time the script was run.
So below is version two of this script.
The changes are:
Here is the updated script:
Update 10/24/13: See also Version 2 of script to show recent Error Report entries on AIX
Here is a script that will show you recent Error Report (errpt) entries on AIX. As an argument to the script you specify the number of minutes you want to go back, and the script will only show errpt entries that have occurred within that many minutes from now.
This can be helpful as a standalone utility, or as part of a monitoring script that would automatically notify you if a new errpt entry came up within the last few minutes.
For example, to only show error's that have occurred within the last 15 minutes:
Or the last hour:
Or the last day:
Here is a screenshot:
Every Filesystem in AIX has two sets of permissions: The permissions on the mount point directory, and the permissions on the mounted filesystem.
Here is an example:
Normally the Mount Point Permissions don't come in to play once the filesystem is mounted (however here is an post that shows what I recommend for them)
However, if a user doesn't have read/execute permissions on the mount point you will see weird behavior and frequently have application issues as well.
Here is an example showing this:
As a non-root user, we do an "ls -al" in the directory and get a weird "./..: Permission denied" error. This is caused because the underlying mount point permissions are restricted (700) and the user doesn't have read/execute permissions on the underlying mount point (even though the mounted filesystem has 777 permissions).
Now there are 2 different ways to check to see what the permissions are on the underlying mount points of your filesystems. You can unmount the filesystem, and do an "ls -ald" on the mount point (but it will probably require application downtime to unmount the filesystem...) Or you could use this handy script that will show you the underlying mount point permissions while the filesystem is online and mounted.
Just a quick disclaimer however... These scripts have worked with my limited testing; but use them at your own risk. The IBM documentation always recommends unmounting the filesystem to check or change mount point permissions and this is the safest and best way to do it. These scripts will do everything with the filesystem mounted and online.
When run, it will show you the underlying mount point permissions for all mounted JFS/JFS2 filesystems:
If you have filesystems with too restrictive mount points that are causing you issues (like /test1 and /app2 in the example above), then you can either unmount the filesystem and change the mount point permissions, or use this script to add read/execute permissions to user/group/others on the underlying mount point directory while it is still mounted and online:
With the script you specify a filesystem and it will add the read/execute permissions on the underlying mount point:
One of the fundamental principles of troubleshooting any issue is to look for what has changed between the time things went from working to not working. One thing that could be relevant to any issue is if any software has been recently updated or installed.
AIX provides the "lslpp -h" command which will show fileset installation and update dates for each fileset. Unfortunately it doesn't sort the output of this, so it can be very difficult to look through the output to find filesets that have been recently updated or installed.
Here is a one liner that will sort the output and show you the most recently installed and updated filesets on your AIX system:
If you pipe it to "tail" you can see the 10 most recently installed or updated filesets on your AIX system:
Note: This one liner is assuming the date output will be MM/DD/YY... If you are in a different locale you might need to modify the one liner a little bit or temporarily change the environment variables to set your system to output dates in MM/DD/YY.
Also, in case you are wondering about the "sed 's/70/-70'" and "sed 's/-70/70'" parts of the script.. I noticed that lslpp -h on some servers will list the dates for some filesets as the year 70 (as in the UNIX epoch). Since "70" (1970) is bigger than "13" (2013) these old "1970" filesets were getting listed as the newest. So before the "sort" of the output I search and replace any 70's with -70's so they will be sorted correctly, and then after the sort change all the -70's back to 70. This way the dates are sorted correctly and the output still looks good.
There are several storage related settings in AIX that cannot be changed if the device is active. These include "fast_fail" , Dynamic tracking (dyntrk), and the "num_cmd_elems" for HBA's and the Queue Depth for hdisks.
Your options to set these are either make the device inactive (usually by taking redundant paths offline) and then make the change, or to use the "-P" flag on chdev and then reboot the server to make the change effective at the next boot.
The "-P" option on chdev has one major drawback however. As soon as you make the change with chdev "-P" it appears that the setting is active right away even before the reboot. If you check with "lsattr" it will appear as if the setting has taken effect. However it actually won't take effect until the next reboot. What has essentially taken place is that the running configuration is out of sync with the ODM. The ODM reflects the updated settings, however they can't be changed in the running configuration of the AIX kernel until the next reboot.
I've actually had discussions with other people who insisted that changing something like fast_fail with the chdev "-P" didn't require a reboot because they checked "lsattr" and it showed it had been changed.
Needless to say this can cause some serious confusion and other issues. The fact is that if you don't know the history of a server and who's worked on it you really can't trust the output of "lsattr" when looking at things like fast_fail, dyntrk, num_cmd_elems, and queue depth.
Chris Gibson did some excellent postings in the past on how to manually use kdb to see if these types of settings have been changed with chdev "-P" but the server hasn't been rebooted for them to actually take effect:
These are excellent posts on how to manually check this, but unfortunately it is not an easy task to do as you have to go in to KDB and run a couple of commands and then decode the cryptic output and do some hex to decimal conversions.
I took the process that Chris Gibson blogged about and automated it through a couple of scripts.
The first script checks all the HBA's on your system and will show you if the fast_fail, Dynamic Tracking (dyntrk), or num_cmd_elems is out of sync with the running configuration:
Here is the output when everything is in sync between the ODM and the running configuration:
Here is the output when one of the settings is out of sync (in this example the fastfail setting). This shows that the setting was changed with chdev "-P" but the server was never rebooted:
The second script checks the Queue Depth settings for each hdisk. Here is the output when everything is in sync between the ODM and the running configuration:
Here it is when the settings are out of sync. This shows that the Queue Depth's were changed with chdev "-P" but the server was never rebooted:
Here is the script to check the HBA settings:
Here is the script to check the hdisk Queue Depth's:
Note that these scripts use "kdb" which always makes me a little nervous. Please test the scripts out in your environment on a test server first. Also note that these scripts will more than likely not work on AIX 5.3 or older systems.
If you liked these scripts, you might also like "prdiff". It will show the differences between your LPAR's saved profile and its running configuration. For more information on it, see the project website at: http://prdiff.sourceforge.net/
This is an update to my previous post on Visualizing the Physical Layout of an AIX Volume Group (see the previous post for full details). Sebastian posted a comment suggesting an option to consolidate/merge PP's that are sequentially numbered to make the script usable on hdisks that might have over 50,000 PP's. This is a great idea, and I was able to update the script to add this.
The options for the script have changed:
-v <vgname> (Specifies VG name, required)
-c ## (Optional - specifies number of columns, defaults to 2 in merge mode, 3 in non-merge mode)
-m (Optionally Enables merge/consolidate mode for sequentially numbered PP's. In parentheses it shows number of PP's and size of merged chunk. )
Here is a screenshot of the new "merge mode". If you leave off the "-m" it will show details on all PP's (see my previous post for screenshots of that mode).
Here is the updated script:
9/23/13 Update - See this updated verison of the script as well.
Here is a script I've written to visualize the physical layout of an AIX volume group. The script visually shows the location of every Physical Partition (PP) on each hdisk (AKA Physical Volume). The output shows which Logical Volume (LV) is on each of the PP's (or if it is free space). The output is color coded so each LV has its own color so that it is very easy to see where each LV physically is across the entire Volume Group. You can specify the number of columns of output depending on the size of your screen.
The intended use of the script is to show a visual representation of the Volume Group to make using commands which move around LP's/PP's such as migratelp easier to use, to make LVM/disk maintenance easier, and also as a learning tool.
Here are a few screenshots:
When running the script you specify 2 parameters: The volume group name, and the number of columns you would like displayed (or it will default to 3 columns if not specified).
Here is the script:
If you aren't familiar with EZH (the Easy HMC Command Line Interface), check out the EZH project website on sourceforge It is a free, easy to install script for your HMC that provides a much better command line interface to your HMC and will make you much more efficient when working with your HMC. Not only does it provide shortcuts for all common HMC commands, it also adds additional functionality not available natively and even includes an interactive menu.
I was recently asked if it is possible to use EZH commands remotely over an SSH connection. For example, say you had a NIM server with SSH keys setup to your HMC, can you run a script on the NIM server that connects to the HMC and then use an EZH command? The answer is YES, it can be done, if you use a SSH command line such as this:
This connects to the HMC, loads the ezh script, and runs the "lparls" ezh command. Note that the quotation mark placement must match the example for this to work.
EZH was first released September 2012 so it has been out for about a year. If you haven't seen it yet, give it a try... If you have any suggestions for how to improve EZH please post a comment or send me an email.
To determine the oslevel on AIX, you can run the "oslevel -s" command. However, what "oslevel -s" reports doesn't always show the entire picture. The OS level reported will be the lowest level of any installed AIX fileset on your server.
For example, if all the filesets on your AIX server are upgraded to AIX TL8 SP3 except for one fileset which is at a lower level, then the oslevel reported will reflect the lower level of that single fileset, which might be something like TL4 SP2. So even though your server is 99.9% AIX TL8 SP3 oslevel would report the lowest level of any installed fileset.
The "oslevel -sq" command will show all of service packs that your AIX server is aware of. If you compare the top line in "oslevel -sq" versus "oslevel -s" they should normally match. If they don't, then you probably have an issue.
If you have a downlevel OS you can figure out which filesets are causing the issue and then fix them.
The first step to figuring out what filesets are causing the problem is to determine if your TL (Technology level) level is incorrect or just your SP (Service Pack) level is incorrect. To do this, compare the highest "oslevel -sq" line with your current "oslevel -s".
If the first 7 characters (####-##) match, but the rest are different, then your TL level is correct, but your SP level is not. For example, if your top line in your "oslevel -sq" output was 6100-07-02-1150, and your "oslevel -s" output was "6100-07-01-1141" then you would know your TL level was correct at TL7, but your SP level was not (oslevel -sq reported SP2, oslevel -s reported SP1). To determine which filesets are the problem if the TL level is correct, but the SP level is wrong, run:
This command will show you all the filesets that are below the SP level of the highest known SP level on the system.
If the TL level doesn''t match, for example if your top line in your "oslevel -sq" output was 6100-07-02-1150, and your "oslevel -s" output was "6100-04-11-1140" then you would know your TL level is incorrect (oslevel -sq reported TL7, oslevel -s reported TL4). To determine which filesets are the problem if the TL level is not correct, run:
This command will show you all the filesets that are below the TL level of the highest know TL level on the system.
Here is a script that will automates this process (note that this script doesn't work with AIX 5.2 or older). It will check out the state of your system and let you know if you have downlevel filesets:
Here is a screenshot of the output:
You might be wondering how you can avoid getting in a downlevel OS situation in the first place... Well usually this issue happens if you use the base media from an older level to install a fileset. For example, if you are at 6.1 TL8 SP3 and a user requests you install a new fileset. You only have the 6.1 TL7 SP2 base media, so you use it to install the requested fileset. If you just do this, your OS level will more than likely be downlevel now and report an incorrect version. What you need to do after installing a fileset from older media is to reinstall the TL8 SP3 update filesets to bring what you just installed up to the correct level. Remember - ALWAYS check the oslevel before and after you do any work related to filesets to make sure what you just did didn't downlevel the OS.