Troubleshooting
Problem
PureData System for Operational Analytics environments is a complex appliance to manage. The following command lines can be used by appliance system admins to check the status of various aspects of their environments. These command lines are designed to get information about the state of the appliance quickly in a compressed and easy to read format.
The commands are available on any version of PDOA at any fix pack level unless restrictions are indicated.
The following commands are run as root on the management node and otherwise indicated. Some commands may be run on any AIX host in the environment. The output shown will vary depending on the version of the appliance, the fix pack level of the appliance, the size of the appliance, modifications to the appliance and the health of the appliance. Output that is host specific will vary depending on the host role. The roles of an appliance host are "management", "management standby", "admin", "admin standby", "data and data standby". Many of the commands have been provided to customers via problem tickets or are found in other appliance technotes.
Administrators are encouraged to establish baselines for their environments that they can compare to in the future.
Administrators are encouraged to learn the best practices for using "dsh" and its companion command: "dshbak". The techniques used will vary depending on the type of output expected from the "dsh" command. These techniques include using commands like sort, cut, and sed to manipulate the output. Do not attempt to use "awk" as it is difficult to correctly specify the single and double quote through "dsh".
The following commands are created to allow copy and paste from a browser window into a shell session. If typing the commands note that the use of single quotation marks and double quotation marks are important as they are not interchangeable in the commands.
When a system is in distress, it is possible for commands using dsh to hang. In those cases, the '-t <number>' option will tell "dsh" to abandon the attempt if a connection isn't established by the number of seconds specified with the "-t" option.
Some commands may be better run serially as they can increase the load or have conflicts if run in parallel. For those "dsh" commands using the '-f <number>' or "fanout" option can run commands serially in the order the hosts are provided in the list.
Commands run with "dsh" may provide output to "stderr". For those commands use "2>&1 | dshbak -c" which will mix combine the "stderr" and "stdout" output together so it shows up in the "dshbak" stanza for the host. If the "stderr" output is not redirected it will show up at the top of the output and can be easily missed. One aspect of "stderr" versus "stdout" outputs is that "stderr" output is not buffered. Therefore the order of the output with both "stderr" and "stdout" are mixed is not guaranteed to match the actual order that the output was produced.
Commands run with "dsh" may provide no output on some hosts. On an environment with many hosts, this makes it difficult to notice that host is missing. For commands that can return no output use a technique which includes a command that is guaranteed to have output such as "echo 1;< cmds>...". Using this technique will ensure that all hosts provide output.
| Category | Description | Command Line | Output |
|---|---|---|---|
| Hosts |
Verify hosts are available on the network.
Run as root from any host.
|
|
|
| Hosts |
Verify hosts are responding and can provide logins.
Run as root on any host.
|
dsh -n ${ALL} date 2>&1 | sort |
|
| Hosts |
Check the "errpt" command to see messages counts for different time periods.
More sophisticated queries can be used by looking at the "errpt" man page.
These commands use the "date" command to provide a formatted date value to the "-s" option.
|
Current Hour:
Current Day:
Current Month:
Current Year:
|
|
| Storage |
Verify Fibre Channel Path Counts
Fibre Channel path counts vary depending on the version of PDOA, the type of host and for data hosts the number of hosts within the same rack.
|
|
V1.0:
--------
V1.1:
---------
|
| Storage |
Verify HDISK Path Counts
Shows a count (first column) of LUN types and enabled path counts.
The number of hard disk in the histogram output depends on the PDOA version, the host role, and the size of the environment, and the number of hosts within each rack.
|
|
V1.0:
------
V1.1:
------
|
| Storage | Verify HDISK IDs |
|
V1.0
------
V1.1 ------
|
| Storage |
Verify GPFS is started.
|
dsh -n ${ALL} '/usr/lpp/mmfs/bin/mmgetstate -a' 2>&1 | dshbak -c |
|
| Storage |
Verify GPFS Mounts
Run as root. Does not require GPFS to be started to provide output.
|
|
V1.0:
-----
V1.1: -----
|
| Storage |
Check for unfixed events on the PDOA storage enclosures.
Run as root on management.
Shows the unfixed alerts on all of the storage enclosures in the environment.
On older firmware levels, following the MAP procedures does not close some alerts.
These commands work on Flash and V7000 based storage enclosures.
|
|
|
| Storage |
Check for drives that are offline.
Run as root on management host.
|
|
|
| HA |
Verify Domain Status (hatools)
Run as root on any host.
V1.0.0.4 and higher
V1.1.0.0 and higher
|
|
|
| HA |
Verify Domain Status (Tivoli System Automation).
Run as root on any host.
V1.0.0.3 and earlier have a single domain, bcudomain, across all hosts. V1.0.0.4 and higher and V1.1.0.0 and higher have a single domain bcudomain# for each rack in the environment.
|
|
|
| HA |
Check Db2 Resource Group Status
The "lsrg" and "lssam" commands are more advanced commands than the standard command: "hals". These commands are run on all of the core nodes using "dsh" to obtain more granular information.
In PDOA appliances, the "lssam" produces a lot of ouptut that is difficult to read quickly. The following commands manipulate that command to only show resource groups at the partition level and do not show the status of the file systems, peer node equivalencies, or network equivalencies.
In PDOA environments hosts within the same "TSA" domain should show the same output. However, during transitional states or error states it is possible that hosts within the same domain will have different output which is not collapsed by the "dshbak -c" command.
Note that the "lssam" command and the "lsrg -m" command may appear to hang during some transitional states.
|
|
|
| HA |
HA Mount Status
The example output was taken with the domain online, the database down and running mmumount all on all core hosts.
This demonstrates the various states of the storage resources as "TSA" and the associated "TSA" policies attempt to restart the storage.
If the output is blank, then all file system resources are in the "Online" state.
The commands "lssam" and "lsrg -m" may hang during some transitional states.
|
|
|
| HA | HA Equivalency Status |
This command is important on the core nodes. This commands is used to ensure per-rack and per-domain consistency of the domain metadata. There have been edge cases where the "TSA" equivalency definitions for "IBM.Peer Node" equivalencies can get to an inconsistent state within the domain. During "Roving HA" events the "PeerNode" equivalency definitions are altered to reflect the new primary and standby hosts as part of a "failover". This also happens to all of the resources in the "Db2 partition set resource groups". During these updates it is possible that the hosts within the domain do not synchronize correctly.
|
|
| HA | HA Resource Status |
This command is important on the core nodes (as is the HA Equivalency Status command). This command displays the roving HA resources. Pay close attention to the hostnames and the order of the hostnames. Compare this output to the equivalency check and to the contents of the db2nodes.cfg file. There are two separate commands, one for IBM.ServiceIPs and one for IBM.Applications.
|
|
| Capacity |
Check the partition filesystem capacities on all hosts.
This command checks filesystems that are shipped with PDOA. Other filesystems can be added to the for loop list to be tracked. Note db2mlog and db2ssd are only available on V1.0 environments.
This presents a histogram of which shows the number of filesystems mounted with the same space % free and inode % free stats.
Note that "db2ssd#" file systems are expected to be full as they are for system temp and have no growth potential. 3 of these file systems on the admin and admin standby hosts are not used, so these file systems will appear empty.
|
|
V1.0:
-----
V1.1: (some hosts are down)
-----
V1.1: Healthier
-----
|
| DB2 |
Determine Db2 copies installed on the appliance.
|
|
V1.1.0.5
---------
V1.1.0.2
---------
More Comments:
---------------------
FP5_FP1 and earlier will include the IBM System Director embedded Db2 copy. This is removed as part of FP6_FP2.
The expected output shows all core hosts have the same Db2 copies and all management hosts have the same Db2 copies. The lone exception is that the management node will have a Db2 copy for IBM System Director if IBM System Director is still installed.
Customers who have multiple Db2 copies (other than on the management host) and are planning to apply V1.0.0.5/V1.1.0.1 or earlier fix packs must only have one Db2 copy on all hosts except for the management host which will have two copies
"PDOA V1.1 FP2" and later fix packs update Db2 differently and do not have the same restrictions as earlier fix packs.
The "PDOA" method to install Db2 is to use the Db2 version in the installation path. However, if a Db2 fix pack was installed in that installation path, the version will no longer reflect the version installed. This fact makes it important to verify the Db2 versions installed on the environment and to not rely on the installation paths.
|
| Db2 |
Determine the actual Db2 level installed per Db2 copy.
|
|
V1.0: -----
V1.1:
-----
|
| Db2 |
Db2 Copies and associated Licenses.
Db2 10.1 should have the Product Identifier "iwee". Db2 10.5 and 11.1 should have the product identifier "db2aese".
The following command will help to discover inconsistent Db2 licenses in the environment.
|
|
V1.0:
-----
V1.1
-----
|
| Db2 |
Db2 Registry Entries Instance Records
|
|
V1.0:
----
V1.1:
-----
Comments:
-------------
The following instances will exist only on the management host.
The following instances will exist only the management hosts.
The appliance was designed to only have one instance on the core hosts. However some customershave multiple instances and may have changed their instance from the default shown in the following list.
Carefully check the core instance records for differences. The most common issue is a standby host that is not updated correctly when Db2 special builds are applied by customers.
Db2 "DAS" instances do not appear in V1.1 systems and also should be removed from V1.0 systems.
|
| Db2 |
Db2 Registry entries for "S records".
Use the cut command to remove installation dates.
"GPFS/TSA/RSCT" entries were added as part of Db2 10.5 but Db2 does not manage these components in PDOA environments.
There is an "S record" for each Db2 copy installed on the host.
|
|
V1.0:
----
V1.1:
----
|
| Db2 |
Db2 Registry V records.
The "DB2SYSTEM" variable will have unique values on each host. This variable is created as part of the Db2 installation pattern used by the appliance.
The "DB2INSTDEF" variable is not relevant to Db2 on AIX. See this link for more information. technote.
|
|
V1.0:
----
V1.1:
----
|
| Servers |
Server Status As Viewed From the HMC.
Run as hscroot on hmc1 or hmc2.
|
|
|
| Adapters | Fiber Channel ("fcs") modified settings. |
|
V1.1 Example:
===========
Notes:
The top stanza shows the management "LPARs". Each "LPAR" has 2 "4 Port HBA Adapters".
The second stanza shows an admin and a data "LPAR". Each "LPAR" has 4 "4 Port HBA Adapters".
In PDOA only the following variables are modified for the fibre channel adapters.
lg_term_dma,max_xfer_size,num_cmd_elems
See https://www.ibm.com/support/knowledgecenter/en/SSH2TE_1.1.0/com.ibm.7700.r2.common.doc/doc/r00000185.html for V1.1 settings.
|
| Interfaces | Fiber Channel Interface (fscsi) Settings |
|
V1.1: Example:
NOTES: PDOA does not modify any fscsi variables in V1.1
|
| AIX hard disks. "(hdisks)" | "hdisk settings" |
# V7K Disks
# Flash Disks
|
V1.1: (See https://www.ibm.com/support/knowledgecenter/en/SSH2TE_1.1.0/com.ibm.7700.r2.common.doc/doc/r00000185.html for V1.1 settings)
The following shows the V7000 settings on a V1.1 environment.
The following shows the Flash900 settings on a V1.1 environment.
NOTES:
The hosts indicated by hostname1 and hostname3 are management LPARs.
The hosts indicated by hostname2 and hostname4 are admin LPARs.
The hosts indicated by hostname5, hostname6 and hostname7 are data LPARs.
The data LPAR counts vary depending on the number of hosts in the same rack. Each data rack has 1 standby host and at least one data host up to a maximum of 4 data hosts.
|
| Appliance file consistency. | How to compare a file on the managmeent host to the same file on the rest of the hosts in the environment? |
A quick and useful way to compare files from the host running dsh to a set or all hosts in the environment.
Variations:
1. The host ip address, in the example 172.23.1.1, will be the source to compare against. 172.23.1.1 is the default ip address for the management host.
2. The "-n ${ALL}" argument includes all AIX hosts in the environment.
3. The "f=/etc/ssh/sshd_config" argument represented the full path of the file to be used in the comparison.
If there is no difference between the source host and the target host, only the cksum will be displayed in the stanza for that host.
If there is a difference the "<" will indicate the target host differences and the ">" will indicate the source host differences.
|
|
| Storage | FC Path Selection |
Requires FP6_FP2 or higher.
Flash:
V7000:
|
|
| Appliance | Map "CEC" to "LPAR" to Host using the "pflayer" tools. |
The following command is run as the root user on the management host. It will query the platform layer "ODM" database.
"PF ID": Platform Layer identifier for "server_os" resource types.
"LPAR": LPAR name that can be used when running LPAR commands on the HMC.
PROFILE: This lists the name of the profile used to start the LPAR.
HOSTNAME: This is the internal hostname for the LPAR.
IP: This is the internal IP address assigned to en11 for the host/LPAR.
MT: This is the model type for the Server or CEC that owns the LPAR.
SN: This is the serial number for the Server or CEC that owns the LPAR.
|
|
| Adapters | Network Adapter Etherchannel Health |
The following command will run entstat -d ent11 on all hosts in the environment. It will then report on the Synchronization Status. Management nodes will show 4 IN_SYNC if healthy (2 Actors + 2 Partners). Core nodes will show 4 segments and 8 IN_SYNC if healthy (4 Actors + 4 Partners). This will not tell you which of the child adapters of the etherchannel are impacted and the order of the children in the etherchannel changes over time for a variety of reasons.
This gives a quick easily readable set of output.
|
|
| Adapters | Network Adapter Etherchannel Expanded Information |
This command uses the output of the command "entstat -d ent11". The command pulls the adapter names, the synchronization status, and the partner port number on the switch in hexadecimal format and then converts that to a decimal format.
While this cannot programmatically determine which "10Gb" switch a port is attached, it can narrow down list of ports involved when troubleshooting a network connection.
|
|
| Appliance | Using the platform layer to display the appliance components. This does not list the RACKs, PDUs, or KVM switch. |
Run as root on the management host.
1. Servers:
2. HMCs:
3. Storage Enclosures:
4. SAN Switches:
5. Network Switches
6. Network Adapters
7. Fibre Channel Adapters
|
1. Servers:
2. HMCs:
3. Storage Enclosures:
4. SAN Switches
5. Network Switches
6. Network Adapters
7. Fibre Channel Adapters
|
Document Location
Worldwide
[{"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSH2TE","label":"PureData System for Operational Analytics A1801"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"V1.0;V1.1","Edition":"","Line of Business":{"code":"LOB76","label":"Data Platform"}}]
Log InLog in to view more of this document
This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.
Was this topic helpful?
Document Information
Modified date:
03 August 2022
UID
ibm10880017