IBM Support

WebSphere eXtreme Scale (WXS) monitoring minimum recommendations

Technical Blog Post


Abstract

WebSphere eXtreme Scale (WXS) monitoring minimum recommendations

Body

 

 

 

The WebSphere eXtreme Scale (WXS) product documentation has a section on monitoring that details all the monitoring options.

 

However, the documentation in the Knowledge Center does not provide details of what specifically needs to be monitored and how to interpret the results. Thus, in support we are often asked “What should we monitor in WXS to determine the health of the environment?” Normally this question is in regards to environmental stability or determining the status of the environment after container start or stop.

 

From a support perspective we normally focus on a few simple checks to determine if the WXS environment is “healthy”.  It is recommended to run these checks periodically during the environment's life cycle to ensure that the environment is stable.  At a minimum these checks should be done any time WXS processes are stopped or started.  To periodically check the health of the environment, it would be wise to automate both the running of the xscmd commands and the check of the generated output.

 

1. xscmd routetable command

The routetable command displays the routetable as it exists in the catalog server.  The routetable is used by the WXS clients to route requests to the correct container.

 

Example output (note: your output may be formatted differently, spacing was adjusted for display purposes in this blog entry):

CWXSI0068I: Executing command: routetable
*** Displaying routing information for data grid: Grid:mapSet

Placement Scope: Domain

Shard Type         Partition   State       Host    Zone          Container
----------         ---------   ---------   -----   -----------   ---------
Primary                    0   reachable   host1   DefaultZone   con2_C-1
SynchronousReplica         0   reachable   host2   DefaultZone   con1_C-0
Primary                    1   reachable   host2   DefaultZone   con1_C-0
SynchronousReplica         1   reachable   host1   DefaultZone   con2_C-1
Primary                    2   reachable   host1   DefaultZone   con2_C-1
SynchronousReplica         2   reachable   host2   DefaultZone   con1_C-0

CWXSI0040I: The routetable command completed successfully.


In the routetable output, ensure that all the shards defined in the environment are accounted for, and are in a reachable state.  If shards are missing, unassigned, or unreachable this is indicative of issues during shard placement.  If there are shards with a status other than reachable, this could cause a wide variety of problems such as but not limited to; TargetNotAvailableExceptions, replication issues, and data loss on failover.  Typically placement issues are caused by placement work timeouts, which are almost always driven by problems with the network or with resource availability (CPU/memory).  The environment should be checked for these types of issues, and if found they should be corrected to ensure the stability of the environment.

 

The xscmd triggerPlacement command can be issued to regenerate any failed placement work, and may allow for any shards in a undesirable state to become reachable.  


2. xscmd showMapSizes command

The showMapSizes command sends a request to each container.  The container responds with details of the shards which exist on that container and the amount of data stored in each shard.  

 

Example output:

CWXSI0068I: Executing command: showMapSizes

*** Displaying results for Grid data grid and mapSet map set.

*** Listing maps for con2 ***
Map Name   Partition   Map Entries   Used Bytes   Shard Type           Container
--------   ---------   -----------   ----------   ------------------   ---------
Map1               0            33        10 KB   Primary              con2_C-1
Map1               1            34        11 KB   SynchronousReplica   con2_C-1
Map1               2            33        10 KB   Primary              con2_C-1
Server total: 100 (32 KB)

*** Listing maps for con1 ***
Map Name   Partition   Map Entries   Used Bytes   Shard Type           Container

--------   ---------   -----------   ----------   ------------------   ---------
Map1               0            33        10 KB   SynchronousReplica   con1_C-0
Map1               1            34        11 KB   Primary              con1_C-0
Map1               2            33        10 KB   SynchronousReplica   con1_C-0
Server total: 100 (32 KB)

Total catalog service domain count: 200 (65 KB)
(The used bytes statistics are accurate only when you are using simple objects or the COPY_TO_BYTES copy mode.)

CWXSI0040I: The showMapSizes command completed successfully.


When checking the showMapsizes output ,ensure that all the defined shards are accounted for, the amount of data is as expected, and all expected containers are active.  The amount of data stored in the environment should not exceed the maximum amount data that the environment was designed to hold.  


3. xscmd showPlacement command

The showPlacement command displays the catalog server's view of where each shard is placed.  

 

Example output:

CWXSI0068I: Executing command: showPlacement

*** Show all online container servers for Grid data grid and mapSet map set.
Host: host2
  Container: con1_C-0, Server:con1, Zone:DefaultZone
    Partition   Shard Type           Reserved
    ---------   ------------------   --------
            1   Primary              false
            0   SynchronousReplica   false
            2   SynchronousReplica   false

Host: host1    
  Container: con2_C-1, Server:con2, Zone:DefaultZone
    Partition   Shard Type           Reserved

    ---------   ------------------   --------
            0   Primary              false
            2   Primary              false
            1   SynchronousReplica   false


  Number of containers matching  = 2
  Total known containers         = 2
  Total known hosts              = 2

CWXSI0040I: The showPlacement command completed successfully.

 

The showPlacement output should be checked to ensure that all expected shards, containers, and hosts are present.  The showPlacement output is the catalog server's view, which is considered the "master view" of shard placement, versus the container's view displayed in the showMapSizes output.  With the showPlacement output, review the distribution of primary and replica shards.  If shards are unbalanced this can lead to load distribution issues.  This is discussed in more detail in my blog: "WebSphere eXtreme Scale (WXS) Load Distribution"

 

If the primary and replica shards are not evenly distributed among the containers, then these can be balanced using the balanceShardTypes command. balanceShardTypes does “move” shards so it is possible that in-flight transactions may fail while the action completes.  It is recommended to run the command at off peak times to minimize the possibility of such failures.


4. Check the replication of the domain

The showReplicationState and revisions commands display the current revision/replication status of the grid.  A revision is made when a key,value pair is inserted, updated, or deleted.  These three transactions change the data on the primary shard for the partition and the new data must be replicated to the replica shards for the partition.  This ensures that the data in the replica shards remains synchronized with the data in the primary shard.

 

a. xscmd showReplicationState command

 

Example output:

CWXSI0068I: Executing command: showReplicationState

Container   Outstanding Inbound Revisions   Outstanding Outbound Revisions
---------   -----------------------------   ------------------------------
con1                                    0                                0
con2                                    0                                0

CWXSI0040I: The showReplicationState command completed successfully.

 

The showReplicationState output displays the number of outstanding revisions for the containers.  It is quite common to see outstanding revisions as the display just shows the revisions at a single moment in time.  The number of outstanding revisions will typically be higher on environments with numerous updates and inserts.  Using the showReplicationState output, check for abnormally large numbers of outstanding revisions, meaning the outstanding revision numbers are substantially higher than previously observed normal values.  If an abnormally large number of outstanding revisions are found, then monitor the outstanding revisions over time to ensure there was not a brief burst of heavy activity.  If the revisions are consistently increasing, this could mean there is a problem with replication.

 

b. xscmd revisions command

 

Example output:

CWXSI0068I: Executing command: revisions

*** Revisions for catalog service domain: DefaultDomain, Grid: Grid, MapSet: map Set
Partition   Type                 Domain          Lifetime ID     Revision   Lifetime Owner

---------   ------------------   -------------   -------------   --------   --------------

        0   Primary              DefaultDomain   1515426524160        668   con1
        0   Primary              DefaultDomain   1515426737236        333   con3
        0   SynchronousReplica   DefaultDomain   1515426524160        668   con1
        0   SynchronousReplica   DefaultDomain   1515426737236        333   con3
        1   Primary              DefaultDomain   1515426590725       1002   con2
        1   SynchronousReplica   DefaultDomain   1515426590725       1002   con2
        2   Primary              DefaultDomain   1515426524160        999   con1
        2   SynchronousReplica   DefaultDomain   1515426524160        999   con1


CWXSI0040I: The revisions command completed successfully.

 

The revisions command displays the revision number for each shard and each lifetime ID.  A new lifetime ID is generated when new shards are generated, which is expected when containers are stopped and started.  Thus, it is not unexpected to see multiple references to the same shard just with different LifetimeIDs in the revisions display.  In the revisions display, check the revision number for the primary and replica shards from the same partition which have the same lifetime ID.  It is quite common to see differences in the revision number of the primary and replica for a shard, as the revisions display is just a snapshot of a moment in time and the replication may be actively occurring.  Thus, what would be concerning is very large differences in the revision number for the primary and replica shard for a specific partition/lifetimeID which remain consistently high or even increase over time.

 

Typically the revisions command would be issued to collect more detail on the revisions if there are concerns raised in the showReplicationState output.  The revisions output would allow the user to see the replication for each partition and determine more about which partitions may be falling behind on replication. Issues with replication are typically caused by network or CPU issues.  These types of problems can slow replication or prevent replication from occurring completely.  


5. If using quorum, the quorum status of the environment must be monitored.  This can be done by using xscmd showQuorumStatus command.

 

Example output:

CWXSI0068I: Executing command: showQuorumStatus

Server   Host    Quorum   Quorum Size   Active Servers
------   -----   ------   -----------   --------------
cs1      host1   TRUE               3   cs1, cs2, cs3
cs2      host2   TRUE               3   cs1, cs2, cs3
cs3      host3   TRUE               3   cs1, cs2, cs3

CWXSI0040I: The showQuorumStatus command completed successfully.

 

The showQuorumStatus output would be checked to ensure that the quorum status is true and the quorum size matches the expected size for quorum.  If the quorum status is false, then quorum has been lost and recovery actions should be taken.  Each time quorum is checked, xscmd should be allowed to connect each catalog to ensure that each catalog view is the same.  In the example above with three catalog servers, the showQuorumStatus command would be issued with only one catalog in the -cep arguments.  The command would be run three different times rotating which catalog is passed in on the -cep arguments.  

 

6. If using Multi Master Replication (MMR), the links between the domains and the replication between the domains should be monitored.   The different domains in a MMR environment may have different views of the link status and MMR replication.  Since the xscmd client can only connect to one domain at a time, the xscmd commands would need to be run on each domain in the MMR configuration to get the complete status of the environment.

 

a. xscmd showLinkedDomains command

 

Example output:

CWXSI0068I: Executing command: showLinkedDomains

Example output:
*** Retrieving foreign catalog service domains linked to the following catalog service domain: dom1
Foreign catalog service domain
------------------------------
dom2

CWXSI0040I: The showLinkedDomains command completed successfully.

 

The showLinkedDomains domains output will show the foreign domains that are linked via MMR.  It should be verified that all expected domains are linked.


b. xscmd showLinkedPrimaries -hc command

 

Example output:

CWXSI0068I: Executing command: showLinkedPrimaries

CWXSI0091I: Verifying the primary shards have the correct number of links to foreign primary shards.

*** Displaying results for Grid data grid and mapSet map set. Expected number of online links: 1.

CWXSI0092I: All primary shards for Grid data grid and mapSet map set have the correct number of links to foreign primary shards.

Primary shards reporting links that are online: 2
Primary shards reporting links that are in recovery: 0
Primary shards reporting links that are pending: 0
Primary shards reporting no links: 0

CWXSI0040I: The showLinkedPrimaries command completed successfully.

 

The showLinkedPrimaries -hc command adds an extra step to the showLinkedPrimaries command where the xscmd client checks the link results and writes a message in plain English as to if there are links in a unexpected state.

 

c. xscmd showLinkedPrimaries

 

Example output:

CWXSI0068I: Executing command: showLinkedPrimaries

*** Displaying results for Grid data grid and mapSet map set.

*** Listing Primary Shards for local domain: dom1,
Container: dom1con1_C-0, Server: dom1con1, Host: host1 ***
Grid Name   Map Set Name   Partition   Domain   Container      Status Message
---------   ------------   ---------   ------   ------------   --------------
Grid        mapSet                 0   dom2     dom2con1_C-1   online
Grid        mapSet                 1   dom2     dom2con1_C-1   online

Primary shards reporting links that are online: 2
Primary shards reporting links that are in recovery: 0
Primary shards reporting links that are pending: 0
Primary shards reporting no links: 0

CWXSI0040I: The showLinkedPrimaries command completed successfully.

 

If the showLinkedPrimaries -hc command indicates problems with the linked primaries then the showLinkedPrimaries command should be run to collect more information.  The showLinkedPrimaries output will display the details of each primary shard's foreign domain links so it can be determined exactly which links are facing an issue.

 

d. xscmd showDomainReplicationState

 

CWXSI0068I: Executing command: showDomainReplicationState

Domain dom1
Container   Outstanding Inbound Revisions   Outstanding Outbound Revisions
---------   -----------------------------   ------------------------------
dom1con1                                0                                0

 

Domain dom2
Container   Outstanding Inbound Revisions   Outstanding Outbound Revisions
---------   -----------------------------   ------------------------------
dom2con1                                0                                0


CWXSI0040I: The showDomainReplicationState command completed successfully.

 

The showDomainReplicationState displays the number of outstanding revisions for the containers in a foreign domain.  It is quite common to see outstanding revisions as the display shows just a single moment in time.  The number of outstanding revisions will typically be higher on environments with numerous updates and inserts.  Using the showDomainReplicationState output, check for abnormally large numbers of outstanding revisions, meaning the outstanding revision numbers are substantially higher than previously observed normal values.  If an abnormally large number of outstanding revisions are found, then monitor the outstanding revisions over time to ensure there was not a brief burst of heavy activity.  If the revisions are consistently increasing, this could mean there is a problem with the replication between domains.

 

7. Lastly, it's recommended to have basic environmental monitoring such as CPU monitoring, network monitoring, and JVM heap/GC monitoring across the environment.

 

 

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080573