IBM Support

In Db2LUW FCM Wait Time is showing high. Is that a FCM problem ?

Technical Blog Post


Abstract

In Db2LUW FCM Wait Time is showing high. Is that a FCM problem ?

Body

I have come across this question quite a  few number of times :
In  Db2LUW  FCM Wait Time is showing high.  Is  that  a FCM   problem  ?

Wanted  clarify  bit more on that question  on top of  what this Technote already explained    :
/support/pages/node/279243


FCM  means   Fast Communication Manager.
When it's used  for the communications across two physical  hosts  it goes through FCM  channels
using network  sockets  down below.
But, when it's used  in the same physical host  for any reason it uses  IPC using shared memory.
The  IPCs usually never have response problems  similar to TCP/IP sockets

For  within one physical host  FCM could be used for the communication across two  logical  members/partitions.
However,  as the above  Technote described it can also be  used when  INTRA_PARALLEL  is set to  ON.
That  uses  FCM  local channels  for communications across sub-agents.


Here is an example which shows  most of  the Db2 time is being  spent  in  FCM_SEND_WAIT_TIME   :

========================================================
  -- Detailed breakdown of TOTAL_WAIT_TIME --

                                %    Total
                                ---  ---------------------------------------------
  TOTAL_WAIT_TIME               100  239995

  I/O wait time
    POOL_READ_TIME              0    293
    POOL_WRITE_TIME             0    0
    DIRECT_READ_TIME            0    51
    DIRECT_WRITE_TIME           0    65
    LOG_DISK_WAIT_TIME          3    7290
  LOCK_WAIT_TIME                0    263
  AGENT_WAIT_TIME               0    0
  Network and FCM
    TCPIP_SEND_WAIT_TIME        0    1668
    TCPIP_RECV_WAIT_TIME        1    4249
    IPC_SEND_WAIT_TIME          0    0
    IPC_RECV_WAIT_TIME          0    0
    FCM_SEND_WAIT_TIME          90   216724
    FCM_RECV_WAIT_TIME          3    8228
  WLM_QUEUE_TIME_TOTAL          0    0
  CF_WAIT_TIME                  0    0
  RECLAIM_WAIT_TIME             0    0
  SMP_RECLAIM_WAIT_TIME         0    0
====================================================


And,  in this  case  the   INTRA_PARALLEL  was set to ON in a single  member/partition setup.

As  a result   the  queries were parallelized using  multiple  sub-agents.

And, how the  queries  communicate across difference sub-agents  is,  it  uses something called tableQ.
One part of a query  sends a part of the total work using  tableQ  to  sub-agents and wait to hear back from them.
The sub-agents will   take whatever time to complete the work assigned and then  send back the result to the
parent (coordinator) agent  who will  compile the total work.   The parent agent  will wait until  all the sub-agents sends back the completion.

So,  for some reason if the  FCM  wait time is increased in  single member setup it's not due to issues at  FCM  itself. It's an   issue at the  query level using tableQ.
In  certain  big query cases it's  a normal  observation.  But, in  certain other cases the query access plan
could  be  checked  and  improvement could  be done. And, that might reduce the FCM wait time as a result.

In summary,  the wait time below FCM layer  is  reflected as  part of FCM wait time  and that might confuse users.
Need to understand  if  there is no multi-partition  across different  physical  members are involved
then  the FCM wait time should  be purely  due to the wait time  in lower layer than FCM  level and not a
FCM layer issue.

Also  it's  important to understand  when outputs  like   monreport.dbsummary()  or,  many other
3rd party  tools shows  the similar wait time those are  based on  the consideration that  total  wait time within  Db2  is  100%.
Out of that  what areas are showing  how much percentage.
So,  in  the above example,  it's  90%   out of  total  entire wait time inside Db2.
That is totally a relative figure.
It's  possible  there were only one query active in the database that time  which was  using a  tableQ
and nothing else was running.  
Even  if the total  database  response was  fast  it's  the percentage within that total  database time
which showed in those outputs.  
So,  the   90%   figure was  not a  wait time with respect to the total physical  box time.
In fact,  it'  purely a  measurement  internal to DB2 only.

[{"Business Unit":{"code":"BU029","label":"Data and AI"}, "Product":{"code":"SSEPGG","label":"DB2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":""}]

UID

ibm11139944