IBM Support

Pending infoset Diagnosis

Troubleshooting


Problem

Infoset stuck in a Pending state

Symptom

Infoset stuck in a pending state

Resolving The Problem

In the UI, select your infoset under data workbench. In the URL highlight and copy the infoset id number. Will look similar to this d077ef67-bf8d-4809-8942-51356552121a.

  • Login to the gateway.
  • ssh root@gateway
  • Diagnose the pending state
  • How much work is left outstanding?
  • psql -U dfuser -d dfdata
select * from application_schema.infoset_debug_view where infoset_id = 'd077ef67-bf8d-4809-8942-51356552121a' and state <> 'explorable';
infoset_id                             | gw  | ds  | last_ip       | state     | substate   | tagged | partitions | cubed | cubes | last_update
---------------------------------------+-----+-----+---------------+-----------+------------+--------+------------+-------+-------+---------------------
30f38a1b-2a96-46f9-961e-725603d2272c  | 208 | 62  | 10.128.241.31 | available | processing | 1      |  1         | 1     | 2     | 2012-12-05 20:25:18
  • gw => volume_id on the gateway
  • ds => volume_id on the dataserver
  • last_ip => internal ip address of the dataserver
  • state => how far have we gotten for this volume on this dataserver?
  • tagged / partitions => percentage of volume partitions for this volume that have finished evaluating the infoset query
  • cubed / cubes => percentage of cube aggregation that is finished for this volume that are part of this infoset.
  • last_update => the last time the gateway got a message from this dataserver about this volume regarding infosets
  • What dataservers are working on what?
  • Examine the state column
Check this set of state definitions - from queryvolumestate.py
state
description
service
code
missing from dbappstack and gateway communication problemsGatewayAPIService or PlatoonNA
requestedcreatedGatewayAPIService0
sendingfound by the query replication service, and ready to be sentQueryReplicationService10
receivedreceived and parsed successfullyQueryReplicationService20
unharvestedThere is nothing harvested, no items can be in the infosetQueryStatusService30
partialon the dataserver, some of the volume partitions are finished for this volumeQueryStatusService40
availableon the dataserver, this whole volume is done evaluating members as part, or not part of the infosetQueryStatusService90
explorableall the cubes for this infoset are now on the gateway, and explorers and visualizations should be accurateDistributedCubeReplicationService100
interruptedthis volume was marked as not worth waiting for, is no longer part of the infosetGatewayAPIService110
  • Check the right log files for more information

state
place to look
log file to look at
missing from dbgateway or appstack/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
requestedgateway or ds/deepfs/config/daqueryrepl.out
sendinggateway or ds/deepfs/config/daqueryrepl.out
receivedgateway or ds/deepfs/config/daquerystatus.out
unharvestedgateway or appstack/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
partialgateway or ds/deepfs/config/daquerystatus.out and /deepfs/config/dacuberepl.out
availablegateway or appstack/deepfs/config/dacuberepl.out, /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
explorablegateway or appstack/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
interruptedgateway or appstack/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
  • requested status diagnosis
  • If the volumes are requested on the gateway, check to see if the infosets ever made it to the dataserver.
  • check for exceptions in the gateway's /deepfs/config/daqueryrepl.out, and the dataserver's version.
  • Check for exceptions in the gateway and dataserver's /deepfs/config/siqtransport.out logs
  • See if there is a query in the database on a dataserver that is still requested:
  • select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
  • If it's there on the dataserver, there's a communication problem going "back" to the gateway
  • If it isn't on the dataserver, there is a communication problem between the gateway and the dataserver.
  • Recieved, Partial status diagnosis:
  • In this example, the gateway reports that most work is done, except for one volume on a dataserver.
    dfdata=# select * from application_schema.infoset_debug_view where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and state <> 'explorable'; 
    infoset_id | gw | ds | last_ip | state | substate | tagged | partitions | cubed | cubes | last_update 
    -------------------------------------------------------------------------------------------------------------------- 
    13009c04-1448-4f63-a2ec-7e6c8699b7a7 | 6 | 4 | 10.125.4.131 | partial | processing | 0 | 3 | 0 | 2 | 2013-07-01 14:09:38 
    (1 row)
  • In this case, we can go to the dataserver and figure out if it has finished tagging or not
  • note that the infoset_id in this case, came from the query of the infoset_debug_view, and the volume_id used in the dataserver is the "ds" column from the infoset debug view, and the dataserver that should be investigated is from the "last_ip" column.
  • If it is done tagging it should have a count of one for this example query
  • dfdata=# select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7'');
    count

-------
1
  • In this example, the error is in communicating the updated state from the dataserver to the gateway.
  • The relevant logs for this state would be:
  • On the gateway: 
  • /deepfs/config/daquerystatus.out
  • /deepfs/config/siqtransport.out
  • on the dataserver: 
  • /deepfs/config/daquerystatus.out
  • /deepfs/config/siqtransport.out
 
  • Available status diagnosis
  • First, using the infoset_debug_view on the gateway, find a dataserver that is taking a suspicously long time.
  • Next, let's figure out how many volumes we are waiting for:
  • select count(ds) from application_schema.infoset_debug_view where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' and state <> 'explorable' and last_ip is '9.152.157.29 ' group by ds;
    count
    -------
    210
  • Go there, and see if the dataserver believes it is finished
  • dfdata=# select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
    -[ RECORD 1 ]------------------------+-------------------------------------------------------------------------------------
    object_class_id | 55
    object_class_revision_id | 1
    object_class_name | Infoset-0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7
    object_class_description | Infoset 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7 immutable
    time_created | 2014-03-07 09:52:02.036176+00
    time_modified | 2014-03-07 10:28:57.83661+00
    user_tag_attribute_id |
    user_tag_value |
    appliance_guid | d73fe9de84554d66b620b1431380d73f
    appliance_object_class_id | 39
    auto_sync | f
    auto_gen_explorers | f
    group_id | 10
    query_expression | (att: 1>=0 bytes IN system) AND (att: 2 BETWEEN 2000-01-01 and 2002-12-31 IN system)
    preservation_id |
    production_id |
    application_type | subject
    object_class_replication_revision_id | 2
    expansion | 271
    infoset_id | 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7
  • If either auto_sync or auto_gen_explorers is true, it still believe it has work to do.
  • If so, check tagging on the dataserver:
  • select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = 'd6a92c86-02ac-4783-bf4d-1da32ad7374c');
  • If tagging is not done, but there is no cpu, disk, or database activity, check the querycacher on that dataserver...
  • This is what it looks like when it's got some work to do.

<code># python32 /usr/lib/python2.4/site-packages/deepfile/volumecluster/querycacher_ng/querycacheprocess.pyc

Choose one:
b) (b)lock volume
r) (r)elease volume
c) query (c)hanged
p) (p)rint current activity
t) print (t)ask queue
h) (h)eartbeat
a) (a)bort query caching
q) (q)uit

-> p

  • If the query cacher has work do to, but it's not doing anything, try restarting just the querycacher, and see if that "works around" the problem.
  • /usr/bin/monit -c /etc/deepfile/monitrc restart QueryCacher
  • If the tagging is finished, see what volumes are not done creating cubes...
  • dfdata=# select * from (select count(category) as sum, volume_id from data_schema.cube_creation_status where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' group by volume_id order by volume_id) as foo where sum < 2;
  • If there were no rows in the query performed on the dataserver, the infoset should still be calculating on the dataserver.
  • Or see what cubes are created for that dataserver
  • dfdata=# select * from data_schema.cube_creation_status where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and volume_id = 4;
    category | sub_category | last_created | volume_id | partition_id | infoset_id
    ----------+--------------+-------------------------------+-----------+--------------+--------------------------------------
    2 | | 2013-06-27 03:39:08.543872+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7
    0 | | 2013-06-27 03:39:08.589357+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7
  • If all the cubes that should be generated seem to be generated on the dataserver, and the tagging that is supposed to be done is done on the dataserver, let's go back to the gateway and ping the cube replication service to get an idea of what it thinks it's doing:

# /usr/bin/python /usr/lib/python2.4/site-packages/siqtransport/bootstrap/launchservice.pyc cubereplstatus siqplatform.cuberepl.gateway.status.CubeReplStatus
[2014-03-20 17:29:00 INFO MainThread][service-context] connecting to broker using TCP on 127.0.0.1:11101
[2014-03-20 17:29:00 INFO MainThread][service-launcher] launching service "siqplatform.cuberepl.gateway.status.CubeReplStatus" with broker host: 127.0.0.1 port: 11101
[2014-03-20 17:29:00 INFO MainThread][service-context] connected to message broker
[2014-03-20 17:29:00 INFO MainThread][service-context] registering anonymous service with broker
[2014-03-20 17:29:00 INFO MainThread][service-context] service registered as host ID 0x00001000 service ID 0x0000000C
[Thu Mar 20 17:29:00 2014] Registered...
[2014-03-20 17:29:00 INFO Thread-1][factory] _worker_thread: >

9.152.157.29
1009a8443671434d9cb580119df54557 | state=CubeStatus | session=6993530b-ffe4-4bc8-8f94-bc2cb9497ce3 | last_time=1395332938 | dirty=False | cubestatus=159 | pending=0 | last_err=
9.152.157.55
71fef0dca6aa457aa4e43f66f070730b | state=CubeStatus | session=b0d6eca2-a46c-4fdd-9ead-e64b6f177c11 | last_time=1395332917 | dirty=False | cubestatus=139 | pending=0 | last_err=
9.152.157.25
99a041a9740f449892e9e8ac5149d534 | state=CubeStatus | session=d2323df0-ee6f-46e2-8b24-0d58b01e615c | last_time=1395332919 | dirty=False | cubestatus=159 | pending=0 | last_err=
9.152.157.34
b18132add0df424b961d013736351bd0 | state=CubeStatus | session=ab1e2be3-542c-4973-a5ac-43b27dc9b31d | last_time=1395332924 | dirty=False | cubestatus=119 | pending=0 | last_err=
9.152.157.28
e1e8cb2a180946cf9a55aebe72eb35d6 | state=CubeStatus | session=6950ac4b-03cb-42e9-8562-913823fda145 | last_time=1395304867 | dirty=False | cubestatus=19 | pending=0 | last_err=
7fc0e30ecb3a40f29a00845dc803afb6 | state=idle | session= | last_time=1395331335 | dirty=False | cubestatus=0 | pending=0 | last_err=
531a52d21a0d47fe85ebe60eb0ff994e | state=idle | session= | last_time=1395331422 | dirty=False | cubestatus=0 | pending=0 | last_err=

[{"Product":{"code":"SSSHEC","label":"StoredIQ"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.5.1;7.5.0.2;7.5.0.1;7.5","Edition":"Advanced","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
17 December 2020

UID

swg21687445