Pending infoset Diagnosis

Troubleshooting

Problem

Infoset stuck in a Pending state

Symptom

Infoset stuck in a pending state

Resolving The Problem

In the UI, select your infoset under data workbench. In the URL highlight and copy the infoset id number. Will look similar to this d077ef67-bf8d-4809-8942-51356552121a.

Login to the gateway.
ssh root@gateway
Diagnose the pending state
How much work is left outstanding?
psql -U dfuser -d dfdata

select * from application_schema.infoset_debug_view where infoset_id = 'd077ef67-bf8d-4809-8942-51356552121a' and state <> 'explorable'; infoset_id | gw | ds | last_ip | state | substate | tagged | partitions | cubed | cubes | last_update ---------------------------------------+-----+-----+---------------+-----------+------------+--------+------------+-------+-------+--------------------- 30f38a1b-2a96-46f9-961e-725603d2272c | 208 | 62 | 10.128.241.31 | available | processing | 1 | 1 | 1 | 2 | 2012-12-05 20:25:18

gw => volume_id on the gateway
ds => volume_id on the dataserver
last_ip => internal ip address of the dataserver
state => how far have we gotten for this volume on this dataserver?
tagged / partitions => percentage of volume partitions for this volume that have finished evaluating the infoset query
cubed / cubes => percentage of cube aggregation that is finished for this volume that are part of this infoset.
last_update => the last time the gateway got a message from this dataserver about this volume regarding infosets
What dataservers are working on what?
Examine the state column

Check this set of state definitions - from queryvolumestate.py

state	description	service	code
missing from db	appstack and gateway communication problems	GatewayAPIService or Platoon	NA
requested	created	GatewayAPIService	0
sending	found by the query replication service, and ready to be sent	QueryReplicationService	10
received	received and parsed successfully	QueryReplicationService	20
unharvested	There is nothing harvested, no items can be in the infoset	QueryStatusService	30
partial	on the dataserver, some of the volume partitions are finished for this volume	QueryStatusService	40
available	on the dataserver, this whole volume is done evaluating members as part, or not part of the infoset	QueryStatusService	90
explorable	all the cubes for this infoset are now on the gateway, and explorers and visualizations should be accurate	DistributedCubeReplicationService	100
interrupted	this volume was marked as not worth waiting for, is no longer part of the infoset	GatewayAPIService	110

Check the right log files for more information

state	place to look	log file to look at
missing from db	gateway or appstack	/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
requested	gateway or ds	/deepfs/config/daqueryrepl.out
sending	gateway or ds	/deepfs/config/daqueryrepl.out
received	gateway or ds	/deepfs/config/daquerystatus.out
unharvested	gateway or appstack	/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
partial	gateway or ds	/deepfs/config/daquerystatus.out and /deepfs/config/dacuberepl.out
available	gateway or appstack	/deepfs/config/dacuberepl.out, /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
explorable	gateway or appstack	/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)
interrupted	gateway or appstack	/deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon)

requested status diagnosis
If the volumes are requested on the gateway, check to see if the infosets ever made it to the dataserver.
check for exceptions in the gateway's /deepfs/config/daqueryrepl.out, and the dataserver's version.
Check for exceptions in the gateway and dataserver's /deepfs/config/siqtransport.out logs
See if there is a query in the database on a dataserver that is still requested:
select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
If it's there on the dataserver, there's a communication problem going "back" to the gateway
If it isn't on the dataserver, there is a communication problem between the gateway and the dataserver.
Recieved, Partial status diagnosis:
In this example, the gateway reports that most work is done, except for one volume on a dataserver.
dfdata=# select * from application_schema.infoset_debug_view where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and state <> 'explorable';
infoset_id | gw | ds | last_ip | state | substate | tagged | partitions | cubed | cubes | last_update
--------------------------------------------------------------------------------------------------------------------
13009c04-1448-4f63-a2ec-7e6c8699b7a7 | 6 | 4 | 10.125.4.131 | partial | processing | 0 | 3 | 0 | 2 | 2013-07-01 14:09:38
(1 row)
In this case, we can go to the dataserver and figure out if it has finished tagging or not

note that the infoset_id in this case, came from the query of the infoset_debug_view, and the volume_id used in the dataserver is the "ds" column from the infoset debug view, and the dataserver that should be investigated is from the "last_ip" column.
If it is done tagging it should have a count of one for this example query
dfdata=# select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7'');
count

-------
1

In this example, the error is in communicating the updated state from the dataserver to the gateway.
The relevant logs for this state would be:
On the gateway:
/deepfs/config/daquerystatus.out
/deepfs/config/siqtransport.out
on the dataserver:
/deepfs/config/daquerystatus.out
/deepfs/config/siqtransport.out

Available status diagnosis
First, using the infoset_debug_view on the gateway, find a dataserver that is taking a suspicously long time.
Next, let's figure out how many volumes we are waiting for:
select count(ds) from application_schema.infoset_debug_view where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' and state <> 'explorable' and last_ip is '9.152.157.29 ' group by ds;
count
-------
210
Go there, and see if the dataserver believes it is finished
dfdata=# select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
-[ RECORD 1 ]------------------------+-------------------------------------------------------------------------------------
object_class_id | 55
object_class_revision_id | 1
object_class_name | Infoset-0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7
object_class_description | Infoset 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7 immutable
time_created | 2014-03-07 09:52:02.036176+00
time_modified | 2014-03-07 10:28:57.83661+00
user_tag_attribute_id |
user_tag_value |
appliance_guid | d73fe9de84554d66b620b1431380d73f
appliance_object_class_id | 39
auto_sync | f
auto_gen_explorers | f
group_id | 10
query_expression | (att: 1>=0 bytes IN system) AND (att: 2 BETWEEN 2000-01-01 and 2002-12-31 IN system)
preservation_id |
production_id |
application_type | subject
object_class_replication_revision_id | 2
expansion | 271
infoset_id | 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7
If either auto_sync or auto_gen_explorers is true, it still believe it has work to do.
If so, check tagging on the dataserver:
select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = 'd6a92c86-02ac-4783-bf4d-1da32ad7374c');
If tagging is not done, but there is no cpu, disk, or database activity, check the querycacher on that dataserver...

This is what it looks like when it's got some work to do.

<code># python32 /usr/lib/python2.4/site-packages/deepfile/volumecluster/querycacher_ng/querycacheprocess.pyc

Choose one:
b) (b)lock volume
r) (r)elease volume
c) query (c)hanged
p) (p)rint current activity
t) print (t)ask queue
h) (h)eartbeat
a) (a)bort query caching
q) (q)uit

-> p

If the query cacher has work do to, but it's not doing anything, try restarting just the querycacher, and see if that "works around" the problem.

/usr/bin/monit -c /etc/deepfile/monitrc restart QueryCacher
If the tagging is finished, see what volumes are not done creating cubes...

dfdata=# select * from (select count(category) as sum, volume_id from data_schema.cube_creation_status where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' group by volume_id order by volume_id) as foo where sum < 2;
If there were no rows in the query performed on the dataserver, the infoset should still be calculating on the dataserver.
Or see what cubes are created for that dataserver
dfdata=# select * from data_schema.cube_creation_status where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and volume_id = 4;
category | sub_category | last_created | volume_id | partition_id | infoset_id
----------+--------------+-------------------------------+-----------+--------------+--------------------------------------
2 | | 2013-06-27 03:39:08.543872+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7
0 | | 2013-06-27 03:39:08.589357+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7
If all the cubes that should be generated seem to be generated on the dataserver, and the tagging that is supposed to be done is done on the dataserver, let's go back to the gateway and ping the cube replication service to get an idea of what it thinks it's doing:

# /usr/bin/python /usr/lib/python2.4/site-packages/siqtransport/bootstrap/launchservice.pyc cubereplstatus siqplatform.cuberepl.gateway.status.CubeReplStatus
[2014-03-20 17:29:00 INFO MainThread][service-context] connecting to broker using TCP on 127.0.0.1:11101
[2014-03-20 17:29:00 INFO MainThread][service-launcher] launching service "siqplatform.cuberepl.gateway.status.CubeReplStatus" with broker host: 127.0.0.1 port: 11101
[2014-03-20 17:29:00 INFO MainThread][service-context] connected to message broker
[2014-03-20 17:29:00 INFO MainThread][service-context] registering anonymous service with broker
[2014-03-20 17:29:00 INFO MainThread][service-context] service registered as host ID 0x00001000 service ID 0x0000000C
[Thu Mar 20 17:29:00 2014] Registered...
[2014-03-20 17:29:00 INFO Thread-1][factory] _worker_thread: >

[{"Product":{"code":"SSSHEC","label":"StoredIQ"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.5.1;7.5.0.2;7.5.0.1;7.5","Edition":"Advanced","Line of Business":{"code":"LOB45","label":"Automation"}}]

Tips

Pending infoset Diagnosis

Troubleshooting

Problem

Symptom

Resolving The Problem

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?