Troubleshooting
Problem
Infoset stuck in a Pending state
Symptom
Infoset stuck in a pending state
Resolving The Problem
In the UI, select your infoset under data workbench. In the URL highlight and copy the infoset id number. Will look similar to this d077ef67-bf8d-4809-8942-51356552121a.
- Login to the gateway.
- ssh root@gateway
- Diagnose the pending state
- How much work is left outstanding?
- psql -U dfuser -d dfdata
infoset_id | gw | ds | last_ip | state | substate | tagged | partitions | cubed | cubes | last_update
---------------------------------------+-----+-----+---------------+-----------+------------+--------+------------+-------+-------+---------------------
30f38a1b-2a96-46f9-961e-725603d2272c | 208 | 62 | 10.128.241.31 | available | processing | 1 | 1 | 1 | 2 | 2012-12-05 20:25:18
- gw => volume_id on the gateway
- ds => volume_id on the dataserver
- last_ip => internal ip address of the dataserver
- state => how far have we gotten for this volume on this dataserver?
- tagged / partitions => percentage of volume partitions for this volume that have finished evaluating the infoset query
- cubed / cubes => percentage of cube aggregation that is finished for this volume that are part of this infoset.
- last_update => the last time the gateway got a message from this dataserver about this volume regarding infosets
- What dataservers are working on what?
- Examine the state column
state | description | service | code |
| missing from db | appstack and gateway communication problems | GatewayAPIService or Platoon | NA |
| requested | created | GatewayAPIService | 0 |
| sending | found by the query replication service, and ready to be sent | QueryReplicationService | 10 |
| received | received and parsed successfully | QueryReplicationService | 20 |
| unharvested | There is nothing harvested, no items can be in the infoset | QueryStatusService | 30 |
| partial | on the dataserver, some of the volume partitions are finished for this volume | QueryStatusService | 40 |
| available | on the dataserver, this whole volume is done evaluating members as part, or not part of the infoset | QueryStatusService | 90 |
| explorable | all the cubes for this infoset are now on the gateway, and explorers and visualizations should be accurate | DistributedCubeReplicationService | 100 |
| interrupted | this volume was marked as not worth waiting for, is no longer part of the infoset | GatewayAPIService | 110 |
- Check the right log files for more information
state | place to look | log file to look at |
| missing from db | gateway or appstack | /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon) |
| requested | gateway or ds | /deepfs/config/daqueryrepl.out |
| sending | gateway or ds | /deepfs/config/daqueryrepl.out |
| received | gateway or ds | /deepfs/config/daquerystatus.out |
| unharvested | gateway or appstack | /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon) |
| partial | gateway or ds | /deepfs/config/daquerystatus.out and /deepfs/config/dacuberepl.out |
| available | gateway or appstack | /deepfs/config/dacuberepl.out, /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon) |
| explorable | gateway or appstack | /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon) |
| interrupted | gateway or appstack | /deepfs/config/gateway-mesh.log or /var/siq/log/appstack.log (platoon) |
- requested status diagnosis
- If the volumes are requested on the gateway, check to see if the infosets ever made it to the dataserver.
- check for exceptions in the gateway's /deepfs/config/daqueryrepl.out, and the dataserver's version.
- Check for exceptions in the gateway and dataserver's /deepfs/config/siqtransport.out logs
- See if there is a query in the database on a dataserver that is still requested:
- select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
- If it's there on the dataserver, there's a communication problem going "back" to the gateway
- If it isn't on the dataserver, there is a communication problem between the gateway and the dataserver.
- Recieved, Partial status diagnosis:
- In this example, the gateway reports that most work is done, except for one volume on a dataserver.
dfdata=# select * from application_schema.infoset_debug_view where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and state <> 'explorable';
infoset_id | gw | ds | last_ip | state | substate | tagged | partitions | cubed | cubes | last_update
--------------------------------------------------------------------------------------------------------------------
13009c04-1448-4f63-a2ec-7e6c8699b7a7 | 6 | 4 | 10.125.4.131 | partial | processing | 0 | 3 | 0 | 2 | 2013-07-01 14:09:38
(1 row) - In this case, we can go to the dataserver and figure out if it has finished tagging or not
- note that the infoset_id in this case, came from the query of the infoset_debug_view, and the volume_id used in the dataserver is the "ds" column from the infoset debug view, and the dataserver that should be investigated is from the "last_ip" column.
- If it is done tagging it should have a count of one for this example query
- dfdata=# select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7'');
count
-------
1
- In this example, the error is in communicating the updated state from the dataserver to the gateway.
- The relevant logs for this state would be:
- On the gateway:
- /deepfs/config/daquerystatus.out
- /deepfs/config/siqtransport.out
- on the dataserver:
- /deepfs/config/daquerystatus.out
- /deepfs/config/siqtransport.out
- Available status diagnosis
- First, using the infoset_debug_view on the gateway, find a dataserver that is taking a suspicously long time.
- Next, let's figure out how many volumes we are waiting for:
- select count(ds) from application_schema.infoset_debug_view where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' and state <> 'explorable' and last_ip is '9.152.157.29 ' group by ds;
count
-------
210 - Go there, and see if the dataserver believes it is finished
- dfdata=# select * from application_schema.object_classes where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7';
-[ RECORD 1 ]------------------------+-------------------------------------------------------------------------------------
object_class_id | 55
object_class_revision_id | 1
object_class_name | Infoset-0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7
object_class_description | Infoset 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7 immutable
time_created | 2014-03-07 09:52:02.036176+00
time_modified | 2014-03-07 10:28:57.83661+00
user_tag_attribute_id |
user_tag_value |
appliance_guid | d73fe9de84554d66b620b1431380d73f
appliance_object_class_id | 39
auto_sync | f
auto_gen_explorers | f
group_id | 10
query_expression | (att: 1>=0 bytes IN system) AND (att: 2 BETWEEN 2000-01-01 and 2002-12-31 IN system)
preservation_id |
production_id |
application_type | subject
object_class_replication_revision_id | 2
expansion | 271
infoset_id | 0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7 - If either auto_sync or auto_gen_explorers is true, it still believe it has work to do.
- If so, check tagging on the dataserver:
- select count(volume_id) from data_schema.volumes_classified where object_class_id in (select object_class_id from application_schema.object_classes where infoset_id = 'd6a92c86-02ac-4783-bf4d-1da32ad7374c');
- If tagging is not done, but there is no cpu, disk, or database activity, check the querycacher on that dataserver...
- This is what it looks like when it's got some work to do.
<code># python32 /usr/lib/python2.4/site-packages/deepfile/volumecluster/querycacher_ng/querycacheprocess.pyc
Choose one:
b) (b)lock volume
r) (r)elease volume
c) query (c)hanged
p) (p)rint current activity
t) print (t)ask queue
h) (h)eartbeat
a) (a)bort query caching
q) (q)uit
-> p
- If the query cacher has work do to, but it's not doing anything, try restarting just the querycacher, and see if that "works around" the problem.
- /usr/bin/monit -c /etc/deepfile/monitrc restart QueryCacher
- If the tagging is finished, see what volumes are not done creating cubes...
- dfdata=# select * from (select count(category) as sum, volume_id from data_schema.cube_creation_status where infoset_id = '0a9aa08a-83d5-4b8b-85a1-ba60fcaa5ee7' group by volume_id order by volume_id) as foo where sum < 2;
- If there were no rows in the query performed on the dataserver, the infoset should still be calculating on the dataserver.
- Or see what cubes are created for that dataserver
- dfdata=# select * from data_schema.cube_creation_status where infoset_id = '13009c04-1448-4f63-a2ec-7e6c8699b7a7' and volume_id = 4;
category | sub_category | last_created | volume_id | partition_id | infoset_id
----------+--------------+-------------------------------+-----------+--------------+--------------------------------------
2 | | 2013-06-27 03:39:08.543872+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7
0 | | 2013-06-27 03:39:08.589357+00 | 4 | 1 | 13009c04-1448-4f63-a2ec-7e6c8699b7a7 - If all the cubes that should be generated seem to be generated on the dataserver, and the tagging that is supposed to be done is done on the dataserver, let's go back to the gateway and ping the cube replication service to get an idea of what it thinks it's doing:
# /usr/bin/python /usr/lib/python2.4/site-packages/siqtransport/bootstrap/launchservice.pyc cubereplstatus siqplatform.cuberepl.gateway.status.CubeReplStatus
[2014-03-20 17:29:00 INFO MainThread][service-context] connecting to broker using TCP on 127.0.0.1:11101
[2014-03-20 17:29:00 INFO MainThread][service-launcher] launching service "siqplatform.cuberepl.gateway.status.CubeReplStatus" with broker host: 127.0.0.1 port: 11101
[2014-03-20 17:29:00 INFO MainThread][service-context] connected to message broker
[2014-03-20 17:29:00 INFO MainThread][service-context] registering anonymous service with broker
[2014-03-20 17:29:00 INFO MainThread][service-context] service registered as host ID 0x00001000 service ID 0x0000000C
[Thu Mar 20 17:29:00 2014] Registered...
[2014-03-20 17:29:00 INFO Thread-1][factory] _worker_thread:
9.152.157.29
1009a8443671434d9cb580119df54557 | state=CubeStatus | session=6993530b-ffe4-4bc8-8f94-bc2cb9497ce3 | last_time=1395332938 | dirty=False | cubestatus=159 | pending=0 | last_err=
9.152.157.55
71fef0dca6aa457aa4e43f66f070730b | state=CubeStatus | session=b0d6eca2-a46c-4fdd-9ead-e64b6f177c11 | last_time=1395332917 | dirty=False | cubestatus=139 | pending=0 | last_err=
9.152.157.25
99a041a9740f449892e9e8ac5149d534 | state=CubeStatus | session=d2323df0-ee6f-46e2-8b24-0d58b01e615c | last_time=1395332919 | dirty=False | cubestatus=159 | pending=0 | last_err=
9.152.157.34
b18132add0df424b961d013736351bd0 | state=CubeStatus | session=ab1e2be3-542c-4973-a5ac-43b27dc9b31d | last_time=1395332924 | dirty=False | cubestatus=119 | pending=0 | last_err=
9.152.157.28
e1e8cb2a180946cf9a55aebe72eb35d6 | state=CubeStatus | session=6950ac4b-03cb-42e9-8562-913823fda145 | last_time=1395304867 | dirty=False | cubestatus=19 | pending=0 | last_err=
7fc0e30ecb3a40f29a00845dc803afb6 | state=idle | session= | last_time=1395331335 | dirty=False | cubestatus=0 | pending=0 | last_err=
531a52d21a0d47fe85ebe60eb0ff994e | state=idle | session= | last_time=1395331422 | dirty=False | cubestatus=0 | pending=0 | last_err=
Was this topic helpful?
Document Information
Modified date:
17 December 2020
UID
swg21687445