Troubleshooting
Problem
A customer's WPS seemed to become a bit sluggish and all the usual tuning efforts hadn't helped a lot. One of the things that still could be done was to look into the BPEDB itself to see, whether there are unexpected amount of data and/or relations.
Symptom
BPEDB "slow" and growing
Resolving The Problem
One of the educational principles and an essential base of what is called “experience” is to learn from errors made. Second best is to learn from errors made by others – which is still better than not learning at all. Maybe this story helps a bit.
A customer's WPS seemed to become a bit sluggish and all the usual tuning efforts hadn't helped
a lot. One of the things that still could be done was to look into the BPEDB itself to see,
whether there are unexpected amount of data and/or relations.
Finally, the investigations revealed an unhandled condition in a process model, that lead to a
undesirable loop.
We started with a
select count(*) from process_instance_b_t with ur |
to get the number of rows in the process instance table - which represents the number of process instances WPS knows about.
We include the "with UR" clause in all these selects to avoid, that the database sets any locks that would interfere even more with
other users (applications) of the database. Or - to phrase it differently - to be as least intrusive as possible.
Maybe you're interested in a few more details, then you can ask for the amount of instances in a certain state.
select count(*) from process_instance_b_t where state = 2 with ur |
would yield all running instances. These states include DELETED=0, READY=1, RUNNING=2, FINISHED=3, COMPENSATING=4, FAILED=5,
TERMINATED=6, COMPENSATED=7, TERMINATING=8, FAILING=9, INDOUBT=10, SUSPENDED=11, COMPENSATION_FAILED=12
Similarly you can collect some statistics on activities:
select count(*) from activity_instance_b_t with ur |
select count(*) from activity_instance_b_t where state = 5 with ur |
FAILING=10, WAITING=11, EXPIRED=12, STOPPED=13, PROCESSING_UNDO=14.
When your typical (or average) process model executes e.g. 30 activities, then the amount of rows in the activity_instance_b_t
table should roughly be in the order of magnitude of up to 30 times the amount of rows in the process_instance_b_t table.
In this cases we found over 39M activity instances for about 50k process instances, where we only expected
up to 1M activity instances.
That justified some deeper investigation - we wanted to find out, which instances had the most activities.
SELECT
COUNT(AI.AIID) AS NUMBER_OF_ACTIVITIES
PROCESS_INSTANCE_B_T AS PI, PROCESS_TEMPLATE_B_T AS PT
AI.PIID = PI.PIID
WITH UR |
- PI.PIID is the process instance ID from the process_instance_b_t table
- To see a bit more than just a hex ID, the name of the related process template (PT.NAME) is included.
- And as we're interested in the amount of activities for these process instances, we'd need to include something from the activity_instance_b_t table and do the necessary grouping.
In our specific case this resulted in:
PIID | NAME | STATE | NUMBER_OF_ACTIVITIES |
----------------------------------- | ------------ | ------ | -------------------- |
x'9003011CE5DED75B3EFDEB538C02DAE4' | LGClaimEH | 6 | 147047 |
x'9003011E841DE9AF3EFDEB53045C4103' | LGClaimEH | 6 | 96609 |
x'9003011E841DDEF13EFDEB53045C3DD9' | LGClaimEH | 6 | 96462 |
. . . . |
A lookup of
select auto_delete from process_template_b_t where name = 'LGClaimEH' |
However, why there were so many activities in these instances needed further investigation. We tried to get a list of the most recently executed activities in some of the topmost cases in the above list.
(N.B. only activities with the “business relevance” flag set are persisted in the activity instance table)
SELECT
ACTIVITY_TEMPLATE_B_T ATP
AI.PIID = '9003011CE5DED75B3EFDEB538C02DAE4'
WITH UR |
LAST_STATE_CHANGE | NAME | STATE |
----------------------- | ------------------- | --------- |
2009-03-22-16.24.17.964333 | Activity_17 | 7 |
2009-03-22-16.23.55.925757 | Activity_14 | 5 |
2009-03-22-16.23.32.528576 | Activity_14 | 5 |
2009-03-22-16.23.11.976875 | Activity_14 | 5 |
2009-03-22-16.22.49.582347 | Activity_14 | 5 |
2009-03-22-16.22.24.257894 | Activity_14 | 5 |
2009-03-22-16.22.01.723894 | Activity_14 | 5 |
. . . |
After the application developer saw this, it didn't take long to find out, what to correct.
In this case
• the “delete when finished” flag was activated
• the (not shown here) business rule, that determined the 20 seconds interval got adapted
• and most important the business logic, that lead to the undesired loop, was corrected.
At least this case could be used as a demonstration of robustness as the oldest of these instances dated over 9 moths ago. The server performed this “nonsense” in the application code for quite a long time before it's slowness in processing (most obvious in deleting instances) was noticed. And even then – it didn't break.
The final cleanup in the database was done in small chunks using the deleteCompletedProcessInstances.py script in the {install_root}/runtimes/bi_v6/ProcessChoreographer/admin directory.
Was this topic helpful?
Document Information
More support for:
WebSphere Process Server
Software version:
6.2, 6.1.2, 6.1
Operating system(s):
AIX, HP-UX, Linux, Solaris, Windows, z/OS, IBM i
Document number:
623819
Modified date:
15 June 2018
UID
swg21381220