IBM Support

Slow BPC processing - a PD story using DB analysis

Troubleshooting


Problem

A customer's WPS seemed to become a bit sluggish and all the usual tuning efforts hadn't helped a lot. One of the things that still could be done was to look into the BPEDB itself to see, whether there are unexpected amount of data and/or relations.

Symptom

BPEDB "slow" and growing

Resolving The Problem

One of the educational principles and an essential base of what is called “experience” is to learn from errors made. Second best is to learn from errors made by others – which is still better than not learning at all. Maybe this story helps a bit.

A customer's WPS seemed to become a bit sluggish and all the usual tuning efforts hadn't helped
a lot. One of the things that still could be done was to look into the BPEDB itself to see,
whether there are unexpected amount of data and/or relations.



Finally, the investigations revealed an unhandled condition in a process model, that lead to a
undesirable loop.



We started with a

select count(*) from process_instance_b_t with ur

to get the number of rows in the process instance table - which represents the number of process instances WPS knows about.


We include the "with UR" clause in all these selects to avoid, that the database sets any locks that would interfere even more with
other users (applications) of the database. Or - to phrase it differently - to be as least intrusive as possible.

Maybe you're interested in a few more details, then you can ask for the amount of instances in a certain state.
select count(*) from process_instance_b_t where state = 2 with ur

would yield all running instances. These states include DELETED=0, READY=1, RUNNING=2, FINISHED=3, COMPENSATING=4, FAILED=5,
TERMINATED=6, COMPENSATED=7, TERMINATING=8, FAILING=9, INDOUBT=10, SUSPENDED=11, COMPENSATION_FAILED=12

Similarly you can collect some statistics on activities:
select count(*) from activity_instance_b_t with ur
select count(*) from activity_instance_b_t where state = 5 with ur
Valid states are: INACTIVE=1, READY=2, RUNNING=3, SKIPPED=4, FINISHED=5, FAILED=6, TERMINATED=7, CLAIMED=8, TERMINATING=9,
FAILING=10, WAITING=11, EXPIRED=12, STOPPED=13, PROCESSING_UNDO=14.


When your typical (or average) process model executes e.g. 30 activities, then the amount of rows in the activity_instance_b_t

table should roughly be in the order of magnitude of up to 30 times the amount of rows in the process_instance_b_t table.


In this cases we found over 39M activity instances for about 50k process instances, where we only expected
up to 1M activity instances.

That justified some deeper investigation - we wanted to find out, which instances had the most activities.
SELECT
    PI.PIID, PT.NAME, PI.STATE,
    COUNT(AI.AIID) AS NUMBER_OF_ACTIVITIES
FROM
    ACTIVITY_INSTANCE_B_T AS AI,

    PROCESS_INSTANCE_B_T AS PI,

    PROCESS_TEMPLATE_B_T AS PT
WHERE
    PI.PTID = PT.PTID AND

    AI.PIID = PI.PIID
GROUP BY
    PI.PIID, PT.NAME, PI.STATE
ORDER BY
    NUMBER_OF_ACTIVITIES DESC
FETCH FIRST 20 ROWS ONLY
WITH UR
Some explanations:
  • PI.PIID is the process instance ID from the process_instance_b_t table
  • To see a bit more than just a hex ID, the name of the related process template (PT.NAME) is included.
  • And as we're interested in the amount of activities for these process instances, we'd need to include something from the activity_instance_b_t table and do the necessary grouping.

In our specific case this resulted in:
PIID NAMESTATE  NUMBER_OF_ACTIVITIES
-------------------------------------------------------------------------
x'9003011CE5DED75B3EFDEB538C02DAE4'LGClaimEH
6
147047
x'9003011E841DE9AF3EFDEB53045C4103'LGClaimEH
6
96609
x'9003011E841DDEF13EFDEB53045C3DD9'LGClaimEH
6
96462
. . . .


LGClaimEH was expected to have 20-30 activities at most. Position 100 in that ordered list had about 57k and position 1000 had 7k activities. Furthermore state=6 indicated that these instances were no longer running (6=terminated). From their name, the customer could tell, that they're no longer relevant and should have been cleaned up automatically.
A lookup of
select auto_delete from process_template_b_t where name = 'LGClaimEH'
proved, that when modelling the process the “delete instance when finished” flag was not set (auto_delete=0), which was the reason, why all these instances still existed.
However, why there were so many activities in these instances needed further investigation. We tried to get a list of the most recently executed activities in some of the topmost cases in the above list.

(N.B. only activities with the “business relevance” flag set are persisted in the activity instance table)

SELECT
    AI.LAST_STATE_CHANGE, ATP.NAME, AI.STATE
FROM
    ACTIVITY_INSTANCE_B_T  AI,

    ACTIVITY_TEMPLATE_B_T  ATP
WHERE
    AI.ATID = ATP.ATID and

    AI.PIID = '9003011CE5DED75B3EFDEB538C02DAE4'
ORDER BY
    AI.LAST_STATE_CHANGE DESC
FETCH FIRST 40 ROWS ONLY
WITH UR
resulted in something like
LAST_STATE_CHANGE          NAMESTATE
---------------------------------------------------
2009-03-22-16.24.17.964333Activity_17
7
2009-03-22-16.23.55.925757Activity_14
5
2009-03-22-16.23.32.528576Activity_14
5
2009-03-22-16.23.11.976875Activity_14
5
2009-03-22-16.22.49.582347Activity_14
5
2009-03-22-16.22.24.257894Activity_14
5
2009-03-22-16.22.01.723894Activity_14
5
 . . .

which continued on and on with “Activity_16” in State finished(=5) in roughly 20 seconds difference in the timestamps for thousands of rows.
After the application developer saw this, it didn't take long to find out, what to correct.
In this case
• the “delete when finished” flag was activated
• the (not shown here) business rule, that determined the 20 seconds interval got adapted
• and most important the business logic, that lead to the undesired loop, was corrected.
At least this case could be used as a demonstration of robustness as the oldest of these instances dated over 9 moths ago. The server performed this “nonsense” in the application code for quite a long time before it's slowness in processing (most obvious in deleting instances) was noticed. And even then – it didn't break.

The final cleanup in the database was done in small chunks using the deleteCompletedProcessInstances.py script in the {install_root}/runtimes/bi_v6/ProcessChoreographer/admin directory.
[{"Product":{"code":"SSQH9M","label":"WebSphere Process Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Business Process Choreographer","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"},{"code":"PF035","label":"z\/OS"},{"code":"PF012","label":"IBM i"}],"Version":"6.2;6.1.2;6.1","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

More support for:
WebSphere Process Server

Software version:
6.2, 6.1.2, 6.1

Operating system(s):
AIX, HP-UX, Linux, Solaris, Windows, z/OS, IBM i

Document number:
623819

Modified date:
15 June 2018

UID

swg21381220

Manage My Notification Subscriptions