A technical deep dive into the global process monitor model

Hone your monitor model development skills with this under-the-hood look at the global process monitor model

Get an in-depth look at a monitor model for BPEL processes running on WebSphere® Process Server. Understanding this model will enable you to customize it by adding metrics, KPIs and dashboard views for specific business needs. You can also extend the model to monitor processes running on different engines. Finally, the article contains many tips for advanced monitor model development.

Share:

Introduction

The global process monitor model [5] monitors any BPEL process "out-of-the-box," that is, without generating and deploying any monitor models or executable code. Unlike generated models for process monitoring, which track the executions of a particular process definition, this model detects and tracks any process that runs on WebSphere Process Server (hereafter called Process Server) and sends events to WebSphere Business Monitor (hereafter called Monitor). Process Server must be at V6.1 or higher, and Monitor at V6.2.0.1. or higher.

This article provides a detailed look into the design of the global process monitor model. It was written with three goals in mind:

  1. Allow developers to customize the global process monitor model for a particular process. While you can build custom models starting from a generated monitor model, starting from the global process monitor will give you a more robust result, which is less sensitive to process logic changes, data model changes, missing events, or scrambled event order. It also allows you to leverage its built-in logic for monitoring invocation chains, state history, variable changes, task statistics, process statistics, process migrations, and many other kinds of observables you may want to capture.
  2. Allow developers to broaden the scope of the model, by adding subscriptions for events from other process engines. The structure and logic of the global process monitor model is not tied to BPEL. It merely assumes that process definitions have executions, which in turn have steps, and that there will be events reporting state changes of process executions and their steps. The many "specialties" of Process Server that the global process monitor captures as well (process variables, process migrations, task escalations, and so on) are purely optional and, if an engine does not report any of those, its processes will still be monitored just fine.
  3. Provide an advanced course in monitor model authoring by walking users through the design of this model and discussing some of the underlying rationale.

Readers who are primarily interested in understanding the data shown on the pre-built dashboards should refer to [5]. However, that reference may also be a good introductory resource for developers who want to tailor or customize the global process monitor model, to first gain an overview of its built-in capabilities. This article assumes that you have a solid understanding of monitor model concepts and are familiar with the monitor model editor. You can find introductions to those topics in [1] or [2].

Monitor model overview

Figure 1 shows the principal structure of the monitor model.

Figure 1. Global process monitor model — overall structure
Global process monitor model — overall structure

Here is the reasoning that led to this structure: We would like to monitor processes, along with their steps and their variables, both at the definition level (M1) and at the execution level (M0). A process definition can have many step definitions. It also can have many executions. A process execution can be broken down into step executions, but each step execution is also a logical child of a step definition. Hence, the perfect monitoring context (MC) structure for monitoring processes and their steps, both at the definition level and at the execution level, is a diamond as shown in Figure 2.

Figure 2. Ideal MC structure for process monitoring — a diamond
Ideal MC structure for process monitoring — a diamond

But the architecture of Monitor requires that monitoring contexts form a tree. To cope with this restriction, we applied two different techniques to break up the diamond structure. These are indicated in red in Figure 1:

  • To solve the problem for step executions, each execution is monitored by two MCs: The children of a Process Execution MC (Process Execution Step in Figure 1) monitor each step of a given process run. The children of a Step Definition MC (Step Execution in Figure 1) monitor each run of a particular process step. The diamond structure has been resolved by instantiating two MCs for each step execution. The equal sign means that each Process Execution Step monitoring context has a corresponding Step Execution monitoring context, but with different grouping and navigation properties.
  • To cope with this problem for variables, we removed the link between the MC for a variable definition and the MCs representing its values (the red dashed line in Figure 1). As a result, variable values are not aggregated at the variable definition level, which means that to see them on the dashboard, you must navigate via the process execution that holds that value.

This explains the upper part of the model structure in Figure 1, which now indeed is a tree. The lower part shows a simple, two-level hierarchy for monitoring human task definitions and their executions. Since Process Server allows human tasks to be defined as steps in a process as well as standalone, not all human tasks will be captured by a monitor model that only tracks processes. Therefore, we added the lower part of the model to account for standalone tasks. For completeness, however, the Task Definition and Task Execution MCs subscribe to events from both inline and standalone human tasks, so that inline tasks are monitored both as "process steps" in the upper part of the model and as "human tasks" in the lower part.

This concludes our overview of the monitor model structure. There are numerous other monitoring context definitions, which for the most part have auxiliary purposes: they hold data which really should be held in complex type metrics. Since Monitor supports only simple metric types, complex metric types, as well as sequences (arrays), must be realized using auxiliary MC definitions. Almost all of these auxiliary MCs capture time-stamped histories of execution states, task assignments, task escalations, model versions, instance migrations, variable values, and so on.

Figure 3 shows the full monitoring context hierarchy of the global process monitor model, with the principal monitoring context definitions emphasized. These represent the main navigation path, and you should not lose sight of it when exploring this model. Where parts of a monitoring context name are in brackets, this indicates that the display names have been shortened accordingly.

Figure 3. Monitoring context hierarchy of the global process monitor model
Monitoring context hierarchy of the global process monitor model

If you look at the monitor model source, you'll find that there are monitoring context definitions with names starting with "Aux" that are not shown in Figure 3; you'll also find metrics with names starting with "Aux" in MCs that are shown in Figure 3. All of these are auxiliary constructs to support the monitor model logic. You should ignore them when setting up dashboard views, and consider them hidden (quite like private fields in object-oriented programming).

Principles of operation

Before diving into more detail, this section gives you an overview of how the global process monitor model works. It focuses on the core structure and principal logic, deliberately omitting many of the finer points. The description references Figure 3.

The core design of the monitor model is based on the assumption that a process engine emits three kinds of events:

  • Events reporting state changes of process executions (process started, process suspended, process resumed, process ended, and so on)
  • Events reporting state changes of process steps (activity started, activity ended, activity failed, and so on)
  • Events reporting updates of process variables

It also assumes that events about process steps and process variables carry a process execution identifier, and that all kinds of events reference the process definition (and optionally a server, in order to distinguish copies of the same process running at different locations). BPEL processes running on Process Server emit all three kinds of events [4] and meet these criteria. We note in passing that scope events are processed like step events.

The global process monitor model is constructed such that none of the three event kinds is mandatory. For example, if a process emits nothing but variable change events, the existence of process executions is inferred from those; execution start and end times won't be known, but the earliest and the latest variable change event give an estimate; nothing is known about the process steps. As another example, a process that only emits events about certain human tasks is monitored at just that level of detail: there is no information about automated steps, execution state changes, or variable values, but the human tasks from which events are received are monitored and their process executions are inferred.

How is this accomplished? Starting at the lowest level MCs of the backbone structure (bold names in Figure 3): the Step Execution and Process Execution Step monitoring contexts subscribe to all events reporting state changes of steps in a running process. The first event about a step execution that is seen instantiates the two MC definitions, and the state it reports is recorded. For example, "task Review Expense Account for account #12345 was assigned to user John Doe at 2009-09-09T09:09:09Z". All further events about this step execution are received by the same two instances, and update the step execution's history.

Similarly, a Process Execution Variable MC is instantiated when the first event about a process variable (instance) is received. The variable value is recorded. Events reporting updates of the variable are received by the same monitoring context and extend the value history.

Moving up in the MC hierarchy, a Variable Definition or a Step Definition monitoring context is created by the first event from which the existence of such a (variable or step) definition can be inferred. For example, the first event received from an execution of a step named Review Expense Account causes the creation of a corresponding Step Definition MC. As more expense accounts are being processed, there will be many more executions of this step, and possibly several events per step execution (reporting start, assignment, escalation, completion, and so on). But these will only update the statistics in the same Step Definition monitoring context.

A Process Execution monitoring context is created by the first event from which the existence of a new process execution can be inferred. Typically, this is the process start event. But if that event is not sent, the first event reporting a process step or variable change within a new execution does the job. Further events about this process execution—like process suspended, process resumed, process terminated, process failed—update the process execution's history in the already existing monitoring context, just like events about step executions update step execution histories in the lower level MCs.

At the top of the MC hierarchy is a Process Definition monitoring context, which represents a process definition deployed at a particular server. The first event received from a newly deployed process creates this MC; this may be the process start event of the first execution, but any other event will do as well. When the same process is deployed at multiple servers (for example, at a test server and at a production server) two MC hierarchies are initialized and track the executions running on the two servers. When a new version of the process is deployed, however, the top-level monitoring context stays the same. The version change is recorded (a new Process Version monitoring context is created) but is considered an "evolution" of the process definition, and execution statistics are continuously aggregated in the same Process Definition MC.

The logic of Task Definition and Task Execution MCs is similar to the logic of Process Definition and Process Execution MCs.

Enabling audit events

Process Server allows you to selectively enable and disable audit events. Table 1 provides some tips on which events to enable for the global process monitor, along with some rationale.

Table 1. Audit events for process monitoring
Which events to enableWhy
For each process to be monitored, enable all events at the process level.Typically, a process only emits a few events during its life time (start, end, failures, deletion, and so on), so this should not result in overwhelming event traffic. Enabling all events will give Monitor full awareness of execution state changes, which usually outweighs any performance gains from partial event enablement.
For each activity to be monitored, enable all events as well.Typically, activities of interest are human tasks and invocations. Enabling only some events, like activity start and activity end, may be seen as a way to reduce event traffic, but typically the performance gains are modest and do not outweigh the loss in information; a more efficient way to cut down event traffic would be to disable all events for short-running steps, bottom-level sub-processes, steps within loop bodies, and so on.
If you monitor both a process and the sub-process that it calls, enable at least the entry event for the invoke activity that links the two.This allows the global process monitor to capture the caller-callee relationship, as discussed in Tracking Invocation Chains
Enable all events for (standalone and inline) human tasks you want to monitor.
Enable all events for loop activities.This gives you a time-stamped history of loop iterations. Often that will suffice to monitor the loop execution, and find out in which iterations the time was spent. It may allow you to disable any and all event emissions within the loop body.
Disable events for short-running, automated steps.This obviously helps performance, but it also helps avoid unnecessary clutter on the dashboard, because typically those short-running steps are not of interest to business users.
Enable variable change events for the process variables you want to monitor (and only for those).As an aside, allocating process variables specifically for monitoring can be a useful technique: Define variables for the information you would like to track, and maps to these variables at the appropriate points in the process (for example, when major milestones are reached or business items undergo important state changes). Monitoring these variables can give you business-level content, process status, and a time line all at once.

Inbound event subscriptions

This article is not meant to be exhaustive; we wont go over every model element of the global process monitor and explain it in every detail. Instead, we'll zoom in on selected constructs that are characteristic of this model and highlight key aspects of their design. Once those are understood, the remainder of the model should be understandable for anyone who cares to inspect it. As a side benefit, the highlighted constructs may be useful samples for authoring monitor models, giving you ideas of what can be done and how.

As an example for the inbound event subscriptions of this monitor model, Table 2 shows the event subscriptions of a Process Definition, Process Execution, and Process Execution Step monitoring context. Note that those five subscriptions are all external event subscriptions of these three, layered MC definitions. (There are a few additional subscriptions for internal events, which we'll discuss later.)

Table 2, The event subscriptions of the Process Definition, Process Execution, and Process Execution Step MCs
Event subscriptionFilterCorrelation predicateZero-One-Multiple correlation matches
Process_Definition/Existsfn:exists( Exists/eventPointData/bpc:processTemplateName )Exists/eventPointData/bpc:processTemplateName = Name and fct:logical-deployment-location( Exists/server ) = Deployed_Atcreate-deliver-error
Process_Definition/Process_Execution/State_Changedfn:exists( State_Changed/eventPointData/bpc:processTemplateName ) and fn:exists( State_Changed/baseData/wbi:eventHeaderData/wbi:ECSCurrentID ) and fn:string-length( State_Changed/eventPointData/bpc:BPCEventCode ) = 5 and fn:contains('21000 21001 21002 21004 21005 21019 21020 21090 42001 42003 42004 42009 42010 42027 42041 42042 42046 42047 42056 42079 42080 ', fn:concat(State_Changed/eventPointData/bpc:BPCEventCode, ' ') )State_Changed/eventPointData/bpc:processTemplateName = ../Name and fct:logical-deployment-location( State_Changed/server ) = ../Deployed_At and State_Changed/baseData/wbi:eventHeaderData/wbi:ECSCurrentID = Identifiercreate-deliver-error
Process_Definition/Process_Execution/Step_State_Changedfn:exists( Step_State_Changed/eventPointData/bpc:processTemplateName ) and fn:exists( Step_State_Changed/baseData/wbi:eventHeaderData/wbi:ECSParentID ) and fn:string-length( Step_State_Changed/eventPointData/bpc:BPCEventCode ) = 5 and fn:contains('21006 21007 21011 21021 21022 21027 21080 21081 42005 42015 42020 42021 42022 42024 42026 42031 42032 42036 42037 42038 42039 42040 42050 42054 42055 42057 42061 42062 42064 42065 42066 42067 42068 42070 ', fn:concat( Step_State_Changed/eventPointData/bpc:BPCEventCode, ' ' ) )Step_State_Changed/eventPointData/bpc:processTemplateName = ../Name and fct:logical-deployment-location( Step_State_Changed/server ) = ../Deployed_At and Step_State_Changed/baseData/wbi:eventHeaderData/wbi:ECSParentID = Identifiercreate-deliver-error
Process_Definition/Process_Execution/Process_Execution_Step/State_Changedfn:exists( State_Changed/eventPointData/bpc:processTemplateName ) and fn:exists( State_Changed/baseData/wbi:eventHeaderData/wbi:ECSCurrentID ) and fn:exists( State_Changed/baseData/wbi:eventHeaderData/wbi:ECSParentID ) and fn:string-length( State_Changed/eventPointData/bpc:BPCEventCode ) = 5 and fn:contains( '21006 21007 21011 21021 21022 21027 21080 21081 42005 42015 42020 42021 42022 42024 42026 42031 42032 42036 42037 42038 42039 42040 42050 42054 42055 42057 42061 42062 42064 42065 42066 42067 42068 42070 ', fn:concat( State_Changed/eventPointData/bpc:BPCEventCode, ' ' ) )State_Changed/eventPointData/bpc:processTemplateName = ../../Name and fct:logical-deployment-location( State_Changed/server ) = ../../Deployed_At and State_Changed/baseData/wbi:eventHeaderData/wbi:ECSParentID = ../Identifier and State_Changed/baseData/wbi:eventHeaderData/wbi:ECSCurrentID = Identifiercreate-deliver-error
Process_Definition/Process_Execution/Process_Execution_Step/Called_Execution_Existsfn:exists( Called_Execution_Exists/baseData/wbi:eventHeaderData/wbi:ECSParentID ) and fn:exists( Called_Execution_Exists/baseData/wbi:eventHeaderData/wbi:ECSCurrentID )Called_Execution_Exists/baseData/wbi:eventHeaderData/wbi:ECSParentID = Identifierignore-deliver-deliverToAll

Table 3 declares the namespace prefixes used in the above expressions. These declarations apply throughout this document.

Table 3. Namespace declarations of the global process monitor model
PrefixNamespace nameUsage
xshttp://www.w3.org/2001/XMLSchemaXPath 2.0 type names and constructor functions
fnhttp://www.w3.org/2005/xpath-functionsXPath 2.0 built-in functions
cbehttp://www.ibm.com/AC/commonbaseevent1_0_1Common Base Event (wrapper) event payload
wbmhttp://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/functionsWebSphere Business Monitor built-in functions
wbihttp://www.ibm.com/xmlns/prod/websphere/monitoring/6.1WebSphere Business Integration event payloads
bpchttp://www.ibm.com/xmlns/prod/websphere/scdl/business-process/6.0.0WebSphere Business Process Choreographer event payloads
htmhttp://www.ibm.com/xmlns/prod/websphere/scdl/human-task/6.0.0WebSphere Human Task Manager event payloads
evthttp://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/events/Global_Process_MonitorGlobal process monitor internal event payloads
fcthttp://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/functions/Global_Process_MonitorGlobal process monitor user-defined functions

Table 4 shows the event part definitions used in Table 2.

Table 4. Some event parts used in the global process monitor model
IdentifierPathType
eventPointDatacbe:CommonBaseEvent/wbi:event/wbi:eventPointDatavarious, for example bpc:BPC.BFM.PROCESS.BASE
baseDatacbe:CommonBaseEvent/wbi:eventwbi:Event
servercbe:CommonBaseEvent/cbe:sourceComponentId/@instanceIdxs:string

Monitoring context creation and life cycle

Referring back to Table 2, you see that the top-level Process Definition monitoring context has only one event subscription, named Exists, but with a very wide filter: any event that carries a process template name in field cbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:processTemplateName will pass. All events issued by BPEL processes running on Process Server meet this criterion. The correlation predicate compares the process template name with the Name metric and its logical deployment location with the Deployed At metric, so that two deployments of the same process (for example, for test and production) are distinguished: different top-level monitoring contexts are created, one per deployed executable. The current implementation of fct:logical-deployment-location( ... ) extracts the WebSphere cell name, which is based on the assumption that if several copies of the same process are deployed and monitored concurrently, then they run in different cells (note that the audit events sent by Process Server do not contain WebSphere cluster information).

The main purpose of the Exists subscription is to instantiate a Process Definition top-level monitoring context as soon as a new process is detected. If that were all it did, its settings for zero-one-multiple correlation matches should be create-ignore-error: once an MC has been created there would be no need for it to receive any further events. However, the events received by this subscription also have downstream effects, as shown in Figure 4.

Figure 4. Downstream effects of the Process_Definition/Exists event subscription
Downstream effects of the Process_Definition/Exists event subscription

Three of the metrics in Figure 4, namely Deployed At, Name and Identifier, are merely initialized by the first event that arrives and thus don't need any further events delivered to this monitoring context. But the Latest Audit Event and Earliest Audit Event timestamps are updated continuously (the Earliest Audit Event only because this monitor model allows for events to arrive out of order). Furthermore, the Version metric is updated with the latest deployed version of the process (any prior versions are recorded in a Process Version child MC) and finally, the Time Since Last Active timer is restarted from zero by every new event. This timer is used to terminate this top-level monitoring context along with all descendants when no event has arrived for 90 days (the process is then considered inactive). To change this timeout value, you have to edit the trigger condition of the Inactive trigger defined in the Process Definition MC, but you may also consider a more sophisticated implementation, which might invoke a user-defined function as part of the trigger condition, passing the Time Since Last Active as well as the Name and Deployed At metrics, and run some external logic to make the decision whether it is time to terminate the context hierarchy. That could even involve a human decision maker. If a process execution starts after its Process Definition monitoring context is terminated, a new top-level MC is created and monitoring resumes immediately. However, the statistics will be divided: those for the old process runs will be in the terminated Process Definition monitoring context, while those for the more recent runs will be accumulated in the new MC.

Tip: Avoiding parent-not-tound exceptions

An event that might create an MC must also be received by its parent, and cause the parent's creation if it does not exist. In a multi-level MC hierarchy, this principle must be applied recursively.

Referring to Table 2 again, and moving down in the stack of monitoring context definitions, you see that Process_Definition/Process_Execution MCs have two event subscriptions, State_Changed and Step_State_Changed. The first receives all events about process execution state changes, which are the real events of interest at this level. The second subscription is a copy of the subscription for state-change events at the next lower level, that is, a copy of the State_Changed subscription of the Process_Definition/Process_Execution/Process_Execution_Step MCs: its filter condition is exactly the same, and the correlation predicate has only been adjusted for the higher MC nesting level. That subscription's only purpose is to receive step-state change events before they are delivered to a Process Execution Step MC, and create the required parent monitoring context if it does not exist. This is an example of a general design principle, which was applied throughout this monitor model: see Avoiding Parent-Not-Found Exceptions.

To achieve this, ensure that for every event subscription that might create a child MC there is a context creating event subscription with the same or a wider filter at the parent level. Looking at the examples in Table 2:

  • Events received by Process_Definition/Process_Execution/Process_Execution_Step/State_Changed are also received by Process_Definition/Process_Execution/Step_State_Changed
  • Events received by Process_Definition/Process_Execution/Step_State_Changed are also received by Process_Definition/Exists
  • Events received by Process_Definition/Process_Execution/State_Changed are also received by Process_Definition/Exists

All of these event subscriptions create a new monitoring context if no matching one exists. Since events are always delivered from the top-level monitoring context down, this ensures that every new child context has a parent. Note that the Process_Definition/Process_Execution/Process_Execution_Step/Called_Execution_Exists subscription is of no concern, because it never causes an MC creation.

Filter expressions

Tip: Writing event subscription filters

An event subscription filter should (1) test for the existence of all fields needed in the correlation predicate, and (2) add the minimum number of conditions required to distinguish the desired events from all others.

Looking at the filter expressions in Table 2, you see that they follow a pattern: existence tests for certain fields in the event are followed by checks for certain event codes. A general design rule for filter expressions is described in Writing Event Subscription Filters.

The filter expressions in Table 2 follow this rule, except that the existence of <EventSubscriptionName>/server is never tested. This is to allow for events from unknown servers. The fct:logical-deployment-location( ... ) function returns an empty string in this case, to avoid problems with non-existing fields in correlation predicates. But you could contend that these filters are over-defined: with knowledge about the audit events produced by Process Server, you could argue that if the evaluation of cbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:BPCEventCode returns the value 21000, then this must be a PROCESS_STARTED event from Process Server's BPEL engine, which is known to contain the fields used in the correlation predicates, and thus the existence tests can be dropped. We erred on the safe side here and kept the more elaborate filter expressions in this showcase model, but you should feel free to simplify them. Resulting performance gains should be minimal, since testing the existence of a field that must be fetched anyway comes at virtually no cost, but the model size will be reduced and the expressions will become more readable.

Testing containment in a set of strings

Another technique demonstrated by the filter expressions in Table 2 is testing containment in a set of strings. First note that with full XPath 2.0 support, the BPC event code tests could have been written using a general comparison [3], as in the following example:

State_Changed/eventPointData/bpc:BPCEventCode = ('21000','21001','21002','21004', ..... )

This expression is true if the result of the left hand side (after atomization) matches any of the strings in the right hand side sequence. But with the limited XPath 2.0 support in Monitor, you must code something like this:

fn:string-length( State_Changed/eventPointData/bpc:BPCEventCode ) = 5 
    and
fn:contains(
    '21000 21001 21002 21004 ..... ', 
    fn:concat( State_Changed/eventPointData/bpc:BPCEventCode, ' ' ) )

Tip: Testing containment in a set of strings

To test whether a string S is contained in a set of strings, divide the set into subsets of equal length. Concatenate the strings in each subset, using a separation character that none of the strings contains (use multiple separation characters if no single character will do). Insert the separation character (or characters) after every string in the concatenation, including the last. For each concatenated subset, test if S has the length characterizing it and if the concatenation of S with the separation character (or characters) is string-contained in the concatenated subset. Repeat this for all subsets, and combine the results with OR.

Again throwing in knowledge about Process Server audit events, one could argue that the BPC event codes are always five characters long and thus the length-check is unnecessary—but as before we kept it in order to demonstrate the general technique. Note that without the length check, event codes like '1000' or '004' would produce unwanted matches, but even with the length-check in place, the test would fail if there were an event code containing a blank, like '1001 '. But Process Server event codes don't contain blanks, and knowing this was essential in selecting the separation character. See Testing Containment In A Set Of Strings for a general description of this method.

This explains how the filter expressions in Table 2 were formed: All BPC event codes have length 5, so there is only one subset of strings to be concatenated, and none contains a blank, so that blank can be used as the separation character. As a final illustration of this method, assume you want to test whether a given string S is one of 'one', 'two', 'too', 'hot', 'too much' or 'too cold'. Following the above method, the expression would be

fn:string-length( S ) = 3 and fn:contains('one,two,too,hot,', fn:concat( S, ',' ) ) or
fn:string-length( S ) = 8 and fn:contains('too much,too cold,', fn:concat( S, ',' ) )

Since two of the strings contain blanks, a different separator had to be used and a comma was chosen. Note that without the length tests you would get unwanted matches for 'much' and 'cold,' for example. Also note that if a blank had been picked as the separator, the string 'much too' would produce an unwanted match even with the length-tests in place.

Testing fields that may be empty

While we're discussing techniques for authoring Boolean expressions, here is another one that is often useful: when event fields or metrics may be absent (empty), there are two ways in which this can be interpreted: either as a match (optional field, omitting it is fine) or as a mismatch (mandatory field, must be present and have the expected value).

Using general comparisons [3] together with negation allows you to encode all possibilities. Here are some examples:

event/field = 'foo'           -- field must be present, and value must be 'foo'
event/field != 'foo'          -- field must be present, and value must not be 'foo' 
fn:not(event/field != 'foo')  -- field can be absent, but if present must be 'foo'
fn:not(event/field = 'foo')   -- field can be absent, but if present must not be 'foo'

The not-equal cases are often very important too. For example, if the goal is to block all events from a certain source but let all others events pass, then using an expression like event/source != 'bad_source' won't produce the desired result: events with no source field at all will be blocked too. The right expression would be fn:not(event/source = 'bad_source')

Finally, keep in mind that the same rules apply for inequalities. Here are a few more examples to conclude this section:

event/severity > 10           -- must be present, and must be greater than 10
event/severity <= 10          -- must be present, and must not be greater than 10
fn:not(event/severity <= 10)  -- can be absent, but if present must be greater than 10
fn:not(event/severity > 10)   -- can be absent, but if present must not be greater than 10

Tip: Testing fields that may be empty

If the absence of an event field that is used in a comparison should yield a true condition, use the inverse general comparison (e.g., '!=' instead of '=', or '<=' instead of '>') together with fn:not( ... ).

Hence, to capture all events where severity is either low or unspecified, the last expression must be used. The essence of this method is summarized in Testing fields that may be empty.

However, please note that at the time of this writing there is a defect in Monitor V7 that leads to an incorrect evaluation of general comparisons with empty arguments in certain cases. Until this is fixed, it is recommended that you include explicit tests for empty fields, using fn:exists() or fn:empty(), to ensure the desired results. Some of the expressions discussed in Monitoring step life cycles demonstrate how to use fn:exists() or fn:empty() to guard against this problem. Comparison results seen in the monitor model debugger should be correct in all cases.

Correlation predicates

A monitor model defines one or more monitoring context hierarchies. (Note: the fact that an MC instance hierarchy must be defined by encoding an MC definition hierarchy is unfortunate, and could be considered an error in the implementation of the programming model. It prevents, for example, defining a tree of monitoring contexts where all nodes have the same type, like a tree of MCs reflecting a process invocation stack; it also prevents reusing child MC definitions in building higher level MCs.) For each incoming event, Monitor must find zero or more target MCs to which it is delivered. The correlation predicate of an inbound event subscription determines those target MC(s) within a monitoring context hierarchy. For event subscriptions that will not cause MC creation, comparing a field in the event with a unique metric in the target monitoring context will do the job. An example of this is the last correlation predicate in Table 2:

Called_Execution_Exists/baseData/wbi:eventHeaderData/wbi:ECSParentID = Identifier

Here, if no matching MC is found the event is ignored (see the last column in Table 2). For event subscriptions that do cause MC creation, however, there must be a way to know where in the tree of existing MCs the new MC should be anchored. To this end, the correlation predicates of these subscriptions must be more elaborate and compare a field in the event with one or more metrics at each level above the new MC. All other correlation predicates in Table 2 are of this type. As an example, here is the one for Process_Definition/Process_Execution/Process_Execution_Step/State_Changed events:

State_Changed/eventPointData/bpc:processTemplateName = ../../Name
    and fct:logical-deployment-location( State_Changed/server ) = ../../Deployed_At
                                     -- identify grandparent-level MC (Process_Definition)
                                     
    and State_Changed/baseData/wbi:eventHeaderData/wbi:ECSParentID = ../Identifier
                                     -- identify parent-level MC (Process_Execution)
                                     
    and State_Changed/baseData/wbi:eventHeaderData/wbi:ECSCurrentID = Identifier
                                     -- identify target MC (Process_Execution_Step)

When an event arrives that passes the filter of a subscription, Monitor first finds out how many existing MCs match the correlation predicate. There can be zero, one, or multiple matches, and the last column in Table 2 determines the subsequent course of action. For the Process_Definition/Process_Execution/Process_Execution_Step/State_Changed subscription, more than one match will cause an error, one match will cause event delivery to the target MC, and zero matches will cause a new MC to be created. The algorithm for MC creation makes use of the various parts of the correlation predicate, as follows:

  1. The top-level (Process_Definition) MC with matching Name and Deployed_At metrics is found; there must be exactly one match or an error is raised (no, or multiple, parent monitoring contexts).
  2. The second level (Process_Execution) MC with the Identifier metric matching the event's ECSParentID is found; there must be exactly one match or an error is raised (no, or multiple, parent monitoring contexts).
  3. The third level (Process_Execution_Step) MC is created anew, and made a child of the higher level MCs that were identified.

Requiring a new monitoring context, unless it is top-level, to always be anchored within an MC hierarchy is quite restrictive and using the correlation predicates in the way just explained is somewhat tricky. A more flexible architecture could support MC references based on "foreign keys" (imitating the relational model), which would allow for (1) more general (non-tree) structures and (2) MC relationships to be established at any time (not necessarily at context creation). We'll see what future releases of Monitor will bring.

Generated pre-filter

When a monitor model is deployed, a Boolean pre-filter expression is generated and used as the event selector for the monitor model input queue. Ideally, the pre-filter would let all events pass to which the monitor model subscribes, and block all others. The straightforward way to achieve this is by forming a disjunction (logical OR) of all inbound event filters in the model. The drawback of this method is that the resulting filter expressions can become very large, with bad implications for performance. At the other end of the spectrum, a very simple pre-filter can be generated as a conjunction (logical AND) of the conditions that all inbound event filters have in common—but that results in a trivial pre-filter when no common conditions exist. Hence, the pre-filter generator tries to strike a compromise: it groups the inbound event filters of a monitor model into subgroups that have at least one condition in common, and generates the pre-filter as the disjunction of the common conditions of each group.

Here is the pre-filter that is generated for the global process monitor model (slightly formatted for readability):

<selector 
xmlns:xs="http://www.w3.org/2001/XMLSchema" 
xmlns:fn="http://www.w3.org/2005/xpath-functions" 
xmlns:cbe="http://www.ibm.com/AC/commonbaseevent1_0_1" 
xmlns:bpc="http://www.ibm.com/xmlns/prod/websphere/scdl/business-process/6.0.0" 
xmlns:htm="http://www.ibm.com/xmlns/prod/websphere/scdl/human-task/6.0.0" 
xmlns:wbi="http://www.ibm.com/xmlns/prod/websphere/monitoring/6.1" 
xmlns:evt=
  "http://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/events/Global_Process_Monitor"
referenceTypes=
  "xs:string xs:string xs:string xs:string xs:string xs:string xs:string xs:string 
   xs:string xs:string xs:string xs:string xs:string xs:string xs:string xs:string 
   xs:string xs:string xs:string" 
filterClass="com.ibm.wbimonitor.xsp.cei.extensions.EventSelectorFilter">

fn:exists(cbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:processTemplateName/text())
or fn:exists(cbe:CommonBaseEvent/wbi:event/wbi:eventPointData/htm:taskTemplateName/text())
or fn:exists(cbe:CommonBaseEvent/wbi:event/wbi:eventHeaderData/wbi:ECSCurrentID/text())

or fn:exists(cbe:CommonBaseEvent/evt:stepExecutionEscalation/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:stepExecutionState/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionStepEscalation/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:taskExecutionPotentialOwners/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionState/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:taskExecutionEscalation/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionStepPotentialOwners/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:stepExecutionPotentialOwners/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionVariableUnformattedValue/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:executionId/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:stepExecutionWorkItemUpdate/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:called/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:taskExecutionState/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionStepState/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:processExecutionStepWorkItemUpdate/text()) 
or fn:exists(cbe:CommonBaseEvent/evt:taskExecutionWorkItemUpdate/text())

</selector>

The first three conditions account for all Process Server events, both from processes and human tasks. In fact, the third condition could have been avoided by adding a (redundant) test for the process template name in the last filter expression in Table 2 (as well as a few other filters testing for ECSCurrentID, but not the process or task template name), but testing the existence of an event field that is needed anyway comes at very low cost, so the performance benefit of eliminating the ECSCurrentID test should be negligible.

It is recommended that users do not modify the generated pre-filters, so this is mostly background information, giving you a glance "under the hood" of monitor's event processing algorithm.

Date, time and duration formatting

A practice employed throughout this monitor model, and visible in Figure 4, is that all xs:dateTime and xs:duration metrics are also available as XML Schema formatted strings. For example, the metrics Latest Audit Event (xs format) and Earliest Audit Event (xs format) contain the XML Schema formatted representations of Latest Audit Event and Earliest Audit Event respectively. This was done strictly for dashboard visualization—and should really be a formatting option that can be chosen by flipping a switch on the dashboard. For now, you can select whether to display one or the other kind of metric (or both) as you configure your widgets.

Showing the xs-formatted strings will display more information, that is, milliseconds and the timezone, and eliminate blanks to avoid wraparound effects that tend to inflate the row height in tabular views. When setting up your dashboard, pick the non-xs versions for a more user-friendly format, which shows timestamps in the local timezone and suppresses milliseconds—or the xs-formatted versions for more detailed information, more technical and standard-compliant formatting, and potentially more rows per screen in the tabular views. The basic dashboard shipped with the global process monitor model uses default formatting, while the advanced dashboard shows the xs-formatted fields.

For illustration, timestamps in both formats are displayed side by side in Figure 5. You see that with the default formatting (first column) no timezone information is shown, so that you have to know where the screen shot was taken in order to tell which point in time it represents. The first column also suggests that the task took 1 second to get from the inactive state to the ready state, but the second column shows that it took only 0.187 seconds.

Figure 5. Standard and xs-formatted timestamps
Standard and xs-formatted timestamps

(See a larger version of Figure 5.)

Monitoring correlation sets

Correlation sets are used by Process Server to correlate incoming process events—which are one-way messages directed at a process and not to be confused with audit events—with running executions. While they could be seen as technical constructs that are of little interest to business observers, they sometimes contain important business-level keys, like order numbers or product identifiers. Hence, correlation sets are captured by the global process monitor as long as the pertinent audit events (event type 42027) have been enabled.

However, Process Server audit events report correlation sets in a hex-binary format, which must be converted to human readable form. Here is a sample event:

<CommonBaseEvent creationTime="2009-12-10T19:14:39.753Z" 
  extensionName="BPC.BFM.PROCESS.CORREL" sequenceNumber="15" severity="10" 
  globalInstanceId="CE9897A484FF32ACCBA1DEE5C0438D9F90" version="1.0.1">

  [...]

  <wbi:event xmlns:wbi="http://www.ibm.com/xmlns/prod/websphere/monitoring/6.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:bpc="http://www.ibm.com/xmlns/prod/websphere/scdl/business-process/6.0.0">
    <wbi:eventHeaderData>
      <wbi:WBISESSION_ID>
        9.58.26.115;OrderProcessing;;processCustomerOrder;1260472478722;1478383646
      </wbi:WBISESSION_ID>
      <wbi:ECSCurrentID>_PI:90030125.7a03ac70.8ce5c5f6.65020000</wbi:ECSCurrentID>
      <wbi:ECSParentID>
        9.58.26.115;OrderProcessing;
        sca/dynamic/reference;;processCustomerOrder;1260472478722;1478383646
      </wbi:ECSParentID>
      <wbi:WBIEventVersion>6.1</wbi:WBIEventVersion>
    </wbi:eventHeaderData>
    <wbi:eventPointData xsi:type="bpc:BPC.BFM.PROCESS.CORREL">
      <wbi:eventNature>CORRELATION</wbi:eventNature>
      <wbi:payloadType>full</wbi:payloadType>
      <bpc:BPCEventCode>42027</bpc:BPCEventCode>
      <bpc:processTemplateName>ProcessCustomerOrder</bpc:processTemplateName>
      <bpc:processTemplateValidFrom>
        Tue 2009-11-10 20:00:00.000
      </bpc:processTemplateValidFrom>
      <bpc:eventProgressCounter>0</bpc:eventProgressCounter>
      <bpc:eventLocalCounter>5</bpc:eventLocalCounter>
      <bpc:processInstanceExecutionState>
        2 - STATE_RUNNING
      </bpc:processInstanceExecutionState>
      <bpc:processTemplateId>
        _PT:90010124.dfa293ac.8ce5c5f6.ed3f02f9
      </bpc:processTemplateId>
      <bpc:principal>admin</bpc:principal>
      <bpc:correlationSet>
        ACED0005757200025B42ACF317F8060854E00200007870000000BD3C3F786D6C20766572
        73696F6E3D22312E30223F3E3C636F7272656C6174696F6E536574206E616D653D226F72
        6465724964656E746966696572223E3C70726F7065727479206E616D653D226F72646572
        44756544617465222076616C75653D22546875204465632032342030303A30303A303020
        4553542032303039222F3E3C70726F7065727479206E616D653D226F726465724E756D62
        6572222076616C75653D223132333435222F3E3C2F636F7272656C6174696F6E5365743E
      </bpc:correlationSet>
    </wbi:eventPointData>
  </wbi:event>
</CommonBaseEvent>

And here is the content of the <bpc:correlationSet> element after decoding and prefixing it by the event timestamp, which is what's shown on the dashboard:

2009-12-10_14:14:39.753_GMT-05:00
<correlationSet name="orderIdentifier">
<property name="orderDueDate" value="Thu Dec 24 00:00:00 EST 2009" />
<property name="orderNumber" value="12345" />
</correlationSet>

It may be instructive to look at the map expression for the Correlation Sets metric:

if 
  ( State_Changed/eventPointData/bpc:BPCEventCode = '42027' ) 
then 
  fn:substring( 
    fn:concat( 
      fct:format-dateTime-value( 
        xs:string( fn:adjust-dateTime-to-timezone( Aux_Last_Event_Timestamp ) ) ), 
      '<br>', 
      wbm:escape-special-characters( 
        fct:get-correlation-set-info( State_Changed/correlationSet ) ), 
      '<br>', 
      Correlation_Sets 
    ), 1, 4000
  ) 
else 
  Correlation_Sets

First, you see an if-then-else expression, which ensures that only PROCESS_CORRELATION_SET_INITIALIZED events (event code 42027) can update the Correlation Sets metric. When such an event is received by the State Changed subscription, the Correlation Sets metric will be prepended by: (1) the event timestamp formatted for the local timezone of the Monitor server, followed by (2) a line break, followed by (3) the formatted correlation set information, and finally (4) another line break. The substring function limits the length of the result to 4000 characters (the maximum size of a string metric) in order to avoid a runtime exception if the combined length of all correlation sets exceeds that limit, which should be rare. If more than one correlation set is initialized for a process execution, this metric shows them all, each set preceded by its initialization timestamp. For other metrics containing timestamped sequences, like State History, Escalation History, and so on, there are child MCs to hold the timestamped values. For correlation sets, this was not deemed necessary because, in the majority of cases, there will be just one, or a very small number. Needless to say, you can add these child MCs if you want, following the pattern used for State History, Escalation History, and so on, which is discussed in Monitoring step life cycles.

The user-defined XPath function that performs the hex conversion is shown below, as a coding example for user-defined functions. The remaining user-defined XPath functions used by the global process monitor model are discussed in User-defined XPath functions.

public class FormattingUtils {
  public static final String COPYRIGHT="Copyright IBM Corporation 2009."; 
  
  private static final String NAMESPACE_NAME = 
    "http://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/functions" 
    + "/Global_Process_Monitor";

  private static final String hexStringRegex = "[0-9A-Fa-f]+";
  private static final Pattern hexStringPattern = Pattern.compile(hexStringRegex);  
  private static final String correlationSetStart = "<correlationSet name=";
  private static final String correlationSetEnd = "</correlationSet>";
    
  /**
   * Convert a hex string representing a byte array into the equivalent 
   * {@link java.lang.String}, and extract the substring starting with 
   * "<correlationSet name=" and ending with "</correlationSet>". 
   * If the input is not a hex string, or (after conversion) does not contain 
   * such a substring, null is returned.
   * <p>
   * @param input a hex string representing a byte array.
   * <p>
   * @return the substring starting with "<correlationSet name=" 
   * and ending with "</correlationSet>" after converting the hex 
   * string passed via input to a {@link java.lang.String}, or 
   * null if the input is not a hex string or no such substring was found. 
   */
  @XPathFunction(
      namespaceName = FormattingUtils.NAMESPACE_NAME,
      localName = "get-correlation-set-info",
      description = "Extract a string representation of the correlation set " +
        "information carried in a PROCESS_CORRELATION_SET_INITIALIZED event issued by " +
        "WebSphere Process Server (event code 42027).",
      isSelfContained = true,
      isDeterministic = true
      )
      public static String getCorrelationSetInfo(final String input) {
        
    if (input == null) 
      return null;
        
    Matcher hexStringMatcher = FormattingUtils.hexStringPattern.matcher(input);
    if (!hexStringMatcher.matches()) 
      return null;
        
    final byte[] bytes = new byte[input.length() / 2];
    for (int i = 0; i < bytes.length; i++)
      bytes[i] = (byte) Integer.parseInt(input.substring(2*i, 2*i+2), 16);
        
    final String result = new String(bytes);
    final int start = result.indexOf(FormattingUtils.correlationSetStart);
    final int end = result.lastIndexOf(FormattingUtils.correlationSetEnd);
        
    if (start < 0 || end < 0)
      return null;

    return 
      result.substring(start, end + FormattingUtils.correlationSetEnd.length())
        .replaceAll("><", ">\r\n<");
  }
}

The XML-formatted correlation sets are transmitted in the audit events without any formatting white space. The fct:get-correlation-set-info() function performs some minimal formatting by inserting a CRLF sequence between adjacent element tags. This is done via string substitution, without properly parsing the XML, and certainly can be improved (with the current implementation, a CRLF sequence will also be inserted between adjacent '>' and '<' characters in a correlation set value, which should, however, be rare). The result of the user-defined function is passed into wbm:escape-special-characters(), which substitutes the CRLF sequences by a <br/> tag, so that new lines indeed appear in a Web browser. Quite like the date / time / duration formatting options discussed in Date, time and duration formatting, this should really be a formatting option that users can choose on a per-column basis in the Instances widget. For now, it is handled by monitor model logic.

Tip: Showing XML on the dashboard

Use the built-in function wbm:escape-special-characters() to format any string containing XML for dashboard display.

Note that XML special characters must be escaped in order to be visible on the dashboard, and the built-in function wbm:escape-special-characters() was provided for that purpose. Without reformatting, unknown XML tags are ignored by most Web browsers and will simply be invisible.

Monitoring execution state

As explained earlier, the global process monitor model defines relatively few event subscriptions for a large variety of process and human task events. As shown in Table 2, a Process Execution Step MC has essentially one event subscription, State Changed, which subscribes to all events reporting state changes of running process steps. For most of the down-stream processing—triggers, counters, maps, and so on—this is more convenient than having separate subscriptions for each kind of state change that may be reported (as you can see in Table 2, thirty-four different event codes are recognized): For example, metrics like Time Started, Time Ended, Name, and so forth, can be updated by just one simple map, because the event timestamp, the step name, and so on, are in the same location for each kind of event. With one subscription for all state change events, one map will do instead of thirty-four.

An exception is the map for the State metric (more precisely, for metric Aux Last State Reported from which the State metric is derived) which is anything but simple. For reasons that we'll explain shortly, the bpc:state event field is not just copied but predefined states are set explicitly. Here is the map expression:

if ( State_Changed/eventPointData/bpc:BPCEventCode = '21006' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '21021' ) then 'READY' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42020' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '21007' and 
     State_Changed/eventPointData/bpc:state = '3 - STATE_RUNNING' ) then 'RUNNING' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21011' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42026' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42036' ) then 'FINISHED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42005' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42021' ) then 'SKIPPED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21080' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42022' ) then 'FAILED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42023' ) then 'FAILING' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21027' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42024' ) then 'TERMINATED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21022' ) then ( 
     if ( fn:exists( State_Changed/username ) ) then 
     fn:concat( 'CLAIMED by ', State_Changed/username ) else 'CLAIMED' ) else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21007' and 
     State_Changed/eventPointData/bpc:state = '11 - STATE_WAITING' ) then 'WAITING' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '21081' ) then 'EXPIRED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42015' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42066' ) then 'STOPPED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42043' ) then 'COMPENSATING' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42044' ) then 'COMPENSATED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42045' ) then 'COMPFAILED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42031' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42068' ) and 
     State_Changed/eventPointData/bpc:state = '1 - STATE_INACTIVE' ) 
   then 'INACTIVE_FRETRIED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42031' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42068' ) and 
     State_Changed/eventPointData/bpc:state = '2 - STATE_READY' ) 
   then 'READY_FRETRIED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42031' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42068' ) and 
     State_Changed/eventPointData/bpc:state = '3 - STATE_RUNNING' ) 
   then 'RUNNING_FRETRIED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42031' and 
     State_Changed/eventPointData/bpc:state = '11 - STATE_WAITING' ) 
   then 'WAITING_FRETRIED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42031' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42068' ) and 
     State_Changed/eventPointData/bpc:state = '13 - STATE_STOPPED' ) 
   then 'STOPPED_FRETRIED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42032' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42067' ) and 
     State_Changed/eventPointData/bpc:state = '5 - STATE_FINISHED' ) 
   then 'FINISHED_FCOMPLETED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42032' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42067' ) and 
     State_Changed/eventPointData/bpc:state = '4 - STATE_SKIPPED' ) 
   then 'SKIPPED_FCOMPLETED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42032' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42067' ) and 
     State_Changed/eventPointData/bpc:state = '6 - STATE_FAILED' ) 
   then 'FAILED_FCOMPLETED' else (
if ( ( State_Changed/eventPointData/bpc:BPCEventCode = '42032' or 
     State_Changed/eventPointData/bpc:BPCEventCode = '42067' ) and 
     State_Changed/eventPointData/bpc:state = '13 - STATE_STOPPED' ) 
   then 'STOPPED_FCOMPLETED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42037' ) 
   then 'LOOP_CONDITION_TRUE' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42038' ) 
   then 'LOOP_CONDITION_FALSE' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42057' ) 
   then fn:concat( 'FINISHED (', 
     xs:integer( State_Changed/branchesStarted ) + 1, ' branches)' ) else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42061' ) 
   then 'FINISHED_CONDTRUE' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42062' ) 
   then 'FINISHED_ALLCONDFALSE' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42064' ) 
   then 'SKIP_REQUESTED' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42065' ) 
   then 'SKIPPED_ON_REQUEST' else (
if ( State_Changed/eventPointData/bpc:BPCEventCode = '42070' ) 
   then 'SKIPPED_ON_EXIT_CONDITION_TRUE' else (
if ( fn:contains( '42039 42040 42050 42054 42055 ', 
     fn:concat( State_Changed/eventPointData/bpc:BPCEventCode , ' ' ) ) and 
     State_Changed/eventPointData/bpc:state = '1 - STATE_INACTIVE' ) 
   then 'INACTIVE' else Aux_Last_State_Reported
)))))))))))))))))))))))))))))))

It is basically a large case statement, which sets the state of a step execution based on several fields in a received event. This was done for a number of reasons:

  • To show more user-friendly state designations.

    For example, WAITING instead of 11 - STATE_WAITING as reported in the audit events.

  • To show more detail about the state.

    For example: CLAIMED by John for a task assigned to John; or FINISHED (5 branches) for a for-each that spawned 5 parallel threads of execution.

  • To show more detail about the state's history.

    For example WAITING_FRETRIED means that the activity has been forcibly retried and as a result is in the waiting state. This is more information than just reporting that the activity is in the WAITING state, which could have been reached by other means. It is also more information than reporting that the activity was forcibly retriewhich can produce different states as a result.

  • To allow the same states to be shown for other process engines.

    Their audit events may not have a 'state' field, and even if they do, could use different names.

This map and similar maps for the states of human task executions and process executions are the most complex expressions in this monitor model. As discussed earlier, using different subscriptions for different kinds of events would have chopped it down into 34 smaller maps, but the overall complexity is still minimized with the current approach: not only would the many event subscriptions and the many resulting maps have cluttered the monitor model, but it is also easier to view and edit one large expression in a text editor than to click through many small expressions in order to make some general changes or get an overview.

Monitoring step life cycles

In this section we'll discuss the logic behind the State and State History metrics of a Process Execution Step monitoring context. The State and State History metrics of the Step Execution and Task Execution MCs use the same logic, and similar logic is also used for the Escalation History, Potential Owners History, and Work Item Update History metrics, which occur in those same monitoring contexts.

Here are the maps that populate the State Timestamp, State, and State History metrics, as well as the auxiliary metrics they depend on:

Aux_Last_Event_Timestamp
  <- State_Changed/predefinedData/creationTime

Aux_Last_State_Reported  
  <- if ( State_Changed/eventPointData/bpc:BPCEventCode = '21006' or 
          State_Changed/eventPointData/bpc:BPCEventCode = '21021' ) 
          then 'READY' else ( .......... )

State_Timestamp          
  <- if ( fn:empty( State_Timestamp ) or Aux_Last_Event_Timestamp >= State_Timestamp ) 
          then Aux_Last_Event_Timestamp else State_Timestamp

State                    
  <- if ( fn:empty( State_Timestamp ) or Aux_Last_Event_Timestamp >= State_Timestamp ) 
          then Aux_Last_State_Reported else State

When the Aux_Reported_State_Changed trigger fires:

State_History            
  <- fn:substring( 
          fn:concat( 
               Aux_Last_Event_Timestamp_xs_format, 
               '  ', 
               Aux_Last_State_Reported, 
               '<br />', 
               State_History 
          ),1, 4000
     )

When a State Changed event arrives, the two auxiliary metrics are updated directly. (The full expression for the Aux Last State Reported metric is shown in Monitoring Execution State.) They are copied to the State Timestamp and State metrics, respectively, when the event timestamp is not earlier than State Timestamp, or when the latter has never been set.

The reason for this logic is to cope with State Change events that arrive out of order. Since Monitor has built-in logic to ensure the proper order for events from Process Server, this is not necessary for monitoring BPEL processes running on Process Server. However, as event sources are added that may not provide an event sequence identifier, and to demonstrate the general technique, this logic was left in the model.

A careful reader may have noticed that the map expressions for State Timestamp and State can be simplified, using the advice in Testing fields that may be empty:

State_Timestamp          
  <- if ( fn:not( Aux_Last_Event_Timestamp < State_Timestamp ) )
     then Aux_Last_Event_Timestamp else State_Timestamp

State                    
<- if ( fn:not( Aux_Last_Event_Timestamp < State_Timestamp ) )
     then Aux_Last_State_Reported else State

But in order to guard against the current problems with general comparisons in the presence of empty arguments, the tests for empty State Timestamp metrics were made explicit.

Updates of the State History metric are gated by the Aux Reported State Changed trigger, which is evaluated for every State Changed event received, and whose condition is

Aux_Last_State_Reported != Aux_Saved_Last_State_Reported or
  fn:starts-with( Aux_Last_State_Reported, 'LOOP_CONDITION' )

The trigger was introduced to prevent extensions of the state history when consecutive State Changed events report the same state. An exception was made for LOOP_CONDITION_TRUE / LOOP_CONDITION_FALSE, because recording the repeated true / false evaluations at the beginning / end of a do-while or while-do loop makes sense. When the trigger fires, the State History metric is prepended with the XML Schema formatted timestamp, the last state that was reported, and a line break tag. The fn:substring() function limits the length of the result to 4000 characters in order to avoid an exception when the maximum length for string metrics is exceeded. That should be rare, but loops with a large number of iterations might cause this condition.

A snapshot of the State History metric of a Task Execution MC, which shows the same state history as Figure 5, is shown in Figure 6.

Figure 6. Showing the State History metric in a Task Execution monitoring context
Showing the State History metric in a Task Execution monitoring context

(See a larger version of Figure 6.)

You can clearly see how this metric causes an excessive row height, which is one reason why these timestamped histories were moved to child monitoring contexts. (Being able to capture the state histories of long-running loops is another reason.) You see the drill-down link that leads to those child MCs in front of the State History metric; drilling down leads you to the view shown in Figure 5. State History and other metrics of this kind (Escalation History, Potential Owners History, Work Item Update History) are not shown in the dashboard configurations that are shipped with the global process monitor, but they do exist in the model. Feel free to use them in the dashboards instead of the drill-down links, to provide an all-on-one-page view of a process step's execution state. In addition, the State History metric is used in the calculation of waiting and working durations of human tasks, as we'll discuss in Gathering execution statistics.

The child MCs holding timestamped state records are created by sending an event out from the Step Execution MC whenever its State metric changes, and receiving it back by the same monitor model. The outbound event definition (of a Task Execution MC), and the corresponding inbound event definition (of a Task Execution State History child MC) are shown in Figure 7 and Figure 8, respectively.

Figure 7. Outbound event definition for task state reporting
Outbound event definition for task state reporting

(See a larger version of Figure 7.)

Figure 7 shows that the payload of these events consists of four metrics:

  • The task definition identifier, which is the top-level MC key
  • The task execution identifier, which is the second-level MC key
  • The timestamp of the last event, which will become the third-level MC key
  • The task execution state, which is the real payload data carried by the event

The filter of the event subscription in the child MC (see Figure 8) checks for the existence of a task execution state field in the payload. It defines the same event parts as its counterpart outbound event definition, except for the event part called emptyString, which we'll discuss shortly.

Figure 8. Inbound event definition for task state reporting
Inbound event definition for task state reporting

(See a larger version of Figure 8.)

Looking at the correlation predicate in Figure 8, you see that the first two comparisons compare the grandparent and parent MC keys with the corresponding fields in the event payload, as required for MC creating subscriptions. The last comparison will always evaluate to false, but satisfies the formal criteria for correlation predicates, namely comparing a metric in the target MC (Time) with an expression that depends on the incoming event. The emptyString field never exists in an incoming event, and the casting to a dummy xs:dateTime value was done to safeguard against any problems with empty fields in correlation predicates. There are certainly other ways in which a comparison could have been constructed that satisfies these formal requirements but always yields false: the expression in Figure 8 is just one example of achieving that. The Time and State metrics you see in Figure 5 are set from the corresponding fields in the event, and the Time (xs format) metric is simply reformatted from Time, using the following expression:

fct:format-dateTime-value( xs:string( fn:adjust-dateTime-to-timezone( Time ) ) )

The standard function fn:adjust-dateTime-to-timezone(), when called with a single xs:dateTime argument, adjusts that to the local timezone of the server on which it runs. The user-defined function fct:format-dateTime-value() then formats the timezone-adjusted timestamp as yyyy-mm-dd_hh:mm:ss.sss_GMT[+|-]hh:mm . You can see results of that in Figure 5 and Figure 6.

Monitoring process versions

As you've seen in Table 2, the process definition version (template version) is not used in the correlation expressions identifying a Process Definition MC, and hence, if various versions of a process are deployed at the same server over time, events originating from those versions are received in the MC sub-tree under the same Process Definition MC and contribute to the same set of statistics. The Version and Version No Blanks metrics in the Process Definition MCs always contain the version reported by the last event (that is, the current version running). However, all versions of a process definition are recorded by Process Version child monitoring contexts, whose dashboard rendering may look like Figure 9.

Figure 9. Showing process definition versions on the dashboard
Showing Process Definition Versions on the Dashboard

(See a larger version of Figure 9.)

The process definition name in Figure 9 is redundant (it can be looked up in the parent), but is replicated for convenience. The version is shown with and without blanks, but the no-blanks format is used in most other MC instance views to avoid the row height inflation that occurs when those strings wrap around onto several lines. Finally, the timestamps of the earliest and latest audit event received from a process version are shown, demarcating the time span during which it was active.

The process template versions are recorded by an event subscription of the Process Version MCs, which is shown in Figure 10.

Figure 10. Detecting process definition versions
Detecting process definition versions

(See a larger version of Figure 10.)

The Detected event subscription receives any event from a BPEL process that carries a process definition name (processTemplateName) and a version identifier (processTemplateValidFrom). The correlation predicate references the process definition name and server (if present) in the parent context, and compares the version identifier in the event with the one in the Process Version child MC. If no match is found, that is, if this is the first event reporting a new process version, a new child MC is created. Otherwise, the event is delivered to the already existing MC for this version, and used to update the Earliest Audit Event and Latest Audit Event timestamps.

Monitoring process migrations

Starting with Version 7, Process Server allows migration of running process executions from one process definition version to another. There are events associated with such migrations, which the global process monitor receives and uses to report these migrations. There are two MC definitions related to process instance migrations: Process Migration and Process Execution Migration. The first is instantiated when any migration is detected between two versions of a process. It records the fact that "one or more running executions have been migrated from process definition version A to version B". The second is instantiated for each individual migration of a process execution. It records a fact like "running execution #123 has been migrated from process definition version A to version B."

Some sample screen shots are shown in Figure 11, Figure 12 , and Figure 13. Each figure displays a single wide dashboard screen using two rows. Figure 11 shows two Process Migration MCs, which record the fact that one or more executions have been migrated between the following versions of CheckItemAvailabilityAndPrice:

Tue_2009-12-15_00:00:01.000 -> Tue_2009-12-15_00:00:02.000
Tue_2009-12-15_00:00:02.000 -> Wed_2009-12-16_00:00:00.000

You can also see the earliest and the latest migration start and the end.

Figure 11. Process definition versions between which migrations have occurred
Process definition versions between which migrations have occurred

(See a larger version of Figure 11.)

Figure 12 shows a snapshot of five Process Execution MCs of migrated process executions. The version shown in the second column, Wed_2009-12-16_00:00:00.000, is the version of the migration target. You can see that these executions were migrated in the Process Execution Migration column. Follow the drill-down links to see detailed information about each migration that has happened for an execution.

Figure 12. Process executions after migration
Process executions after migration

(See a larger version of Figure 12.)

Figure 13 shows the result of drilling down into Process Execution Migration MCs via the links in Figure 12. You see the name of the process definition, the old and the new version, the old and the new process definition identifier, as well as the time interval during which the migration happened. By drilling up to the Process Execution level and then down again to the Process Execution Steps, you can find out where the execution was at that point.

Figure 13. A single process execution migration record
A single process execution migration record

(See a larger version of Figure 13.)

In summary, process execution migrations are tracked by the global process monitor as follows:

  • A new version of a process definition is always recorded in a Process Version MC, subordinate to the Process Definition MC. This happens regardless of whether executions are migrated to the new version.
  • If one or more executions are migrated between two versions of a process definition, this is recorded in a Process Migration child of the Process Definition MC. The child MC will show between which versions the migrations occurred, and the time span during which they happened.
  • Each migration that happens for a process execution is recorded individually in a Process Execution Migration MC, which is subordinate to the migrated execution's MC. Note that the migrated execution will continue to be tracked by the same Process Execution monitoring context. The migration record will show the process definition versions between which the migration took place, and the time interval during which it happened.
  • Steps that were added, or omitted, or any other logic changes that come with a new process version are recorded automatically as the migrated executions start to emit events. This is simply a consequence of the general fact that the global process monitor detects process structure dynamically at runtime.

The logic of the Process Migration and Process Execution Migration MCs is fairly basic: the events reporting the start or completion of a migration are received, and used to create or update their target MC. One noteworthy fact, perhaps, is that the correlation key of a Process Migration or Process Execution Migration monitoring context consists of a pair of process definition identifiers, representing the "from" and the "to" version of the process definition.

Monitoring process events

BPEL processes can define event handlers, which can be associated with a process as a whole or with a scope, and which are activated by incoming process events. Process events are one-way messages directed at a process execution, and not to be confused with audit events, which process executions emit. The term "process event" is used for those inbound one-way messages. The global process monitor tracks process events using Process Execution Event MCs, which represent process events received by the process executions corresponding to their parent MCs.

The event subscription for the audit events that report the arrival of a process event at a running execution is shown in Figure 14. Only one audit event is expected per process event received, as you can see from the correlation settings for zero|one|multiple matches: create|ignore|error. The MC reporting the process event is keyed by the event timestamp, which assumes that a process execution will not report the receipt of two process events at exactly the same time.

Figure 14. Subscription for audit events reporting process events
Subscription for audit events reporting process events

(See a larger version of Figure 14.)

There are different subscriptions for the audit events received from process-level event handlers and scope-level event handlers. This is because the audit events emitted by the two kinds of event handlers are quite different. In fact, they are so different that one might almost define two kinds of MCs, one for process-level event handlers and one for scope-level event handlers. Table 5 summarizes the differences.

Table 5. Content of audit events sent by process-level and scope-level event handlers
ContentLocation in audit event from process-level event handler (PROCESS_EVENT_RECEIVED / event code 42047)Location in audit event from scope-level event handler (SCOPE_EVENT_RECEIVED / event code 42048)
Message (business payload)cbe:CommonBaseEvent/wbi:event/wbi:applicationData/wbi:contentn/a
WSDL operationcbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:operationcbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:operation
WSDL port type namecbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:portTypeNamen/a
WSDL port type namespacecbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:portTypeNamespacen/a
Scope to which receiving event handler is attachedn/acbe:CommonBaseEvent/wbi:event/wbi:eventPointData/bpc:activityTemplateName

It's not obvious why the different kinds of event handlers report the receipt of a process event using different audit event structures. By adding some of the data reported by the process-level audit events to the scope-level audit events, convergence should be possible.

Tracking invocation chains

The need for this feature emerged during an early review of the global process monitor and its dashboard. Someone asked how users could see where most of the time is spent. The goal thus became to show the invocation stack of a process execution, along with the times spent at each sub-process level. Once the time-intensive process executions have been identified, it should be straightforward to drill into their steps to spot the heavy-weight tasks. Figure 15 shows the dashboard view that was added for this purpose.

Figure 15. Visualizing sub-process invocations
Visualizing sub-process invocations

(See a larger version of Figure 15.)

This dashboard view is just a different arrangement of the Instances view for Process Execution MCs. Here, the executions are not filtered by process definition but instead grouped by initiating (top-level) process execution, and within each group are sorted by start time. You can see two top-level executions in Figure 15, with identifiers PI:90030125.9d7f4954.90e5c5f6.3c30ce6 and PI:90030125.9fc856e3.90e5c5f6.7cbc0031. They are marked by having invocation depth 0 and no indentation of the process definition name. The first is an execution of ProcessCustomerOrder and the second an execution of the recursive process DoItAgain. Following the top-level executions are executions of their first level sub-processes, with invocation depth 1. You see that the ProcessCustomerOrder execution called CheckItemAvailabilityAndPrice five times, and then ProcessCustomerInvoice once. You also see that DoItAgain called itself recursively four times. The Indented Process Definition Name column holds the name of the process whose execution is shown in that row, but indented according to the invocation depth. The example of DoItAgain shows that the same process definition can have executions at different levels; therefore, being a top-level, first-level, second-level, and so on sub-process is not a property of a process definition.

From the total duration, waiting duration, and working duration columns in Figure 15 you see that most of the time of the ProcessCustomerOrder execution was spent in the top-level process, not in any of its sub-processes. You would thus dig into that execution to look for some long running tasks. The durations also reveal that most of the time spent in the sub-processes was due to waiting for a human task to be claimed by a user; once claimed, it was typically completed within a few seconds. Figure 15 also shows that the DoItAgain process did not call any human tasks, as the waiting and working durations are zero at every level.

Clearly, knowing the initiator identifier and the invocation depth of each process execution is key to enabling this view. But how are these determined? The audit events sent by process executions do not contain such information. In fact, a process execution running ten invocation levels deep does not know anything about that: it neither knows that there are ten higher level executions waiting for it to return, nor does it know anything about the process execution at the top of the stack. This is where the global vantage point of Monitor can be used to add information that individual executions do not have.

Let's look at how this is accomplished. In a nutshell: the monitoring context for a process execution that calls sub-processes collects all of their execution identifiers in an XML structure, which is held in a string metric. The sub-process execution identifiers are gleaned from any process-level events that these sub-processes emit. When a new sub-process execution becomes known in the monitoring context of its caller, that monitoring context sends an internal event to the monitoring context of the sub-process execution, communicating its own initiator identifier and invocation depth. The sub-process execution MC adds 1 to the invocation depth, and adopts the same initiator identifier. It is now prepared to do the same for its sub-processes, as soon as those have started and communicated their execution IDs. In summary, the invocation depth and initiator identifier are communicated down the stack, by using internal events flowing from a calling execution's MC to that of the called execution.

Figure 16. Monitor model logic in support of visualizing sub-process invocations
Monitor model logic in support of visualizing sub-process invocations

For a more detailed description of the underlying logic, refer to Figure 16, which shows the monitoring context of a calling process execution at the top, the context of the invocation step in the middle, and the context of the called process execution at the bottom. The only model elements shown are those that play a major part in the logic flow, described here with step numbers referencing Figure 16:

  1. The calling process execution's MC is created, and its Identifier is set, by the first audit event reporting the existence of that process execution.
  2. The Initiator Identifier is initialized with the process execution's own identifier (copied from Identifier).
  3. The Invocation Depth is set to 0.
  4. The process execution arrives at an invoke activity, for which a Process Execution Step child context is created. Its Identifier is initialized with the invoke activity's execution identifier. Nothing is known about the called execution at this point, which is just about to start in the process server. It is only after it has been started and begun to emit audit events that its execution identifier and other data can be captured.
  5. The called process execution's MC is created and its Identifier is set, as it was for the caller's MC, from the first audit event reporting the existence of the sub-process execution.
  6. As for the caller's MC, the Initiator Identifier is initialized with the sub-process execution's own identifier (copied from Identifier).
  7. As for the caller's MC, the Invocation Depth is set to 0 (for now).
  8. The process-level events emitted by the sub-process execution carry the invoke activity's execution identifier in their ECSParentID field. This fact is used to receive them in the invoke activity's monitoring context via the Called Execution Exists subscription.
  9. The events received from the sub-process carry its execution identifier in the ECSCurrentID field; it is stored in the Called Execution Identifier metric.
  10. The Aux Add Called Execution Identifier trigger is evaluated. Its trigger condition tests whether the sub-process execution ID is not already part of the list of known sub-process execution identifiers in the Aux Called Execution Identifiers metric in the parent context. When the sub-process execution ID is reported for the first time it won't be found, and the trigger fires.
  11. The trigger drives a map of the Aux Called Execution Identifiers metric in the parent context, which appends the new sub-process execution identifier to the list of called execution identifiers that are already known. We have now reached the point described in the first sentence of the description above: "The monitoring context for a process execution that calls sub-processes collects all of their execution identifiers in an XML structure, which is held in a string metric." The sub-process execution IDs are stored in the Aux Called Execution Identifiers metric as a sequence of XML elements of the form <evt:called>called-execution-identifier</evt:called>.
  12. The Aux Add Called Execution Identifier trigger also drives a map of the Aux Called Executions metric, which contains a wbm:send-events() function. The wbm:send-events() function will only send out one event, with three payload elements: the called execution identifier; the caller's initiator identifier; and the caller's invocation depth. (The maps of Aux Called Executions containing these wbm:send-events() functions are shown below.)
  13. The internal event is received in the monitoring context of the called execution, via the Aux Report Initiator Identifier And Invocation Depth subscription. The called execution identifier is used as the correlation key.
  14. The called process execution's Initiator Identifier is overwritten from the incoming internal event.
  15. The called process execution's Invocation Depth is overwritten as well, by adding 1 to the depth reported by the incoming internal event.
  16. With the preceding step completed, the called process execution's monitoring context has the correct Initiator Identifier and Invocation Depth, and can pass them on to the MCs of its own sub-process executions once these come to life. But that assumes that these sub-sub-processes become known after the above 15 steps are completed—which one can't guarantee. The internal event that reports the correct initiator identifier and invocation depth could become held up behind events from sub-sub-processes that have already started. Those events would trigger the same procedure just described, between the MCs representing the lower-level executions, but communicate the sub-process execution's default settings before they were overwritten (Initiator Identifier = own identifier; Invocation Depth = 0). To account for this possibility, the incoming internal event (via another trigger, Aux Initiator Identifier And Invocation Depth Changed which is not shown in Figure 16), triggers another wbm:send-events() function which will broadcast the updated Initiator Identifier and Invocation Depth values to any and all sub-sub-process execution MCs known at that point, by iterating over the list of execution id's in Aux Called Execution Identifiers.

Here are the wbm:send-events() functions invoked in steps 12 and 16 in Figure 16. Note that the XML fragments passed via the first argument are constructed on the fly, using string concatenation.

When trigger Aux_Add_Called_Execution_Identifier fires:

wbm:send-events( 
    fn:concat( 
        '<evt:called xmlns:evt="... URI ...">',
        Process_Execution_Step/Called_Execution_Identifier, 
        '</evt:called>' ), 
    '/evt:called', 
    'fn:concat( 
        "<evt:called", $var1, ">", 
        $currentItem/text(), 
        "</evt:called><evt:main", $var1, ">", 
        $var2, 
        "</evt:main><evt:depth", $var1, ">", 
        $var3, 
        "</evt:depth>" )', 
    " xmlns:evt='... URI ...'", 
    Initiator_Identifier, 
    xs:string( Invocation_Depth ) 
)


When trigger Aux_Initiator_Identifier_And_Invocation_Depth_Changed fires:

wbm:send-events( 
    fn:concat( 
        '<evt:root xmlns:evt="... URI ...">', 
        Aux_Called_Execution_Identifiers, 
        '</evt:root>' ), 
    '/evt:root/evt:called', 
    'fn:concat(
        "<evt:called", $var1, ">", 
        $currentItem/text(), 
        "</evt:called><evt:main", $var1, ">", 
        $var2, 
        "</evt:main><evt:depth", $var1, ">", 
        $var3, 
        "</evt:depth>" )', 
    " xmlns:evt='... URI ...'", 
    Initiator_Identifier, 
    xs:string( Invocation_Depth ) 
)


In both expressions, "... URI ..." stands for the namespace URI

http://www.ibm.com/xmlns/prod/websphere/monitoring/6.2.0/events/Global_Process_Monitor

The XPath expression of the first invocation (/evt:called) will always produce a result of length 1, and a single outbound event will be sent. The XPath expression of the second invocation (/evt:root/evt:called) will produce exactly the sequence of <evt:called> elements currently stored in the Aux Called Execution Identifiers metric. It can have zero or more items, and the corresponding number of outbound events will be sent: one for each sub-process execution that is known at that point.

Tip: Sorting Instances views by more than one column

To sort the rows in an Instances widget by more than one column, define an auxiliary string metric as the concatenation of the metrics to be used for sorting. Pad numeric metrics with leading zeros, all others with trailing blanks, up to some fixed length for each participating metric. Concatenate the padded results to set the auxiliary metric. Do not show the auxiliary metric in the Instances view, but use it for sorting.

In closing this section, here are some remarks about another auxiliary metric of the Process Execution MCs defined in support of the invocation chain view in Figure 15: Aux Initiator Identifier And Time Started (xs format). This is needed because the Instances widget doesn't allow sorting rows based on multiple columns, but sorting by start time within initiator identifier was required to produce the view in Figure 15. To cope with this limitation, the auxiliary metric was set by padding the metrics to be used for sorting up to a constant length, and concatenating the results; the auxiliary metric was then used for sorting. This method can be generalized to any number of columns (see Sorting Instances Views by More Than One Column).

Gathering execution statistics

An important goal of the global process monitor model was to collect execution statistics, such as the minimum, maximum, or average duration, for each monitored process definition. The first choice for such aggregations in a monitor model are KPIs. However, while metrics live in monitoring contexts, which can be added dynamically as new processes are deployed and started, KPIs are predefined singletons. It's not possible to define a KPI like "average execution time" in a Process Definition monitoring context. From a monitor model architecture viewpoint, this is unfortunate, and ideally the distinction between metrics and KPIs would be dissolved by allowing metrics to be set via aggregate functions and to be displayed via KPI widgets. But for the time being, aggregate values that can have multiple instances must be defined using metrics.

If process executions were never restarted, aggregating their durations would be straightforward: whenever an execution ends, its duration would be added to a total, the number of completed executions would be incremented, the average recalculated as the quotient of these numbers, and the maximum and minimum would be adjusted if necessary. But when a process execution is restarted, all of those aggregates will have to be readjusted: a process duration that was added to a total must be subtracted again, the number of completed executions must be decremented, and the minimum or maximum may have to be set back to some prior value. A lot of the logic for calculating aggregates in the global process monitor serves to prepare for, and deal with, such "undo" scenarios.

Let's look at the logic for some of the aggregate metrics in a Process Definition MC in detail. The examples shown are representative for all aggregations in this monitor model, and metrics with similar logic can be found in Task Definition and Step Definition MCs. Figure 17 shows the five aggregates that are calculated for the total process duration (min, max, sum, avg, std-dev), as well as the model elements involved in enabling these calculations.

Figure 17. Monitor model logic in support of gathering process execution statistics
Monitor model logic in support of gathering process execution statistics

Here is an outline of how the total durations of process executions are aggregated, and readjusted when executions are restarted. The step numbers correspond to the numbers in Figure 17.

  1. The Boolean metric Has Ended is initialized to false.
  2. The Boolean metric Aux Has Ended At Least Once is initialized to false.
  3. The metric Aux Saved Total Duration is not initialized (remains empty).
  4. The metric Total Duration is continuously updated as events arrive from the monitored process execution. Here is its map expression:
    Total_Duration <- if ( fn:exists( Time_Started ) ) 
                      then ( if ( fn:exists( Time_Ended ) ) 
                             then Time_Ended - Time_Started 
                             else Aux_Last_Event_Timestamp - Time_Started ) 
                      else Total_Duration

    The map depends on the metrics Time Started and Time Ended (which are not shown in Figure 17). The metric Time Started holds the earliest timestamp of any event received from the monitored process execution. The metric Time Ended holds the timestamp of the event that reported the process execution's final state (switching Has Ended to true). After Time Ended is set, it is used to calculate the Total Duration, but up to that point the timestamp of the current event is used. Hence, while the process is running the metric Total Duration represents "duration for which it has been active". Here are the pertinent map expressions.
    Time_Started <- if ( fn:empty( Time_Started ) or 
                         Aux_Last_Event_Timestamp < Time_Started ) 
                    then Aux_Last_Event_Timestamp 
                    else Time_Started
    
    Time_Ended   <- if ( fn:not( Has_Ended ) ) 
                    then Aux_Empty_DateTime 
                    else ( if ( fn:exists( Time_Ended ) ) 
                           then Time_Ended 
                           else State_Timestamp )
  5. Aggregation starts when the Has Ended flag switches to true. Here is its map expression:
    Has_Ended <- State = 'FINISHED' or 
                 State = 'TERMINATED' or 
                 State = 'FAILED' or 
                 State = 'COMPENSATED' or 
                 State = 'COMPFAILED' or 
                 State = 'DELETED'

    Obviously, when the process execution is restarted its state changes to a different value (such as 'RUNNING' for example) and Has Ended becomes false again.
  6. The metric Aux Has Ended At Least Once becomes true, but unlike Has Ended, it will never become false again. Here is its map expression:
    Aux_Has_Ended_At_Least_Once <- Has_Ended or Aux_Has_Ended_At_Least_Once
  7. The Aux Has Ended trigger fires once whenever Has Ended changes from false to true. It is evaluated when Has Ended is updated, and its condition is that metric proper; it is not repeatable, and therefore requires Has Ended to become false before it can fire again.
  8. The trigger Aux Has Ended increments the Finished counter in the Process Definition MC. This is the first effect the ended process execution has in the parent monitoring context.
  9. The trigger Aux Update Aggregate Durations is evaluated when Has Ended, Total Duration, Waiting Duration, or Working Duration change. Its trigger condition is
    Has_Ended and 
        fn:exists(Total_Duration) and 
        fn:exists(Waiting_Duration) and 
        fn:exists(Working_Duration) and 
        fn:not( Total_Duration = Aux_Saved_Total_Duration and 
                Waiting_Duration = Aux_Saved_Waiting_Duration and 
                Working_Duration = Aux_Saved_Working_Duration )

    Hence, it fires when a process execution has ended, its total, waiting and working durations are available, and different from their saved copies (values already used for aggregation). Unlike the Aux Has Ended trigger, which only fires once each time a process execution ends (including after restart), this trigger also fires when the total, waiting, or working durations are updated by some late-breaking event after the execution's state has been reported as final. The aggregate values are then adjusted (as you'll see shortly). This is one of the many ways in which this monitor model was made robust against events arriving out of order.
  10. Four of the Process Definition metrics in Figure 17 are updated immediately when Aux Update Aggregate Durations fires. Here are their map expressions:
    Aux_Candidate_Max_Total_Durations <- 
         fct:update-duration-list( 
              Aux_Candidate_Max_Total_Durations, 
              Aux_Empty_Duration, 
              fn:true(), 
              Process_Execution/Total_Duration, 
              Process_Execution/Aux_Saved_Total_Duration,
              4000 )
    
    Aux_Candidate_Min_Total_Durations <- 
         fct:update-duration-list( 
              Aux_Candidate_Min_Total_Durations, 
              Aux_Empty_Duration, 
              fn:false(), 
              Process_Execution/Total_Duration, 
              Process_Execution/Aux_Saved_Total_Duration, 
              4000 )
    
    Aux_Sum_Squares_Total_Duration    <- 
         Aux_Sum_Squares_Total_Duration 
         + ( Process_Execution/Total_Duration div xs:dayTimeDuration('PT1S') )
         * ( Process_Execution/Total_Duration div xs:dayTimeDuration('PT1S') )
         - ( if ( fn:exists(Process_Execution/Aux_Saved_Total_Duration) ) 
             then 
           ( Process_Execution/Aux_Saved_Total_Duration div xs:dayTimeDuration('PT1S') )
         * ( Process_Execution/Aux_Saved_Total_Duration div xs:dayTimeDuration('PT1S') )
             else 
             0.0 )
    
    Sum_Total_Duration                <- 
         Sum_Total_Duration 
              + Process_Execution/Total_Duration 
              - ( if ( fn:exists( Process_Execution/Aux_Saved_Total_Duration ) ) 
                  then 
                  Process_Execution/Aux_Saved_Total_Duration 
                  else 
                  xs:dayTimeDuration('PT0S') )

    Starting from the bottom of the list, you see that Sum Total Duration is updated by adding the new total duration of a process execution, while subtracting any old value that had been saved. Aux Sum Squares Total Duration is calculated in the same fashion, except that the square of the total duration is added and the square of the prior value is subtracted (if any). The user-defined functions updating Aux Candidate Min Total Durations and Aux Candidate Max Total Durations essentially perform the same "remove-old-add-new" operations, but with a list of duration values: the fct:update-duration-list() function adds the duration passed via the fourth parameter, and removes the duration passed via the fifth parameter, from the comma-separated list of durations that is passed via the first. It returns the modified list. See User-Defined XPath Functions for further details.

    Another four metrics in the parent MC are updated immediately after the four above, and below are their map expressions. These maps are not trigger-gated, and therefore will run whenever any of the metrics they depend on is updated. All should be self-explanatory. The user-defined functions fct:get-max-duration() and fct:get-min-duration() return the maximum / minimum duration value in the comma-separated list that is passed as an argument. Note that the method shown here for calculating minimum and maximum durations, using an auxiliary list of candidate min / max values, is indispensable in order to allow the adjustment of a minimum or maximum when a contributing value is revoked during process restart: Simply falling back to the previous value will fail when several contributing process executions are restarted (as several prior values might then have become invalid).
    Avg Total Duration    <- 
         if ( Finished > 0 ) 
         then Sum_Total_Duration div Finished 
         else Aux_Empty_Duration
    
    Max Total Duration    <- 
         xs:dayTimeDuration( fct:get-max-duration( Aux_Candidate_Max_Total_Durations ) )
    
    Min Total Duration    <- 
         xs:dayTimeDuration( fct:get-min-duration( Aux_Candidate_Min_Total_Durations ) )
    
    StdDev Total Duration <- 
         if ( Finished > 0 and
              fn:exists( Aux_Sum_Squares_Total_Duration ) and
              fn:exists( Avg_Total_Duration ) ) 
         then 
              xs:dayTimeDuration( 
                   fn:concat( 
                        "PT", 
                        fct:sqrt( Aux_Sum_Squares_Total_Duration div Finished - 
                             ( Avg_Total_Duration div xs:dayTimeDuration('PT1S') ) * 
                             ( Avg_Total_Duration div xs:dayTimeDuration('PT1S') ) 
                        ), 
                        "S" 
                   ) 
              ) 
         else 
              Aux_Empty_Duration

    This concludes the updates of all aggregate total durations in the Process Definition MC. One more step is necessary to prepare for a possible process execution restart: saving the total duration value that was used to update the statistics. The next two steps take care of that.
  11. After all updates driven by trigger Aux Update Aggregate Durations have been made, trigger Aux Save Durations fires.
  12. The metric Total Duration is copied to Aux Saved Total Duration.
  13. The trigger Aux Restarted fires when Has Ended switches from true to false. Here is its trigger condition: fn:not( Has_Ended ) and Aux_Has_Ended_At_Least_Once The condition is evaluated whenever Has Ended is changed, and the conjunction with Aux Has Ended At Least Once is necessary to prevent this trigger from firing initially, when Has Ended is set to false as the MC is created.
  14. The Finished counter is decremented by the Aux Restarted trigger, now again reflecting the correct number of finished executions (down by one).
  15. A series of updates is made in the parent MC, which undoes all updates of step 10: The saved total duration is removed from the lists of candidate min / max total durations (15a, 15b). The saved total duration and its square are subtracted from metrics Sum Total Duration and Aux Sum Squares Total Duration, respectively (15c, 15d). After the four metrics and the counter they depend on have been set back to prior values, the Avg Total Duration, Max Total Duration, Min Total Duration, and StdDev Total Duration metrics are recalculated (15e -- 15h), which completes the undo operation in the Process Definition MC.
  16. After the trigger Aux Restarted has caused all statistic values to be readjusted, the trigger Aux Clear Saved Durations fires.
  17. The metric Aux Saved Total Duration is set back to empty (its value before the process ended).

You see that the restart logic (which was initiated by Has Ended becoming false again, which triggered Aux Restarted) has reestablished the state before the process execution ended with one exception: the metric Aux Has Ended At Least Once is not set back to false; but that, of course, is intentional.

Tip: Aggregating metrics in an ancestor monitoring context

Aggregates of metric values can be calculated incrementally in a parent (or higher level) monitoring context using the techniques discussed in this section. The logic will be relatively simple if values, after contributing to the aggregation, will never change. More complex logic is required when already aggregated values may change or become deleted and have to be removed from the aggregation. The global process monitor model exemplifies the full range of techniques that may be required.

The techniques shown in this section to calculate the count, minimum, maximum, sum, average and standard deviation of the total duration of process executions can be applied in general in order to obtain aggregates of child metric values in a parent (or higher level) monitoring context (see Aggregating metrics in an ancestor monitoring context).

The algorithms described above for the calculation of the sum, average, minimum, maximum, and standard deviation of the total duration of process executions are also used to aggregate waiting durations and working durations at the process definition level. Furthermore, the same techniques are used to aggregate the total, waiting, and working durations of process steps and human tasks at their definition levels—with the retry operation now playing the role that restart plays for process executions, and triggering corrective adjustments.

To put things into perspective, Figure 18 shows all of the downstream effects of the Aux Update Aggregate Durations trigger in a Process Execution MC (#9 in Figure 17). The model elements involved in updating the total duration, which are shown in Figure 17 and discussed above, are framed in red.

Figure 18. Downstream effects of the Aux Update Aggregate Durations trigger
Downstream effects of the Aux Update Aggregate Durations trigger

(See a larger version of Figure 18.)

Counting child MCs

It's often useful to count the number of children of a monitoring context. For example, the number of steps of a process execution equals the number of Process Execution Step children of its Process Execution MC, and the number of its variables is equal to the number of Process Execution Variable children. Ideally, in order to get those counts one should be able to define a metric in a Process Execution MC with a map like

Variables <- fn:count( Process_Execution_Variable )

The current level of XPath 2.0 support in Monitor does not allow this; however, you can define a counter in a parent MC that is incremented each time a child is created. In deciding how to control this counter, two cases must be distinguished: If the subscriptions that cause the creation of a child MC receive at most one event, they can be used to increment the counter in the parent MC directly. But if those subscriptions receive further events after a child MC was created, an initialization trigger must be used instead: it has no condition, is not repeatable, and is linked to all of the child context-creating event subscriptions. Note that the linkage can be indirect, by making the trigger dependent on a metric that each creation event will update (for example, the MC key).

An example for an initialization trigger is shown in Figure 19. When the first event about a process execution variable is received by the Accessed subscription, a new Process Execution Variable context is created, the trigger Aux Detected is evaluated and, as it has no condition, it fires. Since the trigger is not repeatable, and its condition is always true, it will never fire again no matter how many further Accessed events are received. The trigger increments the Variables counter in the Process Execution parent monitoring context (not shown in Figure 19).

Figure 19. A monitoring context initialization trigger
A monitoring context initialization trigger

(See a larger version of Figure 19.)

Tip: Counting child MCs

To count the number of children of a monitoring context, define a counter in that context which is incremented by all child-creating event subscriptions if those receive at most one event, or by a child MC initialization trigger otherwise. The initialization trigger has no condition, is not repeatable, and is linked from any child MC creating event subscription. As a result, it fires only once as a child MC is created. To count the currently active child MCs, the counter must be decremented by child MC termination triggers.

With the method just described, one counts the number of child MCs that have ever been initialized. To count the number currently active, any termination trigger of a child MC must decrement the counter. Counting child MCs gives summary of these procedures, and the same method can be used to count MCs at the grandchild, great-grandchild, great-great-grandchild, and so on, levels.

Calculating task durations

For human tasks, the global process monitor records three durations:

  • Total Duration: the elapsed time from when the task comes to life until it has reached a final state
  • Waiting Duration: the time that the task spends in the READY state
  • Working Duration: the time that the task spends in the CLAIMED state

Consider the state history in Figure 20.

Figure 20. A human task state history
A Human Task State History

(See a larger version of Figure 20.)

This task execution was claimed by user admin, then unassigned, then claimed again by that user, and finally finished. Hence, it was waiting for someone to claim it during two time intervals, both of which contribute to the waiting duration. It was also in the claimed state (being worked on) during two time intervals, both of which contribute to the working duration. Drilling up from the view in Figure 20, we see the summary for this task execution, shown in Figure 21.

Figure 21. A human task execution record
A human task execution record

(See a larger version of Figure 21.)

The State History metric, which replicates the information in the State History child MCs and is normally hidden, is shown here as well. As you'll see shortly, it plays a crucial role in the calculation of the task's waiting and working durations. Readers who do the math will readily verify that the Waiting Duration is equal to the time spent in the READY state, the Working Duration is equal to the time spent in the CLAIMED state, and the Total Duration is the time from the moment this task came to life (initially in the INACTIVE state) until it was finished. Here are the map expressions for the three duration metrics:

Total_Duration   <- if ( fn:exists( Time_Started ) ) 
                    then ( if ( fn:exists( Time_Ended ) ) 
                           then Time_Ended - Time_Started 
                           else Aux_Last_Event_Timestamp - Time_Started ) 
                    else Total_Duration

Waiting_Duration <- xs:dayTimeDuration( 
                        fct:aggregate-duration-in-state( 
                            'READY', State_History, Aux_Last_Event_Timestamp ) )

Working_Duration <- xs:dayTimeDuration( 
                        fct:aggregate-duration-in-state( 
                            'CLAIMED', State_History, Aux_Last_Event_Timestamp ) )

You see that the map expression for the Total Duration of a task execution is the same as for the Total Duration of a process execution, which was already discussed in Gathering Execution Statistics. The maps for the waiting duration and the working duration pass the State History metric, which contains the task's entire recorded state history, to the user-defined function fct:aggregate-duration-in-state(). The function calculates the time spent in the state that is passed as the first argument. The timestamp of the current event is passed via the third argument, and used as the endpoint of the interval that starts with the latest state history record. For example, if the last state recorded is CLAIMED, then the time from the corresponding state change up to the timestamp passed via the third argument will be added to the Working Duration.

One important aspect of the fct:aggregate-duration-in-state() function is that it sorts the state history records by their timestamps before calculating any intervals. This is another measure to safeguard against events arriving out of order: if events reporting task state changes arrive out of order, the records in the State History metric will be out of order, too. But that won't affect the calculations of the times spent in any state, which will be performed after sorting the state transition records. (Note that the records in a drill-down view like Figure 21 will always be in time order, because they are sorted by the Instances widget.)

Accumulating task durations

The previous section explained how the waiting and working durations of human tasks are calculated, based on state change events. But the global process monitor dashboard also shows waiting and working durations for invocations and process executions. These are defined as follows:

  • The waiting / working duration of a process execution is the sum of the waiting / working duration of its human tasks and invocations.
  • The waiting / working duration of an invocation is the same as the waiting / working duration of the called process execution.

As a result, the waiting / working durations shown for process executions are the accumulated waiting / working durations of all human tasks in those executions and any sub-processes they call directly or indirectly. Implementing these aggregations of durations up the stack requires another type of internal event. Let's look at the detailed logic to achieve this.

Figure 22. Accumulating task durations up the call stack
Accumulating task durations up the call stack

To illustrate the logic for accumulating human task durations, Figure 22 shows two Process Execution MCs, each of which has a Process Execution Step child MC. The upper one represents the calling process execution (parent process) and its child MC represents the invocation step. The lower one represents the called process execution (sub-process) and its child MC represents some human task within this process.

In a nutshell, here is what happens: The waiting duration and the working duration of a human task are calculated from its state history, as described in Calculating Task Durations. The durations are added up at the process execution level, along with the waiting / working durations of all human tasks and invocations in this process execution. Whenever these aggregates change, an internal event is sent from the MC representing the process execution to the one representing its invocation (arrow in Figure 22), reporting the new aggregate durations. The invocation's waiting / working duration is set from the event. It's added up at the process execution level (now of the calling process) along with the waiting / working durations of all human tasks and invocations in this process execution.

Following is a more detailed description of these steps, with numbers referring to Figure 22:

  1. The calling process execution arrives at an invocation step. A child of its Process Execution MC is created, representing the invocation, and initialized. Specifically, the metric Aux Saved Working Duration is initialized to a duration of zero length (PT0S).
  2. The called process execution starts, and a corresponding MC is created. Its Identifier is set to the new execution's process instance ID.
  3. As described in Tracking invocation chains, the called execution's instance identifier is also captured in the MC representing the invocation, in metric Called Execution Identifier.
  4. The called process execution arrives at some human task. A child of its Process Execution MC is created, representing this human task execution, and initialized. Specifically, the metric Aux Saved Working Duration is initialized to a duration of zero length (PT0S).
  5. As the human task is processed, its state history is updated.
  6. The Working Duration of the human task is calculated from its state history, as described in Calculating task durations.
  7. The trigger Aux Update Aggregate Working Duration fires whenever the Working Duration is set to a value that's different from the value last saved in Aux Saved Working Duration.
  8. Whenever the Working Duration of a human task changes, and consequently its Aux Update Aggregate Working Duration trigger fires, the process-level Working Duration is adjusted. Here is its map expression:
    Working_Duration <-  Working_Duration 
                         + Process_Execution_Step/Working_Duration 
                         - Process_Execution_Step/Aux_Saved_Working_Duration
  9. The updated process-level working duration is reported in an outbound event, which carries the new working duration as well as the process execution identifier.
  10. This event emission completes the immediate downstream effects of trigger Aux Update Aggregate Working Duration. Next, the chained trigger Aux Save Working Duration fires in the human task MC.
  11. As a result, the task's new Working Duration is copied to Aux Saved Working Duration, to prepare for another update at the process execution level should that become necessary.
  12. The internal event is received by the MC representing the invocation step. The correlation predicate makes use of the fact that that MC has the identifier of the called execution stored in metric Called Execution Identifier.
  13. The working duration reported by the event is saved in metric Aux Called Execution Working Duration.
  14. The working duration of the invocation step is set to the reported duration. Here is its map expression:
    Working_Duration  <-  
         if ( fn:empty( Aux_Called_Execution_Working_Duration ) ) 
              then 
                   xs:dayTimeDuration( 
                        fct:aggregate-duration-in-state( 
                             'CLAIMED', State_History, Aux_Last_Event_Timestamp ) ) 
              else 
                   Aux_Called_Execution_Working_Duration

    Note that the same map was used in step 6 to set the human task's working duration from its state history, and effectively this map is a switch: If there is a reported working duration in Aux Called Execution Working Duration (which only happens for invocation steps) that will be used to set the metric. If not, then its value will be calculated from the State History metric, as the time spent in the CLAIMED state; the result of that will be zero unless the step is a human task that as been assigned to a user.
  15. Just as in step 7, the trigger Aux Update Aggregate Working Duration fires.
  16. Just as in step 8, the updated Working Duration of the invocation step is added to the aggregate Working Duration of the (calling) process execution.

Exactly the same logic is used to accumulate waiting durations.

Note that an internal event is used here to propagate information up the stack, from the MC representing a called process execution to that representing its caller. Contrast that with the logic described in Tracking invocation chains and shown in Figure 16, where an internal event is used to propagate information down the stack.

Finally, note that in processes with parallelism, the waiting or working duration of a process execution can easily exceed its total (elapsed) duration, and the same can happen at the process definition level: the average waiting duration of a process definition can exceed its average total duration, for example.

Handling scrambled event order

Tip: Handling out-of-order events

If the downstream maps of an inbound event subscription involve data supposedly delivered by earlier events, gate those maps by tests that verify that all of the required data has arrived, and trigger them by any of the contributing events. If that is not possible, include tests for any required data in the correlation predicate, and specify the "retry" option for no correlation matches.

There are fundamentally two techniques for handling events that may arrive in an unpredictable order, which are summarized in Handling out-of-order events and illustrated below.

Here is an example for the first technique. As a straightforward way to calculate the duration of a task, you might code the following logic in a monitor model (Start_Time and Duration are metrics; Task_Started and Task_Ended are inbound event subscriptions):

Start_Time  <-  Task_Started/creationTime
Duration    <-  Task_Ended/creationTime - Start_Time

The intent is clear. First, the event reporting the start of the task sets the Start_Time metric; then, the completion event sets the Duration metric by subtracting the start time from its own creation time. But when the start event arrives after the completion event, this method fails.

As an alternative, consider the following logic:

Start_Time  <-  Task_Started/timestamp
End_Time    <-  Task_Ended/timestamp
Duration    <-  if ( fn:exists( Start_Time ) and fn:exists( End_Time ) ) 
                then End_Time - Start_Time 
                else Duration

Now the start event and the completion event can arrive in any order. Each will save its creation time in a metric. The task duration will be calculated after both timestamps are available.

This technique is used in many places in the global process monitor and you have seen examples of it earlier, for example in the map expression for Total Duration in step 4 in Gathering execution statistics. A more sophisticated technique, but still in the same vein, is the method to calculate the waiting and working durations of human tasks that was explained in Calculating task durations: The state changes of human tasks are recorded, along with their timestamps, in the order in which they are reported. But the recorded state history is sorted before the durations are calculated, which obviously makes the model robust against state change events arriving out of order. Yet another example, described in step 9 of Gathering execution statistics, is the setup of the trigger Aux Update Aggregate Durations in a Process Execution MC: it continues to fire when the execution's total duration, waiting duration, or working duration are updated by some late-breaking event after the process execution was already reported complete.

An example for the second technique is given by the event subscriptions Aux Receive Called Execution Waiting Duration and Aux Receive Called Execution Working Duration in the Process Execution Step MCs. If correlation fails for the internal events they subscribe to, this means that either the events from the sub-process invocation step have not arrived (while events from the sub-process execution have come through) or that the MC representing the sub-process invocation step has not yet seen an event from which it could glean the called execution's identifier. Both circumstances should be resolved as more events are processed, and thus the retry option was chosen to guard against a scrambled event order in this case.

Tip: Using stopwatches (timers)

Don't use timers to measure the time between two events (starting it by the first, and stopping it by the second). Instead, calculate the time interval from the events' timestamps as described here. Timers are useful to measure "time since" (as opposed to "time between"), in particular when the end event may be delayed indefinitely, or not occur at all.

Note that stopwatches (timers) are particularly sensitive to events arriving out-of-order: if the start event arrives after the stop event, the result will obviously not be useful. Therefore, unlike stopwatches in sport events, monitor model timers should not be used to measure the time between two events (see Using stopwatches (timers)).

Timers are useful in order to raise an alert when a heartbeat signal or other expected response is missing for more than a predefined duration. A timer can be started when the wait time begins, and periodically compared with the threshold duration by a recurring trigger. When it exceeds the threshold, the trigger fires and the alert is raised. Other examples for using timers are any scenarios that need to show running clocks on the dashboard.

If the first method is applied to guard against events arriving out of order (that is, all events are received, payload data is saved, and downstream actions are blocked until all required information is available), a problem may arise that any of these events must create the target MC if it does not exist. If each event carries the necessary keys, of the target MC and any ancestors, this strategy can be used. The global process monitor model was written in this fashion. If creating the MC by every event is not an option, then the retry variant should be used: If events arrive while no target MC exists yet, they will be delayed until (hopefully) the event that creates it has come through.

Notes:

  1. At the time of this writing, there is a known problem in Monitor V7 when a new version of a monitor model is deployed while active MCs exist and potential creation events for those MCs arrive after the new model version is deployed. In that case, these creation events instantiate MCs according to the new monitor model definitions, while they should have been consumed by the already existing MCs with the old model logic. The global process monitor model allows almost every incoming event to create the required monitoring contexts, and hence may encounter this problem. To avoid it, either never deploy a new monitor model version, or if that is inevitable, quiesce the old model version (ensure that there are no active MCs) before making the switch.
  2. Making every event a potential MC creation event is sometimes hampered by the event not carrying the keys of the parent of the target MC (the target MC key is typically found in the event, as it is needed for correlation). The requirement that MC creation events must carry the keys of the target MC and its ancestors is understandable, but still unfortunate. The complexity resulting for correlation predicates was already pointed out at the end of Inbound event subscriptions. The impediments mentioned here, which make it harder to have more events create MCs if needed, are another drawback. Ideally, the monitor model architecture should not enforce MC-MC relationships to be hierarchical, and not require them to be established when a child MC is created. Instead, MC-MC relationships should be dynamic, based on foreign key metrics, and established at any time as either of the participating MCs is created or updated with the required values.

Performance considerations

It should be clear by now that the global process monitor model as-is may not be ideal for performance critical environments. The fact that a single incoming event can cause dozens of secondary events is a major performance impediment, but at the same time one that is easy to tackle. Many of the internal events can be disabled without affecting any of the model's core function.

Here are some performance tuning tips:

  • With the exception of Aux Report Waiting Duration and Aux Report Working Duration, which were discussed in Accumulating task durations, the outbound event definitions in this monitor model only serve to move history records into child MCs for better visualization (see Monitoring step life cycles). Deleting them, along with the child MC definitions that these internal events instantiate, will not take away any capabilities of the global process monitor. All history records are also captured in string metrics, like State History, Escalation History, and so on, and the only drawback of showing these metrics in lieu of the child MCs on the dashboard are bloated row heights. No information is lost.
  • Removing the wbm:send-events() function that communicates initiator identifiers and invocation depth values down the stack, by deleting the definition of metric Aux Called Executions, will not have any side effects other than missing initiator identifier and invocation depth values. These values are also strictly for dashboard visualization (they enable the Invocation Chains view shown in Figure 15) and omitting them has no side effects on other parts of the model logic.
  • Deleting the outbound event definitions Aux Report Waiting Duration and Aux Report Working Duration from the Process Execution MC definition will result in invocation steps to never have a waiting or working duration. Consequently, the waiting and working duration of a process execution will only reflect its own human tasks, ignoring any tasks performed in sub-processes. That's a significant change in the model logic and in the semantics of these durations. If it can be tolerated, removing these outbound event definitions for performance reasons is fine.
  • The previous steps should achieve by far the most significant performance improvements, as they eliminate all secondary (internal) events. If further performance gains are needed, deleting metrics that are not required is a logical next step. Candidates may be the standard deviations and the Aux Sum Squares metrics used in their calculations, as well as the xs-formatted date/time and duration metrics.
  • If all of those measures still do not produce a model with good enough performance, writing a custom model is probably necessary.

Finally, keep in mind that a lot of performance can often be gained by throttling event sources: disable event emission for any and all process steps, process variables, or even entire sub-processes that don't need to be monitored.

User-defined XPath functions

Table 6 summarizes the user-defined XPath functions used by the global process monitor.

Table 6. User-defined XPath functions used by the global process monitor
Function signatureDescription
fct:add-to-list(xs:string?, xs:string?) xs:string? Appends one or more strings passed via the second argument, which must be enclosed in XML tags if there is more than one, to a comma-separated list of string values passed via the first argument. Each separating comma must be followed by a blank. The XML tags in the second argument only serve as separators for the strings they contain; if a single string is passed, a wrapping XML tag is optional (if present, it will be removed before adding the string). Strings are only appended if the list does not contain them already.
fct:aggregate-duration-in-state(xs:string?, xs:string?, xs:dateTime?) xs:duration? Given a history of state transitions (passed via the second argument), calculates the aggregate duration of the state given by the first argument. The state transitions must be encoded as a blank-separated sequence of transition timestamps followed by target states. The sequence does not have to be time ordered. The state designations may contain blanks, but not in the first or last character position. The third argument is the current time, which is used to calculate the duration of the state that was entered last. It defaults to the current time of the system clock.
fct:convert-indenting-blanks(xs:string?) xs:string? Replaces any indenting blanks in the argument by &nbsp; symbols.
fct:format-dateTime-value(xs:string?) xs:string? Formats the lexical representation of an xs;dateTime value for improved readability.
fct:get-correlation-set-info(xs:string?) xs:string? Extracts a string representation of the correlation set information carried in a PROCESS_CORRELATION_SET_INITIALIZED event issued by Process Server (event code 42027).
fct:get-max-duration(xs:string?) xs:duration? Returns a duration with maximal length from the comma-separated list of xs:duration values that is passed.
fct:get-min-duration(xs:string?) xs:duration? Returns a duration with minimal length from the comma-separated list of xs:duration values that is passed.
fct:logical-deployment-location(xs:string?) xs:string? Returns the first part of a possibly composite identifier, where parts are expected to be separated by slashes or backslashes.
fct:prefix-N-times(xs:string?, xs:string?, xs:integer?) xs:string? Prefixes the string passed as the first argument with the string passed as the second argument the number of times indicated by the third argument. If the third argument is not a positive integer, the first argument is returned unchanged. If the first argument is empty, the result will be empty.
fct:remove-content-wrapper(xs:string?) xs:string? Removes the <wbi:content> wrapper from the argument, but retains all required namespace declarations.
fct:remove-from-list(xs:string?, xs:string?) xs:string? Removes one or more strings passed via the second argument, which must be in consecutive XML tags if there is more than one, from a comma-separated list of string values passed via the first argument. Each separating comma must be followed by a blank. XML tags in the second argument only serve as separators for the strings they contain; if a single string is passed, wrapping XML tags are not required. Strings passed via the second argument that are not found in the first are ignored. Strings passed via the second argument that are found several times in the first are removed completely (every occurrence is removed).
fct:replace-all(xs:string?, xs:string?, xs:string?) xs:string Replaces any substrings within the first argument that match the regular expression passed via the second argument with the string passed as the third argument. If the second argument is empty, the first is returned unchanged. If the last argument is empty, an empty string is substituted for any matches.
fct:sqrt(xs:decimal?) xs:decimal? Calculates the square root of a decimal number.
fct:update-duration-list(xs:string?, xs:duration?, xs:boolean?, xs:duration?, xs:duration?, xs:integer?) xs:string? Updates a list of serialized xs:duration values, which is passed as the first parameter, as follows: The duration passed as the second parameter is considered a lower or an upper bound, depending on the value of the Boolean passed as the third parameter ('true' indicates lower bound, 'false' indicates upper bound). Any duration values outside that bound are removed from the list. The duration passed as the fourth parameter is added to the list, as long as it is within the specified bounds, even if the list already contains that value (duplicates are allowed). The duration passed as the fifth parameter is removed from the list if it contains it. If the second fourth or fifth parameter are empty, then no cut-off bound is applied and no durations are added to or removed from the list, respectively. The last parameter limits the length of the result: If specified, xs:duration values are removed from the result list until it no longer exceeds that length; the shortest (longest) durations are removed when the third parameter is true (false).

For more detailed descriptions, see the simplified Javadoc files, or the user-defined function source.

Dimensional model

A comprehensive dimensional model has been defined for the monitoring context structure shown in Figure 3. It allows you to set up a host of dimensional views, slicing and dicing the data collected by the global process monitor in many different ways. Two examples are shown in Figure 23.

Figure 23. Some dimensional views
Some dimensional views

(See a larger version of Figure 23.)

The upper diagram in Figure 23 shows the number of process executions that have started (orange), are in progress (green), and have finished (blue) for three different process definitions. This view uses a cube based on the Process Definition monitoring contexts, a dimension based on the process definition name, and three measures, Started, In Progress, and Finished, which reflect the sum of the counters with the same names in the Process Definition MCs. Since there is only one Process Definition MC per dimension value (unless there are duplicate process names) nothing is summed up and one simply gets a graphical rendering of these counts.

The lower diagram in Figure 23 shows the minimum, average, and maximum of the total duration, the waiting duration, and the working duration of process executions. The process definition whose aggregated execution times are shown is selected using the pull-down menu (ProcessCustomerOrder was chosen). This view uses a cube based on the Process Execution MCs, a dimension based on the process name, and nine measures. The measures are aggregations of the total, waiting, and working durations from all process executions that have the same process name. Note that those same aggregations are held in metrics of the Process Definition MCs and calculated incrementally as described in Gathering execution statistics. You can use the measures calculated by the cube engine to validate the Process Definition metrics calculated via monitor model logic; they must match.

To give you an idea of the predefined dimensional model, two of its seven cubes are shown in Table 7. Those are also the cubes that supplied the data for the diagrams in Figure 23.

Table 7. Two cubes of the dimensional model
CUBE: Process Definition Cube
MC: Process Definition
MeasureSource Metric/Counter/Timer/KeyAggregation Function
StartedStarted (C)Sum
FinishedFinished (C)Sum
In ProgressIn Progress (M)Sum
Min Duration InactiveTime Since Last Active (T)Minimum
Max Duration InactiveTime Since Last Active (T)Maximum
Dimension / LevelsSource Metric/Counter/Timer/Key
Process Definition Id / IdentifierIdentifier (M)
Process Definition Name / NameName (M)
CUBE: Process Execution Cube
MC: Process Execution
MeasureSource Metric/Counter/Timer/KeyAggregation Function
Min Total DurationTotal Duration (M)Minimum
Avg Total DurationTotal Duration (M)Average
Max Total DurationTotal Duration (M)Maximum
StdDev Total DurationTotal Duration (M)Standard deviation
Sum Total DurationTotal Duration (M)Sum
Min Waiting DurationWaiting Duration (M)Minimum
Avg Waiting DurationWaiting Duration (M)Average
Max Waiting DurationWaiting Duration (M)Maximum
StdDev Waiting DurationWaiting Duration (M)Standard deviation
Sum Waiting DurationWaiting Duration (M)Sum
Min Working DurationWorking Duration (M)Minimum
Avg Working DurationWorking Duration (M)Average
Max Working DurationWorking Duration (M)Maximum
StdDev Working DurationWorking Duration (M)Standard deviation
Sum Working DurationWorking Duration (M)Sum
ExecutionsIdentifier (K)Count
Dimension / LevelsSource Metric/Counter/Timer/Key
Process Definition Id And Execution State / Identifier / StateProcess Definition Identifier (M) / State (M)
Process Execution State / StateState (M)
Process Name / NameProcess Definition Name (M)
Indented Process Name / Indented NameIndented Process Definition Name (M)
Initiator Identifier / IdentifierInitiator Identifier (M)

Monitor requires metrics used for dimensions to have defaults, which is why there are no dimensions for execution start and end times. If such dimensions are desired, you could add metrics "Time Started With Default and Time Ended With Default for this purpose, and give them default values (some point in time far back for Time Started With Default, and some point in time far out for Time Ended With Default). You would overwrite those values via copy-maps from the actual Time Started and Time Ended metrics as soon as these are set.

Predefined KPIs

While the dimensional model is fairly comprehensive, the KPI model is not. As discussed in Gathering execution statistics, the main reason for this is that KPIs as defined by Monitor are singleton aggregate numbers not instantiated in a monitoring context, while useful aggregations are almost always "by process definition" or "by task definition." Knowing the average execution time of any process whatsoever, or the minimum and maximum working time of any task whatsoever, is typically not very useful.

Still, a few KPIs have been predefined for the global process monitor, mainly to give an example for this feature. These are shown in Figure 24. The usefulness of the first five is questionable. They show the average total, waiting, and working duration for any human task that was monitored, as well as the total number of human tasks completed during the last seven days. The "per workday" KPI simply divides the tasks completed during the last seven days by 5, holidays are thus ignored. The two last KPIs, however, can be useful from a technical and performance point of view.

Figure 24. Predefined KPIs
Predefined KPIs

(See a larger version of Figure 23.)

One of the many great features of KPIs is that they can be added to an already-deployed monitor model at any time. Thus, once it is known for which process some key performance indicators should be measured, a KPI with a data filter for that process name is easily defined and will aggregate only its executions. While a generic monitor model obviously cannot provide such KPIs out-of-the-box, they are easy enough to add.


Download

DescriptionNameSize
Global process monitor source (proj interchg fmt)1GlobalProcessMonitor.zip176KB

Note

  1. Contains the monitor model source files in project interchange format, for import into the monitor model editor of WebSphere Integration Developer.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Business process management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Business process management, WebSphere
ArticleID=465622
ArticleTitle=A technical deep dive into the global process monitor model
publish-date=02032010