How events are distributed among the entities in an ODM Decision Server Insights solution can impact performance in several — sometimes surprising — ways. Some people refer to this as having a “hot entity”, but that term more often obscures, rather than clarifies, the various circumstances under which performance can be adversely affected by patterns of event distribution. This document delves in to the details for a more complete understanding.
Event processing systems are by their nature highly distributed. In order to achieve scale, many events are processed concurrently.
Agents subscribe to a types of events in DSI. Agent instances are associated with an individual entity instance, called the bound entity. DSI routes events to each agent by the entity key. Processing an individual event against an individual entity is a single transaction — no matter how many agents there are. However, processing the same event against multiple entities, constitutes multiple asynchronous transactions.
Just like in a database system, DSI does not allow multiple updates to the same data at the same time. However, unlike a database system, DSI does not rely just on locking to achieve data consistency. DSI serializes the work on a single entity instance, both for a single event and across multiple events.
Balancing Event Load Across Entities
Although it’s required for data consistency, serializing the work for a single entity sounds like it goes against the distributed nature of event processing. While that’s true in general, DSI does an excellent job of finding work to do on another entity, while one entity is busy.
In a normal DSI system there may be a handful of entities that have a backlog of events. There is no degradation of performance at all, because DSI finds events to process against other entities.
Performance problems only occur if a high percentage of events are being processed against few entities, and then, only if the event rate is high enough to cause a backup. The classic example is where each event instance is being processed against an entity for an individual and an entity for a group. If you have too few group entities, or the distribution against the groups is highly unbalanced, you must watch out for problems with high event rates.
The point at which you see this problem is if the sustained fraction of events for a single entity exceeds 1/C, where C is the number of CPU cores that you have in production.
As an example, let’s suppose that we are monitoring people (actually mobile phones) boarding and exiting subway trains in a large city.
- Metropolis has 400 subway trains in service at any one time.
- There are 2M riders of the Metropolis subway.
- On average 1000 people board or exit a train in any given second.
- DSI runtime cluster for the Metropolis subway system is hosted on 4 x 12-core nodes.
There is no problem with the Metropolis’ entities for passengers and trains. The trains may board up to 1000 passengers per minutes, when they are stopped, but this is not sustained. Also, there are more trains boarding than cores, so there is nothing blocking DSI from using the full cluster to do work.
However, there is a problem, if we add a city entity for Metropolis, e.g. to track the total number of subway passengers in any given moment. Each boarding event would need to be process by the individual entity, the train entity, and the Metropolis entity. Eventually, everything would end up waiting on the Metropolis entity, and we would have a severe performance problem.
Let’s also consider how this would work in a suburb of Metropolis.
- Smallville has 20 subway trains in service at any one time.
- There are 10K riders of the Smallville subway.
- On average 50 people board or exit a train in any given second.
- DSI runtime cluster for the Smallville subway system is hosted on 4 x 4-core nodes.
There is no problem with Smallville’s entities for riders, because there are so many of them. However, trains could be a problem. Because there are only 20 trains running, maybe only 5 of them are boarding at any one time. Each of those could take on 1000 riders per minute, because although there are fewer trains in Smallville, each train is the same size as in Metropolis.
In this scenario it’s likely that work would back up against the trains. It might be the case that while processing backlog, all the 16 cores in Smallville’s cluster could not be used, because the backup might only be against 5 trains. Further analysis would need to be done to see if the loads could be processed in time to free up work for the next trains to stop.
It’s ironic that Smallville might have a problem, but Metropolis would not — as long as Metropolis’ solution avoided using a whole-city entity.
Managing Events Referenced by Agent State
The single biggest impediment to good performance in an event application is moving around too much data. The single biggest source of that data are event payloads. It is common in an event application for there to be orders of magnitude more events than entities.
When a DSI rule agent contains event aggregation or event correlation, the Decision Engine state must keep references to events. When an event is processed, the DSI runtime provides the engine with the actual event instances so that rules can be executed.
The definition of an agent contains the types of events that it processes. That definition should specify a horizon for each event. The horizon is how long into the past the agent can “see” its event types. The engine must keep references to events that are within the horizon. (The administrator may configure a maxHorizon for the solution. The maxHorizon limits any individual agent horizon. The person developing an agent will not know when maxHorizon changes, however.)
Use of the horizon is closely tied to the rate in which events are received. For good performance you cannot have both a high event rate and a long horizon. Referencing tens or hundreds of events in engine state is usually fine, but referencing thousands or tens-of-thousands of events is a cause for concern. For example, an agent for a train entity may expect only one train-start event per day; so setting the horizon to a year may not be a problem. However, a train may board 200,000 passengers per day; so setting a horizon of longer than 5 minutes may be a problem.
Here are some recommendations for managing event references from rule agent state:
- Always specify a horizon. Specifying a horizon makes it clear what the agent needs. If you don’t specify a horizon, the agent’s event-type horizon defaults to maxHorizon, which defaults to a month.
- Specify a horizon of zero, if an agent doesn’t need an event type for aggregation or correlation. It is frequently the case that an agent will want to process events that it has no need to aggregate or correlate. The engine will not keep event references at all for a horizon of zero.
- Use shared aggregates. Except when being initially calculated, shared aggregates do not need to reference the original events. Plus, the same shared aggregate can be calculated once for multiple agents and multiple time-frames.
When trying to improve the performance of a solution, the amount of Decision Engine state required is often the biggest factor. (Setting the eventCacheDuration tuning property should not be used, when trying to increase event processing performance. The eventCacheDuration can be used to reduce memory requirements for a solution with relatively low event rates.)
Pushing Processing Down the Data Model
Many kinds of applications have hierarchical data models. In event processing applications however, it’s important that the processing be pushed to the branches and leaves of the data model — not the trunk.
Let’s suppose that a data model has a banks with customers and that each customer can have one or more accounts. If the bank receives a data feed of transactions, how should it process them?
This situation that we want to avoid is to have all events do significant processing in an agent bound to the bank entity, and in turn submit events to other agents. We cannot have every event expect to modify the Bank entity and get good performance.
The ideal way for these to be processed is to have an agent associated with the account subscribe to the events. Routing events directly to the accounts from inbound connectivity or the Gateway API distributes the events as broadly as possible, so most processing can be done in parallel. The account agent can read the customer or bank data as necessary.
Sometimes, it may not be possible to implement the ideal. For example, you might want to modify one of the higher-level entities, or you might need to introspect the event before knowing what entity to process it against. Let’s consider those cases.
If the Customer entity needs to be modified, then another agent is required, using the Customer as the bound entity. There are two possibilities: the modifications of the Customer and Account can happen in parallel, or they can happen sequentially. Note that when multiple entities are modified, each is an independent transaction.
For the Customer and the Account to be modified in parallel, the agents for both the account and the customer subscribe to the inbound events.
For the Customer and Account to be modified sequentially, as part of it’s processing the agent for the Customer emits an event for the agent for the Account.
In other cases it may be necessary to introspect the event before knowing what Customer and Account need it. An entityless Java agent is used to avoid a bottleneck, because multiple instances of an entityless Java agent can run in parallel. There are no consistency issues, if nothing is modified. Again, it is possible to modify the Customer and Account entities in parallel …
… or sequentially.
To be able to have the entityless Java agent read the Bank or other entities, you’ll need to use code like the following:
Relationship rBank = createRelationship( Bank.class, keyValue);
Bank bank = rBank.resolve();
The key to good performance in event applications is to process many events involving a small amount of data, rather than few events involving a large amount of data. Keep these three points in mind:
- Scale can’t be achieved by processing many events against few entities.
- Keep the number of past events that an agent must reference to a minimum.
- Push processing to agents associated with entities as far down the object model as possible.