Contents


Cool your hot entities in IBM ODM Decision Server Insights

Comments

In any high-volume event processing system, such as Decision Server Insights in IBM® Operational Decision Manager (ODM), an entity instance referenced by thousands of events is a "hot entity." Hot entities slow down processing, becoming the sole consumer of events within the system. This situation effectively reduces an entire multi-processing grid to wait for a single thread to complete.

This tutorial aims to help Decision Server Insights architects and developers build solutions without hot entities. Learn the causes of hot entities and tips to avoid them.

A problem exists in event processing systems when the majority of inbound events channel into a few hot entities. The hot entity problem is not simply about a single entity being overloaded with events, but it is about a gradual decrease in performance when fewer and fewer entities are targeted to receive events. In the following chart, you can see that event ingestion (events per second) tails off linearly when less than 200 entities are capable of processing inbound events. The chart shows the performance decrease for the Decision Server Insights example solution in this tutorial. The exact amount varies, depending on your own solution and hardware.

Table of inbound events per                     second (EPS) versus number of entities
Table of inbound events per second (EPS) versus number of entities

A poorly-designed Decision Server Insights solution allows hot entities to slow down event ingestion until the grid stops. In worst-case scenarios, solutions face the following issues:

  • The inbound event queue becomes full, forcing the Decision Server Insights solution to reject further events.
  • Millions of events hit one hot entity, which takes days to clear.

This tutorial shows three example solutions to explain how to identify hot entities and examines ways to avoid them. The first solution has no hot entity protection, the second has partial protection, and the third has full protection against hot entities.

Hot entity types

You can encounter the following three types of hot entities:

  • Benign hot entities, which cause minor performance problems that can be ignored.
  • Persistent hot entities, which consistently cause problems.
  • Hidden hot entities, which occasionally cause problems.

Benign hot entities take a few minutes to process a storm of events, which results in near real-time event detection rather than real-time detection. Unless real-time event detection is critical, benign hot entities are usually not a problem.

Persistent hot entities cause serious performance problems, but the problems can be easily identified in development and resolved before deploying to production.

Hidden hot entities are the most problematic and might only be identified after the solution has gone live. For example, imagine an airplane monitoring system that receives airplane events, as shown in the following illustration. These events are sent to an airplane agent that is associated with one or more airplane entities. Under normal operation, all airplanes emit status events and these events are evenly distributed among all airplanes. However, if a single airplane starts to fail, it emits a storm of warning events towards a single airplane entity instance. This entity instance becomes the consumer of most of the airplane events, and this entity is hot.

Illustration of an airplane monitoring system with a hot entity
Illustration of an airplane monitoring system with a hot entity

Tips to avoid hot entities

Now, learn tips for redesigning your solution to prevent hot entities.

Tip 1: Divide and conquer

Avoid complex entities that consume complex events. Divide complex entities and events into smaller parts. As shown in the following illustration, the airplane event and airplane entity are split into smaller component entities: fuselage entity, cockpit entity, engine entity, wing entity, and gear entity. Each component entity receives a subset of events emitted by the plane, which reduces the possibility of hot entities.

Illustration of an example solution with a divide and conquer                     approach
Illustration of an example solution with a divide and conquer approach

Tip 2: Address rule agent performance

Hot entities occur when an entity instance cannot consume events as fast as they arrive. One cause could be your rule agents are too slow. This might be because your solution has a large event horizon.

Examine whether you need a large event horizon. Can you store a summary of event history in the entity instead? For example, consider a rule that counts engine warnings. Instead of using event history to count the number of warning events, it might be better to add a warning counter on the engine entity instead.

Another way to improve performance is to rewrite your rule agent as a Java™ agent. Consider this approach if your rules are technical and there is no real benefit to expressing logic as business rules.

Tip 3: Avoid making expensive OSGi calls for every event

Examine your rules to see if you need to make expensive Open Source Gateway Initiative (OSGi) calls every time you receive an event. For example, if you call predictive analytics software, such as IBM SPSS Statistics, does the call need to be made for every event? Or, can the call be made less frequently by introducing another agent that receives an event summary? This concept is explained later in this tutorial.

Tip 4: Clone the entity

If you have followed the tips 1-3, and you still have hot entities, then consider cloning the entity using the following pattern. Unlike the approach in tip 1, which splits the entity into logical constituent parts, this pattern clones the same entity n times. Each clone consumes 1/n of the total events for the overall object. For more information on this pattern see the tutorial section.

Will your solution experience hot entities?

To determine whether your Decision Server Insights solution is prone to hot entities, use the following formula:
hot entity potential = inbound EPS / (instance EPS * min instances)

Let inbound EPS be the sustained volume of inbound events per second.

Let instance EPS be the maximum number of events that a single entity instance can process per second.

Let min instances be the fewest number of entity instances involved in processing inbound events (the worst case).

If the your hot entity potential is greater than 1, you have a risk of hot entities in your solution. The higher the value, the higher the risk.

Cool entity example

The inbound EPS value is 100, the instance EPS value is 50, and the min instances value is 4:
100 / (50 * 4) = 0.5

The hot entity potential is 0.5, which indicates a cool solution.

Hot entity example

The inbound EPS value is 100, the instance EPS value is 50, and the min instances value is 1:
100 / (50 * 1) = 2

This time the hot entity potential is 2, which indicates a hot solution. If you have a risk of hot entities, address them by applying the tips from the previous section.

The next section shows how to apply these tips to an example based on real work with IBM ODM customers who encountered hot entities. Take a deep dive into three different Decision Server Insights implementations of the airplane monitoring system example that was illustrated in the introduction. The first solution is simplest, but the most prone to hot entities. The second solution applies the divide and conquer pattern to reduce the heat but is still problematic. The third solution is fully protected against hot entities.

Example solution 1: The airplane monitoring solution with no optimization

The first attempt at implementing the example airplane monitoring solution is shown in the previous Hot entity types section. Airplane entities are associated with airplane events. As shown in the following code listing, an airplane entity records flight data about an airplane:

Listing 1. The airplane entity
/****** Airplane Entity ******/
an airplane is a business entity identified by an airplane id .
an airplane has an average engine exhaust temperature (integer ) . 
an airplane has an average engine pressure ratio (integer ) .
an airplane has an average engine rpm (integer ) .
an airplane has an wing warnings (integer ) . 
an airplane has an cockpit warnings (integer ) . 
an airplane has an fuselage warnings (integer ) . 
an airplane has an gear warnings (integer ) . 
an airplane has an event count (integer ) .

An airplane event is emitted by an aircraft at regular intervals to provide flight data, as shown in the following code listing:

Listing 2. The airplane event
/****************** Airplane Event ******************/
an airplane event is a business event time-stamped by a timestamp ( date & time ) .
an airplane event has an aircraft id .
an airplane event has an engine . 
an airplane event has a wing . 
an airplane event has a gear .
an airplane event has a cockpit .
an airplane event has a fuselage .

a wing is a concept  .
a wing has a lift ( integer ) .

a fuselage is a concept.
a fuselage has a pressure ( integer ) .

a cockpit is a concept .
a cockpit has a altitude ( integer ) .
a cockpit has a speed ( integer ) .

a gear is a concept .
a gear  has a gear state ( a gear status ).

a engine is a concept .
an engine has a pressure ratio ( integer ) .
an engine has a rpm ( integer ) .
an engine has a exhaust temperature ( integer ) .

a gear status can be one of: UP, DOWN, STUCK.

The airplane entity is bound to the airplane agent, and the agent receives airplane events. The airplane agent contains the following rules.

Business rules

As shown in the following screen capture of the Insight Designer, there are six business rules for monitoring the airplane:

Screen capture of business rules in the Insight Designer in IBM                     Operational Decision Manager Decision Server Insights
Screen capture of business rules in the Insight Designer in IBM Operational Decision Manager Decision Server Insights

There are two engine rules. The first is CalcAverages, which computes a rolling weighted average for engine parameters, shown in the following code listing:

Listing 3. The CalcAverages rule
when an airplane event occurs
then
    define rpmAverage as ( the average engine rpm of 'the airplane' + 
                                   the rpm of the engine of this airplane event  ) / 2 ;
    define pressureAverage as (    the average engine pressure ratio of 'the airplane' + 
                                   the pressure ratio of the engine of this airplane event  ) / 2 ;
    define exhaustTempAverage as ( the average engine exhaust temperature of 'the airplane' + 
                                   the exhaust temperature of the engine of this airplane event  ) / 2 ;

    set the average engine rpm of 'the airplane'  to rpmAverage;
    set the average engine pressure ratio of 'the airplane' to pressureAverage;
    set the average engine exhaust temperature of 'the airplane' to exhaustTempAverage;
    set the event count of 'the airplane' to the event count of 'the airplane' + 1 ;

The second engine rule is EngineShutdown, which applies the averages calculated in the previous code listing to predict engine failure. As shown in the following code listing, an IBM SPSS Statistics analytics algorithm is run to determine a failure probability. If the probability is over eight, then an engine shut-down event is emitted.

Listing 4. The EngineShutdown rule
when an airplane event occurs
if
    calculate engine failure probability ( the average engine exhaust temperature of 'the airplane'    , 
                                           the average engine pressure ratio of 'the airplane' ,  
                                           the average engine rpm of 'the airplane' ) is more than 8
then
    
    emit a new actionable event where
        the operator action is ENGINE_ERROR   ,
        the reason is "Engine Failing on " + the aircraft id,
        the aircraft id is the aircraft id of this airplane event   ,
        the timestamp is now ;

In this solution, the IBM SPSS Statistics call to 'calculate engine failure probability' is a simulated call.

Now you can explore the remaining rules that are provided in the solution on Github.

Run example solution 1

  1. Go to http://github.com/ncrowther/CoolingHotEntities
  2. Download the sample code.
  3. Extract the contents to a convenient location.
  4. Open Insight Designer.
  5. Click File > Import and select General > Existing Projects into Workspace.
  6. Import all the projects under the downloaded JetSolution1 folder into Insight Designer.
  7. If you see a Generate Java Model error, right-click the JetSolution project and select Configure > Migrate Solution.
  8. Start the cisDev Decision Server Insights server on your local machine.
  9. Right-click the JetSolution project and select Deploy > Deployment Configurations.
  10. Click New and create a deployment name called JetSolution.
  11. Select Local server. Then click Next.
  12. Select Disable SSL hostname verification and Disable server certificate verification, and click Next.
  13. Select Create new connectivity server configuration, and click Next.
  14. Ensure all options are selected (Inbound Endpoints, HTTP and InboundHttpEndpoint), and click Finish.
  15. The solution should be deployed to your local cisDev Decision Server Insights server instance. Check the console for the following message:
    CWMBE1452I: Successfully deployed connectivity for the solution "JetSolution".
  16. In Insight Designer, go to the JetStatusTester project and open JetStatusTestSeq.eseq under the Event Sequences folder.
  17. In the top level directory of the project, edit testdriver.properties and ensure the trustStoreLocation is set to your Decision Server Insights installation path.
  18. Right-click JetStatusTestSeq.eseq, and select Run As>Event Sequence.
  19. Select Enable recording, Reset solution state and Delete all entries, and click Run.
    The script runs and sends one or more events into the solution. The rules fire and create Actionable Events for the Flight Control team to process. A recording captures all submitted and emitted events, and stores the state of the entities before and after each event is processed, in a file that you can view in Insight Inspector.
  20. Open the following recording in Insight Inspector to see the events: https://localhost:9443/ibm/insights/view?id=JetSolution-0.0.
  21. In Insight Inspector, verify that all events are concentrated on one airplane entity, and that the entity is hot, as shown in the following screen capture:
Screen capture of the Insight Designer for example solution 1
Screen capture of the Insight Designer for example solution 1

Example solution 2: Apply the divide and conquer pattern

The problem with the first example solution is that the airplane entity is hot. This example solution 2 applies a divide and conquer pattern to the airplane event so that it is split into constituent parts such as engine, wing and gear.

In the sample code, you can see that both the airplane entity and the airplane event are split into parts, as shown in the following screen capture:

Illustration of example solution 2 divide and conquer approach
Illustration of example solution 2 divide and conquer approach

Splitting the airplane event into component events not only improves performance but also allows you to configure the number of components of an aircraft without changing the event structure.

Run example solution 2

Follow all the build and run instructions for JetSolution1, but this time, import JetSolution2.

  1. Open the following recording in Insight Inspector to see the events: https://localhost:9443/ibm/insights/view?id=JetSolution-0.0.
  2. You should see the entities shown in the following screen capture in the Insight Inspector:
Screen capture of the Insight Designer for example solution 2
Screen capture of the Insight Designer for example solution 2

Verify from Insight Inspector that the events are now dispersed to different components, improving performance and reducing the likelihood of hot entities. To be sure, you can run a performance tester, as described in the next section.

Run the performance tester for example solution 2

Now run a Java program for a stress test of the engine component.

To run the performance tester complete the following steps:

  1. Open the HttpPerformanceTester project in JetSolution2.
  2. Go to the src/dsi folder and open DSISendEvent.java.
  3. At the start of the main() method, set the following values:
    	NUMBER_OF_AIRPLANES = 50;	
    	NUMBER_OF_ENGINES = 2;	
    	NUMBER_OF_EVENTS = 500;
  4. Check that the port number on the inbound HTTP endpoint is correct for your installation:
    urlStr = "http://localhost:9080/jetstatus/InboundHttpEndpoint";
  5. Save the changes and run the program by right-clicking DSISendEvent.java and selecting Run as Java Application.
    Examine engine agent activity in the log here:
    Decision_Server_Insights_installation_directory\
    runtime\wlp\usr\servers\cisDev\logs\trace.log
    Examine the engine entities through the REST API here:
    https://localhost:9443/ibm/ia/rest/solutions/JetSolution/entity-types/entityModel.Engine

    As expected, the engine entities are updated with the latest engine data in a matter of seconds.

  6. Now change the following constants to simulate a single engine failure emitting a storm of events:
    NUMBER_OF_AIRPLANES = 1;
    NUMBER_OF_ENGINES = 1;
    NUMBER_OF_EVENTS = 500;
  7. Run the program again and examine the trace log. This time it takes several minutes to process just 500 events. Why? Because all events are directed to a single engine entity instance and this instance is hot, as shown in the following screen capture:
Illustration of example solution 2 hot entity
Illustration of example solution 2 hot entity

Example solution 3: Improve the solution

In the example solution 2, the engine entity is hot. It received all engine events when an engine was about to fail. Now, in the final example 3, you apply the clone pattern to cool the airplane solution. Engine events for each engine instance are now consumed by one of n engine clone entities. Each clone summarizes the data, and at defined intervals, the clone passes a summary to the engine agent. The business rules in the engine rule agent make the decision about whether the engine is about to fail. This agent receives only a subset of events, so its bound entities are not hot.

When an engine event storm occurs, you see the following activity in the solution:

  1. Engine events are sent to an engine clone agent, which is bound to multiple engine clone entities.
  2. Every thirty seconds, each engine clone entity sends an engine summary event to the engine rule agent.
  3. The engine rule agent is bound to the engine entity. The rules use the engine summary event to determine whether the engine is failing.

This sequence is shown in the following screen capture:

Illustration of example solution 3 with no hot entities
Illustration of example solution 3 with no hot entities

The clone pattern implementation

The cornerstone of the clone pattern is the EngineCloneJavaAgent.java. It divides an incoming event storm for a single engine between many clone entities to distribute the event load. Each clone emits a slow trickle of summary events every thirty seconds to the engine entity, so it is no longer hot. The agent Java class is called EngineCloneJavaAgent.java.

The following process event method is inside the EngineCloneJavaAgent.java Java class:

Listing 5. The process event method
	@Override
	public void process(Event event) throws AgentException {

		if (event instanceof EngineEvent) {

		   // Summarize the Engine event
		   summarizeEngineEvent((EngineEvent) event);

		}
	}

If the event is an instance of EngineEvent, then the following summarizeEngineEvent method is called:

Listing 6. The summarizeEngineEvent method
/**
	 * Summarize the Engine event
	 * 
	 * @param engineEvent the engine event to be summarized
	 * @throws AgentException
	 */
	private void summarizeEngineEvent(EngineEvent engineEvent) throws AgentException,
			EntityTypeException {

		EngineClone engineClone = (EngineClone) getBoundEntity();

		String EngineCloneName = engineEvent.getEngineCloneId();
		String engineName = engineEvent.getEngineId();

		if (conceptFactory == null) {
			conceptFactory = getConceptFactory(ConceptFactory.class);
		}

		if (engineClone == null) {

			printToLog(Level.INFO, "Creating a new engine Clone: "
					+ EngineCloneName + " associated to engine: " + engineName);

			engineClone = (EngineClone) createBoundEntity();
			engineClone.setEngineId(engineEvent.getEngineId());
			engineClone.setEngineCloneId(EngineCloneName);
			engineClone.setAircraftId(engineEvent.getAircraftId());
			engineClone.setAverageExhaustTemperature(engineEvent.getExhaustTemperature());
			engineClone.setAverageRpm(engineEvent.getRpm());
			engineClone.setAveragePressureRatio(engineEvent.getPressureRatio());
			engineClone.setEventCount(1);
			engineClone.set$CreationTime(engineEvent.getTimestamp());

			// Load the bound entity back into the grid
			updateBoundEntity(engineClone);
		} else {

			printToLog(Level.INFO,
					"Calculate averages in engine Clone: " + EngineCloneName);
	
			int eventCount = engineClone.getEventCount() + 1;
			engineClone.setEventCount(eventCount);
			
			int averagePressureRatio = (engineClone.getAveragePressureRatio() + 
engineEvent.getPressureRatio()) / 2;
			engineClone.setAveragePressureRatio(averagePressureRatio);
	
			int averageRpm = (engineClone.getAverageRpm() + engineEvent.getRpm()) / 2;
			engineClone.setAverageRpm(averageRpm);
	
			int averageExhaustTemperature = (engineClone.getAverageExhaustTemperature() + 
engineEvent.getExhaustTemperature()) / 2;
			engineClone.setAverageExhaustTemperature(averageExhaustTemperature);
		}
		
		// Schedule call back to emit summary event after n seconds
		if (!engineClone.isTimerRunning()) {

			engineClone.setTimerRunning(true);

			printToLog(Level.INFO,
					"Timer started for : " + engineClone.getEngineCloneId());

			final int TIMER_INTERVAL = 30;
			schedule(TIMER_INTERVAL, TimeUnit.SECONDS, "");

		}

		updateBoundEntity(engineClone);

	}

This summarizeEngineEvent method summarizes the EngineEvent by creating weighted averages for engine parameters. It then schedules a call back in thirty seconds (if not already scheduled) to emit a summary event.

Every thirty seconds, the following timer call-back method is called:

Listing 7. The timer call-back method
	@Override
	// Timer callback method
	public void process(String key, String cookie) throws AgentException {

		EngineClone engineClone = (EngineClone) getBoundEntity();

		if (engineClone != null) {

			// Emit summary event to Engine Entity
			emitSummaryEvent(engineClone);

			// Delete the entity as its job is done
			deleteBoundEntity();
		}
	}

This timer call-back method then calls the following emitSummaryEvent method to send an engine summary event:

Listing 8. The emitSummaryEvent method
/**
	 * Emit an Engine Summary Event
	 * 
	 * @param EngineClone the Engine clone bound entity
	 */
	private void emitSummaryEvent(EngineClone EngineClone) {

		if (conceptFactory == null) {
			conceptFactory = getConceptFactory(ConceptFactory.class);
		}

		EngineSummaryEvent engineSummaryEvent = 
                            conceptFactory.createEngineSummaryEvent(ZonedDateTime.now());
		
		engineSummaryEvent.setEngineId(EngineClone.getEngineId());
		engineSummaryEvent.setAircraftId(EngineClone.getAircraftId());
		engineSummaryEvent.setAveragePressureRatio(EngineClone.getAveragePressureRatio());
		engineSummaryEvent.setAverageRpm(EngineClone.getAverageRpm());
		engineSummaryEvent.setAverageExhaustTemperature(EngineClone.getAverageExhaustTemperature());
		engineSummaryEvent.setEventCount(EngineClone.getEventCount());
		
		try {
			printToLog(Level.INFO,
					"Emit Engine Summary Event from : " + 
                                                EngineClone.getEngineCloneId() + " to "  + 
                                                EngineClone.getEngineId());
			emit(engineSummaryEvent);
		} catch (AgentException e) {
			printToLog(Level.SEVERE, "Error emitting Engine Summary Event : "
					+ engineSummaryEvent.get$Id());
			e.printStackTrace();
		}
	}

After sending the summary event, the summary agent deletes the bound entity. The agent only creates new clone entities if it receives more engine events.

Deleting and creating clones again in this way creates memory elasticity, which means the clones only exist in memory while there is a storm of events. When there are no event storms, there are fewer clones, and when there are no events, there are no clones.

Tune the clone parameters

If you apply a clone pattern, there is a trade-off between real-time response and memory consumption.

You can control the variables of real-time response and memory consumption with two parameters, NUMBER_OF_CLONES and TIMER_INTERVAL.

NUMBER_OF_CLONES parameter

For best performance, use 10 clones per entity. More than 10 could consume too much memory, and fewer than 10 could still allow hot entities. However, you can tune the number of clones to suit your solution.

For example, if you have a few entities that are always hot, increase the number of clones. If you have many entities that are occasionally hot, reduce the number of clones to save memory.

To change the number of clones in Example Solution 3, edit DsiEventXmlFactory.java in the HttpPerformanceTester project. Edit the NUMBER_OF_CLONES constant, as shown in the following code listing:

Listing 9. Change the number of clones
	final int NUMBER_OF_CLONES = 10; 
	final int engineCloneId = (int) (Math.random() * NUMBER_OF_CLONES);

Timer_Interval parameter

The timer interval specifies the time delay between receiving an engine event and sending an engine summary event. The longer the delay, the fewer the summary events.

If your solution requires a quicker response time, you can reduce the timer delay. However, understand that this approach means more summary events, which could cause the engine entity to become hot again.

For a good compromise between the number of summary events and near real-time situation detection, use a delay of 30 seconds.

To change the timer interval in Solution 3, edit EngineCloneJavaAgent.java in the EngineCloneJavaAgent project. Change the TIMER_INTERVAL constant, as shown in the following code listing:

Listing 10. Change the timer interval
	final int TIMER_INTERVAL = 30;
	schedule(TIMER_INTERVAL, TimeUnit.SECONDS, "");

Run example solution 3

To run the cool entity sample code for Example Solution 3, complete the following steps:

  1. Stop and then restart the cisDev Decision Server Insights server with the --clean option:
    server stop cisDev
    server start cisDev --clean
  2. Follow all the build and deploy instructions for JetSolution1, but this time import JetSolution3.
  3. Open the HttpPerformanceTester project
  4. Navigate to the src/dsi folder and open DSISendEvent.java.
  5. At the start of the main() method, set the following values to simulate a single engine failure emitting a storm of events :
    NUMBER_OF_AIRPLANES = 1;
    NUMBER_OF_ENGINES = 1;
    NUMBER_OF_EVENTS
    = 500;
  6. Review the logs.

    This time when you examine the activity log you should see consistent performance whether you run the sample with many planes or just one. However, recognize that the events emitted are no longer in real time but delayed by the timer interval, which in this case is thirty seconds.

Examine the activity log here:
Decision_Server_Insights_installation_directory\runtime\wlp\usr\servers\cisDev\logs\trace.log

Examine the engine clone entities through the REST API. You should see ten entities:
https://localhost:9443/ibm/ia/rest/solutions/JetSolution/entity-types/entityModel.EngineClone

Note that the engine clone entities exist in memory only for a short period. They are deleted after sending their summary information to the engine entity.

Examine the engine entity through the REST API. You should see a single entity with all the information from each of the ten clones:
https://localhost:9443/ibm/ia/rest/solutions/JetSolution/entity-types/entityModel.Engine

Conclusion

In this tutorial, you learned how to identify the causes of hot entities and what design tips help you avoid them. You saw three example Decision Server Insights solutions. The first solution had no hot entity optimization, the second had partial optimization, and the third had full optimization.

Now you can recognize when your IBM ODM Decision Server Insights solution has performance problems from hot entities, and you are equipped with tips to prevent the problems in the future.

Acknowledgments

The authors would like to thank Pierre Berlandier, Dan Selman, and Ben Cornwell for reviewing this tutorial.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Middleware, Mobile development
ArticleID=1039659
ArticleTitle=Cool your hot entities in IBM ODM Decision Server Insights
publish-date=11142016