RTAM Troubleshooting and Useful Features
When you are about to tune and performance test the RTAM agent, there are a number of details that you should pay attention to and be aware of some key metrics that will help you understand the conditions of well running setup.
These are just some of the questions I ask myself when we talk about performance concerns/issues with RTAM. Some of these may be general and some may be specific.
Average Get Jobs Time
The amount of time (on average) spent within getting jobs to execute. GetJobsTime/GetJobsProcessed
Average Put Time
The amount of time (on average) spent outside of GetJobsTime putting each message into a queue. MessageBufferPutTime/ExecuteMessagesCreated
Average Execution Time
The amount of time (on average) spent within execute jobs. Average
Expected Messages Created
This is based on the NumberOfRecordsToBuffer for the agent and how many times GetJobs is called within a 10 minute window. NumberOfRecordsToBuffer x GetJobs ~= ExecuteMessageCreated
Number of Inventory Activities
RTAM does its execution per item. Activities just drive which item to process. However, during execution, RTAM does take into account which nodes in the activities need to be checked/updated (and their DGs). A separate metric will be captured for how many activities are processed. NumInvActivitiesProcessed
Max Possible Threads
With the current response times, the maximum number of threads is based on the average put time. Average Execution Time / Average Put Time + 1.
Given throughput requirements of executions required per minute, the required number of threads is at minimum Throughput / (60000/Average Execution Time).
For sake of examples, use this. It's best to go through the below sections in order and address the problems in that order.
Recompute Availability When Capacity Depleted
Some customers will have RTAM recompute by calling createInventoryActivityList for the entire catalog (or items stocked at a particular node). This can cause delays in RTAM as when a store is depleted of capacity, the activities created may be in the hundreds of thousands or even millions. Though RTAM can scale horizontally, this still requires a very good database and queue server to handle this kind of workload.
An alternative that should be provided to customers who do such things is to configure sourcing rules to consider only nodes that have at least 10% capacity. Followed by a sourcing rule that considers all capacity. This allows for the best nodes to be picked of the nodes that have at least 10% or more capacity.
Internal Queue is Too Slow
The Average Put Time should be a single digit millisecond. Once you go to double-digits, the internal queue may be having issues. Typically, it's accidentally set as a persistent queue, which can lead to average put times of 50 - 100ms. The above example shows an average put time of 30ms. This is far too much and will limit how many threads will be given work. The ideal is to have a 2 - 3ms put time.
Too Many Threads
When getJobs puts messages into a queue, the executor threads will begin working on them right away. The maximum number of threads is based on the average amount of time spent on putting messages into a queue divided by the average execution time of said message. In the above example, 30ms is spent per put. With an average execution of 350ms, then you may only be able to have a maximum of 12 threads. Any more threads would result in multiple execution threads waiting for work.
Too Much Time in GetJobs
In general, whenever a getJobs is called it means that one of your threads is trying to find work to be done. This is done when the executor thread finds no messages in the queue. It'll perform the getJobs. In the mean time, other threads will continue to complete their work, but eventually they'll be done and try to get a message from the queue. If there's nothing in the queue, it will wait until getJobs (other thread) gives it work.
The time spent waiting for work to show up in the queue is not measurable by any metrics today, but can be estimated pessimistically. The time it takes getJobs to run is (GetJobsTime/GetJobsProcessed) + the average message put time. In the above example, this would give you 1024/2 + 30 = 512 + 30 = 542ms per getJobs call. The maximum time the the first executor would have to wait before starting work is this time.
This alone is not a problem, but if there are too many getJobs called in a 10 minute window, then too much time would be wasted in getting jobs and not actually executing. Be careful, just because a lot of getJobs are occurring, doesn't mean there's a problem. The first run of getJobs may find there's only 100 activities to process. Then all 100 will be executed and then the next getJobs may find a new set of 80. And the cycle continues.
To resolve this, compare the ExecuteMessagesCreated to the Expected Messages Created. If these numbers are close, then it's most likely that NumberOfRecordsToBuffer is too small. Consider increasing. If these numbers are not close (actual is far less than expected), then there just isn't enough work.
High Execution Time: List Event Not Used
Typically RTAM will raise an event, and put that event into a persistent queue. Persistent queues can be as slow as 99ms and as fast as 30ms. First check to see if Persistent queues are really required for RTAM events. If not and they're okay with a lost message in case MQ crashes, then make it nonpersistent. Otherwise, it must stay as is.
Even worse however is when the nonlist event is used. This event is raised per item-node (or item-DG) combination. So if you have 100 nodes with activities, that's 100 raiseEvents. If persistent queue takes 50ms, that's 5 seconds of the execution time spent on just writing to persistent queue. The list event should be used instead, which will be raised once per item (or once per execution). So instead of 100 raiseEvents for 50ms each, there will be 1 raiseEvent for maybe 60ms (some 20% overhead).
High Execution Time: Monitor DG level at Node as Well
When RTAM executes, it calculates availability for all nodes affected. If network level monitoring is enabled, then the affected DGs need to be monitored as well. Affected DG are those which contain the node that had the activity. In order to compute the availability of the DG, all nodes of the DG must be checked for availability. In extreme examples, this could be thousands of nodes need to be checked for availability to figure out how much is available at the DG. If you take 10ms per node and assuming everything is linear, then you're looking at 10s of seconds spent in calculating availability of a DG.
But RTAM does optimize this a bit. If the network level DG is monitored at node level as well, then only the nodes that have activities will be checked for availability. The DG will be updated based on the delta changes of the activity nodes. Take for example 2 stores have activities and their YFS_INVENTORY_ALERTS records show 10 and 20 previously calculated. RTAM will now check the availability of those nodes and let's say it's now 8 and 19. RTAM will update the nodes' YFS_INVENTORY_ALERTS with the new numbers and update the DG's YFS_INVENTORY_ALERTS by subtracting 3 (8 - 10 + 19 - 20). Taking the same estimated 10ms per node, now you're looking at 20ms in execution. Note that this is only about the time spent in calculating availability. That's a huge improvement to response time.
Wrong Agent Being Run
Sometimes there's a confusion as to which agent should be run. If you are running activity based RTAM, there's no reason to run Full Sync mode. In full sync mode, all items and nodes are recomputed and availability republished. Ideally if activity mode is already running, Quick Sync is what's really needed. Quick sync just republishes what's already been calculated via Activity Mode. A significantly better throughput can be achieved this way.
Some valid use cases of running full sync:
Long story short, full sync is useful in case of changes to monitor rules and purging. So once in a great while, for some weekly, for others monthly.
Other High Execution Time
Throughput is based on how many executions per thread per minute x number of threads. The execution time will impact your throughput, but RTAM is horizontally scalable. Get a SQLDEBUG log of RTAM to analyze where things are going slowly. Consider the following:
Valid Test Scenario
Proper approach in testing is not to create activities and see how long it will take to process, it is about making changes that will trigger activities. This is a very important point. For level 2 testing focused on RTAM performance, the trigger of RTAM should be either supply/demand changes, not creation of activities. For level 3 testing, order creation, order scheduled, shipments, etc. will cause RTAM to run for you.
Practical hints to increase performance
1. If the alerts are to be published to the queue - mostly to the external queue to be consumed by 3rd party systems, use property:
2. Use yfs.yfs.rtam.readInventoryForOnlyActivityNodes when monitoring at Node and DG (Network) level. Property is enabled based on following condition:
3. Max messages to buffer can be increased from 5K to 25K. (Has to be performance tested, refer to MessageBufferPutTime)
4. COMPUTE_AVAILABILITY_FOR_RTAM = N (Compute availability information with Ship Dates for RTAM)
5. Disable table level caching on YFS_ITEM table
To see more options and ideas of what counts and should be checked to asses RTAM performance, download the RTAM Questionnaire form below.
If you would like the IBM Performance Team to review your RTAM implementation, please fill out the form, open a new case with IBM Support and upload the form for further review.
If this is your first time reading about RTAM and you would need more business level or functional information, please see our Knowledge Center article:
For detailed information on RTAM Enhancements between version 9.3-9.5, please download this presentation:
Special thanks to Steve Igrejas, Vijaya Bashyam, Shoeb Bihari