Monitoring and Troubleshooting

Troubleshooting

Problem

?Spooling? describes the condition where a canister or HBR can not process the incoming data stream quickly enough, and writes the raw data to disk as a protective measure.

Symptom

Overview

?Spooling? describes the condition where a canister or HBR can not process the incoming data stream quickly enough, and writes the raw data to disk as a protective measure.

There are four general (most common) reasons for spooling, and all typically indicate a system resource being overwhelmed by session traffic. For this reason, dropping unwanted traffic (bots, single hit sessions, etc.) is usually the best remedy. Additionally, if organic traffic growth has persisted over time, eventually a sizing study will be necessary, to plan for the future.

The most common triggers for spooling are:

A "Pipeline Spool":
The transport service can not keep up... there is a constant flow of hits from the top of the pipeline to the bottom, but the outflow is lower than the inflow. In this case, one or more session agents can not keep up. The common problem-agents are Privacy(Ex), TLTRef, RTA, and RTASplit. Cleaning privacy rules to make them more efficient may be required.
Hit Evaluation:
This is event processing cost.
The Decouple Log will warn that there are too many un-evaluated hits.
CTree Memory:
The number of hits in the short term canister is too high for allocated memory.
The Decouple Log will warn of the CTree memory limit being exceeded.
Disk performance:
The canister is unable to write sessions to the long term canister (LTC) as quickly as they are arriving.
The Decouple Log will warn about the number of sessions waiting for LTC.

When the Canister has exceeded of it?s performance thresholds, the DecoupleEx session agent begins queuing incoming hits. In this state, the Canister processes data already in memory and newly arriving data is written directly to disk, until the canister returns to a healthy state. At this time, spooled data is again written to the canister, to work down the spool. Note that spooling causes session fragmentation, as active sessions will time out while these subsequent hits are held back, in the spool files.

Cause

Reasons for Spooling

Look for these reasons in the Decouple Log (_DL_):

"Number of sessions waiting for LTC in the canister ... is high"
- Session processors do this work: If processing cores and memory allow, increase this count in the Canister Configuration.
- Try to keep below 8 (def=2). The CTree database ?Faircom User? connection limit is 32 and is shared by other processes.
- This can indicate a disk I/O bottleneck. SAN drives on shared disk resources often have lower throughput than local drives.
- Microsoft?s free SQLIO testing utility is helpful for determining if this is the problem.
- A change to disk resources, such as adding disks to a RAID array or replacing a downed disk, can introduce temporary processing overhead while the new storage space integrated into the array.
- A change in traffic, such as a surge in one hit sessions (i.e., bot, crawler, or attack traffic) can increase the number of sessions ?waiting to time out? and go into the LTC. Data Drops and lowering the time out can help.
- Impact: This is the worst case scenario, because it is often difficult to fix.
- Microsoft indexing must be disabled on the disc: Go to ?my computer?, right click the drive, and making sure indexing is unchecked.
- It may be an issue with the drive:
  - If you are spooling to the same drive as the canister, you are writing each hit TWICE. This is very hard on the drive.
  - Wrong physical disc format/setup, SAN improperly mounted, wrong driver, network connectivity to SAN, etc..
"Number of un-evaluated hits in the canister ... is high"
- Hit Processors do this work: This occurs when the allocated CPU core(s) cannot handle event evaluation.
  - If available processing cores and memory permit, increase this count in the Canister Configuration.
  - Try to keep below 8 (def=2). ?
  - The CTree database ?Faircom User? connection limit is 32 and is shared by other processes.
- Recent changes to privacy rules, events, etc., or an unexpected change in the traffic make-up or volume, can introduce additional processing overhead. ?
  - Privacy expense is normally single-threaded, in the Tealeaf Transport Service. A Multiple Pipeline configuration can help.
  - Event processing is distributed across CPUs among the Hit processors
CTree Memory...limit exceeded
- If free memory permits, increase ?Max Ctree Bytes? on the canister. On a 32bit OS, no more than about 1.8GB should be assigned.
- For 2 indexing processes (Index Config), leave at least 1600 MB free when setting Max Ctree Bytes.
SADecoupleEX: Canister shared memory is stale. flow to canister has been stopped.
- The Canister Manager service periodically publishes a segment of shared memory containing Canister status/statistics.
- The DecoupleEx session agent will start spooling if this shared memory segment is not updated often enough as this would indicate either that the Canister is too busy to update it or is otherwise not responsive. The default ?stale? timeout is 5 minutes.
- On canister services startup, the publishing of health status is delayed due to the TLTMaint health check. This problem normally clears up after a short time, when the health status is published. This might delay the de-spooling process.
- External processes in the Windows operating system may also interfere with publishing shared memory.
No reason for Spooling in the Decouple Log?
- Privacy Expense, Etc. in the transport service may be the cause
  - In this case, the canister itself is not ?spooling?...it will be receiving hits in whatever volume the transport service can deliver them. However, if the upstream volume is higher than downstream for DecoupleEX, something in the transport service can not keep up. This is most commonly the PrivacyEX session agent, which is single threaded and may become overtaxed.
  - Possible resolutions include:
    - Move some privacy rules upstream, to an HBR or PCA
    - Remove rules or make them more efficient, with Tests, more efficient pattern matched or Regex, etc.
    - Create a ?multi-pipeline? configuration, to spread the work across multiple PrivacyEX session agents.
- Insufficient ?Throttle? settings
If this is the problem, a key behavior pattern in the Pipeline status will be a continual flow through the pipeline, but with a higher volume "above" DecoupleEX than below. There is no "stop - start" pattern as would be expected in a canister-limits type of spooling condition. If Privacy or other session agents are not the cause for queueing, this might be the cause.
Resolution: Set the DecoupleEX throttling limits to a higher value:
- ControlMaxOutRate, MaxOutMode etc.

Items to Check

Log File: CSS_1966_<server>_DL_<date>.txt log file : reason, frequency and duration of spooling
- ?Failed to open canister's shared memory?
  - This is the normal message after a canister/transport service restart, as TLTMaint will run with exclusive access
  - If TLTMaint runs into trouble, the canister may not be released. See it?s log, below:
- Log File: TLTMaint.log
  - If this log is not being created, a process creation error may exist.
  - Note: TLTMaint is started from:
    - \Tealeaf\Ctree\Server\ctsrvr.cfg
    - ?SIGNAL_READY C:\Tealeaf\TLTMaint.exe?
    - ? This location has been found to be incorrect when a customer ?copies? this file from another machine.
Report: Portal Status Report
Session data: Is new, unexpected traffic arriving? Look for:
- Large (2000 hit+) sessions: Denial of service attacks (often 404s), application problems (302 re-direct looping), etc.
- Short (few seconds) sessions...high hit volume will force the session closed quickly, due to session size or hit count
High event overhead: May need to reduce active event count
- High Facts/Hit count on Technical Site Metrics dashboard
- Examples: 7 - 20 ?normal?, 40 is very high, a peak of 80 is extremely high
Canister Status on DecoupleEx (single server) status page: Canister Status not ?Real-Time? = spooling
The DecoupleEX throttles may be too tight... so data is not allowed ?down stream? as fast as the system can actually handle it.

Environment

The initial spooling and subsequent de-spooling behavior is governed by this set of thresholds:
?

Thresholds

DecoupleEx session agent settings:

Max/Min Sessions Waiting for LTC (default 20000/1000)
Max/Min Unevaluated Canister Hits (default 20000/1000)
Max/Min % Memory Used (default 80/50) ?(of Max Ctree Bytes as specified in the canister)
Disk % Free (default 2)

Examples:

Unevaluated Hits exceeds 20,000. ?= ?You need more hit processors
UnIndexed Sessions exceeds 20,000 = ?You need more Index Worker Processes (careful on this one)
Sessions awaiting LTC exceeds 20,0000 = You need better disk performance, or more session processors (not usually helpful)?
Canister CPU at 80% ?= You need more canister allocated CPU-check Tealeaf Statistics charts to see which, Hit or session)
Ctree Memory at 80% ?= Your need more canister memory

A De-spooling canister will usually move back and forth between a spooling and non-spooling state, while the queue is worked down. During the 'Despooling' phase the flow rate into the canister will typically be much higher than normal, until an upper thresholds is again breached. This leads to the normal 'breathing' behavior of a de-spooling canister, which is documented in your Decouple log (CSS_1966_<server>_DL_yyyymmdd.txt).

Sizing note: A top-end throughput of about 400 hits/sec for an 8 cores machine is typical. For up to 750, more hardware is likely needed.

Resolving The Problem

Possible corrective actions

If spooling is becoming a chronic condition, then one of the following solutions may alleviate the problem depending on the root cause:

Reduce workload:
- Filtering the web traffic captured by Tealeaf, Data Drops, etc..
- If acceptable, enable ?Session Sampling? on the PCA pipeline settings.
  - Discards hits by session (?use sessioning? is required) so that it does not introduce any page loss or fragmentation.
- Drop ?One hit sessions? (check 1 hit sessions ratio on Active > Status page)
  - Alternatively, combine one-hit BIT sessions into mult-hit sessions, via Privacy or RTA rules.
- Lower eventing overhead...try to lower the fact/hit count by disabling and cleaning up events

Session Sampling:
- If the current configuration is insufficient for the processing load, short term relief can be gained via session sampling on the PCA.
- If this is permissible, you will keep only, for example, 80% of sessions...reducing load enough to get everything on it's feet.

Decrease Session Idle Timeout value
- Reduces STC memory usage, but can lead to session fragmentation
Close Sessions with Compound Session Close events, based on logical business transaction completion
- Reduces STC memory usage
Increase Max Ctree Bytes in Canister Configuration, if system has headroom or more RAM can be added.
- A basing configuration includes about 50% memory for CTree on a canister-only machine
- Be careful if less 3GB is left for the OS, however. Monitor closely.
Add new Processing Servers: CPU + RAM + DISK.
Delete Spool Files: While not a corrective action, if spooled data is not important, this will reduce some load
Open DecoupleEX throttling
- If the system can actually handle the load, DecoupleEX may be too restrictive:
  - MaxOutMode=CANBYTES (Or CANHITS, or CANBOTH)
  - MaxOutHitsPerSec=1000
  - MaxOutBytesPerSec=4000000

Clearing / Renaming spool files

When a large spool develops, it is sometimes desirable to move exist spool files aside to monitor ongoing hit flow. This is helpful for evaluating corrective actions (dropping unwanted hits, tuning adjustments, etc.) and also allows "fresh" data to reach the canisters. In many cases, business users are more concerned with current reporting and session data and prefer not to process a large backlog until later.

The easiest way to do this is to change the "SpoolDir" folder name in the DecoupleEx configuration, then restart the transport service. The Transport Service will create the new folder name if it does not exist.

This directive does not exist in the default configuration... the \Spool folder under the Tealeaf installation folder is used. To specify a folder name, add/edit this directive... for example:?

SpoolDir=D:\Tealeaf\Spool2

Be sure the Transport service process has full access permissions to this location. You can later copy your saved spool files to the new location, in batches... the Transport service will see the files as they are added without requiring a restart.

Note that there is a default 3 day retention period in both the Canister and in the data collector for unprocessed reporting data. If you don't deplete the spool files in that time, you may lose some of the event counts from those hits.

If a large spool is know to contain unwanted data, you can speed the clearing process by dropping these hits below the spool. For example, if an application loop has flooded the Tealeaf system with re-directs (Status code 302 hits), and a decision has been made to temporarily drop these hits at the top of the pipeline, also drop them below the DecoupleEx session agent, to help the spool clear. If you are dropping such hits at the top of an HBR pipeline, add a second DropHit session agent below, or use the child pipelines, or the Canister pipelines, to drop these hits before they reach the canister. This will unload the canister from unnecessary load, as well as any other session agents below the DropHit rule.

Additional Notes

HBR Spooling

When an HBR spools, it is usually because of some downstream condition on the canisters it feeds, and the outflow controls have very little to do with the root problem. There is a temptation to try and ?tune? away the spooling via spooling thresholds and traffic throttle settings, but this approach usually provides very little relief.

The first thing to check on a spooling HBR is the HBR_Poller log, which will identify which canisters are healthy, which are not, and why. Also note that an HBR has multiple spools, one for the main pipeline and one each on the child HBR pipelines. Each of those DecoupleEx session agents will have its own _DL_ log, so also check them as well. Finally, if the HBR says it can't deliver to the canisters and must spool, check the situation in each of the canisters.

Throttling or reducing throttling at the HBR can provide some improvement, but you must monitor statistics data to validate the hardware utilization; you don't want to trade spooling due to max output for spooling due to disc i/o (sessions wait to LTC) on the Process server. ?Therefore, verify that spooling due to sessions LTC is close to zero after increasing throughput...then review the other canister metrics like CPU %, Memory %, hits/sessions waiting for evaluation. If those stay well below 80% and 20K, you can continue to increase until you find a good balance.

Transport Service Restart may be required after a canister restart

When canisters services are restarted due processor count changes, etc., the service stop might stall unless the transport service is restarted. This is not expected to be necessary, but has been observed during spooling conditions, perhaps due to the Canister session agent not closing it?s connection.

Spooling due to high volume from a single Session ID

Hits are sticky to a hit processor to maintain hit processing order. Other hit processors can be idle because only the aggregated un-evaluated hits statistic is supplied to the decoupler. It spools whether the 20,000 unevaluated hits are on a single hit processor or spread across all hit processors.

A possible design improvement would be to provide individual stats for each hit processor, and spool only when all hit processors reach a threshold. This bandage would only defer spooling until the unevaluated hits on the blocking hit evaluator uses up all of the available memory. Recovery from spooling due to this many unevaluated hits would be doubtful.

In some cases, the number of un-evaluated hits in a single hit-processor instance may exceed the max (i.e. 20,000), so the other hit-processors are sitting idle until the backlog of hits are evaluated. In these cases the backlog takes longer to work down, as the hits are all held within a single hit processor.

This problem is most commonly seen during denial of service attacks or infinite re-direction loops, where a disproportionate number of hits are arriving with the same TLTSID, and are therefore all routed to the same HBR, the same canister, and internally, to the same hit processor.

Spooling due to single hit session volume

Single hit sessions are costly due to using memory while they await the idle time-out, and due to event cost, data collection, storage, indexing, etc..
Where single hit sessions? are due to a missing session cookie, ideally that problem in the web application should be fixed, but these hits may be dropped as a defensive measure, while that problem is addressed.

For "Bot" hits, ideally they should be dropped in the pipeline before they reach the canister.?

If bot traffic metrics are required to be retained for business reasons, an advanced alternative:

Via canister RTA or Privacy rules, over-write the TLTSID with a common value to create multi-hit sessions.
This will create large sessions that are closed by safety limits (i.e., 2000 hits),
which is FAR less costly for data aggregation, and will actually reduce STC memory consumption
compared to single hits sessions, as on average these hits will be closed earlier.
Optionally, use an end-of-session event to discard these, to keep them out of the long term canister.

[{"Business Unit":{"code":"BU051","label":"N\/A"},"Product":{"code":"SUPPORT","label":"IBM Worldwide Support"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB33","label":"N\/A"}}]

Tips

Monitoring and Troubleshooting - Spooling