Tuning the reasoning service

Tune Content Cortex AI Services performance parameters to optimize response times, token usage, and cost while maintaining answer quality.

The following configuration parameters determine the behavior and usefulness of the agents in Content Cortex AI Services. They control how aggressively the reasoning service explores, how much prior conversation and tool output it carries forward, and how safely it operates within the model's context budget. Higher limits can improve completeness on complex tasks, but if they are set too high they increase latency, token consumption, and the risk of context overflow or unproductive tool loops. Work within the limits of your chosen model and tune these parameters to achieve your use cases.

To configure these parameters, edit the property files by using the prerequisites.py script. For more information, see Configuring Content Cortex AI Services components.

Tuning the agent iteration limits

The Content Cortex AI Services agents follow a closed-loop workflow. Each run through the loop makes one call to the model and constitutes an iteration. The max_agent_iterations value sets the maximum number of iterations the agent can perform for a single user request before it is forced to stop.

Each model can require a different number of iterations to achieve the same objective. This is influenced by whether and how the model chooses to group the tool calls required to fetch the data from the repository.

The goal is to set this high enough to support your use cases, but not so high that the agent can loop unnecessarily. Start with a low value, around 20 and test simple use cases. If you see a response saying the maximum allowed limit of model calls was reached, increase the value. Continue in the same vein with complex use cases.

The recommended value of 100 supports multi-step retrieval and synthesis with the OpenAI GPT-5.4 model.

Tuning the context window token limit

This parameter defines the total amount of context the reasoning service can send to the model. Context here includes conversation history, retrieved information from the repository through tool calls, system instructions, and the current user request. It is one of the most important tuning controls because a higher value allows the agents in Content Cortex AI Services to retain more of the prior conversation and more retrieved information, which can materially improve answer quality on complex, multi-step queries that depend on accumulated context.

At the same time, setting this value too high can work against efficiency. Larger context windows encourage more history and tool output to be carried forward, which can create context bloat—the model receives more information, but not always more relevant information. That can dilute signal quality, slow responses, and increase token consumption. In pricing models where cost rises with higher usage tiers, an oversized context window can significantly increase operating cost without delivering proportional gains in answer quality.

For most complex use cases involving multi-step retrieval and synthesis across the models we tested, a value less than 250,000 proved sufficient. We recommend testing with complex use cases in order to find an acceptable balance between token usage costs and allowing enough room for meaningful history and retrieved context to improve reasoning quality.

Tuning the context window safety margin

Reserves headroom inside the model context window so the agent does not consume the full token budget before the model can respond.

Set this to 0.80-0.85 so the service can use most of the available context while still preserving safe space for completions and intermediate reasoning. Avoid pushing it too close to 1.0.

Tuning the history window size

This is a sliding window that controls how many prior messages are fed into the model. This should be set to balance the continuity of the conversation while allowing room for the current user query to be addressed. This number does include intermediate, internal messages such as requests for tool calls. These often outnumber the displayed messages by a significant factor (3-10).

Test simple and complex use cases and look up the messages generated per use case. The recommended way to do this is to use the ACCE console to search for the persisted thread. Long-running threads generate more messages. Keeping chat sessions concise and using a new chat session for each business task keeps the message count low.

Tuning the maximum completion tokens

Limits how many tokens the model can generate in its final response, which affects answer depth, latency, and total cost.

In most use cases, this value should be the default value for the model used. It will not be reached but allows the model the freedom to use as many tokens as needed. For finer control, set this high enough for complete synthesis but not excessively high.