What's New

Learn about new and updated information since the last update of this topic collection.

IBM Spyre Enablement Stack for Power 1.1 (RHAIIS 3.4 EA1) - March 2026

Added the following new topics:
Updated the following topics:

Chunked prefill

The Spyre™ card uses chunked prefill to prevent long pauses during response generation, even when your system processes large or complex requests.

When a new request arrives, the scheduler must process the prefill before generating the output text. This processing time can cause delays of 15 - 20 seconds and spikes in inter‑token latency (ITL), which is the time gap between each generated token. This process can result in slow responses.

Chunked prefill breaks this large process into smaller chunks, so the model can alternate quickly between processing the request and decoding output tokens. Chunked prefill minimizes the processing delays to approximately 4 - 5 seconds. This process reduces the overall waiting time for generating the output and improves the overall system efficiency.

Chunked prefill must be enabled for all use cases that include decoder models, such as granite‑3.3‑8b. Encoder‑only models such as embeddings and rerankers do not use chunked prefill. For configuration details, see the Podman example commands in Running RHAIIS as stand-alone Podman containers.

Prefix caching

The Spyre card uses prefix caching to reuse previously computed prompt segments and achieve faster prefill responses when prompts overlap.

Note: Prefix caching requires chunked prefill to function. Chunked prefill is enabled by default with a supported chunk size of 1024 tokens.
To enable prefix caching, add the --enable-prefix-caching command line interface (CLI) flag while in chunked prefill mode.

Prefix caching works the same way as an upstream virtual large language model (vLLM). However, because Spyre uses fixed‑size prefill chunks, the prefix caching system can only reuse a chunk if the entire chunk is in the cache already. Because of this limitation, the cache hit rate varies based on the workload type and the prompt content that is reused from earlier requests.

Prefix caching must be enabled for all use cases that include decoder models, such as granite‑3.3‑8b. Encoder‑only models such as embeddings and re‑rankers do not use prefix caching. For configuration details, see the Podman example commands in Running RHAIIS as stand-alone Podman containers.