What's New
Learn about new and updated information since the last update of this topic collection.
IBM Spyre Enablement Stack for Power 1.1 (RHAIIS 3.4 EA1) - March 2026
- Introduction of Chunked prefill and Prefix caching.
- Dynamic logical partitioning (DLPAR) of Spyre.
- Red Hat OpenShift AI for Spyre.
- Added new Podman run commands in Running RHAIIS as stand-alone Podman containers.
- Added new Podman quadlet in Running RHAIIS as Podman Quadlets.
- Added new environment variables in Container environment variables for Spyre.
Chunked prefill
The Spyre™ card uses chunked prefill to prevent long pauses during response generation, even when your system processes large or complex requests.
When a new request arrives, the scheduler must process the prefill before generating the output text. This processing time can cause delays of 15 - 20 seconds and spikes in inter‑token latency (ITL), which is the time gap between each generated token. This process can result in slow responses.
Chunked prefill breaks this large process into smaller chunks, so the model can alternate quickly between processing the request and decoding output tokens. Chunked prefill minimizes the processing delays to approximately 4 - 5 seconds. This process reduces the overall waiting time for generating the output and improves the overall system efficiency.
Chunked prefill must be enabled for all use cases that include decoder models, such as granite‑3.3‑8b. Encoder‑only models such as embeddings and rerankers do not use chunked prefill. For configuration details, see the Podman example commands in Running RHAIIS as stand-alone Podman containers.
Prefix caching
The Spyre card uses prefix caching to reuse previously computed prompt segments and achieve faster prefill responses when prompts overlap.
--enable-prefix-caching command line interface (CLI) flag while in chunked prefill mode.Prefix caching works the same way as an upstream virtual large language model (vLLM). However, because Spyre uses fixed‑size prefill chunks, the prefix caching system can only reuse a chunk if the entire chunk is in the cache already. Because of this limitation, the cache hit rate varies based on the workload type and the prompt content that is reused from earlier requests.
Prefix caching must be enabled for all use cases that include decoder models, such as granite‑3.3‑8b. Encoder‑only models such as embeddings and re‑rankers do not use prefix caching. For configuration details, see the Podman example commands in Running RHAIIS as stand-alone Podman containers.