Watsonx Orchestrate Onprem: Tiny Milvus CrashLoopBackOff due to etcd backend capacity limit

Troubleshooting

Problem

In Tiny Milvus deployments used by Watsonx Orchestrate On‑Prem, the Milvus standalone pod can repeatedly enter a CrashLoopBackOff state because the embedded etcd backend reaches its maximum supported database size. Once this condition occurs, Milvus cannot start or operate normally, leading to service disruption.

Symptom

The issue typically presents with the following symptoms:

Milvus standalone pod shows 0/1 Ready and enters CrashLoopBackOff
Deployment may report ProgressDeadlineExceeded
Milvus pod logs show:

panic: etcdserver: mvcc: database space exceeded

etcd pod logs show:

alarm:NOSPACEserving /health false due to an alarm

Cause

Tiny Milvus uses a dedicated embedded etcd instance to store Milvus metadata such as collections, schemas, and segment information. This etcd instance is configured with a fixed backend quota:

quota-backend-bytes = 2147483648 bytes (~2 GiB)

Key points:

2 GB is a hard etcd backend limit for Tiny Milvus
The limit cannot be increased through supported configuration, PVC expansion, or tuning
As metadata grows, etcd eventually raises a NOSPACE alarm
Once the alarm is raised, Milvus cannot complete required metadata operations at startup and exits, causing CrashLoopBackOff

This issue is not caused by PVC exhaustion, StorageClass constraints, or pod scheduling failures.

Environment

This issue applies to the following environment:

IBM watsonx Orchestrate / IBM Lakehouse (On‑Prem)
Tiny Milvus (standalone deployment)
Red Hat OpenShift
CPD instance operands namespace
Affected pods:
- ibm-lh-lakehouse-wo-milvus-standalone
- ibm-lh-lakehouse-wo-milvus-etcd-0

Diagnosing The Problem

To confirm this issue, perform the following checks on the etcd pod:

oc rsh ibm-lh-lakehouse-wo-milvus-etcd-0 
ETCDCTL_API=3 etcdctl endpoint status --write-out=table
ETCDCTL_API=3 etcdctl endpoint status --write-out=json
ETCDCTL_API=3 etcdctl alarm list

Indicators of this problem include:

etcd database size close to or at ~2 GB
Presence of alarm:NOSPACE
Milvus standalone pod failing during startup with database space exceeded

Resolving The Problem

1. Increase ETCD and Standalone Milvus Resources

Ensure sufficient CPU, memory, and ephemeral storage are allocated to support normal operation and background maintenance tasks.

Recommended Resource Settings

Ephemeral Storage: 3 GB

Memory: 3 GB

These allocations ensure enough space and memory for etcd compaction and defragmentation, preventing NOSPACE alarms and write failures.

Apply RSI Patch (Stability Headroom Only) to:

Standalone Milvus pod

etcd pod (Milvus dependency)

Note: RSI patch steps are shared in the last section of this technote

2. Verify ETCD Status

Access the ETCD Pod

oc rsh ibm-lh-lakehouse-wo-milvus-etcd-0

Check ETCD Endpoint Status

ETCDCTL_API=3 etcdctl endpoint status --write-out=table

ETCDCTL_API=3 etcdctl endpoint status --write-out=json

Verify:

ETCD state is Running

Note the revision number from the output (required for compaction)

3. Compact ETCD

Compaction removes old revisions and reduces logical database size.

 ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=60s compact <REVISION_NUMBER>

Replace <REVISION_NUMBER> with the value recorded in the previous step.

4. Defragment ETCD

Defragmentation reclaims physical disk space after compaction.

ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=60s defrag

If the command fails, retry with increased timeout:

ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=120s defrag

5. Verify ETCD Health After Maintenance

ETCDCTL_API=3 etcdctl endpoint health

ETCDCTL_API=3 etcdctl endpoint status --write-out=table

ETCDCTL_API=3 etcdctl endpoint status --write-out=json

6. Disarm ETCD NOSPACE Alarm

If an NOSPACE alarm was triggered earlier, disarm it to restore write operations.

ETCDCTL_API=3 etcdctl alarm disarm

7. Restart Milvus Pods

Restart Milvus to ensure it reconnects cleanly to the recovered etcd state.

oc delete pod ibm-lh-lakehouse-xxxxxxx

Replace xxxxxxx with the actual Milvus pod suffix.

Important:

This procedure restores etcd operability, but it does not increase the fixed ~2 GB Tiny Milvus etcd backend limit. Recurrence can only be prevented through strict ingestion control and, preferably, by moving to external knowledge sources supported by watsonx Orchestrate, since the backend limit cannot be expanded. Stop uploads when usage approaches ~90%.

Reaching out to Support

The 2GB limit for Etcd quota-backend-bytes is supposed to work with large number of collections and vectors.

Attached is a script to record the metrics. If you need to reach out to support run this script.

get_milvus_usage_stats.py__3.txt

Run below command against a wo-conversation-controller pod:

cat get_milvus_usage_stats.py | oc exec -i wo-conversation-controller-xxxxxxxxxx-xxxxx -- python3 - --output-format csv > milvus_stats.csv

Recommendation:

Use External Knowledge Sources. Tiny Milvus is intended for experimentation and early‑stage usage, not sustained production growth. Within its fixed 2 GB limit, compaction and defragmentation are the only supported ways to reclaim the space.

Related Information

Building a knowledge source

Document Location

Worldwide

Steps to Create and Apply IBM Lakehouse Milvus RSI Patches (Stability Headroom Only)

The following RSI patch procedures increase CPU and memory headroom for Milvus pods, improving stability during startup, ingestion, recovery, and etcd maintenance operations. These patches do not increase the Tiny Milvus 2 GB etcd backend storage limit.

A. Milvus Standalone Pod – Resource Limit RSI Patch

Step 1: Create the Patch Specification File

Create a JSON file named ibm-lh-milvus-standalone.json with the following content:

[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "4Gi"
  },
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/cpu",
    "value": "4000m"
  }
]

Step 2: Copy the Patch File to the CPD CLI Workspace

Copy the JSON file to the RSI folder inside your CPD CLI workspace location:

cp ibm-lh-milvus-standalone.json ~/cpd-cli-workspace/olm-utils-workspace/work/rsi/ibm-lh-milvus-standalone.json

Note: Adjust the path according to your actual CPD CLI workspace location.

Step 3: Initialize the OLM Container

Initialize the OLM container using the configured login command:

$CPDM_OC_LOGIN

Step 4: Create the RSI Patch

Run the following command to create the RSI patch and set it to an active state:

cpd-cli manage create-rsi-patch \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --patch_name=ibm-lh-milvus-standalone-resource-limit \
  --patch_type=rsi_pod_spec \
  --patch_spec=/tmp/work/rsi/ibm-lh-milvus-standalone.json \
  --spec_format=json \
  --include_labels=milvus/role:standalone \
  --state=active

Step 5: Apply the RSI Patch

Apply the patch to the running pods:

cpd-cli manage apply-rsi-patches \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --patch_name=ibm-lh-milvus-standalone-resource-limit

Step 6: Verification

Check Patch Status

To verify the patch has been created and applied successfully:

cpd-cli manage get-rsi-patch-info \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --all

B. Milvus etcd Pod – Memory Limit RSI Patch

Step 1: Create the Patch Specification File
Create a JSON file named ibm-lh-milvus-etcd.json with the following content:

[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "3G"
  }
]

Step 2: Copy the Patch File to the CPD CLI Workspace

cp ibm-lh-milvus-etcd.json ~/cpd-cli-workspace/olm-utils-workspace/work/rsi/ibm-lh-milvus-etcd.json

Note: Create the rsi directory if it does not already exist.

Step 3: Initialize the OLM Container

$CPDM_OC_LOGIN

Step 4: Create the RSI Patch

cpd-cli manage create-rsi-patch \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--patch_name=ibm-lh-milvus-etcd-resource-limit \
--patch_type=rsi_pod_spec \
--patch_spec=/tmp/work/rsi/ibm-lh-milvus-etcd.json \
--spec_format=json \
--include_labels=icpdsupport/serviceInstanceId:watsonxTmpLteWxdEngine \
--state=active

Step 5: Apply the RSI Patch

cpd-cli manage apply-rsi-patches \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--patch_name=ibm-lh-milvus-etcd-resource-limit

Step 6: Verification

cpd-cli manage get-rsi-patch-info \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--all

oc get pod -n ${PROJECT_CPD_INST_OPERANDS} \
-l icpdsupport/serviceInstanceId=watsonxTmpLteWxdEngine \
-o jsonpath='{range .items[*]}{.metadata.name}{"\
"}{" Memory Request: "}{.spec.containers[0].resources.requests.memory}{"\
"}{" Memory Limit: "}{.spec.containers[0].resources.limits.memory}{"\
"}{end}'

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSVAUS","label":"IBM watsonx Orchestrate Cartridge for IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8mgJ0000000EnlQAE","label":"Design-\u003EKnowledge"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":""}]

Product Synonym

Watson Orchestrate

Was this topic helpful?

Document Information

Modified date:
15 May 2026

UID

ibm17271869

Tips

Watsonx Orchestrate Onprem: Tiny Milvus CrashLoopBackOff due to etcd backend capacity limit

Troubleshooting

Problem

Symptom

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Related Information

Document Location

Steps to Create and Apply IBM Lakehouse Milvus RSI Patches (Stability Headroom Only)

A. Milvus Standalone Pod – Resource Limit RSI Patch

B. Milvus etcd Pod – Memory Limit RSI Patch

Step 1: Create the Patch Specification File
Create a JSON file named ibm-lh-milvus-etcd.json with the following content:

Product Synonym

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?

Tips

Watsonx Orchestrate Onprem: Tiny Milvus CrashLoopBackOff due to etcd backend capacity limit

Troubleshooting

Problem

Symptom

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Related Information

Document Location

Steps to Create and Apply IBM Lakehouse Milvus RSI Patches (Stability Headroom Only)

A. Milvus Standalone Pod – Resource Limit RSI Patch

B. Milvus etcd Pod – Memory Limit RSI Patch

Step 1: Create the Patch Specification FileCreate a JSON file named ibm-lh-milvus-etcd.json with the following content:

Product Synonym

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?

Step 1: Create the Patch Specification File
Create a JSON file named ibm-lh-milvus-etcd.json with the following content: