IBM Support

Watsonx Orchestrate Onprem: Tiny Milvus CrashLoopBackOff due to etcd backend capacity limit

Troubleshooting


Problem

In Tiny Milvus deployments used by Watsonx Orchestrate On‑Prem, the Milvus standalone pod can repeatedly enter a CrashLoopBackOff state because the embedded etcd backend reaches its maximum supported database size. Once this condition occurs, Milvus cannot start or operate normally, leading to service disruption.

Symptom

The issue typically presents with the following symptoms:
  • Milvus standalone pod shows 0/1 Ready and enters CrashLoopBackOff
  • Deployment may report ProgressDeadlineExceeded
  • Milvus pod logs show:
panic: etcdserver: mvcc: database space exceeded
  • etcd pod logs show:
alarm:NOSPACEserving /health false due to an alarm

Cause

Tiny Milvus uses a dedicated embedded etcd instance to store Milvus metadata such as collections, schemas, and segment information. This etcd instance is configured with a fixed backend quota:
 
quota-backend-bytes = 2147483648 bytes (~2 GiB)
 
Key points:
 
  • 2 GB is a hard etcd backend limit for Tiny Milvus
  • The limit cannot be increased through supported configuration, PVC expansion, or tuning
  • As metadata grows, etcd eventually raises a NOSPACE alarm
  • Once the alarm is raised, Milvus cannot complete required metadata operations at startup and exits, causing CrashLoopBackOff
This issue is not caused by PVC exhaustion, StorageClass constraints, or pod scheduling failures.

Environment

This issue applies to the following environment:
 
  • IBM watsonx Orchestrate / IBM Lakehouse (On‑Prem)
  • Tiny Milvus (standalone deployment)
  • Red Hat OpenShift
  • CPD instance operands namespace
  • Affected pods:
    • ibm-lh-lakehouse-wo-milvus-standalone
    • ibm-lh-lakehouse-wo-milvus-etcd-0

Diagnosing The Problem

To confirm this issue, perform the following checks on the etcd pod:
 
oc rsh ibm-lh-lakehouse-wo-milvus-etcd-0 
ETCDCTL_API=3 etcdctl endpoint status --write-out=table
ETCDCTL_API=3 etcdctl endpoint status --write-out=json
ETCDCTL_API=3 etcdctl alarm list
 
 
Indicators of this problem include:
  • etcd database size close to or at ~2 GB
  • Presence of alarm:NOSPACE
  • Milvus standalone pod failing during startup with database space exceeded

 

Resolving The Problem

1. Increase ETCD and Standalone Milvus Resources 

Ensure sufficient CPU, memory, and ephemeral storage are allocated to support normal operation and background maintenance tasks. 

Recommended Resource Settings 

  • Ephemeral Storage: 3 GB 

  • Memory: 3 GB 

These allocations ensure enough space and memory for etcd compaction and defragmentation, preventing NOSPACE alarms and write failures. 

Apply RSI Patch (Stability Headroom Only) to: 

  • Standalone Milvus pod 

  • etcd pod (Milvus dependency) 

 

Note: RSI patch steps are shared in the last section of this technote

 

2. Verify ETCD Status 

Access the ETCD Pod 

oc rsh ibm-lh-lakehouse-wo-milvus-etcd-0 

Check ETCD Endpoint Status 

ETCDCTL_API=3 etcdctl endpoint status --write-out=table 
ETCDCTL_API=3 etcdctl endpoint status --write-out=json 

Verify: 

ETCD state is Running 

Note the revision number from the output (required for compaction)  

 

3. Compact ETCD 

Compaction removes old revisions and reduces logical database size. 

 ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=60s compact <REVISION_NUMBER>

Replace <REVISION_NUMBER> with the value recorded in the previous step.  

 

4. Defragment ETCD 

Defragmentation reclaims physical disk space after compaction. 

ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=60s defrag 

 

 If the command fails, retry with increased timeout: 

ETCDCTL_API=3 etcdctl --dial-timeout=30s --command-timeout=120s defrag  

 

5. Verify ETCD Health After Maintenance 

ETCDCTL_API=3 etcdctl endpoint health 
ETCDCTL_API=3 etcdctl endpoint status --write-out=table 
ETCDCTL_API=3 etcdctl endpoint status --write-out=json 

 

6. Disarm ETCD NOSPACE Alarm 

If an NOSPACE alarm was triggered earlier, disarm it to restore write operations. 

ETCDCTL_API=3 etcdctl alarm disarm  

 

7. Restart Milvus Pods 

Restart Milvus to ensure it reconnects cleanly to the recovered etcd state. 

oc delete pod ibm-lh-lakehouse-xxxxxxx 

Replace xxxxxxx with the actual Milvus pod suffix. 

 

Important:

This procedure restores etcd operability, but it does not increase the fixed ~2 GB Tiny Milvus etcd backend limit. Recurrence can only be prevented through strict ingestion control and, preferably, by moving to external knowledge sources supported by watsonx Orchestrate, since the backend limit cannot be expanded. Stop uploads when usage approaches ~90%.

 

Reaching out to Support

The 2GB limit for Etcd quota-backend-bytes is supposed to work with large number of collections and vectors.

Attached is a script to record the metrics. If you need to reach out to support run this script.

get_milvus_usage_stats.py__3.txt

Run below command against a wo-conversation-controller pod:

cat get_milvus_usage_stats.py | oc exec -i wo-conversation-controller-xxxxxxxxxx-xxxxx -- python3 - --output-format csv > milvus_stats.csv

 

Recommendation

Use External Knowledge Sources. Tiny Milvus is intended for experimentation and early‑stage usage, not sustained production growth. Within its fixed 2 GB limit, compaction and defragmentation are the only supported ways to reclaim the space.

Related Information

Document Location

Worldwide

Steps to Create and Apply IBM Lakehouse Milvus RSI Patches (Stability Headroom Only)

The following RSI patch procedures increase CPU and memory headroom for Milvus pods, improving stability during startup, ingestion, recovery, and etcd maintenance operations. These patches do not increase the Tiny Milvus 2 GB etcd backend storage limit.

 

A. Milvus Standalone Pod – Resource Limit RSI Patch

Step 1: Create the Patch Specification File

Create a JSON file named ibm-lh-milvus-standalone.json with the following content:

[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "4Gi"
  },
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/cpu",
    "value": "4000m"
  }
]

Step 2: Copy the Patch File to the CPD CLI Workspace

Copy the JSON file to the RSI folder inside your CPD CLI workspace location:

cp ibm-lh-milvus-standalone.json ~/cpd-cli-workspace/olm-utils-workspace/work/rsi/ibm-lh-milvus-standalone.json

Note: Adjust the path according to your actual CPD CLI workspace location.

Step 3: Initialize the OLM Container

Initialize the OLM container using the configured login command:

$CPDM_OC_LOGIN

Step 4: Create the RSI Patch

Run the following command to create the RSI patch and set it to an active state:

cpd-cli manage create-rsi-patch \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --patch_name=ibm-lh-milvus-standalone-resource-limit \
  --patch_type=rsi_pod_spec \
  --patch_spec=/tmp/work/rsi/ibm-lh-milvus-standalone.json \
  --spec_format=json \
  --include_labels=milvus/role:standalone \
  --state=active

Step 5: Apply the RSI Patch

Apply the patch to the running pods:

cpd-cli manage apply-rsi-patches \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --patch_name=ibm-lh-milvus-standalone-resource-limit

Step 6: Verification

Check Patch Status

To verify the patch has been created and applied successfully:

cpd-cli manage get-rsi-patch-info \
  --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
  --all


 

B. Milvus etcd Pod – Memory Limit RSI Patch


Step 1: Create the Patch Specification File
Create a JSON file named ibm-lh-milvus-etcd.json with the following content:

[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "3G"
  }
]

 

Step 2: Copy the Patch File to the CPD CLI Workspace

cp ibm-lh-milvus-etcd.json ~/cpd-cli-workspace/olm-utils-workspace/work/rsi/ibm-lh-milvus-etcd.json

Note: Create the rsi directory if it does not already exist.

Step 3: Initialize the OLM Container

$CPDM_OC_LOGIN

 

Step 4: Create the RSI Patch

cpd-cli manage create-rsi-patch \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--patch_name=ibm-lh-milvus-etcd-resource-limit \
--patch_type=rsi_pod_spec \
--patch_spec=/tmp/work/rsi/ibm-lh-milvus-etcd.json \
--spec_format=json \
--include_labels=icpdsupport/serviceInstanceId:watsonxTmpLteWxdEngine \
--state=active

 

Step 5: Apply the RSI Patch

cpd-cli manage apply-rsi-patches \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--patch_name=ibm-lh-milvus-etcd-resource-limit

 

Step 6: Verification

cpd-cli manage get-rsi-patch-info \
--cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS} \
--all
oc get pod -n ${PROJECT_CPD_INST_OPERANDS} \
-l icpdsupport/serviceInstanceId=watsonxTmpLteWxdEngine \
-o jsonpath='{range .items[*]}{.metadata.name}{"\
"}{" Memory Request: "}{.spec.containers[0].resources.requests.memory}{"\
"}{" Memory Limit: "}{.spec.containers[0].resources.limits.memory}{"\
"}{end}'

[{"Type":"MASTER","Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSVAUS","label":"IBM watsonx Orchestrate Cartridge for IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8mgJ0000000EnlQAE","label":"Design-\u003EKnowledge"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":""}]

Product Synonym

Watson Orchestrate

Document Information

Modified date:
15 May 2026

UID

ibm17271869