Troubleshooting Content-Aware Storage (CAS) service

The common troubleshooting steps and limitations in CAS service.

For more information about the Informational, Warning, and Critical events, see Troubleshooting Content-Aware Storage (CAS) service.

nv-ingest pod is in imagepullback error

Problem statement
The nv-ingest pod is in imagepullback error state due to too many requests for docker.io.
Resolution
To resolve the issue, authenticate to docker.io.

CAS ingestion fails

Problem statement
The CAS ingestion fails with the following error in the logs of the cast-runtime pods:
2025-03-26 22:46:36,725 - ERROR - Error during fetching, retrying... 
Error: HTTPSConnectionPool(host='nv-ingest.nv-ingest.svc.cluster.local', port=7670): 
Max retries exceeded with url: /v1/fetch_job/ (Caused by SSLError(SSLError(1, '[SSL] record layer failure (_ssl.c:1006)')))
Cause
It implies that the https NVIDIA NIM service is not available.
Resolution
  1. Scale the cast-runtime deployment down to '0'.
  2. In the cast-runtime deployment, change the https in the value parameter to http:
    - name: NVMM_NIM_SERVICE
    value: https://nv-ingest.nv-ingest.svc.cluster.local
    to
    - name: NVMM_NIM_SERVICE
    value: http://nv-ingest.nv-ingest.svc.cluster.local
    
  3. Scale the deployment back up to its original number.

Error in semantic_search

Problem statement
The query search service gives the following error during a semantic_search:
ERROR: querysearch/semantic_search failed Connection error.
Traceback (most recent call last):
File "/opt/app-root/lib64/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File "/opt/app-root/lib64/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request
resp = self._pool.handle_request(req)
File "/opt/app-root/lib64/python3.11/site-packages/httpcore/_backends/sync.py", line 154, in start_tls
with map_exceptions(exc_map):
File "/usr/lib64/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/app-root/lib64/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
Cause
It implies that the TLS version of the NVIDIA service is not available.
Resolution
  1. Scale down the query-search deployment.
  2. In the query-search deployment, change the value of variable:
    - name: NVMM_EMBED_SERVICE
    value: 'https://nv-ingest-embedqa.nv-ingest.svc.cluster.local'
    to
    - name: NVMM_EMBED_SERVICE
    value: 'http://nv-ingest-embedqa.nv-ingest.svc.cluster.local'
  3. Scale the deployment back up to its original number.

Datasource stuck in "Connecting" state error

Problem statement
The create Datasource may get stuck in a "connecting" state due to different reasons. To do a proper diagnostic, check the CAS operator logs to find the error message.
Cause and resolution
The possible causes of this error are as follows:
  1. Cause:

    Scale CSI user does not have enough privileges for Watcher creation.

    ERROR RETURNED BY sConn.CreateWatch4Fileset =========>
            =====> [EFSSG0012C Permission denied: Your role(s): [csiadmin, containeroperator], required
            role(s): [admin, storageadmin, securityadmin]] Unable to enable clustered watch for source
            castFS:root​

    Resolution:

    Add the Scale CSI user to the Storage Administrator group in Scale as specified Configuring Scale user to enable watch creation.

  2. Cause:

    Kafka authentication ConfigMap is missing.

    2025-03-28T18:21:10Z ERROR Reconciler error {"controller": "datasource", "controllerGroup": "cas.isf.ibm.com", "controllerKind": "DataSource", "DataSource": {"name":"mc-test","namespace":"ibm-cas"}, "namespace": "ibm-cas", "name": "mc-test", "reconcileID": "9aeda80b-4706-408a-9c05-ffcb39cb877e", "error": "panic: odd number of arguments passed as key-value pairs for logging [recovered]"}​

    Resolution:

    Create the ConfigMap for Kafka authentication as explained in Manually set up to connect to Kafka broker​.

  3. Cause:

    Access is denied to static PV that is created by the Datasource.​ If a functional issue is detected regarding static PV accessibility, it could be related to an access denied error. To validate, access the mount path by using the pod generated by the DocumentProcessor. Note that the pod has the same name as the DocumentProcessor.​

    (app-root) sh-5.1$ cd /gpfs/gpfs3/fileset_sample​
    bash: cd: /app-root: Permission denied​

    Resolution:

    Add a known GID to the Fileset in Scale and add an annotation to the Datasource in CAS as described in Step 8.​