Troubleshooting Content-Aware Storage (CAS) service
The common troubleshooting steps and limitations in CAS service.
For more information about the Informational, Warning, and Critical events, see Troubleshooting Content-Aware Storage (CAS) service.
nv-ingest pod is in imagepullback error
- Problem statement
- The
nv-ingestpod is inimagepullbackerror state due to too many requests fordocker.io.
- Resolution
- To resolve the issue, authenticate to
docker.io.
CAS ingestion fails
- Problem statement
- The CAS ingestion fails
with the following error in the logs of the
cast-runtimepods:2025-03-26 22:46:36,725 - ERROR - Error during fetching, retrying... Error: HTTPSConnectionPool(host='nv-ingest.nv-ingest.svc.cluster.local', port=7670): Max retries exceeded with url: /v1/fetch_job/ (Caused by SSLError(SSLError(1, '[SSL] record layer failure (_ssl.c:1006)')))
- Cause
- It implies that the
httpsNVIDIA NIM service is not available.
- Resolution
-
- Scale the
cast-runtimedeployment down to '0'. - In the
cast-runtimedeployment, change thehttpsin the value parameter tohttp:- name: NVMM_NIM_SERVICE value: https://nv-ingest.nv-ingest.svc.cluster.localto- name: NVMM_NIM_SERVICE value: http://nv-ingest.nv-ingest.svc.cluster.local - Scale the deployment back up to its original number.
- Scale the
Error in semantic_search
- Problem statement
- The query search service gives the following error during a
semantic_search:ERROR: querysearch/semantic_search failed Connection error. Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions yield File "/opt/app-root/lib64/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request resp = self._pool.handle_request(req) File "/opt/app-root/lib64/python3.11/site-packages/httpcore/_backends/sync.py", line 154, in start_tls with map_exceptions(exc_map): File "/usr/lib64/python3.11/contextlib.py", line 158, in __exit__ self.gen.throw(typ, value, traceback) File "/opt/app-root/lib64/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions raise to_exc(exc) from exc
- Cause
- It implies that the TLS version of the NVIDIA service is not available.
- Resolution
-
- Scale down the query-search deployment.
- In the query-search deployment, change the value of
variable:
- name: NVMM_EMBED_SERVICE value: 'https://nv-ingest-embedqa.nv-ingest.svc.cluster.local'to- name: NVMM_EMBED_SERVICE value: 'http://nv-ingest-embedqa.nv-ingest.svc.cluster.local' - Scale the deployment back up to its original number.
Datasource stuck in "Connecting" state error
- Problem statement
- The create Datasource may get stuck in a "connecting" state due to different reasons. To do a proper diagnostic, check the CAS operator logs to find the error message.
- Cause and resolution
- The possible causes of this error are as follows:
- Cause:
Scale CSI user does not have enough privileges for Watcher creation.
ERROR RETURNED BY sConn.CreateWatch4Fileset =========> =====> [EFSSG0012C Permission denied: Your role(s): [csiadmin, containeroperator], required role(s): [admin, storageadmin, securityadmin]] Unable to enable clustered watch for source castFS:rootResolution:
Add the Scale CSI user to the Storage Administrator group in Scale as specified Configuring Scale user to enable watch creation.
- Cause:
Kafka authentication ConfigMap is missing.
2025-03-28T18:21:10Z ERROR Reconciler error {"controller": "datasource", "controllerGroup": "cas.isf.ibm.com", "controllerKind": "DataSource", "DataSource": {"name":"mc-test","namespace":"ibm-cas"}, "namespace": "ibm-cas", "name": "mc-test", "reconcileID": "9aeda80b-4706-408a-9c05-ffcb39cb877e", "error": "panic: odd number of arguments passed as key-value pairs for logging [recovered]"}Resolution:
Create the ConfigMap for Kafka authentication as explained in Manually set up to connect to Kafka broker.
- Cause:
Access is denied to static PV that is created by the Datasource. If a functional issue is detected regarding static PV accessibility, it could be related to an access denied error. To validate, access the mount path by using the pod generated by the DocumentProcessor. Note that the pod has the same name as the DocumentProcessor.
(app-root) sh-5.1$ cd /gpfs/gpfs3/fileset_sample bash: cd: /app-root: Permission denied
Resolution:
Add a known GID to the Fileset in Scale and add an annotation to the Datasource in CAS as described in Step 8.
- Cause: