Client application with Content-Aware Storage (CAS)
You are required to provide a client application that uses the CAS search API to leverage the data ingested by CAS as part of your RAG application. This topic outlines the workflow for querying the CAS database using the exposed API.
- Nvidia llama3-1 NIM and L40S or H100 GPU.
- Network connectivity to issue REST APIs to CAS.

The entire workflow is divided into two parts: one part retrieves results from CAS, while the other obtains results from a Large Language Model (LLM).
The Vector database within the CAS namespace is designed to store data ingested from various files. The namespace also includes the query service pod, which features a FAST API (REST API). To retrieve information from the Vector database, there exists an installation of the NVDIA NIM Embedding service outside the namespace. Your task involves developing an application (client application application), that communicates with this REST API by using the OpenShift® cluster authentication. For more information about the FAST API methods, syntax, description, or to try the APIs, see Content-Aware Storage (CAS) APIs.
In addition, you must have a large language model of your choice that is accessible from your application. If you want to use the IBM Watson granite model LLM, then to download and use the different IBM Granite models, see IBM Granite.
When the user of the client application raises a question from its front end, a query REST Call is sent to the query service pod through the FAST API REST call. The query service in turn goes with the query to the NVDIA NIM Embedding Service to get the embedding. This service generates embeddings for the data, enabling the query service to understand and process the request of the user effectively. After the results are fetched, a SQL query is sent with the embedding to the database for top "K" number of Vector results. When the user runs query from the front end, they can specify the value of this "K". For example, Select, column from table order by, limit by vector, with embedding. The top "K" results are returned to the client application which in turn sends it to the LLM. While embeddings make the retrieval process efficient, LLMs add a layer of contextual understanding, transforming raw data into meaningful information. Finally, the end results are published on the front end of the client application.