Ingesting content through CLI

By using a remote S3-compatible source, you can upload documents to a centralized storage location for content ingestion. watsonx Assistant for Z then connects to this location to ingest the uploaded content.

About this task

By default, Multicloud Object Gateway (MCG) is installed and configured. For setup instructions, see Installing and setting up Multicloud Object Gateway for IBM Software Hub.

If you don’t have access to your own S3 source, watsonx Assistant for Z offers an integrated MCG solution that can be used for document ingestion.
Note: Supported file formats include: PDF, HTML, DOCX, CSV, XLS, XLSX, PPTX, and Markdown.
To ingest content using a remote S3 source, begin by uploading your files to an S3-compatible storage location. Next, create an S3 bucket and connect this remote source to the watsonx Assistant for Z content ingestion pipeline.

Procedure

  1. Log in to the server by using the following command:
    zassist login <CI_URL>
  2. Acquire your S3 credentials.
    If you are using the MCG instance that is provided within your cluster, follow these steps:
    1. Log in to the Red Hat OpenShift console.
    2. Click Networking > Routes.
    3. Select the namespace openshift-storage from the Projects drop-down list.
    4. Copy the S3 URL.
    5. Log in using the credentials from the secret noobaa-admin.
    Note: If you're using your own S3 storage, make sure the cluster is configured to connect to that S3 source.
  3. Upload your files to the S3 source.
    Note: Use any S3-compatible client as needed. For instructions on using the AWS S3 CLI, see Installing, updating, and uninstalling the AWS CLI.
    1. Configure the following environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
    2. Run the following command to create an S3 bucket:
      aws --endpoint <S3_URL> --no-verify-ssl s3 mb s3://<BUCKET_NAME>
    3. Upload your documents to the S3 bucket by using the following command:
      aws --endpoint <S3_URL> --no-verify-ssl s3 cp <FOLDER> s3://<BUCKET_NAME> --recursive
  4. Use the following command to connect the remote S3 source to the watsonx Assistant for Z content ingestion pipeline:
    zassist ingest s3 "<USER_DEFINED_NAME>" "<S3_URL>" "<S3_KEY_ID>" "<S3_SECRET_KEY>" "<BUCKET_NAME>" --watch
    Replace USER_DEFINED_NAME with the name of the remote source you're connecting to. Make sure the name is in lowercase, uses only underscores (_), and does not start with a number.
    Note: If there are updates to the remote source, rerun the ingest command. The ingestion pipeline automatically detects and processes the delta changes.

    Use the following flags, as required:
    • --skip-pii - Use this flag to bypass PII checks.
    • --enable-ocr - Use this flag to enable OCR for PDF ingestion, which can speed up processing.
    • --disable-tabular - Use this flag to disable tabular extraction for PDF ingestion, which can also improve processing speed.

    After ingesting and loading the files, you can perform the following actions:
    • List all connected remote sources:
      zassist list 
    • Monitor the status of an ingested resource:
      zassist watch <ID_OR_USER_DEFINED_NAME> 
    • View detailed information about a specific source:
      zassist details <ID_OR_USER_DEFINED_NAME>
    • Remove a source:
      zassist delete <ID_OR_USER_DEFINED_NAME> 

What to do next

Use the AI Assistant to submit queries and receive relevant responses from the documents you ingested.