By using a remote S3-compatible source, you can upload documents to a centralized storage
location for content ingestion. watsonx Assistant for Z then
connects to this location to ingest the uploaded content.
About this task
By default, Multicloud Object Gateway (MCG) is installed and configured. For setup
instructions, see Installing and setting up Multicloud Object Gateway for IBM Software
Hub.
If you don’t have access to your own S3 source,
watsonx Assistant for Z offers an integrated MCG solution that can be
used for document ingestion.
Note: Supported file formats include: PDF, HTML, DOCX, CSV, XLS, XLSX,
PPTX, and Markdown.
To ingest content using a remote S3 source, begin by uploading your
files to an S3-compatible storage location. Next, create an S3 bucket and connect this remote source
to the watsonx Assistant for Z content ingestion
pipeline.
Procedure
- Log in to the server by using the following command:
- Acquire your S3 credentials.
If you are using the MCG instance that is
provided within your cluster, follow these steps:
- Log in to the Red Hat OpenShift console.
- Click .
- Select the namespace
openshift-storage from the
Projects drop-down list.
- Copy the S3 URL.
- Log in using the credentials from the secret
noobaa-admin.
Note: If you're using your own S3 storage, make sure the cluster is configured to connect to that S3
source.
- Upload your files to the S3 source.
- Configure the following environment variables: AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY.
- Run the following command to create an S3 bucket:
aws --endpoint <S3_URL> --no-verify-ssl s3 mb s3://<BUCKET_NAME>
- Upload your documents to the S3 bucket by using the following command:
aws --endpoint <S3_URL> --no-verify-ssl s3 cp <FOLDER> s3://<BUCKET_NAME> --recursive
- Use the following command to connect the remote S3 source to the watsonx Assistant for Z content ingestion pipeline:
zassist ingest s3 "<USER_DEFINED_NAME>" "<S3_URL>" "<S3_KEY_ID>" "<S3_SECRET_KEY>" "<BUCKET_NAME>" --watch
Replace USER_DEFINED_NAME with the name of the remote source you're connecting to. Make sure the
name is in lowercase, uses only underscores (_), and does not start with a number.
Note: If there are
updates to the remote source, rerun the ingest command. The ingestion pipeline automatically detects
and processes the delta changes.
Use the following flags, as required:
--skip-pii - Use this flag to bypass PII checks.
--enable-ocr - Use this flag to enable OCR for PDF ingestion, which can speed
up processing.
--disable-tabular - Use this flag to disable tabular extraction for PDF
ingestion, which can also improve processing speed.
After ingesting and loading the files, you can perform the following actions:
- List all connected remote sources:
zassist list
- Monitor the status of an ingested
resource:
zassist watch <ID_OR_USER_DEFINED_NAME>
- View detailed information about a specific
source:
zassist details <ID_OR_USER_DEFINED_NAME>
- Remove a source:
zassist delete <ID_OR_USER_DEFINED_NAME>
What to do next
Use the AI Assistant to submit queries and receive relevant responses from the documents
you ingested.