Setting size limits for files in external vector stores

You can set custom file size limits for documents in external vector stores that are used to ground foundation model prompts with contextual information.

Before you begin

The IBM watsonx.ai service must be installed.

You must be a cluster administrator.

Procedure

You can change the default file size limits for documents stored in external vector stores such as Elasticsearch and watsonx.data™ Milvus.

Setting file size limits for the cluster
Edit the watsonxaiifm custom resource to specify the size limit in megabytes (MB), kilobytes (KB), or gigabytes (GB) for each file type stored in a vector data store. The file size limits apply across the cluster.

The following table describes which attributes you can set in your custom resource to specify size limits for various file types:

File type Custom resource attribute
CSV csv_file_type_limit
DOC doc_file_type_limit
HTML html_file_type_limit
JSON json_file_type_limit
PPTX pptx_file_type_limit
PDF pdf_file_type_limit
TXT txt_file_type_limit
YAML yaml_file_type_limit
XLS xls_file_type_limit
XML xml_file_type_limit
For example, run the following command to set size limits for PDF and HTML files in your vector data store:
oc patch watsonxaiifm watsonxaiifm-cr \
--namespace=${PROJECT_CPD_INST_OPERANDS} \
--type=merge \
--patch='{"spec":{"file_limits": {"pdf_file_type_limit": "50MB", "html_file_type_limit": "20MB"}}}'
Attention: If you override the file size limits for your cluster and then restart the watsonxaiifm operator during a service upgrade, the cluster-level settings are removed and the default file size limits are applied. You must reapply any cluster-level file size limit override configuration.
Setting file size limits for a project
Use the asset files API to set size limit for each file type stored in your vector data store. For API method details, see Data and AI Common Core Software APIs.
Note: If you specify both the cluster-level and project-level file size limit settings, the project-level settings take precedence and are applied to your installation.
Review the following requirements for project-level file size limit configuration:
  • You must provide the file size limit override configuration in a JSON format only. An invalid JSON format will not set the file size override configuration correctly.
  • If you do not set the limit for a specific file type, the cluster-level setting for that file type is used.
The following table describes which attributes you can set in your JSON configuration file to specify size limits for various file types:
File type JSON configuration attribute
CSV WX_MIME_TYPE_CSV
DOC WX_MIME_TYPE_DOC
HTML WX_MIME_TYPE_HTML
JSON WX_MIME_TYPE_JSON
PPTX WX_MIME_TYPE_PPTX
PDF WX_MIME_TYPE_PDF
TXT WX_MIME_TYPE_TXT
YAML WX_MIME_TYPE_YAML
XLS WX_MIME_TYPE_XLS
XML WX_MIME_TYPE_XML
  1. You can set file size limits for multiple projects simultaneously in a single configuration file. Run the following request to override default values and set custom file size limits specified in a JSON configuration file:

    curl --location --request PUT '<cluster_url>/v2/asset_files/config/override_config.json?account_id=999&root=true' \
    --header 'Authorization: Bearer ${ACCESS_TOKEN}' \
    --form 'file=@"/Users/<user_system_name>/Documents/override_config.json"'
    Important: When you run the asset files API cURL command, make sure to specify limits for every project for which you want to change the file size limits. To preserve the override settings for existing projects and update the configuration for new projects, make sure to specify the configuration for the complete list of affected projects in your workspace in the JSON configuration file. Settings are deleted for any projects that are not included in the configuration file.
    The following file is an example override_config.json that sets custom size limits for PDF and TXT file types for two projects:
    {
        "project_overrides": {
            "<watsonx project ID 1>": {
                "vector_indexes": {
                   "WX_MIME_TYPE_PDF": "10MB",
                   "WX_MIME_TYPE_TXT": "10MB",
                   "WX_MIME_TYPE_CSV": "10MB",
                   "WX_MIME_TYPE_HTML": "10MB",
                   "WX_MIME_TYPE_JSON": "10MB",
                   "WX_MIME_TYPE_XLS": "10MB",
                   "WX_MIME_TYPE_PPTX": "10MB",
                   "WX_MIME_TYPE_DOC": "10MB"
    	     }
    	 },
            "<watsonx project ID 2>": {
                "vector_indexes": {
                   "WX_MIME_TYPE_PDF": "10MB",
                   "WX_MIME_TYPE_TXT": "10MB",
                   "WX_MIME_TYPE_CSV": "10MB",
                   "WX_MIME_TYPE_HTML": "10MB",
                   "WX_MIME_TYPE_JSON": "10MB",
                   "WX_MIME_TYPE_XLS": "10MB",
                   "WX_MIME_TYPE_PPTX": "10MB",
                   "WX_MIME_TYPE_DOC": "10MB"
    	     }
    	 }
        }
    }
    For details about how to retrieve the watsonx™ project ID, see Finding the project ID.
  2. Optional: Run the following command to verify that your file size limit settings are applied correctly:
    curl --location --request GET '<cluster_url>/v2/asset_files/config/override_config.json?account_id=999&root=true' \
    --header 'Authorization: Bearer ${ACCESS_TOKEN}'
    The settings may take up to 15 minutes to apply.

What to do next

To get started with indexing your documents by adding the files to vector data stores, see Adding vectorized documents.