Unstructured Data Integration settings

To preset certain configuration values, define default settings for all Unstructured Data Integration flows in your project.

Changes to the settings are applied to new Unstructured Data Integration flows and to Unstructured Data Integration jobs that run after the settings changed.

Access control list

Access control lists (ACLs) provide details about ownership of and access rights to the files that are ingested. You can control whether that information is used in the Unstructured Data Integration flows in this project.

Enable Access Control List retrieval: Retrieve and preserve file-level permission details during data ingestion. The same access rights are later applied to the generated document sets.
Ingest the documents even if the access control list from the source is not supported by the connection: Documents are ingested even if no information about ownership and access rights in the source can be retrieved.

For more information, see Retrieving Access Control List for ingested documents.

Save frequency

Decide how to save changes when editing flows in the canvas. You can specify an interval for autosave, or select manual save instead.

Save changes automatically: When selected, changes on the canvas are saved automatically at the specified interval.
Apply changes manually: When selected, users must manually save any changes when editing the canvas. No auto-save is enabled.

Document set storage

Define the default storage for document sets that are generated in a Unstructured Data Integration flow. Select a connection and the schema where you want to store the Iceberg tables with imported metadata. You can choose from these connection types:

Iceberg metastore
Presto
watsonx.data Presto

Default embedding model

Set a default model for generating embeddings. When you configure a Unstructured Data Integration flow, you can override the setting by selecting a different available model.

Custom operator

Make a custom operator available to all Unstructured Data Integration flows in your project.

To add a custom operator, provide a name and a description for the operator, and upload these files:

The Python configuration file (.py) for the custom operator
Optional: An archive (.zip) that contains any dependencies for the operator

For more information about creating custom operators, see User-generated nodes.

Environments

Select a default runtime environment:

Python for simple flows where the resource usage is low.
Spark for complex flows where the resource usage is high. Additionally, select the Spark instance and runtime.

You can choose from all runtime environments that are defined in the project.

Spark job assets

If you want to run Unstructured Data Integration flows in a Spark runtime environment, switch this option on. A setup job is created that bundles code and dependencies into a persistent volume for Spark jobs. You can run the setup from any project in the service instance. The setup is done for the entire service instance so you run it only once per instance.