Designing unstructured data curation flows
Unstructured data curation consists of an analysis and a processing step. In the analysis step, you import documents and detect their structure, document format, and the languages they are in. In the processing step, you can create embeddings and extract entities for selected groups of documents.
Project setup
Select or create the project in which you want to work. Remember that projects that are marked as sensitive have certain restrictions so that they might not be suitable for unstructured data curation.
As a prerequisite, a project administrator must set the default document set storage in the project settings for Unstructured Data Integration.
Unstructured data curation flows are processed in the runtime environment that is set in the Unstructured Data Integration project settings. Make sure that the set environment is suitable for the type of curation flows that you want to run in this project.
If retrieval of access control lists (ACL) is configured in the project settings, ACLs are retrieved only for processing flows.
Document scope
Select the documents that you want to analyze to get an understanding of their content and to eventually process them further. You can work with documents from a connection, project assets, or documents that you upload from your local file system.
If you work with documents from a connection and select folders, these folders can contain documents of unsupported document types. Such documents are filtered out during ingestion.
For more information about supported document formats and connections, see Supported connectors for unstructured data curation.
Analysis configuration
You can classify your documents to identify whether the data in your document matches the format in one of the selected document classes. This preprocessing helps you group documents based on certain facets and thus limit later text extraction to the fields in the identified document classes.
The analysis flow also returns the list of documents that didn't match one of the provided document classes. You can then decide whether you need to create a custom document class for these documents or whether you do not want or need to process them any further.
Limit document analysis
Depending on the selected document scope, you might want to limit the size and the number of the documents that are handled by the analysis flow. By default, this option is turned on, and you can set a maximum size in MB and a maximum number. The default limits are 100 MB per document and 200 documents at maximum.
Optical Character Recognition (OCR)
OCR extracts text from images, scanned documents, and tables if documents do not contain any text, but only images. Image processing can take a considerable amount of time. Therefore, you might want to disable it when you analyze programmatically generated PDFs.
Supported scripts
Select an additional language script for extraction of text and entities from documents that use a character set or writing style that is not covered by the default Latin script (ISO/IEC 8859).
You can select Cyrillic (for documents in languages such as Belarusian, Bulgarian, Chechen, Macedonian, Mongolian, Russian, Serbian, Ukrainian, or Uzbek) or Chinese/Japanese/Korean as the second script.
For information about the languages that are covered by these scripts, see Supported machine-printed languages in the IBM watsonx.ai and watsonx.governance documentation.
Identify document classes
By default, the option to identify and assign document classes to documents is turned on. If you already have an idea of what document content you are going to handle, you can select specific document classes for your analysis flow.
If you don't want to create an entity table in the processing flow, but only create embeddings from your documents, you can turn off the option to identify document classes. In that case, analysis detects only the document types and languages.
Processing configuration
Configure default settings for your processing flow that transforms the unstructured content into structured, query-ready tables for analytics and downstream workflows. You can overwrite most of the settings for each run of the processing flow.
Limit document processing
Depending on the selected document scope, you might want to limit the size and the number of the documents that are handled by the processing flow.
After you toggle this option on, you can set a maximum size in MB and a maximum number. The default limits are 100 MB per document and 200 documents at maximum.
Optical Character Recognition (OCR)
OCR extracts text from images, scanned documents, and tables if documents do not contain any text, but only images. Image processing can take a considerable amount of time. Therefore, you might want to disable it when you process programmatically generated PDFs, which don't require OCR.
Default document library
Select a default document library for your output document sets. If you don't set a default library, document sets are added only to the project and you can manually add them to any document library. The document library must already exist in the project.
Import and extract data
These settings define the minimum required processing options. They are set by default and you can't change these settings.
The generated processing flow includes these processing steps:
- The documents that you selected from the analysis result are imported and text is extracted from the documents for classification.
- The ingested documents are classified for targeted text extraction in later processing steps. This step is included only if the documents that are input to the processing flow are grouped by document class.
- The document language is identified for more efficient downstream processing.
If you send more than one document group, processing is split into parallel branches after the classification step. If documents are grouped by other criteria, the split happens directly after the ingestion step.
Enrich data
Assign additional metadata to unstructured data to add business context to your documents. You can automatically assign data classes, business terms, and classifications to your documents to help you decide which business terms and classifications to assign to the containing document set. You can then use the assigned terms and classifications to find relevant document sets more quickly.
Select the categories with governance artifacts that are relevant for your use case to determine which data classes, business terms, and classifications can be assigned to the documents. The categories that you set in the default configuration are used for all runs of the processing flow. You cannot change the selection of individual runs.
A foundation model is used to match data classes to document content. The content and the data classes that are in the selected categories are sent to the model. The model breaks down the content into meaningful pieces, understands the context and relationships between words, and recognizes important information and patterns, and can thus identify and assign the data classes that match best. If a data class is assigned, all business terms and classifications that are defined on this data class are assigned as related artifacts.
Any data-matching configuration of a data class is ignored for this type of data-class assignment.
The default model for data-class matching is mistralai/mistral-small-3-1-24b-instruct-2503, but you can switch to any of the available models. If the model that you initially selected is not available, the default model is automatically
selected. You can then select a different available model. If no model is available during processing, data-class matching is skipped and no data classes are assigned.
The token limit determines the maximum number of tokens that are allowed for the context window for each inference request. The default value is 2,048. Higher values allow for working with larger documents and more consistent responses, but increase latency and costs.
Make data ready for AI
Prepare standard Retrieval Augmented Generation (RAG) patterns:
- If necessary, convert the document content, for example PDF content, into text for vectorization.
- Chunk the text. Split documents into meaningful smaller parts.
- Generate embeddings to encode information units such as words or sentences in your document content as a numerical representation.
- Store the generated embeddings in a vector database to enable semantic searches that retrieve information that is similar in meaning.
Based on this pattern, sections that are most similar to a user's question can be retrieved and passed to the LLM for generating an answer to the question.
If you enable this option, you must select the model that you want to use to generate the embeddings from the deployed embedding models and the connection to a vector database for storing the collection with the generated embeddings.
For OpenSearch or OpenSearch IBM Cloud vector databases, you can also select the default engine for vector indexing and search from these libraries: Faiss, Lucene, NMSLIB, or jVector
After an index is created, you can't change the selected engine for that particular index.
If you do not want to extract structured information from documents or if you don't need to because the documents don't contain any structured information, for example, research documents or flat text files, you can select only this processing option and switch off extraction of structured information.
For this processing option, nodes for chunking, embedding, and generating output in the selected vector database are added to the processing flow.
Extract structured information
For extraction of structured information, at least one document class must have been identified during analysis. Otherwise, this option is disabled.
Documents can contain structured information, for example, invoices, receipts, or bank statements. You can select this processing option to extract this structured information, standardize it, and store it in an entity table. You can use such entity tables as context for complex queries to AI, in downstream analytics, in reporting, or you can enrich them with business context.
- Extract specific entities or categories of information from a document based on the identified document structure.
- Transform the extracted information into the format that is defined for the target table by the applied document class.
- Write the normalized information to an entity table in a structured database.
If you enable this option, you must select a connection to a database for storing the entity tables. See Supported connectors for unstructured data curation.
If you want to extract structured information only and do not want to generate embeddings for RAG, you can select only this processing option and switch off the option to make data ready for AI.
For this processing option, entity curation and entity store nodes are added to the processing flow.