Transform data nodes
Use Transform data nodes to divide your documents into meaningful sections based on their semantic relevance and generate embeddings that you can load and store in a vector database.
Chunking
Chunking is the process of dividing documents into smaller meaningful segments, as most large language models in the RAG use case do not deal with very large documents, but sections of the documents, which are semantically relevant. Using the chunking node improves the context understanding and processing accuracy when you're working with your documents.
You can select one of the following chunk types:
- watsonx
- simple
For both of the chunk types, you can specify the chunk size and chunk overlap parameters.
| Parameter name | Value | Default value | Description |
|---|---|---|---|
| Chunk type | watsonx or simple | watsonx | The type of the selected chunking method. |
| Chunk size | Integer | 1024 | The size of the processed chunk. To preserve the semantic meaning of the text and include complete sentences or paragraphs in the chunk, the chunk size might not be created with the selected chunk size exactly, but larger. |
| Chunk overlap | Integer | 0 | The chunk overlap specifies the number of overlapping tokens between consecutive chunks to maintain context continuity between the chunks. |
Consider where you want to write your processed content into. If you're planning to write it into watsonx.data Milvus or any other watsonx application, always select watsonx. Select the simple option, if you want to use the embeddings outside watsonx as using simple chunking method in watsonx might fail.
Add the chunking node to your flow to avoid exceeding the maximum size limit for a record while inserting embeddings into a vector database. You might skip this step only if the documents to be processed by the flow are small.
After you finish specifying chunking details, move to Embeddings to select the required parameters.
Embeddings
Embeddings are numerical representations of units of information, such as words or sentences, as vectors of real-valued numbers. Embeddings are generated based on the chunks of your input documents and stored as vectors in a vector database of your choice, and help you find the similarities between the documents, or when a query is given to determine how similar the document is to a given query.
For both watsonx and simple chunking types, you can select one of the embedding models deployed in the cluster to compute embeddings. Select the right model based on where the generated embeddings will be used in RAG or other uses cases. However, you must ensure that the model that you select is deployed in watsonx.ai or any other RAG application that consumes the generated embeddings.
The default model is ibm/slate-30m-english-rtrvr. Ensure it is deployed in the cluster. If it's not deployed and you don't select any other model, then the flow fails.
For more information on the embedding models in watsonx.ai, see Supported encoder foundation models in watsonx.ai.
After embeddings are generated for your document, the embeddings feature is added to the output table.
Select Embeddings after all other nodes in your flow and before you move to Generate output node, so that the embeddings are correctly generated after ingestion, cleansing, enrichment, and chunking of the data complete. Later, you can load your embeddings into the vector database.
Branching
Use this node to branch the flow so that the documents undergo different processing steps based on the conditions you define. For example, when processing a set of documents in different languages, you can branch the flow so that the documents in one language undergo PII and HAP annotation, while other documents skip this step.
This node can be added after any other node in the flow. You can add multiple nodes after this one by creating multiple links. You can edit a name for each link, and define conditions.
To define link conditions, either use the condition builder, or click the Advanced tab where you can provide a more complex condition expression manually.
When the branching node is used, all branches are run in parallel as you run the flow, so a document can be processed by multiple branches concurrently.
You can add multiple branching nodes in your flow to form directed acyclic graph with nested branches.
After you create branches, you can use a merging node to merge the output from the selected branches for further processing in the flow.
Merging
Use this node to merge the flow after it was branched. Use the links to connect the branches into the merging node. You can edit the link names.
Double-click the merging node to open the Configuration panel. In that panel, you specify how to merge the data that is incoming from the merged branches. The following options are available:
-
Combine rows: Merge rows from all tables, one after another
Note that if the same document exists in multiple branches, duplicates are created. This option is recommended to be used only when the conditions used when branching produce mutually exclusive results.
-
Combine columns: Merge columns from all tables, row by row, where you can select one of the following options:
- Inner join - combine all matching rows
- Full outer - includes all rows from all datasets
With this merge type, if the merged branches have features with the same name, then all these features will be added to the output, but they will be renamed by adding
link_nameas a prefix to the feature name. For example, if 3 branches namedlink1,link2andlink3have the same feature namedcontent, then the output will havecontent,content_link2,content_link3. It is strongly recommended to avoid this scenario by disabling the Downstream use checkbox for these features in the previous nodes, so that the merge operator only gets one feature with the given name.
Click Preview with sample data to see how the merging works on some sample tables.
Entity curation
Use this node when you have selected to extract entities in the Extract data node. Documents can contain structured information, for example, invoices, receipts, or bank statements. You can extract this structured information, standardize it, and store it in an entity table.
This node transforms the extracted structured data into the format that is defined for the target table by the applied document class. The normalized information can then be written to an entity table in a structured database by using the Entity store node.