Transform data nodes

Use Transform data nodes to divide your documents into meaningful sections based on their semantic relevance and generate embeddings that you can load and store in a vector database.

Chunking

Chunking is the process of dividing documents into smaller meaningful segments, as most large language models in the RAG use case do not deal with very large documents, but sections of the documents, which are semantically relevant. Using the chunking node improves the context understanding and processing accuracy when you're working with your documents.

Add the chunking node to your flow to avoid exceeding the maximum size limit for a record while inserting embeddings into a vector database. You might skip this node only if the documents to be processed by the flow are small.

You can select one of the following chunk types:

  • watsonx
  • simple
  • semantic

For watsonx and simple chunk types, you can specify the chunk size and chunk overlap parameters:

Chunking parameters details
Parameter name Value Default value Description
Chunk type watsonx or simple watsonx The type of the selected chunking method.
Chunk size Integer 1024 The size of the processed chunk. To preserve the semantic meaning of the text and include complete sentences or paragraphs in the chunk, the chunk size might not be created with the selected chunk size exactly, but larger.
Chunk overlap Integer 0 The chunk overlap specifies the number of overlapping tokens between consecutive chunks to maintain context continuity between the chunks.

For semantic chunk type, specify:

  • Embeddings model ID: The AI model used for generating embeddings in semantic chunking;
  • Embeddings weight: Controls how much importance is given to semantic similarity (meaning-based analysis) when deciding where to split chunks;
  • TF-IDF weight: Controls how much importance is given to keyword/term frequency analysis when deciding where to split chunks.

Consider what chunk types you need:

  • If you're planning to write the processed content into watsonx.data Milvus or any other watsonx application, always select watsonx. This option is compatible with applications developed using watsonx SDK.
  • Select the simple option if you want to use the embeddings outside watsonx to process size-based chunks.
  • The semantic option produces chunks that follow natural topic and meaning boundaries rather than arbitrary size limits, resulting in more coherent context units, higher‑quality embeddings, more accurate retrieval, and reduced noise during downstream question‑answering.

Select Enable summarization to generate AI-powered summaries for each document chunk to improve context understanding and retrieval accuracy. The following settings are available:

  • Max input tokens - Maximum number of tokens sent to the LLM per request.
  • Max output tokens - Maximum number of tokens the LLM can generate in each summary response.
  • Summarization model - Select which model to use to generate summaries.
  • Max words per summary, Max senteces per summary - Define the maximum lenght of the summaries.
  • Temperature - Controls output randomness. Use 0 for consistent, factual summaries or higher values (0.7-1) for more creative variations.

After you finish specifying chunking details, move to Embeddings to select the required parameters.

Embeddings

Embeddings are numerical representations of units of information, such as words or sentences, as vectors of real-valued numbers. Embeddings are generated based on the chunks of your input documents and stored as vectors in a vector database of your choice, and help you find the similarities between the documents, or when a query is given to determine how similar the document is to a given query.

For both watsonx and simple chunking types, you can select one of the embedding models deployed in the cluster to compute embeddings. Select the right model based on where the generated embeddings will be used in RAG or other uses cases. However, you must ensure that the model that you select is deployed in watsonx.ai or any other RAG application that consumes the generated embeddings.

The default model is set it the project settings.

For more information on the embedding models in watsonx.ai, see Supported encoder foundation models in watsonx.ai.

After embeddings are generated for your document, the embeddings feature is added to the output table.

Select Embeddings after all other nodes in your flow and before you move to Generate output node, so that the embeddings are correctly generated after ingestion, cleansing, enrichment, and chunking of the data complete. Later, you can load your embeddings into the vector database.

Branching

Use this node to branch the flow so that the documents undergo different processing steps based on the conditions you define. For example, when processing a set of documents in different languages, you can branch the flow so that the documents in one language undergo PII and HAP annotation, while other documents skip this step.

This node can be added after any other node in the flow. You can add multiple nodes after this one by creating multiple links. You can edit a name for each link, and define conditions.

To define link conditions, either use the condition builder, or click the Advanced tab where you can provide a more complex condition expression manually.

When the branching node is used, all branches are run in parallel as you run the flow, so a document can be processed by multiple branches concurrently.

You can add multiple branching nodes in your flow to form directed acyclic graph with nested branches.

After you create branches, you can use a merging node to merge the output from the selected branches for further processing in the flow.

Merging

Use this node to merge the flow after it was branched. Use the links to connect the branches into the merging node. You can edit the link names.

Double-click the merging node to open the Configuration panel. In that panel, you specify how to merge the data that is incoming from the merged branches. The following options are available:

  • Combine rows: Merge rows from all tables, one after another

    Note that if the same document exists in multiple branches, duplicates are created. This option is recommended to be used only when the conditions used when branching produce mutually exclusive results.

  • Combine columns: Merge columns from all tables, row by row, where you can select one of the following options:

    • Inner join - combine all matching rows
    • Full outer - includes all rows from all datasets

    With this merge type, if the merged branches have features with the same name, then all these features will be added to the output, but they will be renamed by adding link_name as a prefix to the feature name. For example, if 3 branches named link1, link2 and link3 have the same feature named content, then the output will have content, content_link2, content_link3. It is strongly recommended to avoid this scenario by disabling the Downstream use checkbox for these features in the previous nodes, so that the merge operator only gets one feature with the given name.

Click Preview with sample data to see how the merging works on some sample tables.

Entity curation

Use this node when you have selected to extract entities in the Extract data node. Documents can contain structured information, for example, invoices, receipts, or bank statements. You can extract this structured information, standardize it, and store it in an entity table.

This node transforms the extracted structured data into the format that is defined for the target table by the applied document class. The normalized information can then be written to an entity table in a structured database by using the Entity store node.

Learn more