Ingest data nodes
Use the Ingest data nodes to load documents from your project, from document sets, or use one of the supported connectors to connect to a data source and specify which data to use.
With the Ingest data nodes you specify which documents you want to process. The node reads the binary content from the data source, collects metadata about the documents, and creates an in-memory table with one row per document that is then consumed by the following operator nodes.
Only one node of this type can be used in the flow. The ingest node is mandatory and it must be the first node in the flow. You can't ingest documents in the middle of the flow.
While running the flow with the Python runtime, select a small number of documents. Running more than 100MB of content might require scaling up the pods.
You can choose one of the following ingest nodes:
Data assets
Use this node to pull the data from project assets.
Ensure that the required documents are already available in the project you are working on. To learn how to upload your documents, see Adding data to a project.
For the list of supported file formats, see Supported source data formats.
Use the browser to select the files. In the configuration panel for this node, you can specify the maximum file size for the documents to be processed, and file types to ingest. You can also specify the sampling method.
If you use this operator, the Extract data operator must be used before any other functional or quality operator to the flow. You can add the Filter annotator operator between the Ingest data and Extract data operators.
The following features are added by this operator:
| Feature name | Description |
|---|---|
| size | The size of a document |
| created_time | When the document was created in the project |
| modified_time | When the document was last modified |
Connections
Ingest the documents from a connection. To use this node, you must first define the required connection in your project. For the list of supported connections, see Supported data sources.
Use the browser to specify the path, which documents to upload, and specify filters such as format, file size and file count.
You can select specific files or a folder:
- If a folder is selected then all the files inside the folder and any sub-folder hierarchy are ingested. However, the files are limited based on the Max Files Size and Max Files settings. If the flow is re-executed, only the new or updated files in the folder are processed.
- If you select specific files, then the files are always ingested ignoring other filtering criteria. They are executed even if the selected file is larger than the given Max Files Size, or the number of ingested files is more than the given Max Files, or the file was already processed by the previous execution.
In the configuration panel for this node, specify the following:
- The maximum file size for the documents to be processed
- File types to ingest
- Sampling method.
- If ACL retrieval is set in the project, specify the connection for Common Policy Gateway (CPG) to save the ACLs extracted from data sources.
If you use this operator, the extract operator must be used before any other functional or quality operator to the flow. You can add Filter Annotator operator between Load documents and Extract operator.
The following features are added by this operator:
| Feature name | Description |
|---|---|
| size | The size of a document |
| created_time | When the document was created in S3 |
| modified_time | When the document was last modified |
Sampling documents
You can use the sampling method to reduce the number of processed files, for example when validating the flow or to get early insights. This option is available when ingesting data from Connections.
Provide the following options:
- Sample size
- Choose how to limit the number of documents to ingest from your data source. You cna specify the maximum number of documents to ingest, or the percentage of documents. Note that using percentage when the data source includes a large number of documents might impact performance.
- Sampling method
- Choose how to select sample documents from your data source:
- Sequential selection - Select first documents in the order they are discovered up to the specified limit.
- Random sampling - Randomly select documents from across all folders for a representative sample. Specify the number of documents that will be used for selecting random documents for sampling. If empty, samples are selected from all the available documents in the data source. Leaving this field empty when the data source includes a large number of documents might impact performance.
- Reproducible sampling
- Choose whether the same sample should be reused (seeded randomness), or whether a new random sample is generated in each run.
From document set
Use the browser to select a document set from your project.
If ACL retrieval is set in the project, specify the connection for Common Policy Gateway (CPG) to save the ACLs extracted from data sources.
This node must be followed by the Extract data node. If the document set already has content or entities extracted, the operator will be skipped and the content doesn't need to be reprocessed.
Next node in the flow
The Ingest data node must be followed by either:
- Extract data node
- Annotation filter node, followed by the Extract data node - if you want to filter the documents before extracting the data.