Managing document classes
To improve extraction results for unstructured data, tailor document classes to the document types in your organization by creating new document classes or by editing existing ones.
You might have documents in your organization that cannot or can only insufficiently be mapped to the predefined document classes. In this case, you can analyze and review a set of sample documents so that you can identify the fields that you want to extract and define tailored document classes.
You can create new document classes or update existing ones in these ways:
Working with the document class editor
You can add or update document classes from the UI. Any changes are immediately available in all projects in the account.
After you ran analysis on a sample set of documents, you can use the analyzed documents as reference for additions and updates. If you know that none of the existing document classes addresses your requirements regarding data extraction and target tables, you can also create or modify a document class in an unstructured data curation asset before you run the analysis.
Updating existing document classes
To update an existing document class:
-
In the unstructured data curation asset, go to the Document classes page by clicking the Document classes icon (
).
-
Select the document class that you want to update. Any documents that have this document class assigned are listed in the Documents with this class section. To see the document and the document class definitions side by side, you can open a preview for each document by clicking the Open document preview icon
. Clicking the View on document button opens the side-by-side view for the first document on the list. If the document class wasn't yet assigned to any document, this section shows a Work with reference documents button. After you click the button, you can also select one of the available documents for a side-by-side view with the field and schema definitions to determine which changes might be required.
-
Edit the entries directly from the Data matching and extraction section or open the side-by-side view as described in the previous step. If you selected to work in the side-by-side view, select a document that has or should have the document class attached that you want to update from the Documents list.
-
Depending on what you want to change, go to the Fields or Schema section.
- Fields
-
Update or delete existing entries or add new document fields or field sets to extract. If you add fields, you must specify the name of the document field and a description. Optionally, provide any number of examples, and instructions for extraction. However, the examples and extraction instructions are needed to populate the value of the document field when data is extracted.
If you delete an entry, that data is no longer extracted when the document class is applied. Corresponding columns in the target table will be set to null in future processing runs.
- Schema
-
Update the name, the description, or the definition of the target database table where the extracted data will be stored. You can edit or delete existing column definitions or add new columns to the table depending on the changes that you made to the Fields section.
If you rename a column, a new column with that name is created in the entity table. Data that was processed before the change is available in the column with the old name. Newly processed data will be mapped to the new column.
If you delete a column, that column is not created in new entity tables. If the corresponding data is still extracted, it can’t be mapped. In existing entity tables, the column value will be set to null in future processing runs.
You can also add new tables to match new field sets that you added.
Update the document class based on the document preview. You can always select a different document to crosscheck your intended changes.
-
When you're done, save your changes.
Creating document classes
To create a document class:
- In the unstructured data curation asset, go to Document classes page by clicking the Document classes icon (
).
- Click Add class > New class.
- From the Documents list, select a document that you want to use as a reference for creating the document class. Review the document and determine the types of data you want to extract, the labels that you want to use, and where each item is on the page (such as upper-left header or column to the right). Also, determine how you want to build your target table.
- Name the new document class and provide a description for the document type to which you want to apply the document class. Make the description as specific as possible by including keywords to ease classification of documents.
- Optional: Provide additional prompt instructions for extraction that can be applied to an entire document page to give additional guidance to improve extraction accuracy, for example, validation rules or formatting requirements. These instructions complement the field-level descriptions.
- To create the document class and save this basic information, click Create.
- On the Fields tab, define the fields and field sets from which you want to extract data. Provide a name as label and a precise description of the content. For a field definition, also provide any number of examples. These items are required. Optionally, you can provide a list of valid values and specific extractions instructions for a field to improve extraction accuracy.
- On the Schema tab, add a target table for the extracted data and then define the columns of your entity table. Specify the column name, a description of the column content, the data type of the column, which data is to be mapped to the column, and any transformation that you want to apply to the extracted data, for example, to make dates uniform. For more information about the available transformations, see Transformation functions.
- To verify your document class definition, review some other documents of the same type. Adjust your specifications as required. When you're done, save the new document class.
You can now use this document class in unstructured data curation.
To create a document class, you can also duplicate an existing document class.
Deleting document classes or parts of its definition
You can delete a document class, any field or field set definitions, and any items in the schema configuration.
- Document class
- Before you delete a document class, make sure that the class isn't referenced in any data curation flow. Otherwise, documents in flows that use this class might no longer be properly classified.
- Field
- Deleting a field can impact existing processing flows. If the field serves as a source for an output column, you must select how to handle that output column:
-
- Keep the column definition without a source
-
- Change the source to a different field
-
- Delete the output column from the configuration
- Field set
- Deleting this field set can impact existing processing flows. If you delete a field set, the contained fields are also deleted.
- Table
- Deleting a table removes the table and all of its columns from the configuration. Tables that were created in the entity store based on that configuration are kept, but no longer updated.
- Column
- If you delete a column, that column is not created in new entity tables. If the corresponding data is still extracted, it can’t be mapped. In existing entity tables, the column value will be set to null in future processing runs.
Working with JSON files
Export the JSON of an existing document and edit the JSON or use it as a sample when you create a new document class. Then, import the JSON file. With this method, you can't directly update an existing document class. You must delete the existing document class before you can import the updated JSON file.
Complete these steps:
-
In the unstructured data curation asset, go to Document classes page by clicking the Document classes icon (
).
-
Export the JSON information of an existing document class. Select the class that you want to adjust or a class that provides a similar structure to what you want to set up and select Export JSON from the overflow menu next to the document class name. You can export the document class JSON from the Document classes page of any unstructured data curation asset.
-
Edit the JSON file. Adjust the entries as required. In general, a new custom document class must meet certain requirements. See Schema requirements.
-
Import the updated or new document class. Go to the Document classes page of any unstructured data cura‚tion asset, click Add class > Import class, and upload the JSON file.
Important:If you updated an existing document class, you must delete that document class before you can import the updated version. To delete a document class, open it and select Delete from the overflow menu next to the document class name.
The custom document class is globally available for use in unstructured data curation and Unstructured Data Integration flows.
-
If you replaced a document class that was already in use, you must update the flows that reference the document class. Flows that are generated and run in an unstructured data curation asset are automatically updated when you reanalyze and reprocess the documents. Unstructured Data Integration flows that you run manually must be updated with the ID of the newly imported document class. You can find the document class ID in your browser URL between the “/document-classes/” and “?context” strings.
To update the document class ID in an existing flow that you run manually, you have these options:
- Change the document class in the Classification operator.
- If your flow has class-based branches, update the link condition by replacing the document ID in the Complex Condition field.