Document classes
A document class defines how to extract structured data from documents and map it to the target database tables.
Document classes are used to classify text in your documents and identify whether the data in your document matches a certain document domain. For each domain, a document class defines a set of key-value pairs for text extraction. The extracted text can be vectorized and written to a vector database for use with foundation models, for example, in RAG solutions. The text can also be written to a structured database table, an entity table, for example, for running complex AI queries or for governance use cases.
- Required permissions
- To create custom document classes, you must have the Manage document classes permission in addition to the Admin or the Editor role in the project.
Predefined document classes
You can classify or extract text from your documents based on these predefined document classes:
- ACORD Insurance Form
- Bank Statements
- Bill of Lading
- Business Licenses & Permits
- Claimant's Statement
- Credit Card Statements
- Customs Form
- Delivery Receipt
- Diploma / Certification
- Driver License
- Expense Reports
- Financial Statement
- I-9 Form
- Initial Claimant's Statement for Income Replacement/Waiver of Premium Benefits
- Insurance Claim
- Invoice
- Life Insurance Authorization Form
- Mortgage / Lending Document
- National ID Card
- Passport
- Patient Intake Form
- Purchase Order
- Receipt
- Remittance / Payment Advice
- Tax Forms (W-9, 1099, 941, 1120)
- Transcripts
- Utility Bill
- W-4 Form
Document class details
When you view a document class definition, you can see the following information.
General
This section provides a description of the document class and lists additional prompt instructions if available. You can edit both fields.
When you create a custom document class, include keywords in the description that help to identify documents. Check the predefined document classes for examples. As additional prompt instructions, provide guidance that can be applied to an entire document page to improve extraction accuracy, for example, "Ignore instructional text and handwritten notes." These extra hints help guide the AI to extract information more accurately.
In the document class editor, you can find and work with these fields on the Details tab.
In the JSON file, these fields are defined in the Document object.
Data matching and data extraction
This section lists the fields that can be extracted for this document class, a description for each field, a sample field value, and any extraction instructions that might be defined for a field. Some document classes have additional field sets for fields that are logically grouped within a document and can occur multiple times. For example, an invoice contains line items, such as multiple entries for purchased products. These line items can be grouped in a table or a list. Such field sets are described in separate tables.
In the document class editor, you can find and work with these fields on the Document fields tab. Currently, you can't create field sets in the editor. However, you can edit existing field sets.
In the JSON file, these field sets are defined as DocumentField objects.
Target table
This section defines the layout of the target entity table that is stored in a Presto database or an Iceberg metastore:
- The name of the column in the target table.
- A description of the column content.
- The data type of the column such as string or date.
- Which extracted data to map to the column and additional information such as locale settings.
- Any transformation that is applied to the source data such as normalization of dates to make them uniform or converting string data into numbers. No transformation is needed if the source data can be directly mapped, for example, names or addresses.
In the document class editor, you can find and work with these fields on the Target table tab.
In the JSON file, the columns are defined as Column objects in a TargetTables object.
Custom document classes
If the predefined document classes do not fully meet your requirements or do not cover the data that you want to analyze and process, you can create custom data classes:
- Work with the document class editor to update an existing document class or create a new one.
- Export the JSON of an existing document and edit the JSON or use it as a sample when you create a new document class. Then, import the JSON file. A new document class must meet certain requirements. See Schema requirements.
For more information about updating or creating document classes, see Managing document classes.
Language support
The predefined document classes are available in English, but can classify documents of any language. To ensure that downstream applications such as the watsonx.data retrieval service return accurate results when queries are submitted in a language other than English, provide the document classes in the required language.
You can have the entire document class in a language other than English or translate only the definition of the output table.
-
Create the entire document class in a different language or translate one of the predefined document classes.
Important: Do not translate thedocument_typefield. This field must remain in English. -
Create a document class where the
target_tablessection and the columns in that section are in a different language. Any other fields are defined in English. You can also translate the output table definition in an existing document class.