Document classes

A document class defines how to extract structured data from documents and map it to the target database tables.

Document classes are used to classify text in your documents and identify whether the data in your document matches a certain document domain. For each domain, a document class defines a set of key-value pairs for text extraction. The extracted text can be vectorized and written to a vector database for use with foundation models, for example, in RAG solutions. The text can also be written to a structured database table, an entity table, for example, for running complex AI queries or for governance use cases.

Required permissions: To create custom document classes, you must have the Manage document classes permission in addition to the Admin or the Editor role in the project.

Predefined document classes

You can classify or extract text from your documents based on these predefined document classes:

ACORD Insurance Form
Bank Statements
Bill of Lading
Business Licenses & Permits
Claimant's Statement
Credit Card Statements
Customs Form
Delivery Receipt
Diploma / Certification
Driver License
Expense Reports
Financial Statement
I-9 Form
Initial Claimant's Statement for Income Replacement/Waiver of Premium Benefits
Insurance Claim
Invoice
Life Insurance Authorization Form
Mortgage / Lending Document
National ID Card
Passport
Patient Intake Form
Purchase Order
Receipt
Remittance / Payment Advice
Tax Forms (W-9, 1099, 941, 1120)
Transcripts
Utility Bill
W-4 Form

Document class details

When you view a document class definition, you can see the following information.

General

This section provides a description of the document class and lists additional prompt instructions if available. You can edit both fields.

When you create a custom document class, include keywords in the description that help to identify documents. Check the predefined document classes for examples. As additional prompt instructions, provide guidance that can be applied to an entire document page to improve extraction accuracy, for example, "Ignore instructional text and handwritten notes." These extra hints help guide the AI to extract information more accurately.

In the document class editor, you can find and work with these fields on the Details tab.

In the JSON file, these fields are defined in the Document object.

Data matching and data extraction

This section lists the fields that can be extracted for this document class, a description for each field, a sample field value, and any extraction instructions that might be defined for a field. Some document classes have additional field sets for fields that are logically grouped within a document and can occur multiple times. For example, an invoice contains line items, such as multiple entries for purchased products. These line items can be grouped in a table or a list. Such field sets are described in separate tables.

In the document class editor, you can find and work with these fields on the Document fields tab. Currently, you can't create field sets in the editor. However, you can edit existing field sets.

In the JSON file, these field sets are defined as DocumentField objects.

Target table

This section defines the layout of the target entity table that is stored in a Presto database or an Iceberg metastore:

The name of the column in the target table.
A description of the column content.
The data type of the column such as string or date.
Which extracted data to map to the column and additional information such as locale settings.
Any transformation that is applied to the source data such as normalization of dates to make them uniform or converting string data into numbers. No transformation is needed if the source data can be directly mapped, for example, names or addresses.

In the document class editor, you can find and work with these fields on the Target table tab.

In the JSON file, the columns are defined as Column objects in a TargetTables object.

Custom document classes

If the predefined document classes do not fully meet your requirements or do not cover the data that you want to analyze and process, you can create custom data classes:

Work with the document class editor to update an existing document class or create a new one.
Export the JSON of an existing document and edit the JSON or use it as a sample when you create a new document class. Then, import the JSON file. A new document class must meet certain requirements. See Schema requirements.

For more information about updating or creating document classes, see Managing document classes.

Language support

The predefined document classes are available in English, but can classify documents of any language. To ensure that downstream applications such as the watsonx.data retrieval service return accurate results when queries are submitted in a language other than English, provide the document classes in the required language.

You can have the entire document class in a language other than English or translate only the definition of the output table.

Create the entire document class in a different language or translate one of the predefined document classes.

Important: Do not translate the document_type field. This field must remain in English.
Create a document class where the target_tables section and the columns in that section are in a different language. Any other fields are defined in English. You can also translate the output table definition in an existing document class.