Document classes

A document class defines how to extract structured data from documents and map it to the target database tables.

Document classes are used to classify text in your documents and identify whether the data in your document matches a certain document domain. For each domain, a document class defines a set of key-value pairs for text extraction. The extracted text can be vectorized and written to a vector database for use with foundation models, for example, in RAG solutions. The text can also be written to a structured database table, an entity table, for example, for running complex AI queries or for governance use cases.

Required permissions
To create custom document classes, you must have the Manage document classes permission in addition to the Admin or the Editor role in the project.

Predefined document classes

You can classify or extract text from your documents based on these predefined document classes:

  • ACORD Insurance Form
  • Bank Statements
  • Bill of Lading
  • Business Licenses & Permits
  • Claimant's Statement
  • Credit Card Statements
  • Customs Form
  • Delivery Receipt
  • Diploma / Certification
  • Driver License
  • Expense Reports
  • Financial Statement
  • I-9 Form
  • Initial Claimant's Statement for Income Replacement/Waiver of Premium Benefits
  • Insurance Claim
  • Invoice
  • Life Insurance Authorization Form
  • Mortgage / Lending Document
  • National ID Card
  • Passport
  • Patient Intake Form
  • Purchase Order
  • Receipt
  • Remittance / Payment Advice
  • Tax Forms (W-9, 1099, 941, 1120)
  • Transcripts
  • Utility Bill
  • W-4 Form

Document class details

When you view a document class definition, you can see the following information.

General

This section provides a description of the document class and lists additional prompt instructions if available. You can edit both fields.

When you create a custom document class, include keywords in the description that help to identify documents. Check the predefined document classes for examples. As additional prompt instructions, provide guidance that can be applied to an entire document page to improve extraction accuracy, for example, "Ignore instructional text and handwritten notes." These extra hints help guide the AI to extract information more accurately.

In a document class JSON file, these fields are defined in the Document object.

Documents with this class

This section lists all documents that got the document class assigned during analysis. For each document, metadata such as the file path in the source, the language, the document type, and the size are shown. To see the document and the document class definitions side by side, you can open a preview for each document by clicking the Open document preview icon Preview. Clicking the View on document button opens the side-by-side view for the first document on the list.

Data matching and data extraction

This section shows the field and the schema definitions of document class.

Fields
This subsection lists the fields that can be extracted for this document class with the sample field values. Some document classes have additional field sets for fields that are logically grouped within a document and can occur multiple times. For example, an invoice contains line items, such as multiple entries for purchased products. These line items can be grouped in a table or a list. Such field sets are described in separate tables.
You can rearrange the entries, edit or delete individual entries, or create new fields or field sets. Each field definition consists of this information:
  • Field name and description and any number of examples of what the field can contain. This information is required.
  • An optional list of valid values that can be returned if the field isn't directly available in a document, but can be deduced from the context (or visual cues). For example, the list can contain entries for currencies, so that Euro can be returned if various parts of an invoice contain a € sign, but nowhere is stated that Euro is currency of the invoice.
  • Optional additional instructions that help to extract specific information more accurately.
  • Optional classification as an affix. This setting identifies the field as a prefix, infix, or suffix that is attached to another element, for example, a currency sign.
In a document class JSON file, fields and field sets are defined as DocumentField objects.
Schema
This subsection shows the layout of the target entity tables that are created and populated in entity database:
  • The name and description of each target table
  • For each column in a target table, the name, the data type, and which extracted data to map to the column
You can rearrange the tables and columns, edit or delete individual entries, or create new tables and columns. Each column definition consists of this information:
  • Column name and description and the data type of the column. This information is required.
  • The assigned field. This information is required and determines the content source for this column.
  • Any transformation that you want to apply to the source data such as normalization of dates to make them uniform or converting string data into numbers. No transformation is needed if the source data can be directly mapped, for example, names or addresses. See Transformation functions. This information is optional.
In a document class JSON file, the tables are defined in a TargetTables object. The columns are defined as Column objects in the individual tables.

Custom document classes

If the predefined document classes do not fully meet your requirements or do not cover the data that you want to analyze and process, you can create custom data classes:

  • Work with the document class editor to update an existing document class or create a new one.
  • Export the JSON of an existing document and edit the JSON or use it as a sample when you create a new document class. Then, import the JSON file. A new document class must meet certain requirements. See Schema requirements.

For more information about updating or creating document classes, see Managing document classes.

Language support

The predefined document classes are available in English, but can classify documents of any language. To ensure that downstream applications such as the watsonx.data retrieval service return accurate results when queries are submitted in a language other than English, provide the document classes in the required language.

You can have the entire document class in a language other than English or translate only the definition of the output table.

  • Create the entire document class in a different language or translate one of the predefined document classes.

    Important: Do not translate the document_type field. This field must remain in English.
  • Create a document class where the target_tables section and the columns in that section are in a different language. Any other fields are defined in English. You can also translate the output table definition in an existing document class.