Document class schema requirements

Document classes for unstructured data curation that you define in JSON format must meet certain schema requirements.

The JSON schema for a document class defines the structure and the validation rules for the document class. Each document class must have a document and a target_table section.

Document object

This object describes the document type and the fields to extract from a document and has these properties:

  • document_type (required, string): The document type or category.
  • document_description (required, string): Description of the document.
  • additional_prompt_instructions (optional, string): Extra instructions for extraction that can be applied to an entire document page. These instructions complement the field-level descriptions. With this property, you can provide additional guidance to improve extraction accuracy, for example, validation rules or formatting requirements.
  • fields (required, array): The list of fields to extract from the document. These fields are defined as DocumentField objects.

DocumentField objects

These objects define the fields to extract:

  • name (required, string): The name of the field. That name is referenced in the target table.
  • description (required, string): A short explanation of what the field represents. The description provides context that helps the model validate and focus on the correct information in the document.
  • extraction_instructions (optional, string): Additional instructions for extracting this specific field such as "remove the zip code from the address".
  • examples (optional, array of strings, 1 or more items): Examples of possible field content to improve accuracy.
  • available_options (optional, array of strings): Valid values for the field. If the field isn't directly available in a document, but can be deduced from the context or visual cues, one of the listed values should be returned. For example, provide a list of currencies, so that Euro can be returned if various parts of an invoice contain a € sign, but nowhere is stated that Euro is currency of the invoice. Specify available_options in addition to any examples.
  • affix (optional, boolean): Determines whether the field is a prefix, infix, or suffix that is attached to another element. For example, a currency sign can be an affix.
  • fields (optional, array): Nested subfields that contain field sets for logically grouped fields within a document. For example, an invoice contains line items, such as multiple entries for purchased products, that are grouped in a table or a list. These subfields are also defined as DocumentField objects.

TargetTables object

This object defines the database table where the extracted data will be stored and the format in which the data is stored, and specifies the Python transformations functions to use for converting the data to the required format. This object can occur multiple times: once for the main table and additional ones for any field sets (fields objects) that are defined for the Document object.

  • name (required, string): The table name.
  • description (required, string): The table description.
  • metadata (optional, object): Additional metadata.
  • columns (required, array): The column definitions. Each column is defined as a Column object.

Column object

Each object specifies how to derive the column value and how and transform it from string to the data type that you want.

  • name (required, string): The column name.
  • type (required, string): The data type of the columns, which can be one of these types: binary, boolean, date, decimal, fixed, long, string, timestamp, uuid
  • description (optional, string): The column description.
  • source (required, Source object): The source of the column value is defined as a Source object.

Source object

This object can be defined as follows:

  • variable (optional, string): As a variable reference such as document id.
  • field (optional, array): As a reference field in the extracted data as defined in the Document section.
  • transform (optional, SourceTransform): As a data transformation. A SourceTransform object has these properties:
    • transform_name (required, string): The name of the Python transformation function to apply to the extracted data. See Transformation functions.
    • arguments (required, array): The name of the argument and the path to the value in the Document section.
  • literal (optional, string): As a literal or constant value.

document_class_schema_version

This entry defines the schema version of the document class. The schema version of a document class that you import must match the version that is currently used in the product.

Transformation functions

You can use these transformation functions:

The required syntax is as follows:

"source": {
  "transform": {
    "arguments": [
      {
        "name": "<transform argument>",
        "value": {
          "field": [
            "<name of the source field>"
          ]
        }
      }
    ],
    "transform_name": "<name of the transform>"
  }
},

currency_to_numeric

The currency_to_numeric transform converts a locale-formatted monetary string to a plain numeric string. This function accepts a locale as an optional input parameter. If a locale is not explicitly provided, it is identified based on the language of the document as identified in the Language annotator node. The function also supports multi-language magnitude suffixes (K, M, B, T, 千, 万, 億, and others).

The input and output data type is string.

Specify the transformation in this format:

"source": {
  "transform": {
    "arguments": [
      {
        "name": "amount",
        "value": {
          "field": [
            "<name of the source field>"
          ]
        }
      },
      {
        "name": "locale",
        "value": "id_ID"
      }
    ],
    "transform_name": "currency_to_numeric"
  }
},

Examples The value of a source field starting_balance would be transformed as follows:

  • $1,234.56 is converted to 1234.56
  • $5K is converted to 5000.00
  • 1.234,56 € is converted to 1234.56

make_date_uniform

The make_date_uniform transform converts various date formats (including multi-language formats) to a uniform ISO date format. The default format is MM-DD-YYYY. However, when dates are stored in an entity table, they are internally converted to the format YYYY-MM-DD.

The input and output data type is string.

Specify the transformation in this format:

"source": {
  "transform": {
    "arguments": [
      {
        "name": "date_str",
        "value": {
          "field": [
            "<name of the source field>"
          ]
        }
      }
    ],
    "transform_name": "make_date_uniform"
  }
},

Examples The value of a source field submission_date would be transformed as follows:

  • 2023-12-25 is converted to “2023-12-25"
  • December 25, 2023 is converted to 2023-12-25
  • 25/12/2023 is converted to 2023-12-25

to_number

The to_number transform converts a human-readable number with magnitude suffixes to a machine-readable number. This function supports multi-language suffixes including English, Chinese, Japanese, Spanish, French, German.

The input data type is string. The output data type is float.

Specify the transformation in this format:

"source": {
  "transform": {
    "arguments": [
      {
        "name": "number_str",
        "value": {
          "field": [
            "<name of the source field>"
          ]
        }
      }
    ],
    "transform_name": "to_number"
  }
},

Examples The value of a source field quantity would be transformed as follows:

  • 10K is converted to 10000.0
  • 2.5M is converted to 2500000.0
  • 10 thousand is converted to 10000.0

weight_to_numeric

The weight_to_numeric transform converts weight values with locale-specific units to kilograms (ISO format). The function supports units in English, Chinese, Japanese, Spanish, French, German, and others and can handle ambiguous units like (500g in Chinese, 600g in Japanese) based on locale.

Specify the transformation in this format:

"source": {
  "transform": {
    "arguments": [
      {
        "name": "weight",
        "value": {
          "field": [
            "<name of the source field>"
          ]
        }
      }
    ],
    "transform_name": "weight_to_numeric"
  }
},

Examples The value of a source field gross_weight would be transformed as follows:

  • 5kg is converted to 5.0
  • 10 lb is converted to 4.53592
  • 500g is converted to 0.5