Document class schema requirements
Document classes for unstructured data curation that you define in JSON format must meet certain schema requirements.
The JSON schema for a document class defines the structure and the validation rules for the document class. Each document class must have a document and a target_table section.
Document object
This object describes the document type and the fields to extract from a document and has these properties:
document_type(required, string): The document type or category.document_description(required, string): Description of the document.additional_prompt_instructions(optional, string): Extra instructions for extraction that can be applied to an entire document page. These instructions complement the field-level descriptions. With this property, you can provide additional guidance to improve extraction accuracy, for example, validation rules or formatting requirements.fields(required, array): The list of fields to extract from the document. These fields are defined asDocumentFieldobjects.
DocumentField objects
These objects define the fields to extract:
name(required, string): The name of the field. That name is referenced in the target table.description(required, string): A short explanation of what the field represents. The description provides context that helps the model validate and focus on the correct information in the document.extraction_instructions(optional, string): Additional instructions for extracting this specific field such as "remove the zip code from the address".examples(optional, array of strings, 1 or more items): Examples of possible field content to improve accuracy.available_options(optional, array of strings): Valid values for the field. If the field isn't directly available in a document, but can be deduced from the context or visual cues, one of the listed values should be returned. For example, provide a list of currencies, so that Euro can be returned if various parts of an invoice contain a € sign, but nowhere is stated that Euro is currency of the invoice. Specifyavailable_optionsin addition to any examples.affix(optional, boolean): Determines whether the field is a prefix, infix, or suffix that is attached to another element. For example, a currency sign can be an affix.fields(optional, array): Nested subfields that contain field sets for logically grouped fields within a document. For example, an invoice contains line items, such as multiple entries for purchased products, that are grouped in a table or a list. These subfields are also defined asDocumentFieldobjects.
TargetTables object
This object defines the database table where the extracted data will be stored and the format in which the data is stored, and specifies the Python transformations functions to use for converting the data to the required format. This object
can occur multiple times: once for the main table and additional ones for any field sets (fields objects) that are defined for the Document object.
name(required, string): The table name.description(required, string): The table description.metadata(optional, object): Additional metadata.columns(required, array): The column definitions. Each column is defined as aColumnobject.
Column object
Each object specifies how to derive the column value and how and transform it from string to the data type that you want.
name(required, string): The column name.type(required, string): The data type of the columns, which can be one of these types:binary,boolean,date,decimal,fixed,long,string,timestamp,uuiddescription(optional, string): The column description.source(required, Source object): The source of the column value is defined as aSourceobject.
Source object
This object can be defined as follows:
variable(optional, string): As a variable reference such as document id.field(optional, array): As a reference field in the extracted data as defined in theDocumentsection.transform(optional, SourceTransform): As a data transformation. ASourceTransformobject has these properties:transform_name(required, string): The name of the Python transformation function to apply to the extracted data. See Transformation functions.arguments(required, array): The name of the argument and the path to the value in theDocumentsection.
literal(optional, string): As a literal or constant value.
document_class_schema_version
This entry defines the schema version of the document class. The schema version of a document class that you import must match the version that is currently used in the product.
Transformation functions
You can use these transformation functions:
The required syntax is as follows:
"source": {
"transform": {
"arguments": [
{
"name": "<transform argument>",
"value": {
"field": [
"<name of the source field>"
]
}
}
],
"transform_name": "<name of the transform>"
}
},
currency_to_numeric
The currency_to_numeric transform converts a locale-formatted monetary string to a plain numeric string. This function accepts a locale as an optional input parameter. If a locale is not explicitly provided, it is identified based
on the language of the document as identified in the Language annotator node. The function also supports multi-language magnitude suffixes (K, M, B, T, 千, 万, 億, and others).
The input and output data type is string.
Specify the transformation in this format:
"source": {
"transform": {
"arguments": [
{
"name": "amount",
"value": {
"field": [
"<name of the source field>"
]
}
},
{
"name": "locale",
"value": "id_ID"
}
],
"transform_name": "currency_to_numeric"
}
},
Examples The value of a source field starting_balance would be transformed as follows:
$1,234.56is converted to1234.56$5Kis converted to5000.001.234,56 €is converted to1234.56
make_date_uniform
The make_date_uniform transform converts various date formats (including multi-language formats) to a uniform ISO date format. The default format is MM-DD-YYYY. However, when dates are stored in an entity table, they are internally
converted to the format YYYY-MM-DD.
The input and output data type is string.
Specify the transformation in this format:
"source": {
"transform": {
"arguments": [
{
"name": "date_str",
"value": {
"field": [
"<name of the source field>"
]
}
}
],
"transform_name": "make_date_uniform"
}
},
Examples The value of a source field submission_date would be transformed as follows:
2023-12-25is converted to“2023-12-25"December 25, 2023is converted to2023-12-2525/12/2023is converted to2023-12-25
to_number
The to_number transform converts a human-readable number with magnitude suffixes to a machine-readable number. This function supports multi-language suffixes including English, Chinese, Japanese, Spanish, French, German.
The input data type is string. The output data type is float.
Specify the transformation in this format:
"source": {
"transform": {
"arguments": [
{
"name": "number_str",
"value": {
"field": [
"<name of the source field>"
]
}
}
],
"transform_name": "to_number"
}
},
Examples The value of a source field quantity would be transformed as follows:
10Kis converted to10000.02.5Mis converted to2500000.010 thousandis converted to10000.0
weight_to_numeric
The weight_to_numeric transform converts weight values with locale-specific units to kilograms (ISO format). The function supports units in English, Chinese, Japanese, Spanish, French, German, and others and can handle ambiguous
units like 斤 (500g in Chinese, 600g in Japanese) based on locale.
Specify the transformation in this format:
"source": {
"transform": {
"arguments": [
{
"name": "weight",
"value": {
"field": [
"<name of the source field>"
]
}
}
],
"transform_name": "weight_to_numeric"
}
},
Examples The value of a source field gross_weight would be transformed as follows:
5kgis converted to5.010 lbis converted to4.53592500gis converted to0.5