Creating custom schemas for key-value pair classification and extraction

Create JSON schemas to extract specific fields from structured documents with the text classification and extraction API.

To build a custom schema for a document, you must define metadata and write effective descriptions for each field you want to extract before you validating and scaling the schema for accurate key-value pair classification and extraction.

You define the custom schema for your document in the schemas setting in the semantic_config parameter.

Before you begin

Review your document and determine the following information that will determine the field names and descriptions you define in the schema:

  • The types of data you want to extract from the document
  • The exact labels for the data you want to extract
  • The location of each item on the page, such as upper-left header or right-hand column

For example, note the following information in the California Personal Auto Insurance Application document:

Screenshot of an auto application form PDF with several fields including Contact Name and Phone

  • Data you want to extract such as Agency Name, Applicant Address, Carrier Name, Policy Number.
  • The exact field labels such as “AGENCY”, “APPLICANT’S NAME AND MAILING ADDRESS”, and “POLICY #”. Quoting the labels as they appear in the document helps the foundation model connect to the correct values.

Procedure

  1. In the metadata at the top of your schema, provide a description for your document in the document_description field. The document_description field is included in the classifier prompt for the foundation model used for key-value pair classification and extraction.

    Tip: Make the document description specific by including keywords to help the classification model correctly identify the document. For example, use keywords such as “California,” “Auto,” and “Application,” in the description for the California Personal Auto Insurance Application document.

    Use the additional_prompt_instructions parameter to provides guidance that the foundation model can apply to the entire page in the document. Use foundation model prompt instructions, such as "Preserve number formatting as seen in the image.", to improve extraction accuracy.

    {
       "document_type": "Auto_Insurance_Application",
       "document_description": "California Personal Auto Application form used to open or update an auto policy.",
       "additional_prompt_instructions": "Return phone numbers exactly as they appear in the document.",
    }
    
  2. For each field you include in the schema, define the following three elements:

    • Field name: Choose a unique key name for the field. Use the following tips for creating a field name:
      • Use underscores to separate words. For example, use applicant_name instead of applicantName.
      • Keep names short but descriptive.
      • For fields in sections, use the format [section_name]_[field_name].
      • For table fields, use the fomat [table_name]_row_[row_number]_[column_name].
    • Example value: Provide a sample value to help the model infer the expected type such as a date value or an integer. Supplying an example improves model performance.
    • Description: Write a brief explanation of what the field represents. The description is passed to the foundation model to help the model understand what to look for during the extraction process. The field description provides context that helps the model validate and focus on the correct information in the document. Use the following tips for writing a description:
      • Be accurate, unambiguous, and specific about where on the document the information is located.
      • Do not include instructions that change the format of the values such as dates or numbers.
      • Mention any labels or headings that identify the field.
      • Note any special cases or variations.

    For example, define a field to extract the agency name from the California Personal Auto Insurance Application document.

    "agency_name": {
      "default": "",
      "example": "Spring Insurance",
      "description": "Name of the insurance agency shown in the Agency section (upper‑left of the page)."
    }
    

    You can optionally use the following methods to define custom values for fields and custom fields for capturing specialized datastructures in the input document:

    Optional field elements

    Specify additional attribute values for a field with the available_options parameter. Use the parameter for a field that is not explicitly mentioned in the document, but can be deduced from the context or visual elements.

    For example, in invoices, currency values may appear in various parts of the document with a dollar sign, but may not explicitly mention that the US dollar is the currency of the invoice. In such cases, you can provide a closed list of valid currency values the model can return and reduce hallucinations in the model response.

    "currency": {
      "default": "",
      "example": "USD",
      "description": "The currency used in the invoice.",
      "available_options": ["USD", "EUR", "CNY", "JPY", "GBP", "AUD", "CAD", "CHF", "HKD", "SGD", "INR", "KRW", "MXN", "BRL", "ZAR", "SEK", "NOK", "DKK", "NZD", "TRY", "AED", "THB", "PLN", "IDR", "MYR", "PHP", "RUB", "CZK", "ILS"]
    }
    
    Table definitions

    Set the type parameter to array to define a field in your schema that represents data from tables in the input document.

    The following JSON example defines a table in your schema that contains information about garaging addresses. The table has columns that contain location, street, city, county, state, and zip code data.

    "additional_garaging_addresses": {
          "type": "array",
          "description": "Additional locations where vehicles are regularly kept.",
          "columns": {
            "location": {
               "default": "",
               "example": "LOC1",
               "description": "Location identifier."
            },
            "street": {
               "default": "",
               "example": "456 Garage St",
               "description": "Street address of garaging location."
            },
            "city": {
               "default": "",
               "example": "Los Angeles",
               "description": "City of garaging location."
            },
            "county": {
               "default": "",
               "example": "Los Angeles",
               "description": "County of garaging location."
            },
            "state": {
               "default": "",
               "example": "CA",
               "description": "State abbreviation."
            },
            "zip_plus_4": {
               "default": "",
               "example": "90001-1234",
               "description": "ZIP code with +4 extension."
            }
          }
    }
    
  3. Validate your JSON schema locally before using the schema in your text extraction request to make sure it is well-formed and matches the expected structure. You can use the following tools:

    • jsonlint.com to check formatting
    • A Python script to load and inspect the schema
    • Your IDE’s built-in JSON linter

REST API request example with custom schema

The following command submits a request to extract text by using a complete custom schema that includes all required metadata at the top, followed by a set of fields with accompanying definitions. Each field contains a default value that is empty, an example, and a description to guide the foundation model during the extraction process.

curl -X POST \
  'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions?version=2025-11-08' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...'

The request body is as follows:

{
    "project_id": "e40e5895-ce4d-42a3-b699-8ac764b89a09",
    "document_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "ca_auto_insurance_app.pdf"
      }
    },
    "results_reference": {
      "type": "connection_asset",
      "connection": {
        "id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
      },
      "location": {
        "bucket":"my-cloud-object-storage-bucket",
        "file_name": "results_data"
      }
    },
    "parameters": {
      "requested_outputs": [
        "assembly",
        "md",
        "html",
        "plain_text",
        "page_images",
      ],
      "languages": [
        "en"
      ],
      "mode": "standard",
      "ocr_mode": "enabled",
      "create_embedded_images": "disabled",
      "kvp_mode": "generic_with_semantic",
      "semantic_config": {
        "schemas": [ {
           "document_type": "Auto_Insurance_Application",
           "document_description": "A California Personal Auto Application form used to collect information necessary for initiating or updating an auto insurance policy. It includes agency, applicant, carrier, and policy details such as contact information, address, policy number, and effective/expiration dates.",
           "additional_prompt_instructions": "Return phone numbers and policy numbers exactly as they appear in the document.",
           "fields": {
              "agency_name": {
                "default": "",
                "example": "Spring Insurance",
                "description": "Name of the insurance agency handling the auto application."
              },
              "applicant_name": {
                "default": "",
                "example": "John Smith",
                "description": "Full name of the person applying for auto insurance."
              },
              "applicant_address": {
                "default": "",
                "example": "245 W 52nd St, Apt 8B, New York, NY 10019",
                "description": "Mailing address of the applicant including street, apartment, city, state, and ZIP code."
              },
              "applicant_phone": {
                "default": "",
                "example": "(917) 555-2843",
                "description": "Phone number for contacting the applicant."
              },
              "applicant_email": {
                "default": "",
                "example": "john.smith@gmail.com",
                "description": "Email address of the applicant."
              },
              "carrier_name": {
                "default": "",
                "example": "Tower Insurance Company",
                "description": "Name of the insurance carrier providing the policy."
              },
              "policy_number": {
                "default": "",
                "example": "10",
                "description": "Unique identifier for the insurance policy."
              },
              "effective_date": {
                "default": "",
                "example": "2023-01-01",
                "description": "Date when the insurance policy becomes effective."
              },
              "expiration_date": {
                "default": "",
                "example": "2024-01-01",
                "description": "Date when the insurance policy expires."
              }
              "additional_garaging_addresses": {
                "type": "array",
                "description": "Additional locations where vehicles are regularly kept.",
                "columns": {
                  "location": {
                     "default": "",
                     "example": "LOC1",
                     "description": "Location identifier."
                  },
                  "street": {
                     "default": "",
                     "example": "456 Garage St",
                     "description": "Street address of garaging location."
                  },
                  "city": {
                     "default": "",
                     "example": "Los Angeles",
                     "description": "City of garaging location."
                  },
                  "county": {
                     "default": "",
                     "example": "Los Angeles",
                     "description": "County of garaging location."
                  },
                  "state": {
                     "default": "",
                     "example": "CA",
                     "description": "State abbreviation."
                  },
                  "zip_plus_4": {
                     "default": "",
                     "example": "90001-1234",
                     "description": "ZIP code with +4 extension."
                  }
                }
              }
           }
        } ]
      }
    }
  }

Troubleshooting

The following table describes some common issues when you use a custom schema and how to resolve them:

Symptom Cause Solution
No values returned Description is too vague. Make the description more specific. Mention visual location or nearby labels. Include an instruction to return the text as-is without formatting changes
Wrong value extracted Ambiguous field names like “Name” Use qualified names such as agency_name or applicant_name.

Learn more