Creating custom schemas for key-value pair extraction
Create JSON schemas to extract specific fields from structured documents with the text extraction API.
To build a custom schema for a document, you must define metadata and write effective descriptions for each field you want to extract before you validating and scaling the schema for accurate key-value pair extraction.
Before you begin
Review your document and determine the following information that will guide the field names and descriptions you define in the schema:
- The types of data you want to extract from the document
- The exact labels for the data you want to extract
- The location of each item on the page, such as upper-left header or right-hand column
For example, note the following information in the California Personal Auto Insurance Application document:
- Data you want to extract such as Agency Name, Applicant Address, Carrier Name, Policy Number.
- The exact field labels such as “AGENCY”, “APPLICANT’S NAME AND MAILING ADDRESS”, and “POLICY #”. Quoting the labels as they appear in the document helps the foundation model connect to the correct values.
Procedure
- In the metadata at the top of your schema, provide a description for your document in the
document_description
field. Thedocument_description
field is included in the classifier prompt for the foundation model used for key-value pair extraction.
{
"document_type": "Auto_Insurance_Application",
"document_description": "California Personal Auto Application form used to open or update an auto policy.",
"additional_prompt_instructions": "Return phone numbers exactly as they appear in the document.",
}
-
Determine which fields to include in the schema. For each field, define the following three elements:
- Field name: Choose a unique key name for the field.
- Example value: Provide a sample value to help the model infer the expected type such as a date value or an integer. Supplying an example improves model performance.
- Description: Write a brief explanation of what the field represents. The description is passed to the foundation model to help the model understand what to look for during the extraction process.
Important: The field description provides context that helps the model validate and focus on the correct information in the document. The description must be accurate and unambiguaous.
For example, define a field to extract the agency name from the California Personal Auto Insurance Application document.
"agency_name": { "default": "", "example": "Spring Insurance", "description": "Name of the insurance agency shown in the Agency section (upper‑left of the page)." }
-
Specify additional attribute values in your schema with the
available_options
parameter. Use the parameter for a field that is not explicitly mentioned in the document, but can be deduced from the context or visual elements. For example, in invoices, currency values may appear in various parts of the document with a dollar sign, but may not explicitly mention that the US dollar is the currency of the invoice. In such cases, you can provide a closed list of valid currency values the model can return and reduce hallucinations in the model response."currency": { "default": "", "example": "USD", "available_options": ["USD", "EUR", "CNY", "JPY", "GBP", "AUD", "CAD", "CHF", "HKD", "SGD", "INR", "KRW", "MXN", "BRL", "ZAR", "SEK", "NOK", "DKK", "NZD", "TRY", "AED", "THB", "PLN", "IDR", "MYR", "PHP", "RUB", "CZK", "ILS"], "description": "The currency used in the invoice." }
-
Optional: Validate your JSON schema locally before using the schema in your text extraction request to make it is well-formed and matches the expected structure. You can use tools such as:
jsonlint.com
to check formatting- A Python script to load and inspect the schema
- Your IDE’s built-in JSON linter
Custom schema and API request example
The following command submits a request to extract text by using a complete custom schema that includes all required metadata at the top, followed by a set of fields with accompanying definitions. Each field contains a default value that is empty, an example, and a description to guide the model during the extraction process.
curl -X POST \
'https://{region}.ml.cloud.ibm.com/ml/v1/text/extractions?version=2024-10-18' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJraWQiOi...'
The request body is as follows:
{
"project_id": "e40e5895-ce4d-42a3-b699-8ac764b89a09",
"document_reference": {
"type": "connection_asset",
"connection": {
"id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
},
"location": {
"bucket":"my-cloud-object-storage-bucket",
"file_name": "ca_auto_insurance_app.pdf"
}
},
"results_reference": {
"type": "connection_asset",
"connection": {
"id": "5c0cefce-da57-408b-b47d-58f7785de3ee"
},
"location": {
"bucket":"my-cloud-object-storage-bucket",
"file_name": "results_data"
}
},
"parameters": {
"requested_outputs": [
"assembly",
"md",
"html",
"plain_text",
"page_images",
],
"languages": [
"en"
],
"mode": "standard",
"ocr_mode": "enabled",
"create_embedded_images": "disabled",
"semantic_config": {
"schemas": [ {
"document_type": "Auto_Insurance_Application",
"document_description": "A California Personal Auto Application form used to collect information necessary for initiating or updating an auto insurance policy. It includes agency, applicant, carrier, and policy details such as contact information, address, policy number, and effective/expiration dates.",
"additional_prompt_instructions": "Return phone numbers and policy numbers exactly as they appear in the document.",
"fields": {
"agency_name": {
"default": "",
"example": "Spring Insurance",
"description": "Name of the insurance agency handling the auto application."
},
"applicant_name": {
"default": "",
"example": "John Smith",
"description": "Full name of the person applying for auto insurance."
},
"applicant_address": {
"default": "",
"example": "245 W 52nd St, Apt 8B, New York, NY 10019",
"description": "Mailing address of the applicant including street, apartment, city, state, and ZIP code."
},
"applicant_phone": {
"default": "",
"example": "(917) 555-2843",
"description": "Phone number for contacting the applicant."
},
"applicant_email": {
"default": "",
"example": "john.smith@gmail.com",
"description": "Email address of the applicant."
},
"carrier_name": {
"default": "",
"example": "Tower Insurance Company",
"description": "Name of the insurance carrier providing the policy."
},
"policy_number": {
"default": "",
"example": "10",
"description": "Unique identifier for the insurance policy."
},
"effective_date": {
"default": "",
"example": "2023-01-01",
"description": "Date when the insurance policy becomes effective."
},
"expiration_date": {
"default": "",
"example": "2024-01-01",
"description": "Date when the insurance policy expires."
}
}
} ]
}
}
}
Best practices
Use the following best practices to write effective schemas:
- Field naming conventions
- Use underscores to separate words. For example, use
applicant_name
instead ofapplicantName
. Keep names short but descriptive. For fields in sections, use the format[section_name]_[field_name]
For table fields, use the fomat[table_name]_row_[row_number]_[column_name]
- Write effective descriptions
- Be specific about where on the document the information is located. Do not include instructions that change the format of the values such as dates or numbers. Mention any labels or headings that identify the field. Note any special cases or variations.
Additional Prompt Instructions Use foundation model prompt instructions to improve extraction accuracy for specific document types, such as, "Preserve number formatting as seen in the image.".
Troubleshooting
The following table describes some common issues when you use a custom schema and how to resolve them:
Symptom | Cause | Solution |
---|---|---|
No values returned | Description is too vague. | Make the description more specific. Mention visual location or nearby labels. Include an instruction to return the text as-is without formatting changes |
Wrong value extracted | Ambiguous field names like “Name” | Use qualified names such as agency_name or applicant_name . |
Learn more
Parent topic: Text extraction parameters