解析提取文本生成的 JSON 结构

使用文本提取 API 从文档或图像文件中提取的文本会被写入一个 JSON 文件,该文件包含文档中各种文本和视觉元素的详细信息。 您可以进一步处理生成的 JSON,以提取您想要的信息。

使用文本提取 API 提取文本时,总是会返回以下 JSON 对象。 这些根对象中的结构是可选的,这意味着它们可以在输出中返回,也可以不返回。

styles键是另一个经常返回的根级对象,但它是可选的。 styles对象包含一个字典列表,每个字典都包含文档中使用的字体的详细信息,如字体大小和字体样式。

您可以编写代码,从您感兴趣的结构中提取文本。 如需了解更多信息,请参阅以下章节:

有关 JSON 模式的详细信息,请访问文本提取 JSON 模式

元数据

metadata键是一个字典,包含已处理文件的元数据详情,如下所示:

  • num_pages:文件的页数。
  • title:文件标题。
  • keywords:与文件相关的关键词。
  • author:文件作者。
  • publication_date:文件创建或发布的日期。
  • subject:文件主题。
  • charset:文件中使用的字符集标准。

下面的 JSON 输出是 PDF 文件元数据对象结构的示例。

"metadata":{
  "num_pages":28,
  "title":"Put AI to work for HR and talent transformation for the retail industry",
  "keywords":"",
  "author":"IBM",
  "publication_date":"",
  "subject":"Apply AI capabilities to drive your HR and talent transformation and generate better business outcomes in the retail industry.",
  "charset":"UTF-8"
}

结构

有两个键指向解析文档中的数据结构:

  • top_level_structures:顶层数据结构的 ID 列表。
  • all_structures:所有数据结构类型的列表。

all_structures键包含解析文档中数据结构所有可能类型的列表。 这些结构是可选的,因此可能包含也可能不包含在输出中。 解析文档中可能包含的一些数据结构如下:

  • sections:文件中所有章节的列表。
  • section_titles:检测到的章节标题列表。
  • lists:文档中所有列表的集合。
  • list_items:检测到的列表对象中存在的列表项的集合。
  • list_identifiers:检测到的列表对象的列表标识符集合。
  • tables:文件中所有表格的列表。
  • table_rows:检测到的表格中存在的表格行列表。
  • table_cells:检测到的表格行中存在的表格单元格列表。
  • tokens:纯文本标记符列表。
  • subscripts:与文档中检测到的标记相关的下标文本实例列表。
  • superscripts:与文档中检测到的标记相关的上标文本实例列表
  • footnotes:脚注列表。
  • paragraphs:段落列表。

处理提取的 JSON

您可以使用 JSON 处理器库,从生成的 JSON 文件中的不同结构中提取文本。

以下命令将返回 PDF 的页数,该值存储在单个 JSON 对象中:

cat output_retail.json | jq '.metadata.num_pages'
注意:该命令使用 jq,这是一个需要单独安装的命令行 JSON 处理器。

对于表格和列表等某些结构,提取的文本会存储在生成的 JSON 中的各种对象中。 您可以使用代码遍历对象,提取您感兴趣的文本。

段落的表示方法

单个段落最常见的情况是依次关联多个标记,其中每个标记代表一个单词。

在某些情况下,段落与其他结构(如章节和列表)相关联。

下面的 JSON 输出说明了从 PDF 中提取文本时,段落和标记在句子 "收集、整理、增长数据"中的关系。

突出显示 "收集整理增长数据 "句子的 PDF 截图。

//The section is listed in the top_level_structures array.
"top_level_structures":["PARA_fbdcdd",...,"SECTION_a2ab08",...],

//The section has a list of parapraphs.
{"id":"SECTION_9a3dda","parent_id":"SECTION_a2ab08","children_ids":["PARA_09384c",...

//The paragraph contains a section title.
{"id":"PARA_09384c","parent_id":"SECTION_9a3dda",
"text_alignment":"left","children_ids":["SECTION_TITLE_a5e3c2"],

//Token IDs listed for the section title.
{"id":"SECTION_TITLE_a5e3c2","parent_id":"PARA_09384c",
"text_alignment":"TBD","children_ids":[
  "TOKEN_48bbae","TOKEN_cc0b9c","TOKEN_d57d27","TOKEN_a7d6da"
]},

//Consecutive tokens with a shared parent_id contain the text from the sentence of interest.
{"id":"TOKEN_48bbae","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"Collect,",
"bbox":{"page_number":8,"x":283.0,"y":775.2945,"width":106.43201,"height":21.44}},
{"id":"TOKEN_cc0b9c","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"organize,",
"bbox":{"page_number":8,"x":396.984,"y":775.2945,"width":126.78082,"height":21.44}},
{"id":"TOKEN_d57d27","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"grow",
"bbox":{"page_number":8,"x":531.31683,"y":775.2945,"width":69.823975,"height":21.44}},
{"id":"TOKEN_a7d6da","parent_id":"SECTION_TITLE_a5e3c2","style_id":"IBM_Plex_Sans_Light_Black_32_0",
"text":"data",
"bbox":{"page_number":8,"x":608.6928,"y":775.2945,"width":62.880005,"height":21.44}},

如何表现图像中的文字

当您向watsonx.aiAPI 的文本提取方法提交带图像的 PDF 文件或图像文件时,图像中的文本用 "tokens表示。 tokens通常包含在 "paragraph或 "section对象中。

带有文本的 PNG 文件截图。

下面的 JSON 摘录说明了提交给文本提取方法的 PNG 文件在 JSON 输出中的表示方式。 包含文本标记的段落对象可从 "top_level_structures对象和 "all_structures根对象中获取。

"top_level_structures":
[
  "PARA_bc9320","PARA_8e9e62","PARA_b7f5cc","PARA_c75980","PARA_61a6a5","PARA_c8c2a8","PARA_8b8dd6","PARA_8c7c77","PARA_61aa92","PARA_1e6d2a","PARA_6eaa8d","PARA_cc6df5","PARA_4a9fb2"
],
"all_structures":{"sections":[],"section_titles":[],"lists":[],
  "list_items":[],"list_identifiers":[],"tables":[],"table_rows":[],
  "table_cells":[],"subscripts":[],"superscripts":[],"footnotes":[],
  "paragraphs":
  [
    {"id":"PARA_bc9320","parent_id":"root","text_alignment":"center",
    "children_ids":["TOKEN_132783","TOKEN_f0e333","TOKEN_dd48c3",
    "TOKEN_c9b25e","TOKEN_080303","TOKEN_ce1aa0","TOKEN_97bf62"]...
    {"id":"PARA_8e9e62","parent_id":"root",...
    ...
    {"id":"PARA_4a9fb2","parent_id":"root",...
  ]

提取的文本在段落内的标记中指定。 以下代币代表图像中的The AI Ladder®字样,如下所示:

"tokens":[
  {"id":"TOKEN_132783","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"The","bbox":{"page_number":1,"x":250.65,"y":109.3,"width":38.880005,"height":21.48999}},
  {"id":"TOKEN_f0e333","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"AI","bbox":{"page_number":1,"x":295.82,"y":114.67,"width":24.109985,"height":16.290009}},
  {"id":"TOKEN_dd48c3","parent_id":"PARA_bc9320","style_id":"Arial_Black_10_0",
    "text":"Ladder®","bbox":{"page_number":1,"x":325.74,"y":110.24,"width":82.66,"height":22.030006}}

列表的表示方法

列表的结构由三个独立的对象表示,它们都是 "all_structures根对象的一部分:

  • lists:格式为项目符号或编号列表的列表项集。
  • list_items:列表中的单个项目,该列表可包含文本、段落或嵌套列表的标记。
  • list_identifiers:包含一个符号,如连字号或数字,用于标识列表项。

下面的 JSON 输出说明了如何在列表的第一个项目中表示文本"提供透明度"。

带有项目符号列表的 PDF 截图,其中列表中的第一项包含提供透明度的高亮字样。

//The lists object contains the list where the listitem is located.
"lists":[{"id":"LIST_ed036e","parent_id":"SECTION_9a3dda","children_ids":[
  "LISTITEM_c802c4",...

//The list_item object contains the list item which contains a list ID followed by several tokens.
"list_items":[{"id":"LISTITEM_c802c4","parent_id":"LIST_ed036e","children_ids":[
  "LIST_ID_781ee7","TOKEN_1df44f","TOKEN_1bcdbf",...

//The list_identifiers object contains list IDs with tokens.
"list_identifiers":[{"id":"LIST_ID_781ee7","parent_id":"LISTITEM_c802c4",
  "children_ids":["TOKEN_4a66cb"]}

//The list ID token includes a token with a hyphen.
{"id":"TOKEN_4a66cb","parent_id":"LIST_ID_781ee7","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"–","bbox":{"page_number":10,"x":994.0,"y":500.36,"width":11.76001,"height":13.639999}}

//The list item tokens include the text *Providing transparency* in them.
{"id":"TOKEN_1df44f","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"Providing","bbox":{"page_number":10,"x":1014.0,"y":500.36,"width":83.55994,"height":13.639999}},
{"id":"TOKEN_1bcdbf","parent_id":"LISTITEM_c802c4","style_id":"IBM_Plex_Sans_Black_20_0",
  "text":"transparency","bbox":{"page_number":10,"x":1102.2799,"y":500.36,"width":117.95801,"height":13.639999}}...

下面的Python代码从列表中提取文本并重建列表,以说明如何通过循环查看列表项来提取标记文本。

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions

## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all list-related structures
all_lists = raw_output['all_structures']['lists']
all_list_items = raw_output['all_structures']['list_items']
all_list_identifiers = raw_output['all_structures']['list_identifiers']

# Get all list items from the first list in the file
list_1 = all_lists[0]
list_1_items = []

for list_item_id in list_1['children_ids']:
  list_1_items.append(find_by_key('id', list_item_id, all_list_items))

# Reconstruct the list
recon_list = []

flat_col = flatten_collection(raw_output['all_structures'])
for list_item in list_1_items:
  val = []
  for list_value_id in list_item['children_ids']:
    list_value = find_by_key('id', list_value_id, flat_col)
    #print(list_value['id'])
    if list_value['id'].startswith("LIST_ID"):
      for list_id_value_id in list_value['children_ids']:
        list_id_value = find_by_key('id', list_id_value_id, flat_col)
        if 'text' in list_id_value:
          val.append(list_id_value['text'])
    elif list_value['id'].startswith("PARA"):
      val.append("\n")
      for para_value_id in list_value['children_ids']:
        para_value = find_by_key('id', para_value_id, flat_col)
        if 'text' in para_value:
          val.append(para_value['text'])
    elif list_value['id'].startswith("TOKEN"):
      val.append(list_value['text'])
    else:
      pass
  print(' '.join(val))

表格的表示方法

表格的结构由三个独立的对象表示,它们是 "all_structures根对象的一部分:

  • tables:与多行表格相关联。
  • table_rows:每一行表格都与多个表格单元格相关联。
  • table_cells:每个表格单元格包含一个标记序列、一个段落和标记的混合序列或一个列表、段落和标记的混合序列。

下面的 JSON 输出说明了表格列标题 "工作流程 "的表示方法。

带有两栏表格的 PDF 文件截图,其中第一栏标题(即 "工作流程 "一词)突出显示


//The all_structures root object contains the table, which has many rows.
"all_structures":{
  ...
  "tables":[{"id":"TABLE_3bfabb","children_ids":[
    "ROW_39aa6f",...,"ROW_63472c"]}

//A separate table rows array contains table cells.
"all_structures":{
  ...
  "table_rows":[{"id":"ROW_39aa6f","parent_id":"TABLE_3bfabb","children_ids":[
    "CELL_bc1c4b","CELL_3a8cdd","CELL_03b6d3"]}

//One of the table cells is identified as a column header and contains a paragraph.
{"id":"CELL_3a8cdd","parent_id":"ROW_39aa6f","is_row_header":false,
  "is_col_header":true,"col_span":1,"row_span":1,"col_start":2,"row_start":1,
  "children_ids":["PARA_088d08"]}

//The paragraph has a token.
{"id":"PARA_088d08","parent_id":"CELL_3a8cdd","children_ids":[
  "TOKEN_b99851"],"indentation":1}

//The token contains the text *Workflows*.
{"id":"TOKEN_b99851","parent_id":"PARA_088d08","style_id":"IBM_Plex_Sans_SmBld_Black_20_0_bold",
  "text":"Workflows","bbox":{"page_number":14,"x":757.0,"y":291.44003,"width":99.15997,"height":13.96}}

下面的Python代码从表格中提取文本并重建表格,以说明如何在表格行和单元格中循环提取标记文本。

# Import required libraries
import json
import numpy as np
import pandas as pd

# Define helper functions
## Function, which finds entry in collection by key-value pair
def find_by_key(key: str, value, collection: list, unique=True):
  find = list(filter(lambda x: x[key] == value, collection))
  if unique:
    if len(find) > 1:
      raise ValueError(f"Found non-unique key-value pair.\n{find}")
    return find[0]
  else:
    return find

## Function, which flattens iterable collection of dicts
def flatten_collection(collection):
  result = []
  for val in collection.values():
    result.extend(val)
  return result

# Load the file with the extracted text
with open("/Users/janedoe/Downloads/output_retail.json") as f:
  raw_output = json.load(f)

# Get all table-related structures
all_tables = raw_output['all_structures']['tables']
all_table_rows = raw_output['all_structures']['table_rows']
all_table_cells = raw_output['all_structures']['table_cells']

# Get all of the cells from the first table
table_1 = all_tables[0]
table_1_cells = []

for row_id in table_1['children_ids']:
  row = find_by_key('id', row_id, all_table_rows)
  for cell_id in row['children_ids']:
    table_1_cells.append(find_by_key('id', cell_id, all_table_cells))

# Reconstruct the first table
last_col = table_1_cells[-1]['col_start']
last_row = table_1_cells[-1]['row_start']

recon_table = np.empty([last_row, last_col], dtype=object)

flat_col = flatten_collection(raw_output['all_structures'])
for cell in all_table_cells:
  cell_col, cell_row = cell['col_start'], cell['row_start']
  for cell_value in cell['children_ids']:
    value = find_by_key('id', cell_value, flat_col)
    entries = []
    for cell_entry in value['children_ids']:
      entry = find_by_key('id', cell_entry, flat_col)
      if 'text' in entry:
        entries.append(entry['text'])
    cell_content = " ".join(entries)
  recon_table[cell_row-1][cell_col-1] = str(cell_content)

pd.DataFrame(data=recon_table[1:,:], columns=recon_table[0,:])

键值对的表示方法

标注数据以键值对的形式在三个独立对象中表示,这三个对象是 all_structures 根对象的一部分:

  • id:键和值组合的唯一 ID。
  • type:提取值的类型。 模型使用通用提取方法处理的值被分配为 key_value 类型。 使用基于模式的提取方法处理的值被分配为 only_value 类型。
  • key:数据的唯一标签。 如果使用通用提取方法提取了键值对数据,则会填充 semantic_labelidbbox 元素。 如果模型使用基于模式的提取方法提取了键值对数据,那么只有 semantic_label 才会填充模式中的值,而 raw_textbbox 元素不会被填充。
  • value:与标签相关的数据。

下面的 JSON 输出说明了加利福尼亚州个人汽车申请表中联系人姓名电话字段的表示方法。

包含多个字段(包括联系人姓名和电话)的 PDF 自动申请表截图

"kvps": [

{

  "id": "KVP_000034",

  "type": "key_value",

  "key": {

    "id": "KEY_000034",

    "semantic_label": "contact_name",

    "raw_text": "CONTACT NAME",

    "normalized_text": null,

    "confidence_score": null,

    "bbox": {

      "x": 26.406426231269133,

      "y": 178.04464285714283,

      "width": 42.25028197003061,

      "height": 15.482142857142861,

      "page_number": 1

    }

},

"value": {

  "id": "VALUE_000034",

  "raw_text": "John Smith",

  "normalized_text": null,

  "confidence_score": null,

  "bbox": {

    "x": 76.57863607068049,

    "y": 178.04464285714283,

    "width": 60.73478033191901,

    "height": 10.321428571428584,

    "page_number": 1

    }

}

},

{

  "id": "KVP_000035",

  "type": "key_value",

  "key": {

  "id": "KEY_000035",

  "semantic_label": "contact_phone",

  "raw_text": "PHONE (A/C. No. Ext)",

  "normalized_text": null,

  "confidence_score": null,

  "bbox": {

  "x": 26.406426231269133,

  "y": 196.10714285714283,

  "width": 42.250283760672005,

  "height": 14.837047751844239,

  "page_number": 1

  }

},

"value": {

  "id": "VALUE_000035",

  "raw_text": "(917) 555-2843",

  "normalized_text": null,

  "confidence_score": null,

  "bbox": {

    "x": 95.06313727469147,

    "y": 196.10715158651936,

    "width": 75.91847683596005,

    "height": 12.256690608987071,

    "page_number": 1

  }

}

},
...
]

下面的 Python 代码从组装好的 JSON 输出文件中的键值对中提取文本,以说明如何在结构化数据中循环并重建内容。

def extract_kvps(assembly_dict):
    """
    Extract and print key-value pairs from the assembly dict
    Works with both 'only_value' and 'key_value' type KVPs
    Includes coordinate information and page dimensions
    """
    try:
        data = assembly_dict

        # Get page metadata for dimensions
        page_metadata = data.get("metadata", {}).get("pages_metadata", [])
        if page_metadata:
            print("Document Page Information:")
            for page in page_metadata:
                page_num = page.get("page_number", "Unknown")
                page_width = page.get("page_pdf_width", "Unknown")
                page_height = page.get("page_pdf_height", "Unknown")
                page_image_width = page.get("page_image_width", "Unknown")
                page_image_height = page.get("page_image_height", "Unknown")

                print(f"Page {page_num}:")
                print(f"  PDF Dimensions: {page_width} x {page_height}")
                print(f"  Image Dimensions: {page_image_width} x {page_image_height}")
                print()
        else:
            print("No page metadata found in the document\n")

        # Extract KVPs if they exist in the data
        kvps = data.get("kvps", [])

        if not kvps:
            print("No KVPs found in the JSON data")
            return

        print(f"Found {len(kvps)} Key-Value Pairs\n")
        print("=" * 80)

        # Process each KVP
        for i, kvp in enumerate(kvps, 1):
            kvp_id = kvp.get("id", "Unknown ID")
            kvp_type = kvp.get("type", "Unknown type")

            # Get key and value information
            key_info = kvp.get("key", {})
            value_info = kvp.get("value", {})

            # Get semantic label (if any)
            semantic_label = key_info.get("semantic_label", "N/A")

            # Get key text (if any)
            key_text = key_info.get("raw_text", "N/A")

            # Get value text
            value_text = value_info.get("raw_text", "N/A")

            # Get coordinates (bounding boxes)
            key_bbox = key_info.get("bbox", "N/A")
            value_bbox = value_info.get("bbox", "N/A")

            # Print the information
            print(f"KVP #{i}: {kvp_id}")
            print(f"Type: {kvp_type}")

            if kvp_type == "only_value":
                print(f"Semantic Label: {semantic_label}")
                print(f"Value: {value_text}")
                print(f"Value Coordinates:")
                if value_bbox != "N/A":
                    print(f"  x: {value_bbox['x']}, y: {value_bbox['y']}")
                    print(f"  width: {value_bbox['width']}, height: {value_bbox['height']}")
                    print(f"  page: {value_bbox['page_number']}")
                else:
                    print("  No coordinates available")
            else:  # key_value type
                print(f"Key Text: {key_text}")
                print(f"Normalized key: {semantic_label}")
                print(f"Value: {value_text}")

                print(f"Key Coordinates:")
                if key_bbox != "N/A":
                    print(f"  x: {key_bbox['x']}, y: {key_bbox['y']}")
                    print(f"  width: {key_bbox['width']}, height: {key_bbox['height']}")
                    print(f"  page: {key_bbox['page_number']}")
                else:
                    print("  No coordinates available")

                print(f"Value Coordinates:")
                if value_bbox != "N/A":
                    print(f"  x: {value_bbox['x']}, y: {value_bbox['y']}")
                    print(f"  width: {value_bbox['width']}, height: {value_bbox['height']}")
                    print(f"  page: {value_bbox['page_number']}")
                else:
                    print("  No coordinates available")

            print("-" * 80)

    except Exception as e:
        print(f"Error processing KVPs: {e}")

文本提取 JSON 模式

在编写代码以从为文档生成的 JSON 中提取信息时,可以参考 JSON 模式。

注:

任何描述中提到 "Not required的结构都可能在模式的未来迭代中被删除。 如果您选择在代码中引用可选结构,那么当模式发生后续更改时,您可能需要更新代码。

{ "$defs": {
    "AssemblyJsonOutput": {
      "type": "object",
      "properties": {
        "metadata": {
          "description": "Metadata about this document.",
          "$ref": "#/$defs/Metadata"
        },
        "styles": {
          "description": "Font styles used in this document. Not required.",
          "type": "array",
          "items": {
            "$ref": "#/$defs/Style"
          }
        },
        "kvps": {
          "description": "Key value pairs found in the document.",
          "type": "array",
          "items": {
            "$ref": "#/$defs/Kvp"
          }
        },
        "top_level_structures": {
          "type": "array",
          "description": "Array of ids of the top level structures which belong directly under the document",
          "items": {
            "type": "string"
          }
        },
        "all_structures": {
          "type": "object",
          "description": "An object containing lists of all structures identified in this document.",
          "$ref": "#/$defs/Structures"
        }
      },
      "required": [
        "metadata",
        "top_level_structures",
        "all_structures"
      ]
    },
    "Metadata": {
      "type": "object",
      "additionalProperties": true,
      "title": "Metadata",
      "properties": {
        "num_pages": {
          "type": "integer",
          "description": "Total number of pages in the document"
        },
        "title": {
          "type": "string",
          "description": "Document title as obtained from source document. Not required."
        },
        "language": {
          "type": "string",
          "description": "Determined by the lang specifier in the <html> tag, or <meta> tag"
        },
        "url": {
          "type": "string",
          "description": "url of the document"
        },
        "keywords": {
          "type": "string",
          "description": "Keywords associated with document. Not required."
        },
        "author": {
          "type": "string",
          "description": "Author of the document. Not required."
        },
        "publication_date": {
          "type": "string",
          "description": "Best effort bases for a publication date (may be the creation date). Not required."
        },
        "subject": {
          "type": "string",
          "description": "Subject as obtained from the source document. Not required."
        },
        "charset": {
          "type": "string",
          "description": "Character set used for the output"
        },
        "output_tokens_flag": {
          "type": "boolean",
          "description": "Whether individual tokens are output, as specified in the input to the API"
        },
        "output_bounding_boxes_flag": {
          "type": "boolean",
          "description": "Whether bounding boxes are output, as requested in the input to the API"
        },
        "pages_metadata": {
          "type": "array",
          "items": {
            "$ref": "#/$defs/PageMetadata"
          }
      },
      "required": [
        "num_pages",
        "charset"
      ]
    },
    "PageMetadata": {
      "type": "object",
      "title": "PageMetadata",
      "properties": {
        "page_number": {
          "type": "integer",
          "description": "Page number, starting from 1"
        },
        "page_image_width": {
          "type": "integer",
          "description": "Width of the page in pixels, assuming the page is an image with the DPI as specified in the dpi property "
        },
        "page_image_height": {
          "type": "integer",
          "description": "Height of the page in pixels, assuming the page is an image with DPI as specified in the dpi property"
        },
        "dpi": {
          "type": "integer",
          "description": "The DPI to use for the page image, as specified in the input to the API"
        }
      }
    }
    "Style": {
      "type": "object",
      "title": "Style",
      "properties": {
        "style_id": {
          "type": "string",
          "description": "Style Identifier which will be used for reference in other objects"
        },
        "font_size": {
          "type": "string",
          "description": "Font size"
        },
        "font_name": {
          "type": "string",
          "description": "Font name"
        },
        "is_bold": {
          "type": "string",
          "description": "Whether or not the the font is bold"
        },
        "is_italic": {
          "type": "string",
          "description": "Whether or not the the font is italic"
        }
      },
      "required": [
        "style_id",
        "font_size",
        "font_name",
        "is_bold",
        "is_italic"
      ]
    },
    "Kvp": {
      "type": "object",
      "title": "KVP",
      "properties": {
        "id": {
          "type": "string",
          "description": "A unique ID of the KVP prefixed with KVP_"
        },
        "type": {
          "type": "string",
          "description": "The type of the KVP"
        },
        "key": {
          "type": "object",
          "description": "The key data of the KVP",
          "$ref": "#/$defs/KvpKey"
        },
        "value": {
          "type": "object",
          "description": "The value data of the KVP",
          "$ref": "#/$defs/KvpValue"
        }
      },
      "required": [
        "id",
        "type",
        "value"
      ]
    },
    "KvpKey": {
      "type": "object",
      "title": "KvpKey",
      "properties": {
        "id": {
          "type": "string",
          "description": "A unique ID of the KVP key prefixed with KEY_"
        },
        "semantic_label": {
          "type": "string",
          "description": "The semantic label of the KVP"
        },
        "raw_text": {
          "type": "string",
          "description": "The original text of the key extracted in the document"
        },
        "normalized_text": {
          "type": "string",
          "description": "The normalized text of the key"
        },
        "confidence_score": {
          "type": "float",
          "description": "The confidence score of the key"
        },
        "bbox": {
          "type": "object",
          "description": "The bounding box of the key",
          "$ref": "#/$defs/BoundingBox"
        }
      },
      "required": [
        "id",
        "raw_text"
      ]
    },
    "KvpValue": {
      "type": "object",
      "title": "KvpKey",
      "properties": {
        "id": {
          "type": "string",
          "description": "A unique ID of the KVP key prefixed with VALUE_"
        },
        "raw_text": {
          "type": "string",
          "description": "The original text of the key extracted in the document"
        },
        "normalized_text": {
          "type": "string",
          "description": "The normalized text of the value"
        },
        "confidence_score": {
          "type": "float",
          "description": "The confidence score of the value"
        },
        "bbox": {
          "type": "object",
          "description": "The bounding box of the value",
          "$ref": "#/$defs/BoundingBox"
        }
      },
      "required": [
        "id",
        "raw_text" 
      ]
    },
    "Structures": {
      "type": "object",
      "description": "An object containing of all flattened structures identified in the document.
      None of the items in this object are required.",
      "sections": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Section"
        }
      },
      "section_titles": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/SectionTitle"
        }
      },
      "lists": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/List"
        }
      },
      "list_items": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListItem"
        }
      },
      "list_identifiers": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/ListIdentifier"
        }
      },
      "tables": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Table"
        }
      },
      "table_rows": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableRow"
        }
      },
      "table_cells": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/TableCell"
        }
      },
      "subscripts": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Subscript"
        }
      },
      "superscripts": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Superscript"
        }
      },
      "footnotes": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Footnote"
        }
      },
      "paragraphs": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Paragraph"
        }
      },
      "code_snippets": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/CodeSnippet"
        }
      },
      "pictures":{
        "type": "array",
        "items": {
          "$ref": "#/$defs/Picture"
        }
      },
      "tokens": {
        "type": "array",
        "items": {
          "$ref": "#/$defs/Token"
        }
      }
    },
    "Section": {
      "type": "object",
      "title": "Section",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        },
        "section_number": {
          "type": "string",
          "description": "Section identifier identified in the document"
        },
        "section_level": {
          "type": "string",
          "description": "Nesting level of section identified in the document"
        }
      },
      "required": [
        "id",
        "parent_id",
        "children_ids",
        "section_number",
        "section_level"
      ]
    },
    "SectionTitle": {
      "type": "object",
      "title": "SectionTitle",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the section"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the section title. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "List": {
      "type": "object",
      "title": "List",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list "
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      },
      "required": [
        "id",
        "parent_id",
        "children_ids"
      ]
    },
    "ListItem": {
      "type": "object",
      "title": "ListItem",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "ListIdentifier": {
      "type": "object",
      "title": "ListIdentifier",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list item"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence"
        }
      },
      "required": [
        "id",
        "parent_id",
        "children_ids"
      ]
    },
    "Table": {
      "type": "object",
      "title": "Table",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table rows"
        }
      },
      "required": [
        "id",
        "parent_id",
        "children_ids"
      ]
    },
    "TableRow": {
      "type": "object",
      "title": "TableRow",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table row"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, table cells"
        }
      },
      "required": [
        "id",
        "parent_id",
        "children_ids"
      ]
    },
    "TableCell": {
      "type": "object",
      "title": "TableCell",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the table cell"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "is_row_header": {
          "type": "boolean",
          "description": "Whether the cell is part of row header or not"
        },
        "is_column_header": {
          "type": "boolean",
          "description": "Whether the cell is part of column header or not"
        },
        "col_span": {
          "type": "integer",
          "description": "column span of the cell"
        },
        "row_span": {
          "type": "integer",
          "description": "row span of the cell"
        },
        "col_start": {
          "type": "integer",
          "description": "column start of the cell within the table"
        },
        "row_start": {
          "type": "integer",
          "description": "row start of the cell within the table"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, underlying paragraphs. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "is_row_header",
        "is_column_header",
        "col_span",
        "row_span",
        "col_start",
        "row_start",
        "text"
      ]
    },
    "Subscript": {
      "type": "object",
      "title": "Subscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the subscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the subscript belongs"
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "Superscript": {
      "type": "object",
      "title": "Superscript",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the superscript"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "footnote_ref": {
          "type": "string",
          "description": "Matching footnote id found on the page"
        },
        "token_id_ref": {
          "type": "string",
          "description": "Id of the token to which the superscript belongs"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "footnote_ref",
        "text"
      ]
    },
    "Footnote": {
      "type": "object",
      "title": "Footnote",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the footnote"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "Paragraph": {
      "type": "object",
      "title": "Paragraph",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the paragraph"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens. Not required."
        },
        "text_alignment": {
          "type": "string",
          "description": "Text alignment of the paragraph. Not required."
        },
        "indentation": {
          "type": "integer",
          "description": "Paragraph indentation. Not required."
        },
        "text": {
          "type": "string",
          "description": "Text property added to all objects"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "CodeSnippet": {
      "type": "object",
      "title": "CodeSnippet",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the code snippet"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique Ids of first level children structures under this structure in correct sequence, in this case, tokens",
          "items": {
            "type": "string"
          }
        },
        "text": {
          "type": "string",
          "description": "Text of the code snippet. It can contain multiple lines, including empty lines or lines with leading spaces."
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "Picture": {
      "type": "object",
      "title": "Picture",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the picture"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "children_ids": {
          "type": "array",
          "description": "Unique identifiers of the tokens extracted from this picture, if any"
        },
        "text": {
          "type": "string",
          "description": "Text extracted from this picture"
        },
        "verbalization": {
          "type": "string",
          "description": "Verbalization of this picture"
        },
        "page_number": {
          "type": "integer",
          "description": "Page that contains this picture"
        },
        "path": {
          "type": "string",
          "description": "Path in the output location where the picture itself was saved"
        },
        "bbox": {
          "type":"object",
          "description": "The bounding box of the picture in the context of the page, expressed as pixel coordinates with respect to pages_metadata.page_image_height and pages_metadata.page_image_width",
          "$ref": "#/$defs/BoundingBox"
        }
      },
      "required": [
        "id",
        "parent_id"
      ]
    },
    "Token": {
      "type": "object",
      "title": "Token",
      "properties": {
        "id": {
          "type": "string",
          "description": "Unique identifier for the list identifier"
        },
        "parent_id": {
          "type": "string",
          "description": "Unique identifier which denotes parent of this structure"
        },
        "style_id": {
          "type": "string",
          "description": "Identifier of the style object associated with this token. Not required."
        },
        "text": {
          "type": "string",
          "description": "Actual text of the token"
        },
        "bbox": {
          "type": "object",
          "description": "The bounding box of the token in the context of the page, expressed as pixel coordinates with respect to pages_metadata.page_image_height and pages_metadata.page_image_width",
          "$ref": "#/$defs/BoundingBox"
        }
      },
      "required": [
        "id",
        "parent_id",
        "text"
      ]
    },
    "BoundingBox": {
      "type": "object",
      "title": "BoundingBox",
      "properties": {
        "page_number": {
          "description": "Which page this represents",
          "type": "integer"
        },
        "x": {
          "description": "X coordinate of the top left corner of the bounding box",
          "type": "float"
        },
        "y": {
          "description": "Y coordinate of the top left corner of the bounding box",
          "type": "float"
        },
        "width": {
          "description": "The width of the bounding box",
          "type": "float"
        },
        "height": {
          "description": "The height of the bounding box",
          "type": "float"
        }
      },
      "required": [
        "page_number",
        "x",
        "y",
        "width",
        "height"
      ]
    }
  },
  "$ref": "#/$defs/AssemblyJsonOutput"
}