XML file format for exported documents

When you export documents to XML, each document is exported as a separate XML file.

XML elements

The format of the XML file has three sections. The first section, Document, identifies the document. The second section, Content, contains information about the document content. The third section, Metadata, contains information about the document metadata, including the names of fields and facets.

If you select the option to use the field name or facet path for mapping metadata when you export documents, field names and facet paths are represented as XML elements in the exported files, an approach that might be useful if you use IBM® Content Collector. Otherwise, the field names and facet paths are represented as attributes in the XML output.

Table 1. Elements and attributes in the XML output for exported documents
Element or attribute name Required or Optional Remarks
/Document Required Root element.
/Document@Id Required Required The document ID, such as the URL.
/Document@Type Required The document type. Possible values are NORMAL (for normal documents), DELETED (for documents that were deleted from the crawl space), and ERROR (for documents that return errors) .
 
/Document/Content Required Second-level element contains information about the document content.
/Document/Content@Truncated Required Indicates whether the content is truncated by the crawler.
/Document/Content@path Optional The file path of the content, if it is also exported.
 
/Document/Metadata Required Second-level element that contains information about the document metadata.
/Document/Metadata/Fields Optional Third-level element that contains information about fields.
/Document/Metadata/Fields/Field Optional The value of the metadata field.
/Document/Metadata/Fields/Field@Name Optional The name of the metadata field.
/Document/Metadata/Facets Optional Third-level element that contains information about facets.
/Document/Metadata/Facets/Facet Optional The facet name, if provided.
/Document/Metadata/Facets/Facet/Path Optional The facet path.
/Document/Metadata/Facets/Facet@Begin Optional If this facet comes from an annotation, the character position that marks the beginning of the annotation.
/Document/Metadata/Facets/Facet/@End Optional If this facet comes from an annotation, the character position that marks the end of the annotation.
/Document/Metadata/Facets/Facet/Keyword Optional A value associated with this facet.

Example of a crawled document export to XML

This example shows the XML output that is created for a document that was crawled by the JDBC database crawler and then exported. All crawled columns are included in the XML output.

<?xml version="1.0" encoding="UTF-8"?>
<Document Id="jdbc://jdbc%3Adb2%3A%2F%2Flocalhost%3A50000%2Fexport/ADMINISTRATOR.COMPANY/ID/1" Type="NORMAL">
  <Content Truncated="false" Path="D:&#x5C;export&#x5C;crawled&#x5C;20090728161558&#x5C;0&#x5C;0670.dat" Encoded="false"></Content>
  <Metadata>
    <Fields>
      <Field Name="__$DatabaseName$__">jdbc:db2://localhost:50000/export</Field>
      <Field Name="__$TableName$__">ADMINISTRATOR.COMPANY</Field>
      <Field Name="__$ContentColumnName$__">DESCRIPTION</Field>
      <Field Name="ID">1</Field>
      <Field Name="NAME">International Business Machines</Field>
      <Field Name="NAME2">IBM</Field>
      <Field Name="CITY">ARMONK</Field>
      <Field Name="STATE">NY</Field>
      <Field Name="COUNTRY">US</Field>
      <Field Name="PHONE">914-499-1900</Field>
    </Fields>
    <Facets></Facets>
  </Metadata>
</Document>

Shown below is the corresponding content file of the exported document. This file was saved as D:/export/crawled/20090728161558/0670.dat:

IBM helped pioneer information technology over the years, and it stands today at the forefront of a worldwide industry that is revolutionizing the way in which enterprises, organizations and people operate and thrive.

Example of an analyzed document export to XML

This example shows the XML output that is created for document that was crawled by the JDBC database crawler, analyzed in the document processing pipeline, and then exported. Only fields that were configured to be returnable are included in the XML output.

<?xml version="1.0" encoding="UTF-8"?>
<Document Id="jdbc://jdbc%3Adb2%3A%2F%2Flocalhost%3A50000%2Fexport/ADMINISTRATOR.COMPANY/ID/1" Type="NORMAL">
  <Content Truncated="false" Path="D:&#x5C;export&#x5C;analyzed&#x5C;20090728161558&#x5C;0&#x5C;0670.dat" Encoded="false">IBM helped pioneer information technology over the years, and it stands today at the forefront of a worldwide industry that is revolutionizing the way in which enterprises, organizations and people operate and thrive.</Content>
  <Metadata>
    <Fields>
      <Field Name="date">1248765056000</Field>
      <Field Name="$language">en</Field>
      <Field Name="$source">database</Field>
      <Field Name="city">ARMONK</Field>
      <Field Name="name">International Business Machines</Field>
      <Field Name="name2">IBM</Field>
      <Field Name="phone">914-499-1900</Field>
      <Field Name="state">NY</Field>
    </Fields>
    <Facets>
      <Facet>
        <Path>date</Path>
        <Path>2009</Path>
        <Path>7</Path>
        <Path>28</Path>
        <Path>16</Path>
      </Facet>
      <Facet>
        <Path>LOCATION</Path>
        <Path>STATE</Path>
        <Path>CITY</Path>
        <Keyword>ARMONK</Keyword>
      </Facet>
      <Facet>
        <Path>LOCATION</Path>
        <Path>STATE</Path>
        <Keyword>NY</Keyword>
      </Facet>
      <Facet Begin="0" End="10">
        <Path>tkm_base_word</Path>
        <Path>noun</Path>
        <Path>unk</Path>
        <Keyword>IBM</Keyword>
      </Facet>

      ...

      <Facet Begin="198" End="209">
        <Path>tkm_base_word</Path>
        <Path>verb</Path>
        <Keyword>operate</Keyword>
      </Facet>
      <Facet Begin="206" End="216">
        <Path>tkm_base_word</Path>
        <Path>conj</Path>
        <Keyword>and</Keyword>
      </Facet>
      <Facet Begin="210" End="217">
        <Path>tkm_base_word</Path>
        <Path>verb</Path>
        <Keyword>thrive</Keyword>
      </Facet>
    </Facets>
  </Metadata>
</Document>