When you export documents to XML, each document is exported as a separate XML file.
The format of the XML file has three sections. The first section, Document, identifies the document. The second section, Content, contains information about the document content. The third section, Metadata, contains information about the document metadata, including the names of fields and facets.
If you select the option to use the field name or facet path for mapping metadata when you export documents, field names and facet paths are represented as XML elements in the exported files, an approach that might be useful if you use IBM® Content Collector. Otherwise, the field names and facet paths are represented as attributes in the XML output.
Element or attribute name | Required or Optional | Remarks |
---|---|---|
/Document | Required | Root element. |
/Document@Id Required | Required | The document ID, such as the URL. |
/Document@Type | Required | The document type. Possible values are NORMAL (for normal documents), DELETED (for documents that were deleted from the crawl space), and ERROR (for documents that return errors) . |
/Document/Content | Required | Second-level element contains information about the document content. |
/Document/Content@Truncated | Required | Indicates whether the content is truncated by the crawler. |
/Document/Content@path | Optional | The file path of the content, if it is also exported. |
/Document/Metadata | Required | Second-level element that contains information about the document metadata. |
/Document/Metadata/Fields | Optional | Third-level element that contains information about fields. |
/Document/Metadata/Fields/Field | Optional | The value of the metadata field. |
/Document/Metadata/Fields/Field@Name | Optional | The name of the metadata field. |
/Document/Metadata/Facets | Optional | Third-level element that contains information about facets. |
/Document/Metadata/Facets/Facet | Optional | The facet name, if provided. |
/Document/Metadata/Facets/Facet/Path | Optional | The facet path. |
/Document/Metadata/Facets/Facet@Begin | Optional | If this facet comes from an annotation, the character position that marks the beginning of the annotation. |
/Document/Metadata/Facets/Facet/@End | Optional | If this facet comes from an annotation, the character position that marks the end of the annotation. |
/Document/Metadata/Facets/Facet/Keyword | Optional | A value associated with this facet. |
This example shows the XML output that is created for a document that was crawled by the JDBC database crawler and then exported. All crawled columns are included in the XML output.
<?xml version="1.0" encoding="UTF-8"?>
<Document Id="jdbc://jdbc%3Adb2%3A%2F%2Flocalhost%3A50000%2Fexport/ADMINISTRATOR.COMPANY/ID/1" Type="NORMAL">
<Content Truncated="false" Path="D:\export\crawled\20090728161558\0\0670.dat" Encoded="false"></Content>
<Metadata>
<Fields>
<Field Name="__$DatabaseName$__">jdbc:db2://localhost:50000/export</Field>
<Field Name="__$TableName$__">ADMINISTRATOR.COMPANY</Field>
<Field Name="__$ContentColumnName$__">DESCRIPTION</Field>
<Field Name="ID">1</Field>
<Field Name="NAME">International Business Machines</Field>
<Field Name="NAME2">IBM</Field>
<Field Name="CITY">ARMONK</Field>
<Field Name="STATE">NY</Field>
<Field Name="COUNTRY">US</Field>
<Field Name="PHONE">914-499-1900</Field>
</Fields>
<Facets></Facets>
</Metadata>
</Document>
Shown below is the corresponding content file of the exported document. This file was saved as D:/export/crawled/20090728161558/0670.dat:
IBM helped pioneer information technology over the years, and it stands today at the forefront of a worldwide industry that is revolutionizing the way in which enterprises, organizations and people operate and thrive.
This example shows the XML output that is created for document that was crawled by the JDBC database crawler, analyzed in the document processing pipeline, and then exported. Only fields that were configured to be returnable are included in the XML output.
<?xml version="1.0" encoding="UTF-8"?>
<Document Id="jdbc://jdbc%3Adb2%3A%2F%2Flocalhost%3A50000%2Fexport/ADMINISTRATOR.COMPANY/ID/1" Type="NORMAL">
<Content Truncated="false" Path="D:\export\analyzed\20090728161558\0\0670.dat" Encoded="false">IBM helped pioneer information technology over the years, and it stands today at the forefront of a worldwide industry that is revolutionizing the way in which enterprises, organizations and people operate and thrive.</Content>
<Metadata>
<Fields>
<Field Name="date">1248765056000</Field>
<Field Name="$language">en</Field>
<Field Name="$source">database</Field>
<Field Name="city">ARMONK</Field>
<Field Name="name">International Business Machines</Field>
<Field Name="name2">IBM</Field>
<Field Name="phone">914-499-1900</Field>
<Field Name="state">NY</Field>
</Fields>
<Facets>
<Facet>
<Path>date</Path>
<Path>2009</Path>
<Path>7</Path>
<Path>28</Path>
<Path>16</Path>
</Facet>
<Facet>
<Path>LOCATION</Path>
<Path>STATE</Path>
<Path>CITY</Path>
<Keyword>ARMONK</Keyword>
</Facet>
<Facet>
<Path>LOCATION</Path>
<Path>STATE</Path>
<Keyword>NY</Keyword>
</Facet>
<Facet Begin="0" End="10">
<Path>tkm_base_word</Path>
<Path>noun</Path>
<Path>unk</Path>
<Keyword>IBM</Keyword>
</Facet>
...
<Facet Begin="198" End="209">
<Path>tkm_base_word</Path>
<Path>verb</Path>
<Keyword>operate</Keyword>
</Facet>
<Facet Begin="206" End="216">
<Path>tkm_base_word</Path>
<Path>conj</Path>
<Keyword>and</Keyword>
</Facet>
<Facet Begin="210" End="217">
<Path>tkm_base_word</Path>
<Path>verb</Path>
<Keyword>thrive</Keyword>
</Facet>
</Facets>
</Metadata>
</Document>