Curl Node Java Python

Document Conversion

API Reference
The IBM Watson Document Conversion Service API Reference page

Introduction

Important: Starting on 11-03-2017, it will no longer be possible to create a new instance of the Document Conversion service on Bluemix. Existing service instances will be supported until 10-03-2018. To continue using features, you will need to migrate. Note: May not apply in select Dedicated environments.

The IBM Watson™ Document conversion service converts a single HTML, PDF, or Microsoft Word™ document into a normalized HTML, plain text, or a set of JSON-formatted Answer units that can be used with other Watson services. Carefully inspect output to make sure that it contains all elements and metadata required by your or your organization's security standards.

API Endpoint

https://gateway.watsonplatform.net/document-conversion/api/v1

npm

npm install watson-developer-cloud@"<=3.0.0"

Maven


<dependency>
  <groupId>com.ibm.watson.developer_cloud</groupId>
  <artifactId>java-sdk</artifactId>
  <version>4.0.0</version>
</dependency>

Gradle

compile 'com.ibm.watson.developer_cloud:java-sdk:4.0.0'

pip

pip install -I watson-developer-cloud==1.0.0

API explorer

To interact with this API, use the Document Conversion Service API explorer. Use the explorer to test your calls to the API, and to view live responses from the server.

Authentication

You authenticate to the Document Conversion API by providing the username and password that are provided in the service credentials for the service instance that you want to use. The API uses Basic Authentication.

After creating an instance of the Document Conversion service, select Service Credentials from the left navigation for its dashboard to see the username and password that are associated with that instance. For more information, see Service credentials for Watson services.

Replace {username} and {password} with your credentials


curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/document-conversion/api/v1/{method}"
      

var watson = require('watson-developer-cloud');

var document_conversion = watson.document_conversion({
  username:     '{username}',
  password:     '{password}',
  version:      'v1',
  version_date: '2015-12-15'
});

DocumentConversion service = new DocumentConversion("2015-12-15");
service.setUsernameAndPassword("{username}","{password}");
      
import json
from watson_developer_cloud import DocumentConversionV1

document_conversion = DocumentConversionV1(
  username='{username}',
  password='{password}',
  version='2015-12-15'
)

Data collection

By default, all Watson services log requests and their results. Logging is done only to improve the services for future users. The logged data is not shared or made public. To prevent IBM from accessing your data for general service improvements, set the X-Watson-Learning-Opt-Out header parameter to true for all requests. (Any value other than false or 0 disables request logging for that call.) You must set the header on each request that you do not want IBM to access for general service improvements.

Response handling

The Document Conversion service uses standard HTTP response codes to display whether a method completed successfully. A 200 response always indicates success. A 400 type response is some sort of failure, and a 500 type response usually indicates an internal system error. Response codes are listed with the individual calls.

Equivalent exceptions are listed with the individual methods. The exceptions typically include the error message and HTTP status code returned by the service. All methods can throw the following common exceptions.

Equivalent exceptions are listed with the individual methods. The exceptions typically include the error message returned by the service. All methods can throw the following common exceptions.

Error format

Name Description
code integer HTTP status code.
error string Error description.

Common exceptions thrown

Exception Description
IllegalArgumentException An illegal argument was passed or an argument was missing for a method that accepts one or more arguments. Also thrown when media type is not supported.
UnsupportedException The request specified an unacceptable media type. (HTTP response code 415.)
UnauthorizedException Access is denied due to invalid credentials. (HTTP response code 401.)

The WatsonException exception catches the error message from the API response.

Example error


{
  "code" : 400,
  "error" : "The request does not contain the required \"config\" part. Include a \"config\" part and resubmit your request."
}
        

Catching an error


try {
  // Your code goes here
} catch (IllegalArgumentException e) {
  // Missing or invalid parameter
} catch (BadRequestException e) {
  // Missing or invalid parameter
} catch (UnauthorizedException e) {
  // Access is denied due to invalid credentials
} catch (InternalServerErrorException e) {
  // Internal Server Error
}
          

var watson = require('watson-developer-cloud');

var document_conversion = watson.document_conversion({
  username:     '{username}',
  password:     '{password}',
  version:      'v1',
  version_date: '2015-12-15'
});

// This code will throw an error because "file" is null
document_conversion.convert({
  file: null
}, function(err, response) {
  // The error will be the first argument of the callback
  if (err) {
    console.log(err);
  }
});
          

from watson_developer_cloud import DocumentConversionV1

document_conversion = DocumentConversionV1(
  username='{username}',
  password='{password}',
  version='2015-12-15'
)

try:
  # Your code goes here
except WatsonException as e:
  print e
        

Versioning

API requests require a version parameter that takes the a date in the format version=YYYY-MM-DD. Send the version parameter with every API request.

When we change the API in a backwards-incompatible way, we release a new minor version. To take advantage of the changes in a new version, change the value of the version parameter to the new date. If you're not ready to update to that version, don’t change your version date. For details about each version, see the Release notes.

The current version is 2015-12-15.

Methods

Convert a document

Converts a document to answer units, HTML or text. This method accepts a multipart/form-data request. Upload the document as the "file" form part and the configuration as the "config" form part. The service call identifies the output type.

POST /v1/convert_document
convert(params, callback())

convertDocumentToAnswer(File document)
convertDocumentToAnswer(File document, String mediaType)
convertDocumentToAnswer(File document, String mediaType, JsonObject config)
convertDocumentToHTML(File document)
convertDocumentToHTML(File document, String mediaType)
convertDocumentToHTML(File document, String mediaType, JsonObject config)
convertDocumentToText(File document)
convertDocumentToText(File document, String mediaType)
convertDocumentToText(File document, String mediaType, JsonObject config)

convert_document(document, config, media_type=None)
  

Request

Parameter Description
version query string The release date of the version of the API you want to use. Specify dates in YYYY-MM-DD format.
file
form data
file
The file to convert. Maximum file size is 50 MB. The API detects the type, but you can specify it if incorrect. Acceptable MIME type values are text/html, text/xhtml+xml, application/pdf, application/msword, and application/vnd.openxmlformats-officedocument.wordprocessingml.document.
document file The file to index. Required if the metadata object is not included. Maximum file size is 50 MB. The API detects the MIME type, but you can specify it if incorrect. Acceptable MIME type values are text/html, text/xhtml+xml, application/pdf, application/msword, and application/vnd.openxmlformats-officedocument.wordprocessingml.document.
conversion_target string Controls the output format of the conversion. Valid values are ANSWER_UNITS, NORMALIZED_HTML, and NORMALIZED_TEXT. For more information about conversion_target, see Advanced customization options and search for "conversion_target".
config form data object A config part that identifies the output type. You can optionally include information to define tags and structure in the conversion output. Maximum size of the part is 1 MB.
config object A config object that defines tags and structure in the conversion output. Maximum size of the object is 1 MB.
config JsonObject A config object that defines tags and structure in the conversion output. Maximum size of the object is 1 MB.
Config
Parameter Description
conversion_target string Controls the output format of the conversion. Valid values are answer_units, normalized_html, and normalized_text. Valid values are ANSWER_UNITS, NORMALIZED_HTML, and NORMALIZED_TEXT. For more information about conversion_target, see Advanced customization options and search for "conversion_target".
{input configuration} object An object that defines how to output the tags and structure of the input document. For more information about the input configurations, see Advanced customization options and search for "input configurations".
normalized_html object An object that defines the content that is included and excluded during the HTML normalization phase. All documents go through this phase. For more information about the normalized_html configurations, see Custom configuration information and search for "HTML normalization configurations".
answer_units object An object that defines the heading level when the conversion_target is answer units. By default, h1 and h2 headings with their content are split into answer units.

Example request


curl -X POST -u "{username}":"{password}" -F config="{\"conversion_target\":\"answer_units\"}" -F "file=@sample.pdf" "https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"

Example with media type specified


curl -X POST -u "{username}":"{password}" -F config="{\"conversion_target\":\"answer_units\"}" -F "file=@sample.pdf;type=application/pdf" "https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"

Example request


var watson = require('watson-developer-cloud');
var fs = require('fs');

var document_conversion = watson.document_conversion({
  username:     '{username}',
  password:     '{password}',
  version:      'v1',
  version_date: '2015-12-15'
});

// custom configuration
var config = {
  word: {
    heading: {
      fonts: [
        { level: 1, min_size: 24 },
        { level: 2, min_size: 16, max_size: 24 }
      ]
    }
  }
};

document_conversion.convert({
  file: fs.createReadStream('sample-docx.docx'),
  conversion_target: 'ANSWER_UNITS',
  // Use a custom configuration.
  config: config
}, function (err, response) {
  if (err) {
    console.error(err);
  } else {
    console.log(JSON.stringify(response, null, 2));
  }
});

DocumentConversion service = new DocumentConversion("2015-12-15");
service.setUsernameAndPassword("<username>", "<password>");

File doc = new File("sample-docx.docx");

// Use a custom configuration.
String configAsString = "{"
  + "\"word\":{"
  + "\"heading\":{"
  + "\"fonts\":["
  + "{\"level\":1,\"min_size\":24},"
  + "{\"level\":2,\"min_size\":16,\"max_size\":24}"
  + "]}}}";

JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();

Answers response = service.convertDocumentToAnswer(doc, null, customConfig)
  .execute();

System.out.println(response);
import json
from watson_developer_cloud import DocumentConversionV1

document_conversion = DocumentConversionV1(
  username='{username}',
  password='{password}',
  version='2015-12-15'
)

config = {
  'conversion_target': 'ANSWER_UNITS',
  # Use a custom configuration.
  'word': {
    'heading': {
      'fonts': [
        {'level': 1, 'min_size': 24},
        {'level': 2, 'min_size': 16, 'max_size': 24}
      ]
    }
  }
}

with open(('sample-docx.docx'), 'r') as document:
  response = document_conversion.convert_document(document=document, config=config)
  print(json.dumps(response, indent=2))

Response

The response depends on the value of output type in the request:

  • When the conversion_target is normalized_text, returns the converted document as plain text (MIME type text/plain)
  • When the output type is normalized_html, returns the converted document as an .html file (MIME type application/octet-stream)
  • When the output type is answer_units, returns the following objects:
  • When the service call is convertDocumentToText, returns the converted document as plain text (MIME type text/plain)
  • When the service call is convertDocumentToHTML, returns the converted document as an .html file (MIME type application/octet-stream)
  • When the output type is service call is convertDocumentToAnswer, returns the following objects:
convertDocumentPayload
Name Description
source_document_id string The unique identifier of the input file.
timestamp date-time Date and time in Coordinated Universal Time that the file was converted.
media_type_detected string The detected MIME type of the input file. Returns an empty value when a file is not included in the request.
metadata object[ ] An array of Metadata objects extracted from the file. Returns an empty array if no metadata is extracted.

The conversion extracts title, author, and date created metadata from PDF and Microsoft Word documents. For HTML documents, the conversion extracts the values for name and content attributes in tags.

answer_units object[ ] An array of AnswerUnit objects that lists information about the input file. You can specify what's included or excluded from the document in the config part.
warnings object[ ] An array of Warning objects that lists messages returned during conversion of the file. Returns an empty array if no warnings are returned.
Metadata
Name Description
name string The name of the metadata extracted from the file.
content string The value of the metadata.
AnswerUnit
Name Description
id string A unique identifier of the document.
type string The classification of the document. Always body.
parent_id string The reference to the parent answer unit, if any.
title string The heading of the document.
direction string The direction to render text: ltr for text in languages rendered left-to-right, and rtl for right-to-left.
content object[ ] An array of Content objects that lists the MIME type and text of the document.
Content
Name Description
media_type string The detected MIME type of the document.
text string The raw or marked-up content of the document.
Warning
Name Description
phase string The step in the conversion process when the warning was produced: answer_units, normalized_html, pdf, or word.
warning_id string The identifier for the warning.
message string The message.

Response codes

Status Description
200 OK The document was successfully converted.
400 Bad Request The input document failed to convert, or your request was malformed.
413 Payload Too Large The file or config parts exceeded the maximum allowed size.
415 Unsupported Media Type The media type of the input file is not supported. Specify the MIME type of the document if auto-detection was not correct.
503 Service Unavailable The service is busy processing other requests. Resubmit the request to try again based on the value of the Retry-After header.

Example response


{
  "source_document_id": "",
  "timestamp": "2015-10-12T20:16:15.535Z",
  "media_type_detected": "application/pdf",
  "metadata": [{
    "name": "publicationdate",
    "content": "2015-07-18"
  }],
  "answer_units": [{
    "id": "de93c979-414b-4967-afd5-21eafeaedf69",
    "type": "regular",
    "title": "Title from your document 1",
    "content": [{
      "media_type": "text/plain",
      "text": "Text from your document 2"
    }]
  }, {
    "id": "f3702667-9133-4e9d-a639-fbfc70822b9c",
    "type": "regular",
    "title": "Title from your document 3",
    "content" :[{
      "media_type": "text/plain",
      "text": ""
    }]
  }],
  "warnings": []
}

Index a document

Prepares a document for the Retrieve and Rank service as part of an Enhanced Information Retrieval solution, then adds the content to your Solr index so you can search it. For details about how to implement the solution, see Building an enhanced information retrieval solution.

POST /v1/index_document
index(params, callback())
ServiceCall<String> indexDocument(IndexDocumentOptions indexDocumentOptions)
index_document(config, document=None, metadata=None, media_type=None)

Request

Parameter Description
version query string The release date of the version of the API you want to use. Specify dates in YYYY-MM-DD format.
config form data object

A config part that identifies your target Retrieve and Rank service. You can optionally include information about fields in your document and configuration information about how to process the document. Maximum size of the part is 1 MB.

config object A config object that identifies your target Retrieve and Rank service. You can optionally include information about fields in your document and configuration information about how to process the document. Maximum size of the part is 1 MB.
file
form data file
stream
The file to index. Required if the metadata object is not included. Maximum file size is 50 MB. The API detects the MIME type, but you can specify it if incorrect. Acceptable MIME type values are text/html, text/xhtml+xml, application/pdf, application/msword, and application/vnd.openxmlformats-officedocument.wordprocessingml.document.
document file The file to index. Required if the metadata object is not included. Maximum file size is 50 MB. The API detects the MIME type, but you can specify it if incorrect. Acceptable MIME type values are text/html, text/xhtml+xml, application/pdf, application/msword, and application/vnd.openxmlformats-officedocument.wordprocessingml.document.
metadata
form data
object

A metadata part that describes the external metadata of the file. Required if the file is not included. You might call this method without a file part when there is no document content to index (for example, with a database connector). Maximum size of the part is 1 MB.

This metadata is indexed with the document in the Retrieve and Rank search collection. The metadata that you specify here overrides values that are extracted from the document. Use the fields object to resolve indexing errors with metadata.

config
Parameter Description
convert_document object An object that allows you to define the structure in the conversion output. Because you are specifying how to add a document to SOLR in the Retrieve and Rank service, some conversion configuration settings are ignored. For example, the value for conversion_target or for selector_tags in the answer_units object is ignored. For details about conversion settings, see Custom configuration information and search for "HTML normalization configurations".
retrieve_and_rank object A retrieve_and_rank object that identifies how to connect the document conversion service to your Retrieve and Rank service instance.
Parameter Description
indexDocumentOptions IndexDocumentOptions The IndexDocumentOptions object that holds the options for indexing the document. Also see the Javadoc information.
IndexDocumentOptions
Parameter Description
convertDocumentConfig JsonObject An object that allows you to define the structure in the conversion output. Because you are specifying how to add a document to SOLR in the Retrieve and Rank service, some conversion configuration settings are ignored. For example, the value for conversion_target or for selector_tags in the answer_units object is ignored. For details about conversion settings, see Custom configuration information and search for "HTML normalization configurations".
document file The file to index. Required if the metadata object is not included. Maximum file size is 50 MB. The API detects the MIME type, but you can specify it if incorrect. Acceptable MIME type values are text/html, text/xhtml+xml, application/pdf, application/msword, and application/vnd.openxmlformats-officedocument.wordprocessingml.document.
dryRun boolean The dryRun property defaults to false. Set it to true to test how your document is indexed. When set to true, serviceInstanceId, clusterId, and searchCollectionName are optional.
IndexConfiguration IndexConfiguration The IndexConfiguration object that identifies how to connect the document conversion service to your Retrieve and Rank service instance. For details, see the Javadoc information.
metadata Map<String, String>

The metadata object that describes the external metadata of the file. Required if the document is not included. You might call this method without a file part when there is no document content to index (for example, with a database connector). Maximum size of the part is 1 MB.

This metadata is indexed with the document in the Retrieve and Rank search collection. The metadata that you specify here overrides values that are extracted from the document. Use the fields object to resolve indexing errors with metadata.

IndexConfiguration
Parameter Description
serviceInstanceId string The identifier or your Retrieve and Rank service. Required if dry_run is not set to false. To find your service_instance_id, click the tile for your service in Bluemix, and then look at the URL in the browser for the serviceGuid= request parameter. The value for service_instance_id is the value for serviceGuid.
clusterId string Matches the value of solr_cluster_id in your Retrieve and Rank service. Required if dry_run is not set to false.
searchCollectionName string Matches the value of collection_name in your Retrieve and Rank service. Required if dry_run is not set to false.
fields IndexFields

A fields object that specifies how to connect metadata in your documents to fields in SOLR. To include all fields, don't specify an include or exclude object. When both objects are provided, only fields that appear in include and not in exclude are indexed. For details, see the Javadoc information.

You can use the fields object to resolve indexing errors related to the SOLR schema. For example, the request will fail when you try to index a field that isn't defined in the SOLR schema. Use fields object to exclude or rename the problem metadata field.

IndexFields
Parameter Description
mappings object[ ] An array of objects to specify how to connect metadata fields in the file to fields in SOLR. Use the syntax "fields":{"mappings"{"from":"field_in_doc","to":"field_in_SOLR"}]}.
include string[ ] An array of fields in the file to include from Retrieve and Rank. To specify the allowed fields, provide only the include object. When you provide an include object, fields that are not included are excluded. Use the syntax "fields":{"include":["field3_in_SOLR"]}.
exclude string[ ] An array of fields to exclude from Retrieve and Rank. To exclude a few fields and allow all others, provide only the exclude object. Fields that are not excluded are allowed. Follows the syntax "fields":{"exclude":["field1_in_SOLR","field2_in_SOLR"]}.
retrieve_and_rank
Parameter Description
dry_run boolean The dry_run property defaults to false. Set it to true to test how your document is indexed. When set to true, service_instance_id, cluster_id, and search_collection are optional.
service_instance_id string The identifier or your Retrieve and Rank service. Required if dry_run is not set to false. To find your service_instance_id, click the tile for your service in Bluemix, and then look at the URL in the browser for the serviceGuid= request parameter. The value for service_instance_id is the value for serviceGuid.
cluster_id string Matches the value of solr_cluster_id in your Retrieve and Rank service. Required if dry_run is not set to false.
search_collection string Matches the value of collection_name in your Retrieve and Rank service. Required if dry_run is not set to false.
fields object[ ]

A fields object that specifies how to connect metadata in your documents to fields in SOLR. To include all fields, don't specify an include or exclude object. When both objects are provided, only fields that appear in include and not in exclude are indexed. For details, see the Javadoc information.

You can use the fields object to resolve indexing errors related to the SOLR schema. For example, the request will fail when you try to index a field that isn't defined in the SOLR schema. Use fields object to exclude or rename the problem metadata field.

fields
Parameter Description
mappings object[ ] An array of objects to specify how to connect metadata fields in the file to fields in SOLR. Use the syntax "fields":{"mappings"{"from":"field_in_doc","to":"field_in_SOLR"}]}.
include string[ ] An array of fields in the file to include from Retrieve and Rank. To specify the allowed fields, provide only the include object. When you provide an include object, fields that are not included are excluded. Use the syntax "fields":{"include":["field3_in_SOLR"]}.
exclude string[ ] An array of fields to exclude from Retrieve and Rank. To exclude a few fields and allow all others, provide only the exclude object. Fields that are not excluded are allowed. Follows the syntax "fields":{"exclude":["field1_in_SOLR","field2_in_SOLR"]}.
metadata
Parameter Description
name string The label for the external metadata to be indexed with the file.
value string The value of the metadata.

Example request (dry_run is false)


curl -X POST -u "{username}":"{password}" -F "config=@config.json" -F "file=@example.html" -F "metadata=@metadata.json" "https://gateway.watsonplatform.net/document-conversion/api/v1/index_document?version=2015-12-15"

Example config part


{
  "convert_document": {
    "normalized_html": {
      "exclude_tags_completely":["script", "sup"]
    }
  },
  "retrieve_and_rank": {
    "dry_run": false,
    "service_instance_id": "692b4b66-bd13-42e6-9cf3-f7e77f8200e5",
    "cluster_id": "sc1ca23733_faa8_49ce_b3b6_dc3e193264c6",
    "search_collection": "example_collection",
    "fields": {
      "mappings": [{
        "from": "Author",
        "to": "Created By"
      }, {
        "from": "Date Created",
        "to": "Created On"
      }, {
        "from": "Continent",
        "to": "Region"
      }],
      "include": [
        "Created By",
        "Created On"
      ],
      "exclude": [
        "Region"
      ]
    }
  }
}

Example metadata part

{
"metadata":[{
  "name": "Creator",
  "value": "Some person"
}, {
  "name": "Subject",
  "value": "Application programming interfaces"
}]
}

var watson = require('watson-developer-cloud');
var fs = require('fs');

var document_conversion = watson.document_conversion({
  username:     '{username}',
  password:     '{password}',
  version:      'v1',
  version_date: '2015-12-15'
});

var config = {
  convert_document: {
    normalized_html: {
      exclude_tags_completely: ['script', 'sup']
    }
  },
  retrieve_and_rank: {
    dry_run: false,
    service_instance_id: '692b4b66-bd13-42e6-9cf3-f7e77f8200e5',
    cluster_id: 'sc1ca23733_faa8_49ce_b3b6_dc3e193264c6',
    search_collection: 'example_collection',
    fields: {
      mappings: [{
        from: 'Author',
        to: 'Created By'
      }, {
        from: 'Date Created',
        to: 'Created On'
      }, {
        from: 'Continent',
        to: 'Region'
      }],
      include: ['Created By', 'Created On'],
      exclude: ['Region']
    }
  }
};

var metadata = {
  metadata: [
    { name: 'Creator', value: 'Some person' },
    { name: 'Subject', value: 'Application programming interfaces' }
  ]
};

document_conversion.index({
  file: fs.createReadStream('sample-docx.docx'),
  config: config,
  metadata: metadata
}, function (err, response) {
    if (err) {
        console.error(err);
    } else {
        console.log(JSON.stringify(response, null, 2));
    }
});

DocumentConversion service = new DocumentConversion("2015-12-15");
service.setUsernameAndPassword("{username}","{password}");


// Create an index configuration with the fields object
// (field mappings, fields to include, fields to exclude)
IndexFields fields = new IndexFields.Builder()
  .mappings("Author", "Created By")
  .mappings("Date Created", "Created on")
  .mappings("Continent", "Region")
  .include("Created By")
  .include("Created on")
  .exclude("Region")
  .build();

IndexConfiguration indexConfiguration = new IndexConfiguration(
  "692b4b66-bd13-42e6-9cf3-f7e77f8200e5",
  "sc1ca23733_faa8_49ce_b3b6_dc3e193264c6",
  "example_collection",
  fields);

// Metadata to index
final Map<String, String> metadata = new HashMap<String, String>();
metadata.put("Creator", "Some person");
metadata.put("Subject", "Application programming interfaces");

// Exclude all <script> and <sup> tags from the input file
String jsonConfig = "{\"normalized_html\" : { \"exclude_tags_completely\":[\"script\", \"sup\"] } }";
JsonParser jsonParser = new JsonParser();
JsonObject convertDocumentConfig = jsonParser.parse(jsonConfig).getAsJsonObject();


IndexDocumentOptions indexDocumentOptions = new IndexDocumentOptions.Builder()
  .document(new File("example.doc"))
  .indexConfiguration(indexConfiguration)
  .convertDocumentConfig(config)
  .metadata(metadata)
  .build();

String response = service.indexDocument(indexDocumentOptions).execute();
System.out.println(response);
import json
from watson_developer_cloud import DocumentConversionV1

document_conversion = DocumentConversionV1(
  username='{username}',
  password='{password}',
  version='2015-12-15'
)

config = {
  'convert_document': {
    'normalized_html': {
      'exclude_tags_completely': ['script', 'sup']
    }
  },
  'retrieve_and_rank': {
    'dry_run': 'false',
    'service_instance_id': '692b4b66-bd13-42e6-9cf3-f7e77f8200e5',
    'cluster_id': 'sc1ca23733_faa8_49ce_b3b6_dc3e193264c6',
    'search_collection': 'example_collection',
    'fields': {
      'mappings': [{
        'from': 'Author',
        'to': 'Created By'
      }, {
        'from': 'Date Created',
        'to': 'Created On'
      }, {
        'from': 'Continent',
        'to': 'Region'
      }],
      'include': ['Created By', 'Created On'],
      'exclude': ['Region']
    }
  }
}
metadata = {
  'metadata': [
    {'name': 'Creator', 'value': 'Some person'},
    {'name': 'Subject', 'value': 'Application programming interfaces'}
  ]
}

with open(('sample-docx.docx'), 'r') as document:
  response = document_conversion.index_document(config=config, document=document, metadata=metadata)
  print(json.dumps(response, indent=2))

Response

indexDocumentPayload
Name Description
converted_document object A converted_document object that contains information about the file after conversion. Returned only with requests that specify dry_run=true in the config part.
solr_document object A SolrDocument object that lists how the document will be indexed in Retrieve and Rank. Returned only with requests that specify dry_run=true in the config part.
status string

Indicates the successful completion of the indexing request: success. Returned only with requests that specify the default dry_run=false in the config part.

warnings object[ ]

An array of warning objects that lists messages returned during conversion of the file. Returns an empty array if no warnings are returned.

When dry_run=true, the warning object is returned within the converted_document object.

converted_document
Name Description
media_type_detected string The detected or specified MIME type of the input file. Returns an empty value when a file is not included in the request.
metadata object[ ] An array of Metadata objects extracted from the file. Returns an empty array if no metadata is extracted.

The conversion extracts title, author, and date created metadata from PDF and Microsoft Word documents. For HTML documents, the conversion extracts the values for name and content attributes in tags.

answer_units object[ ] An array that contains a single answer_unit object that lists information about the input file. You can specify what's included or excluded from the document in the convert_document section of the config part.
Metadata
Name Description
name string The name of the metadata extracted from the file.
content string The value of the metadata.
answer_unit
Name Description
id string A unique identifier of the document.
type string The classification of the document. Always body.
title string The heading of the document.
direction string Indicates the direction to render text: ltr for text in languages rendered left-to-right, and rtl for right-to-left.
content object[ ] An array of content objects that lists the MIME type and text of the document.
content
Name Description
media_type string The detected MIME type of the document.
text string The raw or marked-up content of the document.
warnings
Name Description
phase string The step in the conversion process when the warning was produced: convert_document or retrieve_and_rank.
warning_id string The identifier for the warning.
description string The message.
SolrDocument
Name Description
{metadata} object[ ] A list of metadata that is extracted from the document and specified in the metadata part of the request. Metadata that you specify in the request overrides the extracted values.
body string The raw content of the document.
contentHtml string The marked-up content of the document.
title string The heading of the document.

Response codes

Status Description
200 OK The document was successfully indexed.
400 Bad Request The input document failed to index, or your request was malformed.
413 Payload Too Large The file, config, or metadata parts exceeded the maximum allowed size.
415 Unsupported Media Type The media type of the input file is not supported. Specify the MIME type of the document if auto-detection was not correct.
503 Service Unavailable The service is busy processing other requests. Resubmit the request to try again based on the value of the Retry-After header.

Example response (dry_run was true)


{
  "converted_document": {
    "media_type_detected": "text/html",
    "metadata": [{
      "name": "publicationdate",
      "content": "2015-07-18"
    }],
    "answer_units": [{
      "id": "de93c979-414b-4967-afd5-21eafeaedf69",
      "type": "body",
      "title": "no-title",
      "direction": "ltr",
      "content": [{
        "media_type": "text/html",
        "text": "<h3><p>What is Watson?</p></h3><p>Watson is an artificially intelligent computer system capable of answering questions</p>"
      },
      {
      "media_type": "text/plain",
      "text": "What is Watson? Watson is an artificially intelligent computer system capable of answering questions"
      }]
    }],
    "warnings": []
  },
  "solr_document": {
    "Created By": "Some person",
    "Subject": "Application programming interfaces",
    "body": "What is Watson? Watson is an artificially intelligent computer system capable of answering questions",
    "contentHtml": "<h3><p>What is Watson?</p></h3><p>Watson is an artificially intelligent computer system capable of answering questions</p>",
    "publicationdate": "2015-12-04",
    "title": "no-title"
  }
}

Example response (dry_run was false)


{
  "status": "success",
  "warnings": [{
    "phase": "normalized_html",
    "warning_id": "xpath_not_found",
    "description": "Could not find XPath '//body/div[@id=content]'"
  }]
}