IBM Watson™ Discovery makes it possible to rapidly build cognitive, cloud-based exploration applications that unlock actionable insights hidden in unstructured data.
This blog post describes a scenario where a user encounters the error of “An unexpected error occurred while processing your document“ when uploading a document to the Watson Discovery service and how to resolve the issue.
Problem description
When using IBM Watson™ Discovery to upload a few JSON documents, a user receive the following error message shown in the screenshot below:
Further investigation shows that there are some errors that occurred during the document conversion phase:
Error: Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value at [Source: (org.apache.commons.io.input.CloseShieldInputStream); line: 11, column: 2995]
Cause of the problem
As described in the above error message, there are some illegal characters included in the string value, resulting in the conversion error.
Solution
To be able to bypass the error, the JSON documents need to be updated to conform to the JSON standard. The user may utilise some free online JSON validator tools to verify the content of JSON document. For example, this website can be used for this purpose at the time of writing this blog.
Sample output with the problematic JSON document using the above online validator:
The user may take advantage of existing Perl/Python scripts to further find/replace the illegal characters from the JSON document. (One useful post from online forum.)
The script:
Once the JSON document has been updated and passed the JSON validator, the user may try to upload the document to the Watson Discovery collection again. The same error shouldn’t occur anymore.
Summary
The IBM Watson Discovery service supports various format of documents. Supported document types for Smart Document Understanding:
- Lite plans: PDF, Word, PowerPoint, Excel, JSON*, HTML*
- Advanced plans: PDF, Word, PowerPoint, Excel, PNG**, TIFF**, JPG**, JSON*, HTML*
The link to the Watson Discovery supported document types is here.
In this particular scenario, the JSON document contains illegal control characters and that causes the conversion issue after uploading to Watson Discovery service.