IBM Support

What do the JSON language values mean with IBM Business Automation Content Analyzer?

Question & Answer


Question

How are the DocumentLanguage and Language property values determined?

Answer

Language
After each page is processed by the OCR engine, the engine returns a list of possible languages.  Languages are listed by using their two letter abbreviation.  Objects found do not generally belong to a specific language, so it is a subjective determination to associate those objects with any particular languages.
If OCR is not successful, or no language identifiable objects are found on the page, the list is empty.
     Example: PageInfo":{"Language":["en","de","fr","nl","da","it"],
DocumentLanguage
A separate language check is made at the document level.  In this check, all the words are evaluated in the document as to what language they are from.  Some words exist in more than one language, and can be counted towards each.  If there is a high enough word count for a particular language, then that language is included in the DocumentLanguage list.  The language determination is subjective.  So if a language is listed it does not mean the language definitely exists in the document.  Rather it means that there is a likelihood that some elements of the document can be in that language.
If the document is classified, but there is not a sufficient count to list at least one language, then the returned list is empty.  If the document cannot be classified, the list is always empty.
Example: DocumentLanguage":["English","French"]

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSUM7G","label":"IBM Business Automation Content Analyzer on Cloud"},"Component":"","Platform":[{"code":"PF040","label":"RedHat OpenShift"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
11 October 2019

UID

ibm11075125