Capabilities of IBM Business Automation Content Analyzer
IBM Business Automation Content Analyzer has the following capabilities to meet the smarter business requirements:
- Metadata and data management
-
- It allows the construction of a custom Ontology of Document classes. Within those Document classes, you can specify Classification Words, Title phrases, Keys, and Headers.
- You can delete or modify any of these classes at any time through the Ontology Training UI.
- Ontology can be exported for backup as a JSON file and imported. You can save or use different Ontologies.
- File types supported by Content Analyzer
-
- TIF
- JPG
- PNG
- DOC/DOCX
- Optical Character Recognition (OCR) and Document extraction
-
- It extracts plain text, tables, and bar codes by using OCR.
- Plain text is organized into text blocks and displayed in the block list field in the JSON output. Blocks are organized into Lines and Words.
- It organizes tables by table rows, cells, lines of text, and words. They are displayed in the table lists section of the JSON. Additionally, table line items of simple tables are extracted into a section of the JSON that is called TableLineItems and are grouped into LineItemGroups. In the JSON TableList, the groups are based on the number of sets of column headers in a table. Each LineItemGroups contains a list of Headers and LineItems. TableLineItems are extracted for any document where a user requests Key-Value Pair extraction.
- It gives a confidence level for OCR quality at the Word, Page, and Document levels.
- It includes positioning of bar codes in the bar code list portion of the JSON output.
- Document classification
-
- Documents can be classified based on the classification words that are stored in the Document Classes in your Ontology. These words can be populated automatically by clicking Learn in the Key Value Pairs (KVP) Viewer in a specific document or can be manually defined in the Ontology page.
- Only the words on the first page of the document are used to classify the document.
- Title Detection
-
- Document titles are detected by a combination of phrases that are supplied in the Title class of a Document class in the Ontology. Document titles are also detected by identifying text styling characteristics on the first page of a document that are typical of page titles.
- Header
-
- Header detects section headers with text styling that stands out (bold, italic, underline, large font) or section flags (“III”, “A)”, etc.).
- It displays a confidence level in headers based on the relationship between header text and aliases in Header class of the Ontology.
- It displays all Header attributes in a JSON field called Attributes.
- It returns a confidence level in the quality of the header based on its match to an alias in the Header class.
- Table header
-
- It searches the first row of a table for matches against the Ontology’s Key Class and displays any found headers in a JSON field called CellHeaderAttributes.
- This field is populated if more than half of the cells in the first row of the table have a semantic match or exact match with an alias in the Key Class.
- It returns a confidence level in the match to an alias in the Key Class.
- Key Value Pairs (KVP)
-
- KVPs are displayed in the KVPTable field of the JSON file.
- KVPs are found in text zones and in tables, both within the same cell and between adjacent cells.
- The fields can be denoted as mandatory. The fields can also be denoted as sensitive, which allows for easy redaction by an application that consumes Content Analyzer’s output.
- The algorithm finds the value with relation to the Key, either to the right or below.
- The Key can take up a single line in plain text, though in tables, the Key can span multiple lines if they are in the same cell. The Value can span multiple lines in plain text or in tables within the same cell.
- It returns a confidence level in the OCR quality of both the Key and Value, and the confidence level in a match to an alias in the Key Class.
- Watson Natural Language Understanding (NLU) integration
-
- It sends each page of a document to the Watson NLU API, splits them into chunks of 10000 characters or less.
- It extracts Entities, Keywords, and Relations features from the document and returns them in the Watson section of the JSON output.
- It requires supplying Watson NLU credentials – Username, Password, ModelID (optional), and Service URL – in the AI Integrations page.
- It calls the NLU API by using the ModelID of any model you build and train outside of Content Analyzer.
- You must enable the NLU integration in a document class for a document that is classified in that class to return NLU results in the JSON output.
- If the ModelID is not provided in the associated NLU integration, NLU uses the default model to extract document insights.
- Document segmentation for insertion into Watson Discovery Service (WDS)
-
- It segments the document into Headers and text between headers in the WDS section of the JSON output. Insertion into a WDS instance through your custom application outside of Content Analyzer is easier.
- Segments of text between headers does not span across multiple pages.
- Output file types
-
- Searchable PDF
- JSON
- UTF-8 Text