Content Analyzer actions
IBM Datacap and Business Automation Content Analyzer integration(Content Analyzer).
Business Automation Content Analyzer, Content Analyzer, is a service that provides recognition of images such as, DOC, DOCX, TIF, PNG, JPG as well as PDF documents. Single page and multi-page documents can be provided to Content Analyzer, which will recognize the text, apply preconfigured ontology on the document to identify the document and locate specific data within the page. The Content Analyzer ontology can be configured to look for relationships of data within a page, typically identified as a key-value pair, and after recognizing the text on the page, apply its relationship rules to extract targeted data as key-value pairs. For example, Content Analyzer could return data with a key such as "Invoice Number" and a value of "#12345". The data interpretation and location is dependent on the particular ontology that has been pre-configured within Content Analyzer using its custom configuration tools. The results from Content Analyzer can be used for page identification. When the documents are processed by Content Analyzer, the results can contain information used as page types. The results from the key-value pairs can be used to set a new page type for the processed pages.
Content Analyzer Actions:
Use of Content Analyzer is performed using these primary actions:
- ContentAnalyzerSubmitRequest: Sends the document to be recognized to Content Analyzer. The action does not wait for the result to be processed. The output is retrieved using ContentAnalyzerGetOutput. This allows multiple pages or documents to be sent to Content Analyzer so processing can be started on all of the documents. For example, a batch can be configured to run in a "submit, get, submit, get, submit, get" pattern when processing documents. Alternatively it may be advantageous to submit all of the documents and then wait for the results, using a "submit, submit, submit, get, get, get" pattern.
- ContentAnalyzerGetOutput: Retrieves the recognized data from Content Analyzer and will wait for the results if Content Analyzer is still processing. The results returned from Content Analyzer are saved in the batch and are associated with the DCO object that called the action.
- ContentAnalyzerCreateLayout: Creates a layout file with the contents of the recognized text. A layout file is created for each page that was processed. The layout file can contain the recognized text for the entire page, along with coordinates of the text, and key-value pairs detected by Content Analyzer based on the ontology. The full page text can be loaded into the page CCO using the action CreateCCOFromLayout, so other actions and clients that use the CCO can operate on the data. The key-value pairs can be obtained using the action ContentAnalyzerFindExtractedText.
- ContentAnalyzerFindExtractedText: Retrieves the key-value pairs returned by Content Analyzer from the layout file. The retrieved data can then be used as needed by the application. The data can be loaded into fields, used in a verification panel, or exported to another system.
Content Analyzer Helper Actions:
- ContentAnalyzerConfiguration: Supplies the URL and credentials used to connect to Content Analyzer.
- ContentAnalyzerRetry: If communication with Content Analyzer fails, the request will be attempted again. This action allows configuration of the default retry settings.
- ContentAnalyzerSetTimeout: Allows the default timeout values to be adjusted.
- ContentAnalyzerSetProxy: Optional configuration of a proxy server used to connect with Content Analyzer.
Processing Pages and Documents:
Single page or multi-page documents can be processed by Content Analyzer. It is recommended that when processing single page documents, use actions ContentAnalyzerSubmitRequest and ContentAnlyzerGetOutput on a page level DCO object that is associated with the image. If the document is a multi-page document, such as a PDF, then use the actions ContentAnalyzerSubmitRequest and ContentAnlyzerGetOutput on a document level DCO object that is associated with the PDF.
When the action ContentAnalyzerGetOutput is used at the page level, it expects a single page response. The action will save the results in the batch and identify the files by setting variables in the current object. When ContentAnalyzerCreateLayout is called at the page level, only one page of data can be associated with the current page object. The layout file will be created and associated with the current page object. If multiple pages were recognized, only the first page's information will be saved in the layout file.
When the action ContentAnalyzerGetOutput is used at the document level, it allows for processing of multiple pages. Again, the raw data returned from Content Analyzer is saved in the batch and the file names are saved in the specified variables in the current document node. When ContentAnalyzerCreateLayout is subsequently called at the document level, it will assume that the current document already has existing DCO page objects for each page within the document. If it is unknown if a document is a multi-page or single page document, all of the documents can be treated as multi-page. If there is only one page within the document, then it will simply be a document object that contains one page object.
Processing a Multi-Page Document:
As documents are ingested into a new batch, each document is assigned to a Page object and provided a starting page type of "Other". For the Content Analyzer actions to support possible multiple pages within the document, the ingested file must be associated with a DCO document level object. The action PromotePageToDocument in the Application Objects action library will create a new document level object based on the current page object. The new Document level object will be a copy of the page object and will have the associated image file. The status of the original page object will be set to "75", which means deleted, and it should no longer be used. Once all of the desired pages have been promoted to a Document object, then the action DeleteDCOChildWithStatus in Application Objects can be used to remove all objects with the status of 75 from the batch. Now that the ingested file is associated with a document level object, the Content Analyzer actions can be used to further process the document.
Once a document, such as a PDF, is associate with a Document level DCO object, it can be split into multiple pages. This can be achieved using the Convert action library. The action PDFFREDocumentToImage will convert a PDF into multiple TIF files. When called on a Document level node, the action will create child page objects that are associated with the file. The following example shows that a Document level node that is associated with the file TM000001.pdf and PDFFREDocumentToImage created 4 child page objects. Each has an associated TIF image. One for each page within the original PDF.
˂Document "Invoice"˃ TM000001.pdf (source document object)
˂page "Bill"˃ TM000002.tif (new page object, page 1 of TM000001.pdf)
˂page "Bill"˃ TM000003.tif (new page object, page 2 of TM000001.pdf)
˂page "Bill"˃ TM000004.tif (new page object, page 3 of TM000001.pdf)
˂page "Bill"˃ TM000005.tif (new page object, page 4 of TM000001.pdf)
Now that the Document level node has all of the corresponding child pages, the actions ContentAnalyzerSubmitRequest and ContentAnalyzerGetOutput can be called on the Document level node to have Content Analyzer recognize the page, generate any key-value pairs, and save the data within the batch. ContentAnalyzerCreateLayout can then be called on the Document level node to create the layout files with the associated data. A unique layout file will be associated with each corresponding page as shown in the following example:
˂Document "Invoice"˃ TM000001.pdf
˂page "Bill"˃ TM000002.tif, TM000002_layout.xml
˂page "Bill"˃ TM000003.tif, TM000003_layout.xml
˂page "Bill"˃ TM000004.tif, TM000004_layout.xml
˂page "Bill"˃ TM000005.tif, TM000005_layout.xml
Once these steps are completed, then subsequent actions, such as ContentAnalyzerFindExtractedText, can be run on the Page objects to process the data as needed. If desired, the page data can be used to change the page type, extract text, verify pages, etc. The layout XML can be loaded into the CCO using LoadCCOFromLayout in the SharedRecognitionTools action library to create the CCO so actions and clients that use the CCO can access the recognized text. Any required Datacap actions can be used on this data to process it as required.
Additional Information:
Refer to the Application Objects action library top-level help topic for more information about the Datacap hierarchy and the actions PromotePageToDocument and DeleteDCOChildWithStatus. Refer to the Convert action library for actions that will split multiple page documents, such as PDFs word documents, multi-page TIFs, etc., into single page images.
Refer to the top-level help topic for the Shared Recognition Tools action library for more information about the Datacap CCO and how it is used.
Refer to the Smart Parameter reference to understand the abilities of Smart Parameters and how to use them.