Datacap Insight Edition – SystemT – WKS ARE

Prebuilt or Inbuilt Extractors

Prebuilt extractors are the extractors that are shipped with Datacap. The list of prebuilt extractors is available in action library help for the ExtractText action.

To use the prebuilt extractor, no special configuration is required. The extractor can be called directly through the ExtractText action of DocumentAnalytics library.

Syntax: ExtractText(ModuleName.Viewname,ModuleName.ViewName,…n)

Description :

  • ModuleName for the prebuilt extractor that is shipped with Datacap, is “Named_Entity_Recognition”.
  • Viewname is the extractor name.
  • n” denotes any number of extractors that can be passed to SystemT.

Example

ExtractText(Name_Entity_Recognition.Person,Named_Entitiy_Recognition.Address,Named_Entity_Recogniton.City)
Note: Ensure there are no space between parameters.

Custom extractors

Custom extractors are extractors created by user using the Designer tool / TextAnalytics Designer Tool / SystemT Designer tool. Designer tool is available in Watson Knowledge Studio(WKS). It is a part of Watson NLP.

An extractor can be exported for use within Datacap after it is complete and collects results in SystemT Designer, against a set of documents.

WKS ARE (Watson Knowledge Studio – Advance Rule Editor):

  • Search for the catalog “Watson Knowledge Studio”. You must register yourself with some plan and region in Watson. For example, Free Lite plan and region as Frankfurt. Then click Create.
    Figure 1. Getting started with Knowledge Studio

  • On "Getting started with Knowledge Studio" page, click Manage. "Start by launching the tool" page opens. If you need help on getting started, you can go through the tutorial.
    Figure 2. Launching Watson Knowledge Studio

  • Click “Launch Watson Knowledge Studio”. On "Create a workspace" page select Create advanced rules workspace. Provide a name for the workspace when prompted.
    Figure 3. Creating a Workspace

  • If you already have a workspace you can use it by clicking the link as shown in Fig 4.
    Figure 4. Using an existing workspace

  • Once the workspace is created, click it.
    Figure 5. Working with the workspace

  • For accessing help on how to create extractors, you can click the icon as shown in Fig 6.
    Figure 6. Help for creating extractors

Type of Extractors

  • Single extractor

    A ‘single extractor’ is an extractor that extracts one type or one pattern of data. For example, in the screenshot below, the extractor Contract_Term or ContractStart_Date are single extractors. They are saved as individual extractors within any category.

    They can be exported individually in SystemT.

    Figure 7. Single extractor

    Note: To export the extractor, right click the extractor in the tab Extractors, select Export and follow the steps mentioned in the last section of this topic.
  • Category of Extractors

    Multiple single extractors are saved under a category to form a group of extractors. This is required when you need different type of data to be extracted from your documents. All extractors that are grouped together in a category can be exported at once, instead of exporting them individually.

    For example, in Fig 8, two single extractors are saved under a category named “CustomerX_Extractors”. This category can then be exported as executable which consists of all the extractors.

Figure 8. Category of extractors

To export the extractor, right click the category, select Export and follow the process mentioned in the last section of this topic.

Advantage of using a category is that all the extractors get exported within a category. Three folders are created in System for both the above types of exports. The number of files varies in these folders.

Exporting Extractors in SystemT

Figure 9. Exporting extractors

On export, the SystemT exports the extractor in the form of a zip file containing three folders:

  • Resources

    If you have implemented any external dictionary, those dictionaries are available in this folder inside a nested folder.

  • SRC

    This folder is generated when the check box Include Source Files is selected while exporting the extractor in SystemT. This folder contains the AQL files for extractor. Although these are not used in Datacap, they are helpful in looking up what the modules and views are called, simply by opening the files in a text editor.

  • TAMs
  • This folder contains the TAM files for the extractors. TAM files are the binary files which cannot be modified using notepad or any editing tool. These are used in Datacap.

Configuring custom extractors in Datacap

Follow these steps to add the TAM files associated with the custom extractors to the RRS\Aql folder.

  • Take a backup of the RRS\Aql folder of your Datacap installation.
  • From the exported zip file of the custom extractors, copy all the files from the TAM folder to the \Datacap\RRS\Aql folder
    Note: Since the SystemT tool exports the TAM files associated with all the dependent extractors, the TAM folder may contain TAM file associated with prebuilt extractors. In that case, you get a warning from Windows about replacing existing files. If so, choose the option to replace the files.
  • Next, copy any dictionary files that may have been exported.
    • Search the resources folder of the exported zip file for .dict files.
    • If any are found, copy them to the \Datacap\RRS\Aql folder of Datacap installation.

Now, you can start using your custom extractors in the ExtractText action

Example: ExtractText(tauser_tauser.FullYearText,Named_Entity_Recognition.City)

tauser_tauser is the module name and FullYearText is the viewname.

To get the module name, check the modulename inAQL file present in SRC folder. The highlighted text in the below screenshot is the module name.

Figure 10. Example: Configuring custom extractors

Comparing results between the SystemT tool and Datacap ExtractText action

To validate the functionality of a custom extractor as it is executed in Datacap, it is common to compare the results generated from Datacap with the results that were obtained from SystemT tool.

When doing such a comparison, it is imperative that the same text is being used in both the cases. The best practice would be to run a document through Datacap OCR first, use the text file creation actions to generate an output of the OCR results and use those files when building extractors.