ExtractText

This action finds entities such as names and addresses in the text using text analytics. The results are saved and then can be used by subsequent actions, such as FindExtractedText.

Restriction: This action does not support regular expressions that contain a line break nor tab expressions.

Syntax

bool ExtractText (string extractors) 

Parameters

string extractors
Smart parameter for a comma-separated list of extractors to process.

Returns

Always True.

Level

Document or Page

Details

To extract from non-English text, set the page variable hr_locale to the required language before calling this action, for example, for Japanese call rrset("ja","@P.hr_locale")

The entities to be found are determined by AQL extractors. Datacap provides a set of pre-built extractors. You can create extra AQL extractors by using the IBM BigInsights tools.

Extractors are saved in compiled files with the extension tam. All tam files present in the \rrs\aql folder are loaded. You can add or remove tam, dictionary, and table files to the \rrs\aql folder to control if they are run or not.

All of the loaded extractors are run on the document or page.

Important: The results are saved to the layout XML file, for example tm000001_layout.xml, which can be opened in a text editor to see the entities and the entity fields that can be copied into document hierarchy fields.
Important: The results from ExtractText are saved to the layout XML file, for example tm000001_layout.xml, which can be opened in a text editor to see the available entities and the entity fields that can be copied into document hierarchy fields. You can search for an entity in the layout XML using its name such as Address or Person.
Important: This action requires a 32-bit Java runtime environment. The default location is \Datacap\dcshared\jre or else the path specified in the JAVA_HOME system variable are used.

ExtractText requires a previously created layout file (for example: tm000001_layout.xml), where text is grouped into blocks. See the Document Analytics action DocumentAnalytics actions help topic introduction for information on the layout XML file.

Support for external dictionaries

The ExtractText action supports AQL external dictionary. Using this feature, you can write annotators that do not need to be recompiled when a change is needed.

You can export from the IBM InfoSphere BigInsights web tools and place the exported folders into rrs\aql\src location. The AQL is compiled at the run time.

Note: You need to manually copy any external dictionaries and tables to the \rrs\aqllocation.

It is recommended to keep a backup of RRS folder in case if any file or folder gets corrupted at the time of copying or due to wrong configuration.

Detailed steps to configure Custom annotators in Datacap:

Complete the following steps to configure Custom annotators in Datacap:

  1. Once done creating Custom Extractor by using BigInsights web tool, export the extractor as Executable. Including "Source files" that is optional. Export in a compressed file format.
  2. Copy TAMs file from export to the \RRS\AQL folder. Do not copy the InputDocumentProcessor.TAM file. Leave the original in RRS\AQL folder itself.
  3. Copy the SRC folder from Export (one from exported compressed file) folder to RRS\AQL in case if you want Datacap to compiles any AQL.
  4. Make sure to copy all the supported *.DICT files that are provided by BigInsights in RRS\AQL folder.

Custom extractors must be called with ExtractText action in the format Module.Viewname. Verify the module name from corresponding AQL file. For examples, ZIPCODE_BasicFeatures.ZCView.

Pre-built extractors


Address 

City 

Continent 

Country 

Date 

DateTime 

EmailAddress 

Facility 

FinancialAnnouncements 

FinancialEvents 

Location 

NotesEmailAddress 

Organization 

Person 

PhoneNumber 

StateOrProvince 

URL 

WaterBody 

ZipCode 
For pre-built extractors prefix Named_Entity_Recognition. 
For example, 

Named_Entity_Recognition.Address 
BigInsightsChineseNER.PersonChinese; 

BigInsightsChineseNER.LocationChinese; 

BigInsightsChineseNER.OrganizationChinese; 



BigInsightsJapaneseNER.PersonJapanese; 

BigInsightsJapaneseNER.LocationJapanese; 

BigInsightsJapaneseNER.OrganizationJapanese; 

Example 1

The following example populates the city field with the first instance of an address where the state is California.

ExtractText(Named_Entity_Recognition.DateTime, Named_Entity_Recognition.Address)
FindExtractedText(@P\City, First, Named_Entity_Recognition.Address, city, stateorprovince,
(California)|(CA))

Check the data in Layout xml to get the correct field name.

Additional: Text file created by CreateTextFile action can also be used to extract data. To use the text file, use the variable extractTextUseTxt, and set the value to 1. Using a layout file is also mandatory.

This variable can be used to test the custom extractors, since it works on text file. The extractors created on designer tool also use text file.

Example 2

Recognize()
CreateTextFile
rrset ("1","@P.extractTextUseTxt")
ExtractText(Named_Entity_Recognition.DateTime,Named_Entity_Recognition.Address)