ExtractText
This action finds entities such as names and addresses in the text using text analytics. The results are saved and then can be used by subsequent actions, such as FindExtractedText.
Syntax
bool ExtractText (string extractors) Parameters
- string extractors
- Smart parameter for a comma-separated list of extractors to process.
Returns
Always True.Level
Document or PageDetails
To extract from non-English text, set the page variable hr_locale to the required language before
calling this action, for example, for Japanese call rrset("ja","@P.hr_locale")
The entities to be found are determined by AQL extractors. Datacap provides a set of pre-built extractors. You can create extra AQL extractors by using the IBM BigInsights tools.
Extractors are saved in compiled files with the extension tam. All tam files present in the \rrs\aql folder are loaded. You can add or remove tam, dictionary, and table files to the \rrs\aql folder to control if they are run or not.
All of the loaded extractors are run on the document or page.
ExtractText requires a previously created layout file (for example: tm000001_layout.xml), where text is grouped into blocks. See the Document Analytics action DocumentAnalytics actions help topic introduction for information on the layout XML file.
Support for external dictionaries
The ExtractText action supports AQL external dictionary. Using this feature, you can write annotators that do not need to be recompiled when a change is needed.
You can export from the IBM InfoSphere BigInsights web tools and place the exported folders into rrs\aql\src location. The AQL is compiled at the run time.
It is recommended to keep a backup of RRS folder in case if any file or folder gets corrupted at the time of copying or due to wrong configuration.
Detailed steps to configure Custom annotators in Datacap:
Complete the following steps to configure Custom annotators in Datacap:
- Once done creating Custom Extractor by using BigInsights web tool, export the extractor as Executable. Including "Source files" that is optional. Export in a compressed file format.
- Copy TAMs file from export to the \RRS\AQL folder. Do not copy the InputDocumentProcessor.TAM file. Leave the original in RRS\AQL folder itself.
- Copy the SRC folder from Export (one from exported compressed file) folder to RRS\AQL in case if you want Datacap to compiles any AQL.
- Make sure to copy all the supported *.DICT files that are provided by BigInsights in RRS\AQL folder.
Custom extractors must be called with ExtractText action in the format Module.Viewname. Verify the module name from corresponding AQL file. For examples, ZIPCODE_BasicFeatures.ZCView.
Pre-built extractors
Address
City
Continent
Country
Date
DateTime
EmailAddress
Facility
FinancialAnnouncements
FinancialEvents
Location
NotesEmailAddress
Organization
Person
PhoneNumber
StateOrProvince
URL
WaterBody
ZipCode
For pre-built extractors prefix Named_Entity_Recognition.
For example,
Named_Entity_Recognition.Address
BigInsightsChineseNER.PersonChinese;
BigInsightsChineseNER.LocationChinese;
BigInsightsChineseNER.OrganizationChinese;
BigInsightsJapaneseNER.PersonJapanese;
BigInsightsJapaneseNER.LocationJapanese;
BigInsightsJapaneseNER.OrganizationJapanese;
Example 1
The following example populates the city field with the first instance of an address where the state is California.
ExtractText(Named_Entity_Recognition.DateTime, Named_Entity_Recognition.Address)
FindExtractedText(@P\City, First, Named_Entity_Recognition.Address, city, stateorprovince,
(California)|(CA))
Check the data in Layout xml to get the correct field name.
Additional: Text file created by CreateTextFile action can also be used to extract data. To use the text file, use the variable extractTextUseTxt, and set the value to 1. Using a layout file is also mandatory.
This variable can be used to test the custom extractors, since it works on text file. The extractors created on designer tool also use text file.
Example 2
Recognize()
CreateTextFile
rrset ("1","@P.extractTextUseTxt")
ExtractText(Named_Entity_Recognition.DateTime,Named_Entity_Recognition.Address)