ExtractText

Finds entities such as names and addresses in the text by using text analytics. The results are saved and can then be used by subsequent actions, such as FindExtractedText.

Restriction: This action does not support regular expressions that contain a line break nor tab expressions.

Syntax

bool ExtractText (string extractors)

Parameters

extractors
Smart parameter for a comma-separated list of extractors to process.

Returns

True.

Level

Document or Page level

Details

Finds entities such as names and addresses in the text by using text analytics. The results are saved and can then be used by subsequent actions, such as FindExtractedText. To extract from non-English text, set the page variable hr_locale to the desired language before calling this action. For example, for Japanese call rrset("ja","@P.hr_locale").

The entities that are found are determined by AQL extractors. An initial set of pre-built extractors are provided and while they work in many instances they may not work in every case. See the IBM BigInsights documentation for pre-built extractors. Additional extractors may be created using the IBM BigInsights tools for creating AQL extractors.

Extractors are saved in compiled files with the extension tam. All tam, dictionary, and table files present in the \rrs\aql folder will be loaded. The extractors provided by Datacap are exposed in DatacapPreBuilt_BasicFeatures.tam. You can add or remove tam files to the \rrs\aql folder to control if they are executed or not.

All of the loaded extractors are run on the document or page.
Important: The results from ExtractText are saved to the layout XML file, for example tm000001_layout.xml, which can be opened in a text editor to see the available entities and the entity fields that can be copied into document hierarchy fields. You can search for an entity in the layout XML using its name such as Address.Address.
Important: This action requires a 32-bit Java runtime environment. The default location is\Datacap\dcshared\jre or else the path specified in the JAVA_HOME system variable is used.

ExtractText requires a previously created layout file (for example: tm000001_layout.xml) where text is grouped into blocks. See DocumentAnalytics actions for information on the layout XML file.

Example

The following example populates the city field with the first instance of an address where the state is California.

ExtractText(DateTime.DateTime,Address.Address)
FindExtractedText(@P\City,First,Address.Address,city,stateorprovince,(California)|(CA))

Support for external dictionaries

The ExtractText action supports AQL external dictionary. Using this feature, you can write annotators that do not need to be recompiled when a change is needed.

You can export from the IBM® InfoSphere® BigInsights web tools and place the exported folders into rrs\aql\src location. The AQL is compiled at the run time.

Note: You need to manually copy any external dictionaries and tables to the \rrs\aql location.

It is recommended that you keep a back up of RRS folder, in case a file or folder gets corrupted at the time of copying or due to misconfiguration.

Detailed steps to configure Custom annotators in Datacap:

Complete the following steps to configure Custom annotators in Datacap.

  1. Once you create Custom Extractor using BigInsights Web tool, export the extractor as "Executables" with an option of including "Source files" . Export in a zip format.
  2. Copy TAMs file from export to the \rrs\aql folder. Do NOT copy the InputDocumentProcessor.TAM file. Leave the original in rrs\aql folder itself.
  3. Copy the SRC folder from Export (one from exported zip) folder to rrs\aql.
  4. Make sure to copy all the supported *.DICT files provided by BigInsights in rrs\aql folder.

Custom extractors must be called with ExtractText action in the format Module.Viewname.

Verify the modulename from corresponding aql file.

For e.g. ZIPCODE_BasicFeatures.ZCView

After the compilation process completes, the compiled TAM files are saved in \rrs\aql location. Ensure that you remove the rrs\aql\src folders after the compilation process.

List of pre-built extractors

The following Datacap extractor names consist of the following two parts, separated by a period: the InfoSphere BigInsights extractor name followed by the InfoSphere BigInsights attribute name.

See the IBM InfoSphere BigInsights documentation for more information about pre-built extractors.

http://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.text.doc/doc/ana_txtan_extractor-libraries.html

Address.Address
City.City 
Continent.Continent
Country.Country
Date.Dates
DateTime.DateTime
EmailAddress.EmailAddress
Facility.Facility
FinancialAnnouncements.CompanyEarningsAnnouncement
FinancialAnnouncements.AnalystEarningsEstimate
FinancialAnnouncements.CompanyEarningsGuidance 
FinancialEvents.Alliance
FinancialEvents.Acquisition 
FinancialEvents.JointVenture
FinancialEvents.Merger 
Location.Location
NotesEmailAddress.NotesEmailAddress 
Organization.Organization
Person.Person 
PhoneNumber.PhoneNumber  
StateOrProvince.StateOrProvince
URL.URL WaterBody.WaterBody
ZipCode.ZipCode

BigInsightsChineseNER.PersonChinese;
BigInsightsChineseNER.LocationChinese;
BigInsightsChineseNER.OrganizationChinese;

BigInsightsJapaneseNER.PersonJapanese;
BigInsightsJapaneseNER.LocationJapanese;
BigInsightsJapaneseNER.OrganizationJapanese;