Configuration

You can enable and disable the nonlinguistic entity types that you want to extract in the nonlinguistic entity configuration file. By disabling the entities that you do not need, you can decrease the processing time required. This is done in the Configuration section in the Advanced Resources tab. See the topic About Advanced Resources for more information. If nonlinguistic extraction is enabled, the extraction engine reads this configuration file during the extraction process to determine which nonlinguistic entity types should be extracted.

The syntax for this file is as follows:

	#name<TAB>Language<TAB>Code
Table 1. Syntax for configuration file
Column label Description
#name The wording by which nonlinguistic entities will be referenced in the two other required files for nonlinguistic entity extraction. The names used here are case sensitive.
Language The language of the documents . It is best to select the specific language; however, an Any option exists. Possible options are: 0 = Any which is used whenever a regexp is not specific to a language and could be used in several templates with different languages, for instance an IP/URL/email addresses; 1 = French; 2 = English; 4 = German; 5 = Spanish; 6 = Dutch; 8 = Portuguese; 10 = Italian.
Code Part-of-speech code. Most entities take a value of “s” except in a few cases. Possible values are: s = stopword; a = adjective; n = noun. If enabled, nonlinguistic entities are first extracted and the extraction patterns are applied to identify its role in a larger context. For example, percentages are given a value of “a.” Suppose that 30% is extracted as an nonlinguistic entity. It would be identified as an adjective. Then if your text contained "30% salary increase," the “30%” nonlinguistic entity fits the part-of-speech pattern “ann” (adjective noun noun).

Order in Defining Entities

The order in which the entities are declared in this file is important and affects how they are extracted. They are applied in the order listed. Changing the order will change the results. The most specific nonlinguistic entities must be defined before more general ones.

For example, the nonlinguistic entity “Aminoacid” is defined by:

regexp1=($(AA)-?$(NUM))

where $(AA) corresponds to “(ala|arg|asn|asp|cys|gln|glu|gly|his|ile|leu|lys|met|phe|pro|ser)”, which are specific 3-letter sequences corresponding to particular amino acids.

On the other hand, the nonlinguistic entity "Gene" is more general and is defined by:

regexp1=p[0-9]{2,3}
regexp2=[a-z]{2,4}-?[0-9]{1,3}-?[r]
regexp3=[a-z]{2,4}-?[0-9]{1,3}-?p?

If "Gene" is defined before "Aminoacid" in the Configuration section, then "Aminoacid" will never be matched, since regexp3 from "Gene" will always match first.

Formatting Rules for Configuration

  • Use a TAB character to separate each entry in a column.
  • Do not delete any lines.
  • Respect the syntax shown in the preceding table.
  • To disable an entry, place a # symbol at the beginning of that line. To enable an entity, remove the # character before that line.