Configuration
You can enable and disable the nonlinguistic entity types that you want to extract in the nonlinguistic entity configuration file. By disabling the entities that you do not need, you can decrease the processing time required. This is done in the Configuration section in the Advanced Resources tab. See the topic About Advanced Resources for more information. If nonlinguistic extraction is enabled, the extraction engine reads this configuration file during the extraction process to determine which nonlinguistic entity types should be extracted.
The syntax for this file is as follows:
#name<TAB>Language<TAB>Code
Column label | Description |
---|---|
#name
|
The wording by which nonlinguistic entities will be referenced in the two other required files for nonlinguistic entity extraction. The names used here are case sensitive. |
Language
|
The language of the documents
. It is best to select the specific language; however, an
Any option exists. Possible options are: 0 = Any which is
used whenever a regexp is not specific to a language and could be used in several templates with
different languages, for instance an IP/URL/email addresses; 1 = French;
2 = English; 4 = German; 5 = Spanish;
6 = Dutch; 8 = Portuguese; 10 = Italian. |
Code
|
Part-of-speech code. Most entities take a value of “s” except in a few cases. Possible values
are: s = stopword; a = adjective; n = noun. If
enabled, nonlinguistic entities are first extracted and the extraction patterns are applied to
identify its role in a larger context. For example, percentages are given a value of “a.” Suppose
that 30% is extracted as an nonlinguistic entity. It would be identified as an adjective. Then if
your text contained "30% salary increase," the “30%” nonlinguistic entity fits the part-of-speech
pattern “ann” (adjective noun noun). |
Order in Defining Entities
The order in which the entities are declared in this file is important and affects how they are extracted. They are applied in the order listed. Changing the order will change the results. The most specific nonlinguistic entities must be defined before more general ones.
For example, the nonlinguistic entity “Aminoacid
” is defined
by:
regexp1=($(AA)-?$(NUM))
where $(AA)
corresponds to
“(ala|arg|asn|asp|cys|gln|glu|gly|his|ile|leu|lys|met|phe|pro|ser)
”, which are
specific 3-letter sequences corresponding to particular amino acids.
On the other hand, the nonlinguistic entity "Gene
" is more
general and is defined by:
regexp1=p[0-9]{2,3}
regexp2=[a-z]{2,4}-?[0-9]{1,3}-?[r]
regexp3=[a-z]{2,4}-?[0-9]{1,3}-?p?
If "Gene
" is defined before "Aminoacid
" in
the Configuration section, then "Aminoacid
" will never be matched, since
regexp3
from "Gene
" will always match first.
Formatting Rules for Configuration
- Use a
TAB
character to separate each entry in a column. - Do not delete any lines.
- Respect the syntax shown in the preceding table.
- To disable an entry, place a # symbol at the beginning of that line. To enable an entity, remove the # character before that line.