Nonlinguistic Entities

When working with certain kinds of data, you might be very interested in extracting dates, social security numbers, percentages, or other nonlinguistic entities. These entities are explicitly declared in the configuration file, in which you can enable or disable the entities. See the topic Configuration for more information. In order to optimize the output from the extraction engine, the input from nonlinguistic processing is normalized to group like entities according to predefined formats. See the topic Normalization for more information.

Note: You can turn on and off nonlinguistic entity extraction in the extraction settings.

Available Nonlinguistic Entities

The nonlinguistic entities in the following table can be extracted. The type name is in parentheses.

Table 1. Nonlinguistic entities that can be extracted
Nonlinguistic entity Type name
Addresses (<Address>)
Amino acids (<Aminoacid>)
Currencies (<Currency>)
Dates (<Date>)
Delay (<Delay>)
Digits (<Digit>)
E-mail addresses (<email>)
HTTP/URL addresses (<url>)
IP address (<IP>
Organizations (<Organization>)
Percentages (<Percent>)
Products (<Product>)
Proteins (<Gene>)
Phone numbers (<PhoneNumber>)
Times (<Time>)
U.S. social security (<SocialSecurityNumber>)
Weights and measures (<Weights-Measures>)

Cleaning Text for Processing

Before nonlinguistic entities extraction occurs, the input text is cleaned. During this step, the following temporary changes are made so that nonlinguistic entities can be identified and extracted as such:

  • Any sequence of two or more spaces is replaced by a single space.
  • Tabulations are replaced by space.
  • Single end-of-line characters or sequence characters are replaced by a space, while multiple end-of-line sequences are marked as end of a paragraph. End of line can be denoted by carriage returns (CR) and line feed (LF) or even both together.
  • HTML and XML tags are temporarily stripped and ignored.