Viewing and working in source mode
For each rule and macro the TLA editor generates the underlying source code that is used by the Extractor for matching and producing TLA output. If you prefer to work with the code itself, you can view this source code and edit it directly by clicking the “View Source” button at the top of the Editor. The Source view will jump to and highlight the currently selected rule or macro. However, we recommend using the editor panes to reduce the chance of errors.
When you have finished viewing or editing the source, click Exit Source. If you generate invalid syntax for a rule, you will be required to fix it before you exit the source view.
Macros in the Source View
[macro]
name = macro_name
value = ([type_name|macro_name|literal_string|word_gap])
[macro]
|
Each macro must begin with the line marked [macro] to denote the beginning of a macro. |
name
|
The name of the macro definition. Each name must be unique. |
value
|
A combination of one or more types, literal strings, word gaps, or macros.See the topic Supported Elements for Rules and Macros for more information. When combining arguments, you must use parentheses ( ) to group the arguments and the character | to indicate a Boolean OR. |
In addition to the guidelines and syntax covered in the section on Macros, the source view has a few additional guidelines that aren't required when working in the editor view. Macros must also respect the following when working in source mode:
- Each macro must begin with the line marked
[macro]to denote the beginning of a macro. - To disable an element, place a comment indicator (#) before each line.
Example. This example defines a macro called mTopic. The value for
mTopic is the presence of a term matching one of the following types:
<Product>, <Person>,
<Location>, <Organization>,
<Budget>, or <Unknown>.
[macro]
name=mTopic
value=($Unknown|$Product|$Person|$Location|$Organization|$Budget|$Currency)
Rules in the Source View
[pattern(ID)]
name = pattern_name
value = [$type_name|macro_name|word_gaps|literal_strings]
output = $digit[\t]#digit[\t]$digit[\t]#digit[\t]$digit[\t]#digit[\t]
[pattern (<ID>)]
|
Indicates the start of a that text link analysis rule and provides a unique numerical ID use to determine processing order. |
name
|
Provides a unique name for this text link analysis rule. |
value
|
Provides the syntax and arguments to be matched to the text. See the topic Supported Elements for Rules and Macros for more information. |
output
|
The output format for the resulting matched patterns discovered in the
text. The output does not always resemble the exact original position of elements in the source
text. Additionally, it is possible to have multiple output lines for a given text link analysis rule
by placing each output on a separate line. Syntax for output:
|
In addition to the guidelines and syntax covered in the section on Rules, the source view has a few additional guidelines that aren't required when working in the editor view. Rules must also respect the following when working in source mode:
- Whenever two or more elements are defined, they must be enclosed in
parentheses whether or not they are optional (for example,
($Negative|$Positive)or($mCoord|$SEP)?).$SEPrepresents a comma. - The first element in a text link analysis rule cannot be an optional
element. For example, you cannot begin with
value = $mTopic?orvalue = @{0,1}. - It is possible to associate a quantity (or instance count) to a token. This
is useful in writing only one rule that encompasses all cases instead of writing a separate rule for
each case. For example, you may use the literal string
($SEP|and)if you are trying to match either,(comma) orand. If you extend this by adding a quantity so that the literal string becomes($SEP|and){1,2}, you will now match any of the following instances: "," "and" ", and". - Spaces are not supported between the macro name and the
$and?characters in the text link analysis rulevalue. - Spaces are not supported in the text link analysis rule
output. - To disable an element, place a comment indicator (#) before each line.
Example. Let's suppose your resources contain the following TLA text link analysis rule and that you have enabled the extraction of TLA results:
## Jean Doe was the former HR director of IBM in France
[pattern(201)]
name= 1_201
value = $Person ($SEP|$mDet|$mSupport|as|then){1,2} @{0,1} $Function
(of|with|for|in|to|at) @{0,1} $Organization @{0,2} $Location
output = $1\t#1\t$4\t#4\t$7\t#7\t$9\t#9
Whenever you extract, the extraction engine will read each sentence and will try to match the following sequence:
| Position | Description of the arguments |
|---|---|
1
|
The name of a person ($Person), |
2
|
One or two of the following: comma ($SEP), determiner
($mDet), auxiliary verb ($mSupport), the strings
“then” or “as”, |
3
|
0 or 1 word (@{0,1}) |
4
|
A function ($Function) |
5
|
One of the following strings: “of”, “with”,
“for”, “in”, “to”, or
“at”, |
6
|
0 or 1 word (@{0,1}) |
7
|
The name of an organization ($Organization) |
8
|
0, 1, or 2 words (@{0,2}) |
9
|
The name of a location ($Location) |
This sample text link analysis rule would match sentences or phrases like:
Jean Doe, the HR director of IBM in France
Jean Doe was the former HR director of IBM in France
IBM appointed Jean Doe as the HR director of IBM in France
This sample text link analysis rule would produce the following output:
jean doe <Person> hr director <Function> ibm <Organization> france <Location>
Where:
-
jean doeis the term corresponding to$1(the first element in the text link analysis rule) and<Person>is the type forjean doe(#1), -
hr directoris the term corresponding to$4(the 4th element in the text link analysis rule) and<Function>is the type forhr director(#4), -
ibmis the term corresponding to$7(the 7th element in the text link analysis rule) and<Organization>is the type foribm. (#7), -
franceis the term corresponding to$9(the 9th element in the text link analysis rule) and<Location>is the type forfrance(#9)
Rule Sets in the Source View
[set(<ID>)]
Where [set (<ID>)] indicates the start of a rule set and
provides a unique numerical ID use to determine processing order of the sets.
Example. The following sentence contains information about individuals, their function within a company, and also the merge/acquisition activities of that company.
Org1 Inc has entered into a definitive merger agreement with Org2 Ltd, said John Doe, CEO of
Org2 Ltd.
You could write one rule with several outputs to handle all possible output such as:
## Org1 Inc entered into a definitive merger agreement with Org2 Ltd, said
John Doe, CEO of Org2 Ltd.
[pattern(020)]
name=020
value = $Organization @{0,4} $ActionNouns @{0,6} $mOrg @{1,2}
$Person @{0,2} $Function @{0,1} $Organization
output = $1\t#1\t$3\t#3\t$5\t#5
output = $7\t#7\t$9\t#9\t$11\t#11
which would produce the following 2 output patterns:
-
org1 inc<Organization> + merges with <ActiveVerb> + org2 ltd<Organization> -
john doe <Person> + ceo <Function> + org2 ltd<Organization>
Important! Keep in mind that other linguistic handling operations are performed during the
extraction of TLA patterns. In this case, merger is grouped under merges
with during the synonym grouping phase of the extraction process. And since merges
with belongs to <ActiveVerb> type, this type name is what appears in
the final TLA pattern output. So when the output reads t$3\t#3, this means that the
pattern will ultimately display the final concept for the third element and the final type for the
third element after all linguistic processing is applied (synonyms and other groupings).
Instead of writing complex rules like the preceding, it can be easier to manage and work with two rules. The first is specialized in finding out mergers/acquisitions between companies:
[set(1)]
## Org1 Inc has entered into a definitive merger agreement with Org2 Ltd
[pattern(44)]
name=firm + action + firm_0044
value=$mOrg @{0,20} $ActionNouns @{0,6} $mOrg
output(1)=$1\t#1\t$3\t#3\t$5\t#5
which would produce org1 inc<Organization> + merges with
<ActiveVerb> + org2 ltd <Organization>
The second is specialized in individual/function/company:
[set(2)]
## said John Doe, CEO of Org2 Ltd
[pattern(52)]
name=individual + role + firm_0007
value=$Person @{0,3} $mFunction (at|of)? ($mOrg|$Media|$Unknown)
output(1)=$1\t#1\t$3\tFunction\t$5\t#5
which would produce john doe <Person> + ceo <Function> + org2
ltd <Organization>