Document Settings for Fields Tab
Structured Text Formatting
If you want to skip all or part of the extraction process because you have structured data or want to impose rules on how to handle the text, use the Structured text document type option and declare the fields or tags containing the text in the Structured Text Formatting section of the Document Settings dialog box. Extracted terms are derived only from the text contained within the declared fields or tags (and child tags). Any undeclared field or tag will be ignored.
In certain contexts, linguistic processing is not required, and the linguistic extraction engine can be replaced by explicit declarations. In a bibliography file where keyword fields are separated by separators such as a semicolon (;) or comma (,), it is sufficient to extract the string between two separators. For this reason, you can suspend the full extraction process and instead define special handling rules to declare term separators, assign types to the extracted text, or impose a minimum frequency count for extraction.
Use the following rules when declaring structured text elements:
- Only one field, tag, or element per line can be declared. They do not have to be present in the data.
- Declarations are case sensitive.
- If declaring a tag that has attributes, such as
<title id="1234">
, and you want to include all variations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket (>
), such as<title
- Add a colon after the field or tag name to indicate that this
is structured text. Add this colon directly after the field or tag
but before any separators, types, or frequency values, such as
author:
or<place>:
. - To indicate that multiple terms are contained in the field or
tag and that a separator is being used to designate the individual
terms, declare the separator after the colon, such as
author:,
or<section>:;
. - To assign a type to the content found in the tag, declare the
type name after the colon and a separator, such as
author:,Person
or<place>:;Location
. Declare type using the names as they appear in the Resource Editor. - To define a minimum frequency count for a field or tag, declare
a number at the end of the line, such as
author:,Person1
or<place>:;Location5
. Wheren
is the frequency count you defined, terms found in the field or tag must occur at leastn
times in the entire set of documents or records to be extracted. This also requires you to define a separator. - If you have a tag that contains a colon, you must precede the
colon with a backslash character so that the declaration is not ignored.
For example, if you have a field called
<topic:source>
, enter it as<topic\:source>
.
To illustrate the syntax, let's assume you have the following recurring bibliographic fields:
author:Morel, Kawashima
abstract:This article describes how fields are declared.
publication:Text Mining Documentation
datepub:March 2010
For this example, if we wanted the extraction process to focus on author and abstract but ignore the rest of the content, we would declare only the following fields:
author:,Person1
abstract:
In this example, the author:,Person1
field
declaration states that linguistic processing was suspended on the
field contents. Instead, it states that the author field contains
more than one name, which is separated from the next by a comma separator,
and these names should be assigned to the Person type and that if
the name occurs at least once in the entire set of documents or records,
it should be extracted. Since the field abstract:
is
listed without any other declarations, the field will be scanned during
extraction and standard linguistic processing and typing will be applied.
XML Text Formatting
If you want to limit the extraction process to only the text within specific XML tags, use the XML text document type option and declare the tags containing the text in the XML Text Formatting section of the Document Settings dialog box. Extracted terms are derived only from the text contained within these tags or their child tags.
Important! If you want to skip the extraction process and impose rules on term separators, assign types to the extracted text, or impose a frequency count for extracted terms, use the Structured text option described next.
Use the following rules when declaring tags for XML text formatting:
- Only one XML tag per line can be declared.
- Tag elements are case sensitive.
- If a tag has attributes, such as
<title id="1234">
, and you want to include all variations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket (>
), such as<title
To illustrate the syntax, let's assume you have the following XML document:
<section>Rules of the Road
<title id="01234">Traffic Signals</title>
<p>Road signs are helpful.</p>
</section>
<p>Learning the rules is important.</p>
For this example, we will declare the following tags:
<section>
<title
In this example, since you have
declared the tag <section>
, the text in this
tag and its nested tags, Traffic Signals
and Road
signs are helpful
, are scanned during the extraction process.
However, Learning the rules is important
is ignored
since the tag <p>
was not explicitly declared
nor was the tag nested within a declared tag.