Document Settings for Fields Tab

Structured Text Formatting

If you want to skip all or part of the extraction process because you have structured data or want to impose rules on how to handle the text, use the Structured text document type option and declare the fields or tags containing the text in the Structured Text Formatting section of the Document Settings dialog box. Extracted terms are derived only from the text contained within the declared fields or tags (and child tags). Any undeclared field or tag will be ignored.

In certain contexts, linguistic processing is not required, and the linguistic extraction engine can be replaced by explicit declarations. In a bibliography file where keyword fields are separated by separators such as a semicolon (;) or comma (,), it is sufficient to extract the string between two separators. For this reason, you can suspend the full extraction process and instead define special handling rules to declare term separators, assign types to the extracted text, or impose a minimum frequency count for extraction.

Use the following rules when declaring structured text elements:

  • Only one field, tag, or element per line can be declared. They do not have to be present in the data.
  • Declarations are case sensitive.
  • If declaring a tag that has attributes, such as <title id="1234">, and you want to include all variations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket (>), such as <title
  • Add a colon after the field or tag name to indicate that this is structured text. Add this colon directly after the field or tag but before any separators, types, or frequency values, such as author: or <place>:.
  • To indicate that multiple terms are contained in the field or tag and that a separator is being used to designate the individual terms, declare the separator after the colon, such as author:, or <section>:;.
  • To assign a type to the content found in the tag, declare the type name after the colon and a separator, such as author:,Person or <place>:;Location. Declare type using the names as they appear in the Resource Editor.
  • To define a minimum frequency count for a field or tag, declare a number at the end of the line, such as author:,Person1 or <place>:;Location5. Where n is the frequency count you defined, terms found in the field or tag must occur at least n times in the entire set of documents or records to be extracted. This also requires you to define a separator.
  • If you have a tag that contains a colon, you must precede the colon with a backslash character so that the declaration is not ignored. For example, if you have a field called <topic:source>, enter it as <topic\:source>.

To illustrate the syntax, let's assume you have the following recurring bibliographic fields:

		author:Morel, Kawashima
		abstract:This article describes how fields are declared.
		publication:Text Mining Documentation
		datepub:March 2010

For this example, if we wanted the extraction process to focus on author and abstract but ignore the rest of the content, we would declare only the following fields:

		author:,Person1
		abstract:

In this example, the author:,Person1 field declaration states that linguistic processing was suspended on the field contents. Instead, it states that the author field contains more than one name, which is separated from the next by a comma separator, and these names should be assigned to the Person type and that if the name occurs at least once in the entire set of documents or records, it should be extracted. Since the field abstract: is listed without any other declarations, the field will be scanned during extraction and standard linguistic processing and typing will be applied.

XML Text Formatting

If you want to limit the extraction process to only the text within specific XML tags, use the XML text document type option and declare the tags containing the text in the XML Text Formatting section of the Document Settings dialog box. Extracted terms are derived only from the text contained within these tags or their child tags.

Important! If you want to skip the extraction process and impose rules on term separators, assign types to the extracted text, or impose a frequency count for extracted terms, use the Structured text option described next.

Use the following rules when declaring tags for XML text formatting:

  • Only one XML tag per line can be declared.
  • Tag elements are case sensitive.
  • If a tag has attributes, such as <title id="1234">, and you want to include all variations or, in this case, all IDs, add the tag without the attribute or the ending angle bracket (>), such as <title

To illustrate the syntax, let's assume you have the following XML document:

		<section>Rules of the Road
		     <title id="01234">Traffic Signals</title>
		     <p>Road signs are helpful.</p>
		</section>
		<p>Learning the rules is important.</p>

For this example, we will declare the following tags:

		<section>
		<title

In this example, since you have declared the tag <section>, the text in this tag and its nested tags, Traffic Signals and Road signs are helpful, are scanned during the extraction process. However, Learning the rules is important is ignored since the tag <p> was not explicitly declared nor was the tag nested within a declared tag.