Text Mining Node: Fields Tab

The Fields tab is used to specify the field settings for the data from which you will be extracting concepts. Consider using a Sample node upstream from this node when working with larger datasets to speed processing times. See the topic Sampling Upstream to Save Time for more information.

You can set the following parameters:

Text field. Select the field containing the text to be mined, the document pathname, or the directory pathname to documents. This field depends on the data source.

Text field represents. Indicate what the text field specified in the preceding setting contains. Choices are:

Actual text. Select this option if the field contains the exact text from which concepts should be extracted.
Pathnames to documents. Select this option if the field contains one or more pathnames for the location(s) of where the text documents reside.

Document type. This option is available only if you specified that the text field represents Pathnames to documents. Document type specifies the structure of the text. Select one of the following types:

Full text. Use for most documents or text sources. The entire set of text is scanned for extraction. Unlike the other options, there are no additional settings for this option.
Structured text. Use for bibliographic forms, patents, and any files that contain regular structures that can be identified and analyzed. This document type is used to skip all or part of the extraction process. It allows you to define term separators, assign types, and impose a minimum frequency value. If you select this option, you must click the Settings button and enter text separators in the Structured Text Formatting area of the Document Settings dialog box. See the topic Document Settings for Fields Tab for more information.
XML text. Use to specify the XML tags that contain the text to be extracted. All other tags are ignored. If you select this option, you must click the Settings button and explicitly specify the XML elements containing the text to be read during the extraction process in the XML Text Formatting area of the Document Settings dialog box. See the topic Document Settings for Fields Tab for more information.

Textual unity. This option is available only if you specified that the text field represents Pathnames to documents and selected Full text as the document type. Select the extraction mode from the following:

Document mode. Use for documents that are short and semantically homogenous, such as articles from news agencies.
Paragraph mode. Use for Web pages and nontagged documents. The extraction process semantically divides the documents, taking advantage of characteristics such as internal tags and syntax. If this mode is selected, scoring is applied paragraph by paragraph. Therefore, for example, the rule apple & orange is true only if apple and orange are found in the same paragraph.
Note: Due to the way text is extracted from PDF documents, Paragraph mode does not work on these documents. This is because the extraction suppresses the carriage return marker.

Paragraph mode settings. This option is available only if you specified that the text field represents Pathnames to documents and set the textual unity option to Paragraph mode. Specify the character thresholds to be used in any extraction. The actual size is rounded up or down to the nearest period. To ensure that the word associations produced from the text of the document collection are representative, avoid specifying an extraction size that is too small.

Minimum. Specify the minimum number of characters to be used in any extraction.
Maximum. Specify the maximum number of characters to be used in any extraction.

Input encoding. This option is available only if you indicated that the text field represents Pathnames to documents. It specifies the default text encoding. For all languages except Japanese, a conversion is done from the specified or recognized encoding to ISO-8859-1. So even if you specify another encoding, the extraction engine will convert it to ISO-8859-1 before it is processed. Any characters that do not fit into the ISO-8859-1 encoding definition will be converted to spaces. For Japanese text, you can choose one of several encoding options: SHIFT_JIS, EUC_JP, UTF-8, or ISO-2022-JP.

Partition mode. Use the partition mode to choose whether to partition based on the type node settings or to select another partition. Partitioning separates the data into training and test samples.