Text Mining Node: Fields Tab
Use the Fields tab to specify the field settings for the data from which you will be extracting concepts. Consider using a Sample node upstream from this node when working with larger datasets to speed processing times. See the topic Sampling Upstream to Save Time for more information.
You can set the following parameters:
ID field Select the field containing the identifier for the text records. Identifiers must be integers. The ID field serves as an index for the individual text records. Use an ID field if the text field represents the text to be mined.
Text field. Select the field containing the text to be mined. This field depends on the data source.
Language field Select the field that contains the two letter ISO language identifier. If you do not select a field, the language of each document is assumed to be that of the supplied template.
Document type. The document type specifies the structure of the text. Select one of the following types:
- Full text. Use for most documents or text sources. The entire set of text is scanned for extraction. Unlike the other options, there are no additional settings for this option.
- Structured text. Use for bibliographic forms, patents, and any files that contain regular structures that can be identified and analyzed. This document type is used to skip all or part of the extraction process. It allows you to define term separators, assign types, and impose a minimum frequency value. If you select this option, you must click the Settings button and enter text separators in the Structured Text Formatting. area of the Document Settings dialog box. See the topic Document Settings for Fields Tab for more information.
Textual unity. Select the extraction mode from the following:
- Document mode. Use for documents that are short and semantically homogenous, such as articles from news agencies.
- Paragraph mode. Use for Web pages and nontagged documents. The extraction
process semantically divides the documents, taking advantage of characteristics such as internal
tags and syntax. If this mode is selected, scoring is applied paragraph by paragraph. Therefore, for
example, the rule
apple & orange
is true only ifapple
andorange
are found in the same paragraph.Note: Due to the way text is extracted from PDF documents, Paragraph mode does not work on these documents. This is because the extraction suppresses the carriage return marker.
Paragraph mode settings. This option is available only if you set the textual unity option to Paragraph mode. Specify the character thresholds to be used in any extraction. The actual size is rounded up or down to the nearest period. To ensure that the word associations produced from the text of the document collection are representative, avoid specifying an extraction size that is too small.
- Minimum. Specify the minimum number of characters to be used in any extraction.
- Maximum. Specify the maximum number of characters to be used in any extraction.
Partition mode Use the partition mode to choose whether to partition based on the type node settings or to select another partition. Partitioning separates the data into training and test samples.