Advanced linguistic settings

When you build categories, you can select from a number of advanced linguistic category building techniques such as concept inclusion and semantic networks (English text only). These techniques can be used individually or in combination with each other to create categories.

Keep in mind that because every dataset is unique, the number of methods and the order in which you apply them may change over time. Since your text mining goals may be different from one set of data to the next, you may need to experiment with the different techniques to see which one produces the best results for the given text data. None of the automatic techniques will perfectly categorize your data; therefore we recommend finding and applying one or more automatic techniques that work well with your data.

The following areas and fields are available within the Advanced Settings: Linguistics dialog box:

Input and Output

Category input Select from what the categories will be built:

  • Unused extraction results. This option enables categories to be built from extraction results that are not used in any existing categories. This minimizes the tendency for records to match multiple categories and limits the number of categories produced.
  • All extraction results. This option enables categories to be built using any of the extraction results. This is most useful when no or few categories already exist.

Category output Select the general structure for the categories that will be built:

  • Hierarchical with subcategories. This option enables the creation of subcategories and sub-subcategories. You can set the depth of your categories by choosing the maximum number of levels (Maximum levels created field) that can be created. If you choose 3, categories could contain subcategories and those subcategories could also have subcategories.
  • Flat categories (single level only). This option enables only one level of categories to be built, meaning that no subcategories will be generated.

Grouping Techniques

Each of the techniques available is well suited to certain types of data and situations, but often it is helpful to combine techniques in the same analysis to capture the full range of documents or records. You may see a concept in multiple categories or find redundant categories.

Concept Inclusion. This technique builds categories by grouping multiterm concepts (compound words) based on whether they contain words that are subsets or supersets of a word in the other. For example, the concept seat would be grouped with safety seat, seat belt, and seat belt buckle. See the topic Concept Inclusion for more information.

Semantic Network. This technique begins by identifying the possible senses of each concept from its extensive index of word relationships and then creates categories by grouping related concepts. This technique is best when the concepts are known to the semantic network and are not too ambiguous. It is less helpful when text contains specialized terminology or jargon unknown to the network. In one example, the concept granny smith apple could be grouped with gala apple and winesap apple since they are siblings of the granny smith. In another example, the concept animal might be grouped with cat and kangaroo since they are hyponyms of animal. This technique is available for English text only in this release. See the topic Semantic Networks for more information.

Note: The Maximum search distance option is only available if you select Semantic Network.

Maximum search distance Select how far you want the techniques to search before producing categories. The lower the value, the fewer results you will get—however, these results will be less noisy and are more likely to be significantly linked or associated with each other. The higher the value, the more results you might get—however, these results may be less reliable or relevant. While this option is globally applied to all techniques, its effect is greatest on co-occurrences and semantic networks.

Prevent pairing of specific concepts. Select this checkbox to stop the process from grouping or pairing two concepts together in the output. To create or manage concept pairs, click Manage Pairs... See the topic Managing Link Exception Pairs for more information.

Generalize with wildcards where possible Select this option to allow the product to generate generic rules in categories using the asterisk wildcard. For example, instead of producing multiple descriptors such as [apple tart + .] and [apple sauce + .], using wildcards might produce [apple * + .]. If you generalize with wildcards, you will often get exactly the same number of records or documents as you did before. However, this option has the advantage of reducing the number and simplifying category descriptors. Additionally, this option increases the ability to categorize more records or documents using these categories on new text data (for example, in longitudinal/wave studies).

Other Options for Building Categories

In addition to selecting the grouping techniques to apply, you can edit several other build options as follow:

Maximum number of top level categories created. Use this option to limit the number of categories that can be generated when you click the Build Categories button next. In some cases, you might get better results if you set this value high and then delete any of the uninteresting categories.

Minimum number of descriptors and/or subcategories per category. Use this option to define the minimum number of descriptors and subcategories a category must contain in order to be created. This option helps limit the creation of categories that do not capture a significant number of records or documents.

Allow descriptors to appear in more than one category When selected, this option allows descriptors to be used in more than one of the categories that will be built next. This option is generally selected since items commonly or "naturally" fall into two or more categories and allowing them to do so usually leads to higher quality categories. If you do not select this option, you reduce the overlap of records in multiple categories and depending on the type of data you have, this might be desirable. However, with most types of data, restricting descriptors to a single category usually results in a loss of quality or category coverage. For example, let's say you had the concept car seat manufacturer. With this option, this concept could appear in one category based on the text car seat and in another one based on manufacturer. But if this option is not selected, although you may still get both categories, the concept car seat manufacturer will only appear as a descriptor in the category it best matches based on several factors including the number of records in which car seat and manufacturer each occur.

Resolve duplicate category names by Select how to handle any new categories or subcategories whose names would be the same as existing categories. You can either merge the new ones (and their descriptors) with the existing categories with the same name. Alternatively, you can choose to skip the creation of any categories if a duplicate name is found in the existing categories.