Advanced linguistic settings
When you build categories, you can select from a number of advanced linguistic category building techniques such as concept inclusion and semantic networks (English text only). These techniques can be used individually or in combination with each other to create categories.
Keep in mind that because every dataset is unique, the number of methods and the order in which you apply them may change over time. Since your text mining goals may be different from one set of data to the next, you may need to experiment with the different techniques to see which one produces the best results for the given text data. None of the automatic techniques will perfectly categorize your data; therefore we recommend finding and applying one or more automatic techniques that work well with your data.
The following advanced settings are available for the Use linguistic techniques to build categories option in the category settings.
Category input
Select what the categories will be built from:
- Unused extraction results. This option enables categories to be built from extraction results that aren't used in any existing categories. This minimizes the tendency for records to match multiple categories and limits the number of categories produced.
- All extraction results. This option enables categories to be built using any of the extraction results. This is most useful when no or few categories already exist.
Category output
Select the general structure for the categories that will be built:
- Hierarchical with subcategories. This option creates subcategories and sub-subcategories. You can set the depth of your categories by choosing the maximum number of levels that can be created. For example, if you choose 3, categories could contain subcategories and those subcategories could also have subcategories.
- Flat categories (single level only). This option builds only one level of categories, meaning that no subcategories will be generated.
Grouping techniques
Each of the techniques available is well suited to certain types of data and situations, but often it's helpful to combine techniques in the same analysis to capture the full range of documents or records. You may see a concept in multiple categories or find redundant categories.
- Group by concept inclusion. This technique builds categories by grouping
multiterm concepts (compound words) based on whether they contain words that are subsets or
supersets of a word in the other. For example, the concept
seat
would be grouped withsafety seat
,seat belt
, andseat belt buckle
. - Group by semantic network. This technique begins by identifying the
possible senses of each concept from its extensive index of word relationships and then creates
categories by grouping related concepts. This technique is best when the concepts are known to the
semantic network and are not too ambiguous. It is less helpful when text contains specialized
terminology or jargon unknown to the network. In one example, the concept
granny smith apple
could be grouped withgala apple
andwinesap apple
since they are siblings of the granny smith. In another example, the conceptanimal
might be grouped withcat
andkangaroo
since they are hyponyms ofanimal
. This technique is available for English text only. - Maximum search distance. This setting is only available if you select the Group by semantic network option. Select how far you want the techniques to search before producing categories. The lower the value, the fewer results you will get—however, these results will be less noisy and are more likely to be significantly linked or associated with each other. The higher the value, the more results you might get—however, these results may be less reliable or relevant. While this option is globally applied to all techniques, its effect is greatest on co-occurrences and semantic networks.
- Prevent pairing of specific concepts. Select this option to stop the process from grouping or pairing two concepts together in the output. To create or manage concept pairs, click Manage pairs.
- Generalize with wildcards where possible. Select this option to allow
Modeler to generate generic rules in categories using the asterisk wildcard. For example, instead of
producing multiple descriptors such as
[apple tart + .]
and[apple sauce + .]
, using wildcards might produce[apple * + .]
. If you generalize with wildcards, you'll often get exactly the same number of records or documents as you did before. However, this option has the advantage of reducing the number and simplifying category descriptors. Additionally, this option increases the ability to categorize more records or documents using these categories on new text data (for example, in longitudinal/wave studies).
Other options for building categories
Maximum number of top level categories created. Use this option to limit the number of categories that can be generated the next time you click Build in the categories pane. In some cases, you might get better results if you set this value high and then delete any of the uninteresting categories.
Minimum number of descriptors and/or subcategories per descriptor. Use this option to define the minimum number of descriptors and subcategories a category must contain in order to be created. This option helps limit the creation of categories that don't capture a significant number of records or documents.
Allow descriptors to appear in more than one category.
When selected, this option allows descriptors to be used in more than one of the categories that
will be built next. This option is generally selected since items commonly or "naturally" fall into
two or more categories, and allowing them to do so usually leads to higher quality categories. If
you don't select this option, you reduce the overlap of records in multiple categories and—depending
on the type of data you have—this might be desirable. However, with most types of data, restricting
descriptors to a single category usually results in a loss of quality or category coverage. For
example, let's say you have the concept car seat manufacturer
. With this option,
this concept could appear in one category based on the text car seat
and in another
one based on manufacturer
. But if this option is not selected, although you may
still get both categories, the concept car seat manufacturer
will only appear as a
descriptor in the category it best matches based on several factors including the number of records
in which car seat
and manufacturer
each occur.
Resolve duplicate category names by. Select how to handle any new categories or subcategories whose names would be the same as existing categories. You can either merge the new ones (and their descriptors) with the existing categories with the same name, or you can choose to skip the creation of any categories if a duplicate name is found in the existing categories.