Extracting data

Whenever an extraction is needed, the Extraction Results pane becomes yellow in color and the message Press Extract Button to Extract Concepts appears below the toolbar in this pane.

You may need to extract if you do not have any extraction results yet, have made changes to the linguistic resources and need to update the extraction results, or have reopened a session in which you did not save the extraction results (Tools > Options).

Note: If you change the source node for your stream after extraction results have been cached with the Use session work... option, you will need to run a new extraction once the interactive workbench session is launched if you want to get updated extraction results.

When you run an extraction, a progress indicator appears to provide feedback on the status of the extraction. During this time, the extraction engine reads through all of the text data and identifies the relevant terms and patterns and extracts them and assigns them to a type. Then, the engine attempts groups synonyms terms under one lead term, called a concept. When the process is complete, the resulting concepts, types, and patterns appear in the Extraction Results pane.

The extraction process results in a set of concepts and types, as well as Text Link Analysis (TLA) patterns, if enabled. You can view and work with these concepts and types in the Extraction Results pane in the Categories and Concepts view. If you extracted TLA patterns, you can see those in the Text Link Analysis view.

Note: There is a relationship between the size of your dataset and the time it takes to complete the extraction process. You can always consider inserting a Sample node upstream or optimizing your machine's configuration.

To extract data

  1. From the menus, choose Tools > Extract. Alternatively, click the Extract toolbar button.
  2. If you chose to always display the Extraction Settings dialog, it appears so that you can make any changes. See further in this topic for descriptors of each settings.
  3. Click Extract to begin the extraction process. Once the extraction begins, the progress dialog box opens. After extraction, the results appear in the Extraction Results pane. By default, the concepts are shown in lowercase and sorted in descending order according to the document count (Doc. column) .

You can review the results using the toolbar options to sort the results differently, to filter the results, or to switch to a different view (concepts or types). You can also refine your extraction results by working with the linguistic resources. See the topic Refining extraction results for more information.

Potential extraction issues

Multiple Interactive Workbench sessions can cause sluggish behavior. SPSS® Modeler Text Analytics and SPSS Modeler share a common Java run-time engine when an interactive workbench session is launched. Depending on the number of Interactive Workbench sessions you invoke during a SPSS Modeler session - even if opening and closing the same session - system memory may cause the application to become sluggish. This effect may be especially pronounced if you are working with large data or have a machine with less than the recommended RAM setting of 4GB. If you notice your machine is slow to respond, it is recommended that you save all your work, shut down SPSS Modeler, and re-launch the application. Running SPSS Modeler Text Analytics on a machine with less than the recommended memory, particularly when working with large data sets or for prolonged periods of time, may cause Java to run out of memory and shut down. It is strongly suggested you upgrade to the recommended memory setting or larger (or use SPSS Modeler Text Analytics Server) if you work with large data.

For Dutch, English, French, German, Italian, Portuguese, and Spanish Text

The Extraction Settings dialog box contains some basic extraction options.

Enable Text Link Analysis pattern extraction. Specifies that you want to extract TLA patterns from your text data. It also assumes you have TLA pattern rules in one of your libraries in the Resource Editor. This option may significantly lengthen the extraction time. See the topic Exploring Text Link Analysis for more information.

Accommodate punctuation errors. This option temporarily normalizes text containing punctuation errors (for example, improper usage) during extraction to improve the extractability of concepts. This option is extremely useful when text is short and of poor quality (as, for example, in open-ended survey responses, e-mail, and CRM data), or when the text contains many abbreviations.

Accommodate spelling for a minimum word character length of [n] This option applies a fuzzy grouping technique that helps group commonly misspelled words or closely spelled words under one concept. The fuzzy grouping algorithm temporarily strips all vowels (except the first one) and strips double/triple consonants from extracted words and then compares them to see if they are the same so that modeling and modelling would be grouped together. However, if each term is assigned to a different type, excluding the <Unknown> type, the fuzzy grouping technique will not be applied.

You can also define the minimum number of root characters required before fuzzy grouping is used. The number of root characters in a term is calculated by totaling all of the characters and subtracting any characters that form inflection suffixes and, in the case of compound-word terms, determiners and prepositions. For example, the term exercises would be counted as 8 root characters in the form “exercise,” since the letter s at the end of the word is an inflection (plural form). Similarly, apple sauce counts as 10 root characters (“apple sauce”) and manufacturing of cars counts as 16 root characters (“manufacturing car”). This method of counting is only used to check whether the fuzzy grouping should be applied but does not influence how the words are matched.

Note: If you find that certain words are later grouped incorrectly, you can exclude word pairs from this technique by explicitly declaring them in the Fuzzy Grouping: Exceptions section in the Advanced Resources tab. See the topic Fuzzy Grouping for more information.

Extract uniterms This option extracts single words (uniterms) as long as the word is not already part of a compound word and if it is either a noun or an unrecognized part of speech.

Extract nonlinguistic entities This option extracts nonlinguistic entities, such as phone numbers, social security numbers, times, dates, currencies, digits, percentages, e-mail addresses, and HTTP addresses. You can include or exclude certain types of nonlinguistic entities in the Nonlinguistic Entities: Configuration section of the Advanced Resources tab. By disabling any unnecessary entities, the extraction engine won't waste processing time. See the topic Configuration for more information.

Uppercase algorithm This option extracts simple and compound terms that are not in the built-in dictionaries as long as the first letter of the term is in uppercase. This option offers a good way to extract most proper nouns.

Group partial and full person names together when possible This option groups names that appear differently in the text together. This feature is helpful since names are often referred to in their full form at the beginning of the text and then only by a shorter version. This option attempts to match any uniterm with the <Unknown> type to the last word of any of the compound terms that is typed as <Person>. For example, if doe is found and initially typed as <Unknown>, the extraction engine checks to see if any compound terms in the <Person> type include doe as the last word, such as john doe. This option does not apply to first names since most are never extracted as uniterms.

Maximum nonfunction word permutation This option specifies the maximum number of nonfunction words that can be present when applying the permutation technique. This permutation technique groups similar phrases that differ from each other only by the nonfunction words (for example, of and the) contained, regardless of inflection. For example, let's say that you set this value to at most two words and both company officials and officials of the company were extracted. In this case, both extracted terms would be grouped together in the final concept list since both terms are deemed to be the same when of the is ignored.

Use derivation when grouping multiterms When processing Big Data, select this option to group multiterms by using derivation rules.

Index Option for Concept Map Specifies that you want to build the map index at extraction time so that concept maps can be drawn quickly later. To edit the index settings, click Settings. See the topic Building Concept Map Indexes for more information.

Always show this dialog before starting an extraction Specify whether you want to see the Extraction Settings dialog each time you extract, if you never want to see it unless you go to the Tools menu, or whether you want to be asked each time you extract if you want to edit any extraction settings.