Objectives of text mining
Text mining is meant to extract meaningful information from large amounts of survey data. Text mining is especially useful when the survey data is in response to open-ended questions. In such surveys, responders give their comments or opinions in their own words. They can talk about whatever they feel, like, appreciate, disagree with, or complain about.
Text mining in survey analytics helps you extract meaning from what people say.
For example, consider an open-ended question that asks people for their comments about using touchscreen devices. Naturally, respondents can say whether they like or dislike touchscreen devices, they can mention something specific based on their experience, or perhaps they can suggest improvements.
This article steps through the process of using SPSS Text Analytics for Surveys to analyze and decipher survey data that includes answers to open-ended question about touchscreen devices. At the end of the analysis, you will know how many people like touchscreens and how many don't. You will also identify which features of touchscreen devices people are talking about, how many people like a particular feature, and how many dislike it.
Use software such as SPSS Text Analytics for Surveys to perform automated text mining and analytics when you must extract meaningful information from large amounts of data. The field of text analytics is an important aspect of big data applications.
Introducing sample survey data
The downloadable sample data is an Excel file that includes fictitious survey data. The Excel file has only one worksheet, in which survey data is organized in three columns. Figure 1 gives a view of the three-column headings and some rows of data.
Figure 1. Sample survey data
The first column has a unique ID for each row of data. Each row represents one respondent who took part in the survey. The second column shows answers to the survey question and responses about touchscreen devices. Third column contains information about the age group of respondents who took part in the survey.
Loading survey data into SPSS Text Analytics for Surveys
As shown in Figure 2, when you run SPSS Text Analytics for Surveys you see a prompt that asks whether you want to start a new project or open an existing one.
Figure 2. The initial SPSS Text Analytics for Surveys prompt to start a new project or open an existing project
Click Start a New Project to select from various data sources. In this case, because an Excel file contains the survey data, select Excel from the list that is shown in Figure 3.
Figure 3. Selecting Excel from available data source options
SPSS Text Analytics for Surveys presents a file browsing dialog where you
enter the name of the Excel file and choose from available worksheets.
Browse to the sample Excel file. Leave the
Column names in first row option checked because the names of
columns are included in the first row of the sample worksheet. Click
Figure 4. Selecting the worksheet that carries sample survey data
SPSS Text Analytics for Surveys now presents another dialog box, which asks some questions about columns in your Excel file.
Figure 5. SPSS Text Analytics for Surveys asks about columns of data in your Excel file
In Figure 5, you can see that SPSS Text Analytics
for Surveys loaded column headings
Q1: Comments about using touchscreen devices? and
REF1: Age group. These column headings must be copied into
three fields that are shown on the right in Figure 5
Open Ended Text and
Reference. Select the first column heading
Respondent ID) and click the right arrow next to the
Unique ID field. The column name is copied into the variable.
Similarly, copy the other two variables, as shown in Figure 6.
Figure 6. Specify to SPSS Text Analytics for Surveys the meaning of each of the columns
Some column names (such as D, E, F) are still available on the left to be copied into variables that are shown on the right. But these columns are not available in the sample Excel file. SPSS Text Analytics for Surveys supports multiple questions and references in one analysis, so there is room for more columns. However, the sample Excel file contains only one question and one reference. Therefore, ignore the extra column names and click Next.
SPSS Text Analytics for Surveys asks whether you want to translate your sample data into English. Since the data is already in English, click Next. SPSS Text Analytics for Surveys presents a Category and Resources dialog.
Figure 7. The Category and Resources dialog
SPSS Text Analytics for Surveys comes bundled with linguistic resources that you can use to start your analysis. You can build and reuse your own resources in the form of libraries.
Start analyzing the sample survey data by using the resources that come bundled with SPSS Text Analytics for Surveys. Part 2 of this article describes how to use this initial analysis to develop your own touchscreen-specific linguistic resources. For the moment, leave the Resource Template option selected in Figure 7. This option specifies that the default SPSS Text Analytics for Surveys resources (without any specialized libraries) are to be used for analysis. Click Finish. SPSS Text Analytics for Surveys starts analyzing sample data, a process called concept extraction.
Analyzing results of concept extraction
A concept in SPSS Text Analytics for Surveys is a word that SPSS Text Analytics for Surveys engine deems to be important in analysis, along with all synonyms of the word. SPSS Text Analytics for Surveys takes some time to extract concepts from your sample touchscreen survey data. When the extraction process is complete, you see a window as shown in Figure 8.
Figure 8. SPSS Text Analytics for Surveys brings initial results of extraction
The window of Figure 8 is horizontally divided into two; the right portion shows responses from the sample data in the Excel file. The left portion is vertically divided into two parts. The upper-left part is to show categories, which are described and built in Part 2 of this article. The lower-left part shows the results of extraction. This first article of the series focuses on the right (the response window) and lower-left (the results window) parts of Figure 8.
Look at the results window. This window shows the concepts that SPSS Text
Analytics for Surveys extracted. Many concepts were extracted, for
SPSS Text Analytics for Surveys engine produced this list of concepts by using its default linguistic resources. As you proceed in analyzing these initial results, you will find ways to enhance the linguistic capabilities of SPSS Text Analytics for Surveys to reach more meaningful analysis of the sample survey.
Exploring concepts that are generated by SPSS Text Analytics for Surveys
Each concept in Figure 8 is followed by a number in
brackets (for example, "76" is in brackets after the first concept
touch). The number indicates how many responses contain the
touch concept. In this case, 76 responses contain word
touch and its synonyms.
touch to see responses that contain the
touch concept, as shown in Figure 9 in
the right half of the window, which is the response window.
Figure 9. Responses that contain the
The response window shows individual responses that contain the concept you selected in the results window. The response window has four columns. The leftmost column counts the responses. The second column gives the ID of the response from the original Excel file. In this article, I use IDs to refer to individual responses. The third column contains actual responses. The rightmost column displays categories, which are described and built in Part 2 of this series. This column is therefore empty for the moment.
Notice from Figure 9 that SPSS Text Analytics for
Surveys highlights all 76 occurrences of the
thus making it is easier to find the responses in Figure 9 that use the word "touch" and no synonym.
In addition to highlighting the particular concept that is displayed, SPSS
Text Analytics for Surveys renders all concepts in color codes to make
them identifiable in the response window. For example, in Figure 9 notice that
bigger is displayed
in purple in four places (response IDs 21, 46, 59, and 88). The word
"bigger" is included in a concept named
greater, which is
also displayed in purple as the fourth concept from the top in the list of
concepts that are shown in the results window.
Now click the second concept
touchscreens, which is contained
in 64 responses.
Figure 10. Responses that contain the
You can see from Figure 10 that the touchscreen concept uses two words touch screens and touchscreens as synonyms displayed highlighted in response window.
Notice from the responses that are shown in Figure 9
and Figure 10 that respondents are using
touch and touchscreens interchangeably with the same
meaning. Although "touch" and "touchscreens" are not synonyms in normal
English vocabulary, within the context of a discussion on touchscreen
devices, the two words refer to the same concept. Therefore, it is better
touchscreens into one
concept. With SPSS Text Analytics for Surveys, you can configure your own
linguistic resources, a function that is especially helpful in analyzing
Part 2 defines a domain-specific library of linguistic resources and describes how to configure domain-specific synonyms.
The next concept in the list is
like. Responses included in
like concept are shown in Figure 11.
Figure 11. Responses that contain the
The responses that are shown in Figure 11 include
several words such as prefer, I love, and I
like in the
like concept. And responders like
various things. Some like gestures and others like zooming, touch
displays, and other features.
From a survey analytics viewpoint, the
like concepts do not reveal
much unless you also know how many of the respondents like which features
and how many like touchscreens in general. You can get more specific
information by going through extraction results, step-by-step.
The data on the
like concept indicates only that a noticeable
number of survey responders are expressing positive sentiments about
touchscreens and about features such as gestures and zooming.
The next concept,
greater, contains 15 responses as shown in
Figure 12. A quick view of the responses
indicates that most people are mentioning size of touchscreen devices.
Figure 12. Responses included in the
greater concept show that responders are mentioning size
of touchscreen devices
The next concept is
cool with another 15 responses. Similar to
like concept, responders are expressing positively about
various features such as zooming, the size of touchscreen display, and
Figure 13. Responses mentioning the
cool concept show that responders are mentioning various
The next concept is
fingers. A quick glance at the responses
shows that responders are mostly enjoying the use of fingers for zooming
and typing on touchscreen devices.
Figure 14. Responses showing that most users like zooming and typing with fingers on touchscreen devices
The next concept is
easy to use, where responders appreciate
the ease of using touchscreens. SPSS Text Analytics for Surveys considers
all occurrences of three-word terms ease of use and easy to
use as synonyms.
Figure 15. Responses that appreciate the ease of using touchscreen devices
Further down the list of concepts in Figure 8 you
find two more concepts,
easy (with eight responses) and
user-friendly (with seven responses), which are mentioning
the ease of using touchscreen devices. Therefore, within the context of
this touchscreen survey, it is appropriate to combine
user-friendly with the
easy to use
Figure 16. Responses that use the concept
Figure 17. Responses that talk about user-friendliness
The next concept after
easy to use is
touch gestures with 14 responses (see Figure 8), which indicates that respondents like their experience
of using hand gestures.
Figure 18. Responses mentioning gestures
Within the context of this touchscreen survey,
is used interchangeably with the
fingers concept that is
described earlier in Figure 14. It is better to
combine the two concepts together as a small step in compiling analytic
Gestures are a feature of touchscreen devices and SPSS Text Analytics for Surveys indicates that survey responders find this feature worth mentioning.
Browsing through concepts further down the list in a similar manner, you
find that another pair of concepts,
precise, are better combined into one concept. Both of these
concepts mention the accuracy of touch devices (some criticizing and
others appreciating), as shown in Figure 19 and
Figure 20 below.
To meet survey objectives, you need to segregate how many respondents appreciate the accuracy and how many criticize.
Figure 19. Responses that use the concept
Figure 20. Responses mentioning precision
Another pair of concepts,
excellent (with 12 responses) and
good (with nine responses), is also a candidate to be
combined into one concept. Responses in these two concepts appreciate
different aspects and features of touchscreens, similar to the
cool concepts described earlier. You
need to find out what is good and excellent in the eyes of
Figure 21. Responses mentioning
Figure 22. Responses that find something is good
You see another concept that is named
responses are shown in Figure 23. A quick glance
tells that people are talking about the display of touchscreen devices,
especially the display size.
Display size is another feature that SPSS Text Analytics for Surveys found in the survey data.
Figure 23. Responses that talk about displays
In addition to the positively expressed concepts, you can also find
negatively expressed concepts, such as
not clear and
wrong. Responses included in the
concept are describing precision of touchscreens, as shown in Figure 24. Although inaccurate and not
clear don't appear to be synonyms, SPSS Text Analytics for
Surveys combined some of the responses that express unhappiness about the
precision of touchscreen devices.
Figure 24. Negative responses
Most of the responses included in the concept
referring to the inaccuracy of touchscreen devices. Therefore,
not clear and
wrong can be combined into one
more meaningful concept to wrap all responses that complain about
Figure 25. Negative responses included in the
Similarly, the concept
social media includes responses about
staying connected by using touchscreen devices.
Staying connected is an application of touchscreen devices. SPSS Text Analytics for Surveys indicates that users appreciate staying connected by using touchscreen devices.
Figure 26. Responses included in the
social media concept
Up to this point, the concepts that are extracted by SPSS Text Analytics for Surveys rely on default functions. You can draw some basic conclusions from the results of this analysis of sample survey data.
- The survey data includes responses that use concepts such as
easy to use,
good. These concepts are positive expressions that show that someone likes or appreciates something.
- Responses indicate that hand gestures and staying connected on social media are prominent applications of touchscreens.
- The size of touchscreen devices is an important feature to respondents.
- You need to know how many responses speak in favor of which features or applications. Similarly, you also need to know how many responses generally like touchscreen devices, without mentioning a particular feature or application.
- You discovered negative comments that express dislike of touchscreen devices. You need to know how many responses mention dislike of particular features or aspects of touchscreen devices.
Positive expressions use the concepts
easy to use, and
accurate. SPSS Text Analytics
for Surveys provides a feature to combine concepts in the form of
types of words. All concepts that have something in common
with each other can form a type. For example, the type
Positive can include all the positively expressing
Similarly, another type
TouchFeaturesNApps can combine
concepts such as
Building on this understanding of concepts, which are synonyms, turn your attention to types of words, which hold synonyms together.
Working with types of words
SPSS Text Analytics for Surveys combines concepts into types of words automatically during the extraction process by using its built-in linguistic resources. Explore the types that SPSS Text Analytics for Surveys generated from the sample survey data.
Use the drop-down list immediately above the list of concepts in the results window to switch the results window from showing concepts to showing types, as shown in Figure 27.
Figure 27. Switching from concept to type view
Click Type from the Concept drop-down list. SPSS Text Analytics for Surveys shows the list of types that it generated during the extraction process, as shown in Figure 28.
Figure 28. The list of types that SPSS Text Analytics for Surveys generated by using its built-in linguistic resources
Figure 28 shows that the first type is
Unknown. This type contains all concepts that SPSS Text
Analytics for Surveys cannot fit into any type. Set aside the
Unknown type temporarily to focus on the known types.
The second type in Figure 28 is the
Positive type. Expand it to view the list of concepts
Figure 29. Concepts included in the
All of the positively expressed concepts that were identified earlier
easy to use,
accurate) are included in the
Similarly, notice in Figure 28 that a built-in type
Negative holds the negatively expressed concepts noted
not clear and
wrong). Concepts included
Negative type are shown in Figure 30.
Figure 30. Concepts included in the
Negative types are a bit more
helpful than concepts because these types indicate the total of positive
responses and the total of negative responses. Many survey analytic
projects require counting and analyzing positive and negative expressions
in a meaningful way. SPSS Text Analytics for Surveys includes the built-in
Negative to save time.
To determine which concepts SPSS Text Analytics for Surveys cannot fit into
any type, look at the concepts included in the
as shown in Figure 31.
Figure 31. Concepts included in the
Concepts specific to touchscreens (
displays, and others) are
included in the
Unknown type. SPSS Text Analytics for Surveys
comes bundled with linguistic resources that are applicable to English
dictionaries (and to dictionaries of other languages that SPSS Text
Analytics for Surveys supports). Therefore, it can determine how to apply
Negative to terms that
are assumed to be positive or negative. However, SPSS Text Analytics for
Surveys cannot build types that cover touchscreen-specific
Concepts such as
social media are features and
applications of touchscreen devices. Therefore, it is appropriate to make
a new type
TouchFeaturesNApps and assign all such concepts to
Part 2 covers how to configure the
TouchFeaturesNApps type to
combine into one category the features and applications of touchscreen
devices that respondents found worth mentioning.
Although types are helpful because they bring together similar concepts, types do not go far enough to yield the needed insight. SPSS Text Analytics for Surveys must also indicate how many respondents like or dislike a particular feature. Combine SPSS Text Analytics for Surveys concepts and types to find more insightful answers to these questions.
Patterns of concepts
To find insightful answers, SPSS Text Analytics for Surveys gives a framework of linguistics that is called patterns of concepts or concept patterns.
Concept patterns are essentially a combination of one concept
touch gestures, for example) with a type
Positive, for example). If you can count responses that
mention touch gestures and cool, you can determine how many responses
appreciate touch gestures.
This technique of combining concepts and types gives more insight into the sentiment of people who take part in a survey. Therefore, the technique is part of the field of analytics referred to as sentiment analysis.
The next step is to dig deeper into sentiments that are expressed in the survey data. SPSS Text Analytics for Surveys automatically generates concept patterns during extraction by using its default function. It also enables users to configure their own concept patterns. First, look at the default function of SPSS Text Analytics for Surveys.
Figure 32. Displaying list of concept patterns
Notice the concept patterns that SPSS Text Analytics for Surveys generates automatically by using its built-in linguistic resources, as shown in Figure 33.
Figure 33. Concept patterns that SPSS Text Analytics for Surveys generates
The first concept pattern in the list is
touch + . (with 45
responses), which means 45 responses contain either just one concept
touch with other concepts that SPSS
Text Analytics for Surveys does not think make a pattern with touch.
Similarly, the second concept pattern is
touchscreens + . You
can expect that after you combine the
touchscreens concepts into one, the two concept patterns
automatically merge. However, start by looking at the responses included
touch + . patterns in Figure 34 and
touchscreens + . patterns in Figure 35.
Figure 34. Responses included in the
touch + . concept pattern
Figure 35. Responses included in the
touchscreens + . concept pattern
touch + . and
touchscreens + . patterns
contain messages of mixed nature, including almost all features, along
with positive and negative comments. These two patterns are not much help
in the analysis.
The third pattern is
touchscreens + <Positive> with 23
responses. A quick scan of the responses (Figure 36) shows that almost all responses say something positive and good about
Figure 36. Responses included in the
touchscreens + <Positive> pattern
This point is important in the analysis. It shows one clear opinion in the survey data along with the count of respondents who share the opinion.
Now expand the
touchscreens + <Positive> pattern to find
out how SPSS Text Analytics for Surveys formed this concept pattern.
Figure 37. Combinations included in the
touchscreens + <Positive> pattern
SPSS Text Analytics for Surveys combined the concepts included in the
Positive type (
user-friendly, and other types), one by one with
touchscreens. For example, the first combination in Figure 37 is
touchscreens + good, which
includes responses that contain simple statements like touchscreens
are nice or touchscreens are good.
Figure 38. Concepts included in the
touchscreens + good combination
This way the
touchscreens + <Positive> pattern as a
whole indicates the count of responses that appreciate touchscreens. Each
individual combination within the
touchscreens + <Positive> pattern indicates whether
responders find touchscreens good or cool or user-friendly.
Notice the last combination in
touchscreens + <Positive>
pattern in Figure 37. The combination is
touchscreens + reasonable in which one response from ID 89
says that touchscreens are reasonable in accuracy.... This
combination is explained later in this article.
Figure 39. The last combination in
touchscreens + <Positive> pattern with just one
You can expect that after you combine concepts such as
accurate, corresponding combinations such as
touchscreens + precise and
touchscreens + accurate automatically merge.
touch + <Positive> and
touchscreens + <Positive> patterns also merge when you
To take the analysis further, the fifth concept pattern in Figure 33 is
touch gestures + <Positive>, which includes 13
responses. You can easily see the advantage of concept patterns. You have
an automatic count of positively expressed responses about touch gestures.
You need this information: how many responses indicate positive
feelings about a specific feature?
Check the responses by clicking the
touch gestures + <Positive> pattern. You can see that
SPSS Text Analytics for Surveys accurately chose responses that positively
appreciate gestures, as shown in Figure 40.
Figure 40. Responses and combinations included in
touch gestures + <Positive> pattern
Similarly, check the responses in the next pattern
touchscreens + <Negative> and you find that once again
SPSS Text Analytics for Surveys correctly chose the responses that
indicate negative feelings about touchscreen devices.
Figure 41. Responses and combinations included in
touchscreens + <Negative> pattern
Readers can continue exploring other concept patterns of Figure 33. Consider more closely an example that
shows how SPSS Text Analytics for Surveys is organizing responses in
concept patterns. Figure 42 shows responses of the
seventh pattern (
easy to use + .) in the list.
Figure 42. Responses included in the
easy to use + . pattern
You can see that most of the responses in the
easy to use + .
pattern are simple expressions such as easy to use and ease
of use. It is simple to understand that such responses need to be
included in the
easy to use + . pattern because they don't
say anything other than easy to use.
Now look at the complex response from ID 89 in the same Figure 42. It says that touchscreens are reasonable in accuracy
and ease of use. Recall that the same response ID 89 was also
touchscreens + <Positive> pattern (the only
response in the
touchscreens + reasonable combination that
you saw earlier in Figure 39).
This example shows that SPSS Text Analytics for Surveys makes simple
concept patterns. When it finds touchscreen is cool or
touchscreens are easy to use, it assigns them to
touchscreens + easy to use combination. Similarly, when it
finds touchscreens are reasonable, it assigns the response to
touchscreens + reasonable combination.
However, when you assign the touchscreens are reasonable response,
SPSS Text Analytics for Surveys also considers that the response contains
more text ... in accuracy and ease of use and the additional text
contains two concepts
ease of use.
Therefore, it also assigns the same response to
accurate + .
ease of use + . patterns.
Other patterns are not covered in this article. Download the trial version of SPSS Text Analytics for Surveys and explore all the concepts, types, and patterns that are available.
SPSS Text Analytics for Surveys builds analytic features, one on top of the other. The most basic feature is a concept, which collects synonyms of a word. The next feature is a type of word, in which SPSS Text Analytics for Surveys collects concepts that have something in common.
Concepts and types give you an idea of how survey respondents feel about a topic. Concepts that are extracted by SPSS Text Analytics for Surveys provide a good point to start the analysis of your survey data.
SPSS Text Analytics for Surveys combines concepts and types into concept patterns, a powerful way to proceed in your analysis. Concept patterns provide insight into the sentiments of responders who take part in a survey. With concept patterns, you can learn how many respondents like or dislike something that is discussed in the survey.
SPSS Text Analytics for Surveys provides powerful analytic capability that is based on linguistic resources that come bundled with SPSS Text Analytics for Surveys. With SPSS Text Analytics for Surveys, you also can build your own domain-specific resources.
You can start your analysis with default resources of SPSS Text Analytics for Surveys. It is a good idea to make a note of domain-specific points while you explore results based on built-in functionality of SPSS Text Analytics for Surveys. Later you can use the points to configure your own domain-specific linguistics.
Part 2 explores more features of SPSS Text Analytics for Surveys, explains how to configure your own linguistic resources, and describes how far domain-specific resources go in fine-tuning the basic analytical conclusions that are made here.
|Sample survey data for this article||SampleSurveyData.zip||16KB|
- Learn about all features of SPSS Text Analytics for Surveys by reading its official user guide on the IBM website.
- See this demonstration about IBM SPSS Text Analytics for Surveys.
- The Lousy Linguist posted about his findings after he used a trial version of IBM SPSS Text Analytics for Surveys.
- Ashley Richards wrote a review of IBM SPSS Text Analytics for Surveys.
- Read Using Text Mining Techniques to Analyze Students’ Written Responses to a Teacher Leadership Dilemma (Yuejin Xu and Noah Reynolds). The paper describes a research project that used IBM SPSS Text Analytics for Surveys to analyze survey responses.
- Follow developerWorks on Twitter.
- Stay current with developerWorks technical events and webcasts that focus on various IBM products and IT industry topics.
Get products and technologies
- Download IBM SPSS Text Analytics for Surveys.
- Try Natural Language Toolkit, which is an open source platform to work with projects that involve linguistics.