It is important for leaders everywhere to be aware of whether people think they are doing a good job or not. Until recently, city leaders often read the local newspapers and watched the local news broadcasts to gauge public opinion.
As a result of the popularity of social media sites, many leaders now feel they must monitor the contents of social media too. Unfortunately, it is difficult to keep abreast of the volume of content without some automated assistance.
Thankfully, improvements in the science of automated sentiment analysis can help us gauge the sentiments that are expressed through many sources. However, these tools are only of help if they are properly configured, and such configuration is not a trivial task.
In this article, we explain in detail how city leaders can configure automated sentiment analysis for social media sites for the topics of interest to them by using the recently released Sentiment Analysis dashboard, a part of the IBM Intelligent Operations Center. The dashboard uses IBM Cognos Consumer Insight as its underlying analysis engine. However, much of the advice in this article is equally applicable for any automated sentiment analysis tool.
Naturally each city might have unique circumstances. In general, to illustrate the point we describe how to approach this problem for an imaginary city named "Ambiguoville". The leaders in Ambiguoville are concerned that their citizens are not as happy as their northern neighbors in Happyville and they are concerned that if they don't take prompt action that they might have demonstrations on the streets like what happened with their Southern neighbors in Sadville.
To plan the deployment of any sentiment analysis tool, it is important to know what you want to track sentiment about. We met with Mayor Unsure and other city leaders, such as the chief of police and the head of the media relations department. This meeting helped us understand what the major themes that they wanted to track sentiment about are.
Before any city constructs a list of subjects that they want to track sentiment about, they must decide whether they want to track content about the city or written by people who live in the city. In most cases, they are interested in content about the city. For example, the leaders are interested to learn what people who live elsewhere write when they have a negative opinion about the crime rate in Ambiguoville, but they are not interested to track the sentiment that the residents of Ambiguoville are expressing about the crime rate in other cities.
In practice, it can be tricky to determine the location that is associated with social media:
- Posts to social media sites are often short and they might not contain a direct reference to which city is being discussed. However, if we know the user's home location, it can sometimes be inferred.
- For privacy reasons, some users might be reluctant to declare their home location when they register for accounts on social media sites. But we can guess their home city if their posts frequently mention locations within the city.
In the case of Ambiguoville we used the election manifesto that the mayor used during his election campaign two years ago as our starting point for constructing a list of important topics. Even though his formal re-election campaign had not started, it was likely that the general themes would remain unchanged. In addition, we considered a large overlap with the list of topics that the media relations team were tracking manually in conventional news outlets.
Here is a list of the topic areas we developed, based on our research:
Despite police force efforts over the years, crime continues to be a major concern for the citizens and the city leaders. The city implemented advanced analytics that allow them to monitor the trends. Although the actual crime statistics are trending downward, the leaders feel that worry among the citizens about safety might be on the increase.
Although the city recently deployed a network of hovercraft that allows commuters to rapidly fly around the city, the number of commuters who use it is less than expected. In addition, there are continued problems with parking. The "letters to the editor" section in the local newspaper is full of complaint letters about the lack of parking spaces and the heavy-handed enforcement of regulations by traffic wardens. The city recently introduced a new charging mechanism for parking and there is mixed opinion on whether it is an improvement on the previous system.
The mayor feels that Ambiguoville could easily become a major destination for weekend visitors. Because the city invested in upgrading several of their museums and other attractions, they want to track what people think of these attractions.
Naturally the city leaders want to know what the citizens think about them. The existing media tracking service that they use tracks all mentions of the mayor and other leaders in his HighTech party. In addition, they track mentions of the leading members of the two main opposition parties the Luddites and the Doomsday party. The chief of police is not officially a politician, but he has a high personal profile in the media thus he also likes to track mentions of his name in the media.
Each of these topic areas might be a good candidate to be configured as themes in the Intelligent Operations Center sentiment dashboard. However, it might make sense to split the single topic "Leaders" into the two themes "Government" and "Opposition" because it is possible that the sentiment on each of these topics might trend in different directions.
The evolving topics feature of Cognos Consumer Insight allows you to discover the topics that are discussed within the documents you fetch, even those topics that are not included in your taxonomy.
For example, within the Government theme, the subthemes are:
- The HighTech Party
- Mayor Ultan Unsure
- Deputy Mayor Tom Doubtful
- Police Chief Mike Baton
Under Tourism the subthemes represent each of the main attractions within the city.
In a later section, we go into more detail about how to actually configure the Cognos Consumer Insight taxonomy that implements these themes. However, at this stage it is important to note that the names given to the themes and subthemes are reminders for the people who use the dashboard and do not have to be the actual search terms that are used. For example, the contents of the charts in the dashboard would not change if we replaced the name of the "Government" theme with "Goodies" and replaced the name of the "Opposition" theme with "Baddies"
The preceding guidance is relevant regardless of which sentiment analysis tools you use. Now we cover the practical details of how to use the Cognos Consumer Insight tool and how to connect it to the Intelligent Operations Center for Sentiment Analysis dashboard.
Figure 1 illustrates the architecture of our solution at a high level.
Figure 1. High-level view of the solution architecture
As Figure 1 illustrates, Cognos Consumer Insight does not fetch data directly from social media sites like Twitter and Wordpress. Instead, it retrieves data indirectly through the BoardReader™ service, which makes it simple for Cognos Consumer Insight to support multiple websites. While BoardReader does cover a large array of websites, often there are sources of sentiment that BoardReader cannot access. To access such sources of information, you must use the Data Fetcher Development Kit (see Resources) to create a custom crawler.
The public website for Ambiguoville contains a feedback page where anyone can leave comments to tell the city leaders what they think about the way that the city is being run. When this service was initially created the mayor intended to read all of the comments. However, the page was much more popular than expected and as a result the mayor is not able to read all of the feedback.
It is not possible for BoardReader to crawl the feedback because it is not made available on the Internet. However, the mayor can hire a contractor to develop a custom crawler that can extract the contents of the feedback from the private database where it is stored and then feed it to Cognos Consumer Insight where the sentiment expressed within the feedback is presented on the Intelligent Operations Center sentiment dashboard with the rest of the more conventional media sources. Additionally, it is possible to view an analysis of only this feedback because it is categorized as a distinct source.
When you configure your Cognos Consumer Insight data fetcher, you must first decide what query term or terms you use to fetch documents from BoardReader.
Consider the following factors before you select a suitable query string to use:
- Cognos Consumer Insight includes only the documents in the sentiment analysis screen if they both match the search query and contain snippets that match topics in your taxonomy. It is best to specify a query that ensures that you fetch only those documents that reference your city and use the taxonomy to determine whether they pertain to the topics of interest to you.
- The query that you select determines how many documents are fetched from BoardReader. If you specify a very restrictive search query, you might not have many documents to match against your taxonomy and so you might not have a statistically significant sample of expressions of sentiment. Conversely, if you specify a loose query string you might fetch more documents than you intended, which might take more time to analyze and rapidly use up the credits on your BoardReader key.
- Documents are matched by Cognos Consumer Insight if your search term is in the document title even if it is not in the body of the text. A newspaper site normally includes the newspaper name in the title of each article. You might want your query to include content from the major newspapers and other media outlets for your city. If the name of the media outlet contains your city name, then the search term for your city catches it. If outlet's name does not contain your city name then include it in the search string.
Cognos Consumer Insight uses regular expressions that are not case-sensitive to apply filters. For example, it collects content about Ambiguoville, the query that produces the best results is: "ambig[u|o]+ville". Because there does not seem to be any other cities elsewhere in the world with the same name, we are confident that any document that contains the name of the city is really about it. In addition, it seems that many people misspell the city name as Ambiguville, Ambigoville, Ambigouville, Ambigooville and thus we want to include documents with any of these words.
We were lucky that the city name Ambiguoville is unique, but in practice many cities have the same name as other cities in different parts of the world. You can experiment with your search term to help ensure that you are capturing as much relevant content as possible without degrading the quality of your result by having too many false matches.
If your city shares a name with another city in another part of the world, craft your query carefully so that it produces matches about the right place. The way that you do is slightly different depending upon whether your city is the best known one that shares the name.
- For example, if you wanted to track data about Boston the capital and city in Massachusetts, you might find that most of the hits for the search term "Boston" relate to Boston, Massachusetts. You might want to modify your search query to exclude references to rock music because some of the references to the band of the same name (or you might decide to include the content since the band's home town is what you are interested in). You can probably safely ignore the content about the 30 or so places with the same name because there is relatively less written about them on the Internet.
- However, you might want to track sentiment about the town named Boston in Lincolnshire in England. To do so you want to retrieve documents that included a context word such as "England", "UK", or "Lincolnshire" in addition to the word "Boston" to ensure that the documents really pertain to this Boston.
When you experiment with a search term you can use Google or a similar web search engine to test your query. However, a web search is not the same as the crawling done by BoardReader. You can more accurately test your query by running an ad hoc search in Cognos Consumer Insight. If you do test, it is normally good enough to search content on the current date. If you execute a BoardReader query that goes back to include historical data it might take much longer and use up valuable BoardReader credits.
If you created a custom crawler, be aware that it might handle query terms slightly different from the way BoardReader does.
The query terms that you specify in the Cognos Consumer Insight Administrative UI are passed to the custom crawler on the command line. However, your custom crawler has total freedom about how it processes with this query. In many instances, it might make sense for your custom crawler to completely ignore the search query.
For example, in the case of the custom crawler that is written to parse the suggestions that are entered in the suggestion page on the Ambiguoville's website, it is probably safe to assume that all of these documents relate to the city. Hence, it would not make sense to apply a search query to restrict which documents are passed on to the taxonomy classification step. In addition, parsing documents with the custom crawler does not involve calling the BoardReader service so you do not have to worry about using up an excessive number of credits.
You can consult Cognos Consumer Insight Administrator's guide (see Resources) to learn more about how Cognos Consumer Insight queries work.
Cognos Consumer Insight differentiates between two major categories for analysis:
- Concept is a topic to search for in social media that is relevant for a specific use case. Cognos Consumer Insight extracts concepts from the blogs, message boards, microblogs, news sites, and videos. A Type is a group of concepts. For example, the Mayor, his deputy and the Police Chief each represent concepts that can be grouped into the Type "City Leaders". When your data transfers to the Sentiment Dashboard in the Intelligent Operations Center, the types show up as the top-level themes, while the concepts display as subthemes. When you construct your taxonomy pay close attention to the definition of the concepts, the sentiment rating for the types is the summation of the sentiment rating for the contained concepts.
- Hotword is an aspect that is relevant across concepts or that must be
analyzed as a second dimension. Two examples are "waiting time", which
is relevant for public transport, medical institutions, and many
others, and "credibility", which can analyzed across all "City
Leaders" concepts. Also, some types, or concepts, can be hotwords to
enable cross-analysis. For example, it might be useful to analyze how
each city leader is perceived in relation to crime or parking.
The sentiment dashboard does not display the hotword analysis, so you might be tempted to omit hotwords from your configuration. However, Cognos Consumer Insight insists that at least one hotword is defined in your taxonomy to make it valid so you must add at least one.If you do decide to configure multiple hotwords (for example, because you intend to do additional analysis using the Cognos Consumer Insight analysis user interface), be aware that the order of the hotwords is significant. If a snippet is found that contains more than one of the hotword you defined, Cognos Consumer Insight will associate it with the first matching hotword only.
Another important term is snippet. Because long documents might mention a number of different concepts or express different sentiments (for example a newspaper article on a controversial topic typically gives both sides of the debate), Cognos Consumer Insight breaks up documents into snippets that are a maximum of three sentences before it analyzes the concepts and sentiments. Some people are initially confused when they see the same document listed among both the positive and negative sentiment samples.
Concepts and hotwords are defined by using three parameters:
- Include terms
- Context terms
- Exclude terms
Building the taxonomy is an iterative process. For example, given the previous themes list, we can define a basic taxonomy for one concept as an example.
Include terms are mandatory and contain regular expressions for each concept or hotword. The simplest concept would contain just one include term. Starting generically, we choose "park" to get all snippets related to parking, and added some frequent verb forms like “parking” and “parked”. Since we chose our city name as the query, every snippet found ought to relate to Ambiguoville and from these snippets our concept definition encompasses everything that contains the words "park", "parked" or "parking".
Table 1. Terms that we use in our first run.
|Include terms||Context terms||Exclude terms|
After our first run we discover that the selection was too broad. Looking at the snippets, we detect that many relate to Ambiguoville's Central Park and resonate to provocative ads for "parked cash" at Sparkling Money Depot, Inc., the local bank of Ambiguoville. Thus, our next step is to exclude these results by adding "central park" and "sparkling" to the list of exclude terms.
Our next run is not successful either because it excluded the "Central Park-o-Meter" system and "Sparkling's Car Park", a large, new garage next to the fountain monument.
So, to limit results without getting too narrow we choose to define context terms that relate to the topic. Context terms narrow the results by defining words that must be within the same snippet. We suggest that you use a broad set of context terms since at least one of the context terms must be within the snippet to be considered. In our example, we choose "car", "vehicle", "drive", "park-?o-?meter" (so it is possible to be spelled without the dashes). We modify the exclude terms to contain "sparkling money" and "bank".
"Central Park" is a tricky term since it is a substring of the "Central Park-o-Meter". Queries only match full words, however the dash (“-”) is considered as separator between words. To narrow the possible results, we then use the context terms and provide some common topics that are related to "Central Park" that we found in the snippets of the first run as exclude terms: "Big City Green" (Nickname), "Grosh" (name of the fountain, which is a common meeting place), "picnic" and "badminton" (which is popular in Ambiguoville). We do not exclude "fountain" and "monument" because we might loose snippets that mention "Sparkling's Car Park", also "grass" and "trees" are not good terms to exclude because of an ongoing discussion about "parking space" versus "green space" that we do not want to miss.
Table 2. Terms that we add to our previous run.
|Include terms||Context terms||Exclude terms|
|big city green|
Our third run shows better results, almost all the snippets seem to relate to parking. However, we discover some include terms that can help us further broaden the results, for example the city's Secretary of Transport, Dominique Tourett, who is nicknamed "Detour" in several social media sites. We add "tourett" and "detour" to our context terms.
The "Arnold Schwarzenegger" monument in Central Park is included in many snippets about the central park that are not related to our topic “parking” and therefore we add "Schwarzenegger" as an exclude term.
Additionally, we find "pull in" as an include term and "auto" as a context term since we encountered snippets that contain these synonyms.
Table 3. Terms that we use in our final run
|Include terms||Context terms||Exclude terms|
|pull in||drive||big city green|
The result is compelling, although some fuzziness remains, it is possible to get a good impression on the parking topic in Ambiguoville by using the definitions shown in Table 3. Even the ambiguous context term "auto" (which encompasses every word that starts with "auto-" including the dash, for example "auto-created") improved the results.
To further improve sentiment analysis, it is possible to define positive and negative words to cover local idioms that might not be reflected in the standard dictionary.
For this example, we chose a tricky situation to show the possibility of a high degree of ambiguity. Most definitions are simpler than our example. In a real world example approximately 80 percent of the definitions contain include terms with no context terms or exclude terms, and yet they still provide sufficient precision for analysis.
Using the iterative approach can help you find the right concepts and hotwords. Experimenting with the Cognos Consumer Insight user interface, checking results, comparing against common sense and tuning them to be meaningful is an effective approach to configure sentiment analysis. However, some basic technical knowledge (regular expressions) and local insight into the city's topics is required to help achieve significant results.
When you are happy with your taxonomy you might want to ensure that your users have access to the most current information by establishing a scheduled job in Cognos Consumer Insight.
To set a scheduled job up, start AdminApp on your Cognos Consumer Insight server, then click Data Fetcher > Schedule, as shown in Figure 2.
Figure 2. Establish a scheduled job in Cognos Consumer Insight
(View a larger version of Figure 2.)
- Assign a name to your scheduled job. The name that you choose is stored in the log file so you might want to choose a name that reminds you of why you created this scheduled job. For example, we named our query "Ambiguoville Sentiment".
- Select a future date and time when the job is to run for the first time. Most users like to schedule their job to run during the night so that when their users come into work each morning they can view fresh data. However, it is important to note that the scheduled start time is specified in Coordinated Universal Time and not in the local time of the administrator or the server. (See Resources for more information about Coordinated Universal Time.)
- Specify how often you want the job to repeat. The repeat time is
specified as an integer number of days, which means that the job
cannot run more than one time per day and always runs at the same time
of day as the initial job.
It is impossible to be certain how much time it takes a job to complete since the number of documents can vary from day to day. However, you might want to ensure that each daily job is completed before the next daily job kicks off. If your initial job takes longer than about five hours, you might want to either change your query to retrieve fewer documents or allocate more powerful hardware to your Cognos Consumer Insight server. See Resources for a link to the Cognos Consumer Insight Installation and Configuration Guide that details how to select appropriate hardware for your Cognos Consumer Insight server.
- Specify how far back you want the fetcher to look (in the field
entitled "Start Crawl From"). The longer the period that is covered by
your fetcher, the more time it requires to fetch and analyze the
documents. We suggest that you choose a previous date that is not too
far back. Normally the subsequent daily fetcher runs faster than the
initial fetch because it is fetching only the new documents that
published since it ran the job last. After your job runs for a few
months, you will have a considerable history of sentiment trends.
Because of the evolving topics feature of Cognos Consumer Insight you can discover the topics that are discussed within the document you fetch, even those topics that are not included in your taxonomy. Although this feature is powerful, it is also resource hungry and can cause your fetch job to require more time to complete. The results of the evolving topics phase are not presented on the sentiment dashboard of the Intelligent Operations Center, you don't necessarily have to specify the number of topics and languages.
The results of the evolving topics phase are not presented on the sentiment dashboard of the Intelligent Operations Center, so you might choose not to enable this phase. If you don't enable this phase, you don't have to specify the number of topics and languages.
- Finally, specify which of the queries the fetcher is to use. For example, Figure 2 shows that we chose to use the search query described earlier.
The resulting scheduled job has three phases. Even when all the phases are completed, the data is not displayed in the Intelligent Operations Center dashboard until the data transfer job is completed. This data transfer job is controlled by the Intelligent Operations Center and runs to a different schedule. It is not resource-intensive (especially when there is no new sentiment data to be transferred) and typically completes within a few minutes. Therefore, most users schedule this job to run once per hour or even more frequently.
After setting up their system as described here, the leaders of Ambiguoville were able to monitor the trends in the sentiment that was expressed online about the themes that they care about. Figure 3 shows what their sentiment dashboard might look like.
Figure 3. Sentiment dashboard for Ambiguoville
(View a larger version of Figure 3.)
Police Chief Baton has been selected in the dashboard so you can see details of some recent negative sentiment about him. For more information on this dashboard, consult the IBM Intelligent Operations Center for Sentiment Analysis documentation (see Resources).
Now you know how to configure the taxonomy of your Cognos Consumer Insight server so that city leaders can accurately capture the expressed sentiment about the themes of greatest interest to them. Although we described a fictitious city, the steps that are involved in developing a taxonomy for a real city would not be much different.
Twitris site: Find interesting research done by the Kno.e.sis center at Wright State
University, which is at the leading edge of social media
- Exploring the Potential of IBM Smarter Analytics Solutions: Get a
good overview of the potential benefits of deploying analytics technology
from this IBM Redbooks publication.
- IBM Cognos Consumer Insight product page: Learn more about this
analytic application for measuring consumer sentiment on social media
Cognos Consumer Insight 1.1.0 Product Documentation: Download
release notes, installation and configuration guidance, and
Fetcher Development Kit: Access detailed information, including
the specifications and reference implementations, about this kit that
allows you to integrate a new data source into IBM Cognos Consumer
- IBM Smarter Cities: Explore more about the IBM Intelligent
Operations Center for Smarter Cities and how it can be used to synchronize
and analyze efforts among sectors and agencies as they happen, giving
decision makers consolidated information that helps them anticipate,
rather than just react to, problems.
- IBM Intelligent Operations Center: See how to coordinate your
city to deliver exceptional service to citizens. Learn about the features,
benefits, system requirements, services, support, and more.
- IBM Intelligent Operation Center key performance indicators
(KPIs) (developerWorks, August 2011): Read Allen Smith's two-part
article series to learn how KPIs are modeled, implemented, and
- Coordinated Universal Time: Learn more from the Wikipedia
Industries: Find more industry-specific technical resources for
- developerWorks on
Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and
discussions for software developers.
on-demand demos: Watch demos ranging from product installation and
setup for beginners to advanced functionality for experienced
Get products and technologies
- IBM Smarter City Solutions on Cloud: IBM Intelligent Operations
Center on IBM SmartCloud offers a straightforward, user-based subscription
service at a single price that includes all costs, including hardware,
software, maintenance, support, and networking.
software: Download a trial version, work with product in an
online, sandbox environment, or access it in the cloud.
developerWorks profile: Create your profile today and set up a watchlist.
community: Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
- Smarter Cities developer community: Join this developerWorks
community that offers support for the IBM Smarter Cities offerings
including the Intelligent Operations Center and the Integrated Information
Brian O'Donovan has a Ph.D. in Computer Science from Trinity College Dublin and over 30 years IT experience. For the last 15 years he worked for IBM and contributed to a number of different products within the Lotus family. A keen user of all social media tools (see http://brianodonovan.ie), Brian recently moved into the Smarter Cities group where he is driving the application of automated analysis tools to social media streams.
Thorsten Skalla is a Senior Architect at IBM working in the IT industry for over 15 years as IT specialist and IT architect in the infrastructure domain. For the last eight years he has worked closely with IBM clients in the Public and Insurance Sectors. In his current role as an Industry Advisor for the Public Sector, his primary responsibility is to transform the functional requirements of governmental entities into technical specifications of IBM and other leading IT industry products.