Watson Tone Analyzer: 7 new tones to help understand how your customers are feeling
- Knowing whether a customer is frustrated or satisfied with their interaction is a must-have for Contact Center reps and managers to assess customer satisfaction.
- The new endpoint was trained on customer support conversations on twitter, and tones included are frustrated, sad, satisfied, excited, polite, impolite and sympathetic.
- The Tone Analyzer service detects the above mentioned tones both from the customer’s and the agent’s text conversations.
We are pleased to announce the launch of a new Tone Analyzer endpoint trained for Customer Engagement scenarios. The new endpoint was trained on customer support conversations on twitter, and the tones included are frustrated, sad, satisfied, excited, polite, impolite and sympathetic. Currently, the new endpoint is Beta functionality in the IBM Watson Tone Analyzer service. Given a textual conversation between a customer and an agent or company representative, the service detects the above mentioned tones both from the customer’s and the agent’s text.
Why Did We Build the Tone Analyzer for Customer Engagement Endpoint?
Ever since we released the IBM Watson Tone Analyzer Service, we have gotten feedback from our clients that they would like to use the service to analyze the logs from contact centers, chatbots, and other customer support channels. We worked with clients to figure out what those tones should be and came up with an answer. Turns out tones such as frustration, satisfaction, excitement, politeness, impoliteness, sadness and sympathy are important to detect while analyzing customer engagement data. You asked for it, so, here you have it!
What can a Customer Service Manager do with these tones, you might ask. Knowing whether a customer is frustrated or satisfied with their interaction is a must-have for Contact Center managers to assess customer satisfaction. Of course, a majority of the conversations start with frustrated customers. That is to be expected! However, it is the progression of tones throughout the conversation that is very important to track. If the customer is still frustrated when the conversation ends, that is bad news. However, just knowing how the customer felt at the end of the call alone doesn’t tell the whole story. Was the customer frustrated, even at the end of the conversation, because the resolution given was not acceptable? Or, was it because the agent did not show excitement when resolving the problem? Was the agent impolite or not sympathetic enough to the situation that the customer was in?
Tracking these tone signals can help Customer Service Managers improve how their teams interact with customers. Do the agents need more training in content or in communication style? Are there any patterns in the tones of successful agents? If so, what can be learned from it to replicate it more broadly? Are specific tones of agents indicative of how the conversation is likely to end?
We hope Customer Service Managers can now begin to use these tones to analyze their customer conversations by incorporating the results of this endpoint into their dashboards and analysis applications, thereby improving their customer engagement performance.
How Does it work?
Given a set of customer support conversations and associated tones, we trained a machine learning model to predict tones for new customer support conversations. For the machine learning model, we leverage several categories of features including n-gram features, lexical features from various dictionaries, punctuation and existence of second-person reference in the turn. We use Support Vector Machine (SVM) as our machine learning model. In our data, we have observed that about 30% of the samples have more than one associated tone, so we decided to solve a multi-label classification task than multi-class classification.
For each of our tones, we trained the model independently using One-Vs-Rest paradigm. During prediction, we identify the tones that are predicted with at least 0.5 probability as the final tones. For several tones, our training data is heavily imbalanced and to address this issue we find the optimal weight value of the cost function for each of these tones during training.
Input and output
Input to the Tone Analyzer for Customer Engagement API endpoint is either a single piece of text reflecting a single statement, a set of statements, or a conversation delimited by newline. For each given input, the endpoint will produce a confidence score for each of the predicted tone(s) taken from the following set of 7 tones: Frustration, Satisfaction, Excitement, Politeness, Impoliteness, Sadness and Sympathy. The API will return tones with a confidence score higher than 0.5.
Given text: “Please have patience. We will work on your problem, and hopefully find a solution.”
Output Tone: [Polite: 0.90, Sympathetic: 0.76].
Based on the scores, we can infer that the input text expresses “Politeness” and “Sympathy” with 90% and 76%confidence, respectively.
See below an example of an entire customer-care conversation and predicted tone(s) with confidence score higher than 0.5 for each statement.
|Customer: I know it snowed in Maryland, why aren’t you delivering our range today?Predicted tone(s): Sadness: 0.59
Agent: Did you receive notification?
Predicted tone(s): Sympathy: 0.83, Politeness: 0.74
Predicted tone(s): No Associated Tone
Agent: I understand and I apologize about any disappointment.
Predicted tone(s): Politeness: 0.99, Sympathy: 0.70
Customer: Can you tell me when my package will arrive?
Predicted tone(s): Frustration: 0.78
Agent: Please give me the tracking number.
Predicted tone(s): Politeness: 0.83
Customer: Here is my tracking #.
Predicted tone(s): No Associated Tone
Agent: Your package will arrive today ?
Predicted tone(s): Excitement: 0.89, Politeness: 0.84, Satisfied: 0.78
Customer: Thanks a lot.
Predicted tone(s): Satisfaction:0.85
How did we select these tones for the Tone Analyzer for Customer Engagement Endpoint?
We first conducted a study to identify tone categories that are important in the Customer Engagement domain.
Step 1. A set of 53 tones were selected from tone attributions as listed in three resources: 1) Tone dimensions used in marketing. 2) Tone dimensions used to describe writing styles. 3) Emotion and personality scales from psychology.
Sept 2. We asked crowd workers on CrowdFlower to rate the extent to which the 53 tone attributes can best describe a specific utterance in 1K customer care conversations. In order to simplify the rating task in the context of crowdsourcing, the 53 tones were divided into 4 subsets. An annotator only needs to rate a subset of the tones and the ratings of all the tones were aggregated.
Step 3. We performed factor analysis on the 53 x 53 correlation matrix and found at least 7 significant factors (dimensions). The names of factors were determined to represent the best type of concepts subsumed in each of the dimensions. After the above three steps, 7 important tone dimensions were found in customer care domain: frustration, satisfaction, excitement, politeness, impoliteness, sadness and sympathy.
How did we collect ground truth data?
We used Twitter customer support forums as the data source to collect conversational data for the Customer Care domain tone analysis. Many companies use Twitter as a channel for providing support to their customers these days. Agents are hired and trained to monitor tweets with company mentions or direct help requests and to provide fast support to address the customer’s needs. The number of turns (back-and-forth between customer and agent) range anywhere from one to ten or more. To collect as many conversations as possible, in this project, we selected 62 brands with dedicated customer service accounts on Twitter. We intentionally choose those 62 brands to cover a large variety of industries, geographical locations and organization scales. Overall, 2.6M user requests were collected between Jun. 1 and Aug. 1, 2016.
We additionally conducted some pre-processing on the collected dataset by keeping only those conversations which received at least one reply and involved only one customer and one agent. All non-English conversations or conversations containing requests or answers with images are also removed from the dataset. To preserve users’ and companys’ privacy, we replaced all the @mentions appeared in the each conversation as @customer, @[industry]_company (e.g. @ecommerce_company, @telecommunication_company), @competitor_company, and @other_users). We selected approximately 96K conversational utterances to be annotated by crowd workers.
We used CrowdFlower (https://www.crowdflower.com/) platform for crowd worker annotation tasks. For worker’s better understanding of the annotation context, we asked the workers to annotate on a conversation level, labeling all utterances involved in a conversation. Considering some of the tones that we proposed are highly subjective, annotators may have inconsistent perceptions regarding them. To solve this problem, in this task, we asked the workers to indicate on a 4-point Likert scale, ranging from “Very strongly” to “Not at all”, how strongly do they feel about the proposed tones being demonstrated in the utterance. Advantages of using Likert scale over binary yes/no include: 1. higher tolerance for diverse perceptions; 2. generate less sparse labels.
An example of our annotation interface is shown in Figure 1. Each conversation was labeled by 5 different annotators. We restricted the annotators to be in U.S. and with acceptable levels of performance from previous annotation tasks. We also require them to answer 5 training questions correctly in order to further proceed to the real tasks. The purpose of the training questions is for the annotators to have a sense of what each tone means in this customer service context. Validation questions were also embedded in the real tasks to further validate the quality of labels.
Once we received the labels for each utterance from all 5 workers, we take the average of the 5 scores as the final label. Finally, we set the score of 1 as the threshold of turning the continuous labels back to binary. This was done, based on our observations of the label distributions and the change of the model performances along with the experimentation of the threshold value.