Watson Developer Cloud

The science behind the service

A well-accepted theory of psychology, marketing, and other fields is that human language reflects personality, thinking style, social connections, and emotional states. The frequency with which we use certain categories of words can provide clues to personality, thinking style, social connections, and emotional stress. Several researchers found that variations in word usage in writings such as blogs, essays, and tweets can predict aspects of personality (Fast & Funder, 2008; Gill et al., 2009; Golbeck et al., 2011; Hirsh & Peterson, 2009; and Yarkoni, 2010).

Most of these prior works used the Linguistic Inquiry and Word Count (LIWC) psycholinguistics dictionary to find psychologically meaningful word categories from word usage in writings (Pennebaker et al., 2001; Pennebaker et al., 2007; and Tausczik & Pennebaker, 2010). Used widely in text analytics, LIWC defines 68 different categories, each of which contains several dozen to hundreds of words; example categories include articles, pronouns, words that represent positive emotions, social words, work-related words, and so on.

IBM conducted a set of studies to understand whether personality characteristics inferred from social media data can predict people's behavior and preferences. IBM found that people with specific personality characteristics responded and re-tweeted in higher numbers in information collection and spreading tasks. For example, people who score high on excitement-seeking are more likely to respond, while those who score high on cautiousness are less likely to respond (Mahmud et al., 2013). Similarly, people who score high on modesty, openness, and friendliness are more likely to spread information (Lee et al., 2014).

IBM also found that people with high openness and low emotional range (neuroticism) as inferred from social media language responded more favorably (for example, by clicking an advertisement link or following an account), results that have been corroborated with survey-based, ground-truth checking. For example, targeting the top 10 percent of users in terms of high openness and low emotional range resulted in increases in click rate from 6.8 percent to 11.3 percent and in follow rate from 4.7 percent to 8.8 percent.

Multiple recent studies disclosed similar results for characteristics that were computed from social media data. One recent study with retail store data found that people who score high in orderliness, self-discipline, and cautiousness and low in immoderation are 40 percent more likely to respond to coupons than the random population. A second study found that people with specific Values showed specific reading interests (Hsieh et al. 2014). For example, people with a higher self-transcendence value demonstrated an interest in reading articles about the environment, and people with a higher self-enhancement value showed an interest in reading articles about work. A third study of more than 600 Twitter users found that a person's personality characteristics can predict their brand preference with 65 percent accuracy.

The following sections expand upon these high-level findings to provide much more detail about the research and development behind the Personality Insights service. For more information about studies that apply the service to tangible scenarios, see The service in action.

Understanding the personality models

For the Personality Insights service, IBM developed models to infer scores for Big Five dimensions and facets, Needs, and Values from textual information. The models reported by the service are based on research in the fields of psychology, psycholinguistics, and marketing:

  • Big Five is one of the best studied of the personality models developed by psychologists (Costa & McCrae, 1992, and Norman, 1963). It is the most widely used personality model to describe how a person generally engages with the world. The service computes the five dimensions and thirty facets of the model that were described earlier in this documentation. The dimensions are often referred to by the mnemonic OCEAN, where O stands for Openness, C for Conscientiousness, E for Extraversion, A for Agreeableness, and N for Neuroticism. Because the term Neuroticism can have a specific clinical meaning, the service presents those insights under the more generally applicable heading Emotional Range.

  • Needs are an important aspect of human behavior. Research literature suggests that several types of human needs are universal and directly influence consumer behavior (Kotler & Armstrong, 2013, and Ford, 2005). The twelve categories of needs that are reported by the service are described in marketing literature as desires that a person hopes to fulfill when considering a product or service.

  • Values convey what is most important to an individual. They are "desirable, trans-situational goals, varying in importance, that serve as guiding principles in people's lives" (Schwartz, 2006). Schwartz summarizes five features that are common to all values: (1) values are beliefs; (2) values are a motivational construct; (3) values transcend specific actions and situations; (4) values guide the selection or evaluation of actions, policies, people, and events; and (5) values vary by relative importance and can be ranked accordingly. The service computes the five basic human values proposed by Schwartz and validated in more than twenty countries (Schwartz, 1992).

How the personality models were developed

IBM trained and calibrated the Big Five, Needs, and Values models that are used by the Personality Insights service against specific online media:

  • The model for Big Five personality characteristics was learned from blogs (Yarkoni, 2010).

  • The model for Needs was learned from Twitter.

  • The model for Values was learned from forum posts (Chen et al., 2014).

To understand how much text is needed to infer an author's personality characteristics, IBM conducted measurements by using different data sources, such as Twitter, email, blogs, forums, and wikis. IBM found that different characteristics and different online media converge at somewhat different rates. IBM expects each model to work best with input text drawn from the medium from which it was trained. While you can submit text that is written in different media, the service does not yield the same level of accuracy across all media.

IBM has not done experiments to validate whether analyzing text combined from different online media produces reliable results. IBM also is not aware of any research where personality or other characteristics are inferred from combining text that is written in different media. IBM therefore does not currently recommend combining text from multiple media sources. This recommendation might be rescinded in the future if research validates that combining text from different sources still produces reliable results.

How personality characteristics are inferred

To infer personality characteristics from textual information, the Personality Insights service tokenizes the input text and matches the tokens with the LIWC psycholinguistics dictionary to compute scores in each dictionary category. It builds inferences by matching words from the text with words from the dictionary. These words are often self-reflective, such as words about work, family, friends, health, money, feelings, achievement, and positive and negative emotions. Text that is ideal for personality inference contains such self-reflective words instead of merely factual statements. Nouns such as names of people and places do not contribute to personality inference.

The service uses a weighted combination approach to derive characteristic scores from LIWC category scores. The weights are the coefficients between category scores and characteristics.

  • To infer Big Five characteristics, the service uses the coefficients that are reported by Yarkoni (2010). Yarkoni derived the coefficients by comparing personality scores that were obtained from surveys to LIWC category scores that were obtained from text written by more than 500 individuals.

  • To infer Needs, the service uses coefficients between Needs and LIWC category scores. The coefficients were derived by comparing Needs scores that were obtained from surveys to LIWC category scores that were obtained from text written by more than 350 users.

  • To infer Values, the service uses coefficients between Values and LIWC category scores. The coefficients were derived by comparing Values scores that were obtained from surveys to LIWC category scores that were obtained from text written by more than 800 individuals (Chen et al., 2014).

IBM developed the models for all supported languages in an identical way, by first developing models for English and then augmenting the approach to develop models for the other languages. IBM converted its English-language surveys to each language and then conducted user surveys to collect ground-truth data. IBM then gathered the users' tweets and computed LIWC category scores from them. This effort established coefficients from LIWC categories for Big Five, Needs, and Values scores that were obtained from the surveys.

IBM developed the Personality Insights models independent of user demographics such as age, gender, or culture. However, IBM may in the future develop models that are specific to different age groups, genders, and cultures.

How media influence inferred characteristics

IBM conducted a validation study to understand the effect of media on inferred characteristics. To determine the accuracy of the service's approach to estimating characteristics, IBM compared scores that were derived by its models with survey-based scores for Twitter users (for instance, approximately 500 users for English and more than 600 for Spanish).

To establish ground truth, participants took three sets of standard psychometric tests: 50-item Big Five (derived from the International Personality Item Pool, or IPIP), 52-item fundamental Needs (developed by IBM), and 26-item basic Values (developed by Schwartz). IBM conducted the study in two phases:

  • For the first study, conducted in 2013, IBM recruited 256 Twitter users (Gou et al., 2014). Although the models were learned from different sources, IBM found that for more than 80 percent of the Twitter users, scores for characteristics that were inferred for all three models correlated significantly with survey-based scores (p-value < 0.05 and correlation coefficient between 0.05 and 0.80). Specifically, scores that were derived by the service correlated with survey-based scores as follows:

    • For 80.8 percent of participants' Big Five scores (p-value < 0.05 and correlation coefficients between 0.05 and 0.75)

    • For 86.6 percent of participants' Needs scores (p-value < 0.05 and correlation coefficient between 0.05 and 0.80)

    • For 98.21 percent of participants' Values scores (p-value < 0.05 and correlation coefficients between 0.05 and 0.55)

  • For the second study, conducted in 2015, IBM recruited another set of 237 Twitter users. The study found that for Big Five and Values, scores for inferred characteristics correlated significantly with survey-based scores (p-value < 0.05 and correlation coefficient between 0.07 and 0.21) for every Twitter user. For needs, such significant correlation was observed for 90 percent of the users (p value < 0.05 and correlation coefficient between 0.01 and 0.20).

In both studies, participants also rated on a five-point scale how well each derived characteristic matched their perceptions of themselves. Their ratings suggest that the inferred characteristics largely matched their self-perceptions. Specifically, means of all ratings were above 3 ("somewhat") out of 5 ("perfect").

  • For the 256 Twitter users of the first study, means were 3.4 (with a standard deviation of 1.14) for Big Five, 3.39 (with a standard deviation of 1.34) for Needs, and 3.13 (with a standard deviation of 1.17) for Values.

  • For the 237 Twitter users of the second study, means were 3.31 (with a standard deviation of 1.18) for Big Five, 3.37 (with a standard deviation of 1.22) for Needs, and 3.36 (with a standard deviation of 1.18) for Values.

Notes about personality surveys

When developing the Personality Insights service, IBM relied on personality surveys to establish ground-truth data for personality inference. Ground truth refers to the factual data obtained through direct observation rather than through inference. A typical measure of accuracy for any machine-learning model is to compare the scores inferred by the model with ground-truth data; the previous sections describe how IBM used surveys to validate the accuracy of the Personality Insights service.

The following notes clarify the use of personality surveys and survey-based personality estimation:

  • Personality surveys are long and time-consuming to complete. The results are therefore constrained by the number of Twitter users who were willing and available to participate in the study. IBM plans to conduct validation studies with more users, as well as with users of other online media such as email, blogs, and forums.

  • Survey-based personality estimation is based on self-reporting, which might not always be a true reflection of one's personality: Some users might give noisy answers to such surveys. To reduce the noise, IBM filtered survey responses by including attention-checking questions and by discarding surveys that were completed too quickly.

  • While the correlation between inferred and survey-based scores of approximately 80 percent is both positive and significant, the results imply that inferred scores might not always correlate with survey-based results. Researchers from outside of IBM have also done experiments to compare how well inferred scores match those obtained from surveys, and none reported a fully consistent match:

In general, it is widely accepted in research literature that self-reported scores from personality surveys do not always fully match scores that are inferred from text. What is more important, however, is that IBM found that characteristics inferred from text can reliably predict a variety of real-world behavior.