A well-accepted theory of psychology, marketing, and other fields is that human language reflects personality, thinking style, social connections, and emotional states. The frequency with which we use certain categories of words can provide clues to these characteristics. Several researchers found that variations in word usage in writings such as blogs, essays, and tweets can predict aspects of personality (Fast & Funder, 2008; Gill et al., 2009; Golbeck et al., 2011; Hirsh & Peterson, 2009; and Yarkoni, 2010).
IBM conducted a set of studies to understand whether personality characteristics inferred from social media data can predict people's behavior and preferences. IBM found that people with specific personality characteristics responded and re-tweeted in higher numbers in information-collection and -spreading tasks. For example, people who score high on excitement-seeking are more likely to respond, while those who score high on cautiousness are less likely to do so (Mahmud et al., 2013). Similarly, people who score high on modesty, openness, and friendliness are more likely to spread information (Lee et al., 2014).
IBM also found that people with high openness and low emotional range (neuroticism) as inferred from social media language responded more favorably (for example, by clicking an advertisement link or following an account), results that have been corroborated with survey-based, ground-truth checking. For example, targeting the top 10 percent of users in terms of high openness and low emotional range resulted in increases in click rate from 6.8 percent to 11.3 percent and in follow rate from 4.7 percent to 8.8 percent.
Multiple recent studies disclosed similar results for characteristics that were computed from social media data. One recent study with retail store data found that people who score high in orderliness, self-discipline, and cautiousness and low in immoderation are 40 percent more likely to respond to coupons than the random population. A second study found that people with specific values showed specific reading interests (Hsieh et al. 2014). For example, people with a higher self-transcendence value demonstrated an interest in reading articles about the environment, and people with a higher self-enhancement value showed an interest in reading articles about work. A third study of more than 600 Twitter users found that a person's personality characteristics can predict their brand preference with 65 percent accuracy.
The following sections expand upon these high-level findings to describe the research and development behind the Personality Insights service. For more information about studies that apply the service to tangible scenarios, see The service in action.
For the Personality Insights service, IBM developed models to infer scores for Big Five dimensions and facets, Needs, and Values from textual information. The models reported by the service are based on research in the fields of psychology, psycholinguistics, and marketing:
The Personality Insights service infers personality characteristics from textual information based on an open-vocabulary approach. This method reflects the latest trend in the research about personality inference (Schwartz et al., 2013, and Plank & Hovy, 2015).
The service first tokenizes the input text to develop a representation in an n-dimensional space. The service uses an open-source word-embedding technique called GloVe to obtain a vector representation for the words in the input text (Pennington et al., 2014). It then feeds this representation to a machine-learning algorithm that infers a personality profile with Big Five, Needs, and Values characteristics. To train the algorithm, the service uses scores obtained from surveys conducted among thousands of users along with data from their Twitter feeds.
IBM developed the models for all supported languages in an identical way. The models were developed independent of user demographics such as age, gender, or culture. In the future, IBM might develop models that are specific to different demographic categories.
Earlier versions of the service used the Linguistic Inquiry and Word Count (LIWC) psycholinguistic dictionary with its machine-learning model. However, the open-vocabulary approach just described outperforms the LIWC-based model. For more information about the service's precision for each language in terms of average Mean Absolute Error (MAE) and correlation, see How precise is the service. For guidance about providing input text to achieve the most accurate results, see Guidelines for providing sufficient input.
IBM conducted a validation study to understand the accuracy of the service's approach to inferring a personality profile. IBM collected survey responses and Twitter feeds from between 1500 and 2000 participants for all characteristics and languages. To establish ground truth, participants took four sets of standard psychometric tests:
IBM then compared the scores that were derived by its models with the survey-based scores for the Twitter users. Based on these results, IBM determined the following statistical values between inferred and actual scores for the different categories of personality characteristics:
Mean Absolute Error (MAE) is a metric that is used to measure the difference between actual and predicted values. For the Personality Insights service, the actual value, or ground truth, is the personality score that was obtained by administering a personality survey. The predicted value is the score that the service's models predict.
IBM computed the MAE by taking the average of the absolute value of the difference between the actual and predicted scores. IBM used the absolute value because predicting more of less of the actual value is irrelevant; as long as there is a difference, the model is penalized by the magnitude of the error. The lower the MAE, the better the performance of the model. IBM uses a scale of 0 to 1 for MAE, where 0 means no error (the predicted value is the exact same as the actual value), and 1 means maximum error.
Average correlation is a statistical term that measures the interdependence of two variables. With this metric, IBM measured the correlation between inferred and actual scores for the different categories of personality characteristics. If the predicted score closely tracks the actual results, the correlation score is high; otherwise, the score is low.
IBM measures correlation on scale of -1 to 1: 1 indicates a perfect direct (increasing) linear relationship, and -1 indicates a perfect inverse (decreasing) linear relationship. In all other cases, the value lies between these extremes. If the variables are independent (they have no relationship at all), the correlation is 0.
IBM looks for numbers closer to 1 for best predictions. But personalities are difficult to predict based solely on text, and it is rare to see correlations exceed 0.4 for these types of psychological models. In research literature for this domain, correlations above 0.2 are considered acceptable.
The following table shows per-language average MAE and correlation results for the Personality Insights service. The results place the service at the cutting edge of personality inference from textual data as indicated by Schwartz et al. (2013) and Plank and Hovy (2015).
Average MAE /
Average MAE /
Average MAE /
Average MAE /
|Big Five dimensions||0.12 / 0.33||0.10 / 0.35||0.11 / 0.27||0.09 / 0.17|
|Big Five facets||0.12 / 0.28||0.12 / 0.21||0.12 / 0.22||0.12 / 0.14|
|Needs||0.11 / 0.22||0.12 / 0.24||0.11 / 0.25||0.11 / 0.13|
|Values||0.11 / 0.24||0.11 / 0.19||0.11 / 0.19||0.10 / 0.14|
To compute the percentile scores, IBM collected a very large data set of Twitter users (one million users for English, 100,000 users for each of Arabic and Japanese, and 80,000 users for Spanish) and computed their personality portraits. IBM then compared the raw scores of each computed profile to the distribution of profiles from those data sets to determine the percentiles.
Note: For Arabic input, the service is unable to produce meaningful percentiles and raw scores for a number of personality characteristics. For more information, see Limitations for Arabic input.
The relationship between personality and purchasing behavior has been studied across a variety of products and services:
Applying these known relations between consumption behaviors and personality is challenging: Most of these works used personality data derived from surveys, and their models are not publicly available. IBM therefore decided to learn these consumption preference models directly. When training the models, IBM used personality scores returned from the Personality Insights service as features. As a result, when you apply these models to calculate a user's personality characteristics with the service, the predictions are likely to be more accurate.
The Personality Insights service infers consumption preferences based on the results of its personality profile for the author of the input text. From existing literature, IBM identified 104 consumption preferences that have proved to be correlated with personality. These include preferences related to shopping, movies, music, and other categories. IBM then created a psychometric survey to assess an individual's inclination for each consumption behavior.
IBM obtained responses to its survey from about 600 individuals for whom it also had Twitter data (more than 200 self-authored tweets for each user). IBM submitted the tweets to the service to gather a personality profile for each individual. It then built a classifier for each consumption preference, where the input feature set was the personality information.
For inclusion with the service, IBM selected only those consumption preferences for which personality-based classification performed at least 9 percent better than random classification. Of the original 104 preferences, 42 satisfied this criterion and are exposed as consumption preferences by the service.
When developing the Personality Insights service, IBM relied on personality surveys to establish ground-truth data for personality inference. Ground truth refers to the factual data obtained through direct observation rather than through inference. A typical measure of accuracy for any machine-learning model is to compare the scores inferred by the model with ground-truth data; the previous sections describe how IBM used surveys to validate the accuracy of the service.
The following notes clarify the use of personality surveys and survey-based personality estimation:
Personality surveys are long and time-consuming to complete. The results are therefore constrained by the number of Twitter users who were willing and available to participate in IBM's study. IBM plans to conduct validation studies with more users, as well as with users of other online media such as email, blogs, and forums.
Survey-based personality estimation is based on self-reporting, which might not always be a true reflection of one's personality: Some users might give noisy answers to such surveys. To reduce the noise, IBM filtered survey responses by including attention-checking questions and by discarding surveys that were completed too quickly.
While the correlation between inferred and survey-based scores is both positive and significant, the results imply that inferred scores might not always correlate with survey-based results. Researchers from outside of IBM have also done experiments to compare how well inferred scores match those obtained from surveys, and none reported a fully consistent match:
In general, it is widely accepted in research literature that self-reported scores from personality surveys do not always fully match scores that are inferred from text. What is more important, however, is that IBM found that characteristics inferred from text can reliably predict a variety of real-world behavior.