I recently ran across this artice (https://lctech.vn/blog/ibm-watson-compares-trumps-inauguration-speech-obamas/). It describes the author's attempt at a comparative analysis of the personalities of Barack Obama and Donald Trump based on applying the IBM Watson Personality Insights API to their US Presidential inauguration speeches. The article has many charts, figures and analyses, according to various capabilities of the API. But, these cannot make up for the logical fallacy under which the API was applied in the first place.
UPDATE: Even as I was publishing this article, a similar misuse of IBM Watson Personality Insights API was reported by CNBC (http://www.cnbc.com/2017/07/17/tim-cook-is-silicon-valleys-most-imaginative-ceo-says-ibm-data.html). The analysis produced results such as that Apple's CEO Tim Cook is the Silicon Valley's most imaginative tech leader and that Microsoft's CEO Satya Nadella is one of the most assertive tech leaders. These are non sequiturs (they may be true or false, but the analysis doesn't actually establish these truths that it asserts).
One of the most important principles in data science is the test set for a machine learned model must be a good representative of the expected usage of the machine learned model. Otherwise, the accuracy of the machine learned model on the test set will have little to do with its accuracy in practice. In the field of psychometrics, this principle actually has a name: construct validity. Generally, it makes sense to take cues on measuring machine learning from the vast experience of educational psychologists who measure human learning.
A corollary principle in data science is that the training set for a machine learned model must be consistent with the test set. Otherwise, the machine learning algorithm will not be likely to learn the construct that the test set tests. In fact, it's not uncommon to draw the test set randomly from the training set, in which case the two sets are likely to be consistent, and the challenge reduces to determining whether the training and test sets provide a good representation of the intended use case. Essentially, data scientists spend a lot of time thinking about and working on training set quality in order to attain high construct validity.
But, if you are a data science consumer, then you have to think about these principles in reverse. If you are a software developer who uses an API that offers the inferential function of a machine learned model trained by a data scientist or data science team, then you are a consumer their data science results. Such is the case when you use IBM Watson Personality Insights.
If this situation describes you, then it is important for you to look into how the API's machine learned model was trained so that you can determine whether that training reflects your use case. In the case of IBM Watson Personality Insights, this information is provided here: https://www.ibm.com/watson/developercloud/doc/personality-insights/science.html
According to this source, the API was trained based on mapping personality test results with the linguistic patterns of 200 tweets from the 600 participants. There is no evidence to suggest that our tweet writing is linguistically consistent with how we write emails, blogs, or other documents, much less speech transcripts from US Presidents' inauguration speeches or CEO speeches. For one thing, we know that except for tweet storms, successive tweets aren't necessarily all that much related to each other. But the sentences and paragraphs of these other forms of writing are much more logically and sequentially connected together. After all, that's why we *have* speech writers.
By comparison, if your use case is to determine personality traits of, say, a prospective customer or employee, based on their Twitter feed, then you're more likely to be appropriately using IBM Watson Personality Insights API.
In the case of this API, there are further questions that a psychologist would ask, and therefore that you should ask, too. In particular, the training data was drawn from a sample of 600 participants. But, are those participants representative of the target population on whom you will be doing the inferences with the API? For example, if your prospective customer or employee base comes from, say, the fashion industry, and if the training data participants came dominantly from, say, the tech industry or even from the population at large, then your results with the API may be significantly affected by the difference. Do your best to find out the demographics of the training data participants and your target population to see if there are mismatches. There are other similar questions. Are members of your target population more prone to tweet storms, retweeting, and/or replying to tweets than the training sample population? All of these tendencies are reflective of personality traits, so if there are differences between the training sample and the target population, then you may not be able to use the API.
For any API, you as a software developer are practicing a basic form of data science by checking these issues because you are ensuring construct validity between the inferences in your use case and the training data for the machine learned model you are consuming.