The Road to GammaGamma: The Learned ClassifierIn machine learning and statistics, classification is the problem of identifying a semantic class for an observation, on the basis of a training set of data containing observations (or instances) whose class membership is known. Gamma is a classifier, that given a document, will give us the class that document belongs to (with an associated degree of probability to boot!). Our job is the build the function gamma, which takes a document, and returns a class. Application: Sentiment AnalysisGiven a movie review, we want to return the classifier (gamma). The classifier will give us a degree of probability that this movie review is either Positive or Negative. We could define more classes (Excellent, Good, Neutral, Weak, Terrible), but we'll keep this example simple. If a reviewer states that the latest Bruce Willis movie was "quite simply one of the worst days of the 'Die Hard' series", the gamma classifier should return
Where x and y are % probabilities, and we we might assume the probability the classifier would consider this review Negative to be far higher than the probability of this review being Positive. Such results however, would depend entirely upon how we supervise (train) the classifier.
Classification (Per Token)One approach to building a classifier would be to simply take all words that seem to suggest positive sentiment and put them in one list, and take all words that would seem to suggest negative sentiment and put them in another list. For a given review, see how many words occur in each list, and return a probability based on that. SentiWordNet could be used in aid of this approach. The token "worst" occurs in the SentiWordnet 3.0 corpus:
SentiWordnet considers that the token "worst" has a NegScore (Negative Score) of 0.75 and a PosScore (Positive Score) of 0.25. This indicates that, on average, one in four uses of the token "worst" is in a positive application (perhaps "I wanted to see this movie in the worst way!"). The presence of the word "worst" in the review: "quite simply one of the worst days of the 'Die Hard' series" Would contribute toward both negative and positive sentiment classification. The application of both a PosScore and NegScore to a token may seem counterintuitive at first, but consider this example:
The first and last reviews are negative, and the middle reviews are positive. But language is ambiguous. What about these examples?
These might get annotated as:
Tokens that might seem "negative" can be used in a positive connotation, hence the use of both positive and negative scores in SentiWordnet. We could choose to search SentiWordnet for each token in the review and add up the scores. We might conceivably choose to skip certain partsofspeech, such as articles or determiners, and if our application supported the ability, to skip named entities (such as Movie Titles). Given this process, we could end up with:
Which would seem to indicate that this was more likely to have been a positive review, which is an erroneous conclusion. Clearly, more context is needed. Classification (Language Model)Context can be provided through the use of a language model. Basic ProbabilityAssume we have some finite vocabulary
V = { bruce willis, does, what, he, does, best }
It is not uncommon for this vocabulary to be very large, but we'll assume a small set for this example. Given the vocabulary, there are an infinite variety of possible sentences that can be created from this vocabulary (assuming no restrictions apply on the number of times a token can be used):
A wellformed sentence has:
A sentence doesn't have to "make sense" to be considered wellformed. Now let's assume we have a training sammple in English. Maybe you collect all the sentences you see in the WSJ for the last 20 years. Or all the tweets in the last month. This training sample can be quite large.
PValueGiven a training sample, we want to learn a distribution (P) over sentences in a language. P is going to be a function that satisfies 2 conditions. This formula can be read as: In other words, the formula gives the probability of any given sentence for the vocabulary. Language Model BasicsA language model is a collection of tokens that occur next to each other (collocated) and their frequencies:
Let's assume I have 10,000 movie reviews. Perhaps the reviews come from a site like imdb.com where reviewers score their own reviews on a scale of 110 stars: In this case, the reviewer scores the movie 5/10 (50%). Note that 42 out of 75 people agree with the reviewer. With enough information we could weight this as a metadata score. If only 5% of the population agrees with this reviewer, we might choose to discount the negative sentiment. A trigram language model of this review would look like this:
Naturally the list of trigrams would increase greatly with a larger corpus of reviews. The term frequency and document frequency counts will help build out a TF/IDF score. A partial (top 500) trigram model for the NY Times from January, 1987 with TF/IDF frequencies is here. Not surprisingly, New York City is the second most common trigram from that list. Given a trigram language model of movie reviews, we can apply the same technique that we used with SentiWordnet, but rather than dealing with individual tokens, we would be dealing with the trigram "is quite simply", or "quite simply one". In our example, the trigram is related to a negative sentiment, but this could change as the model changes. ShortcomingsWhile the approach of creating a language model and associating ngrams to classes (positive or negative sentiment in this case) is a valid one, there is a short coming in the formula we use. If we come across a trigram, phrase or token that is not already associated to a class in our langauge model, the probability of this association will be 0. In SentiWordnet, "quite" and "simply" both contain a degree of positive sentiment, with no negative sentinment. In a trigram model of movie reviews, we might have a better idea of how often. Building the Language ModelSo we build our language model based on reviews that are analyzed at positive / negative / neutral. Then we can analyze each new tweet with a high probability of confidence. Note that memes on twitter frequently change. Due to the nature of this dynamic corpus, language models built around twitter will need frequent updating. This is called "Supervised Machine Learning". ("Naive Sentiment Analysis" vs "Supervised Machine Learning"). We have a training set of documents that have been handlabeled for their class.
Input: Output: The Road to GammaThe goal of this supervised machine learning is to produce gamma. For each class, compute the probability of each word occurring in that class This could also be calculated with higher precision for ngram language models, where n > 1: The "NB" in CNB stands for "Naive Bayes", that is, the best class by the Naive Bayes assumption (or, that class that maximizes these prior probabilities). Addendum / MiscRules for handlabeling:
