What tools or metrics are available to help a user fine-tune an LDA model? For those new to topic modeling, it can be frustrating to learn there is no one, or even collection of, performance metric adopted in literature.
Qualitative. Believe it or not, qualitative evaluation is not uncommon, particularly in real-world applications. These may often involve examining the top five or ten keywords for each topic. These are then used to evaluate topics according to how interpretable the topics are by human users.8 This sort of “eyeballing,” so to speak, requires a significant amount of expert domain knowledge and familiarity with the documents under consideration.9
Coherence. Topic coherence is one popular quantitative method for evaluating generated topics. A topic coherence score measures how often a given topic’s most probable words co-occur in the same documents throughout the corpus. More specifically it computes the co-occurrence frequency of each word pair from a topic’s top n words against each individual’s word’s frequency across the corpus. This aims to quantify how coherent a given topic is. A model’s overall coherence score averages the coherence score belonging to each individual topic. In effect, this overall score signifies the average topic coherence within a given model. Per its name coherence evaluates models solely according to how cohesive their topics are. Topics must also maintain a degree of exclusivity, however, for which there is currently no quantitative measure.10
Recent research shows that quantitative metrics, such as coherence score, are unreliable for topic model evaluation. This is, in part, due to ambiguity in the professed evaluative goal of interpretability—what makes a model and its results interpretable?11 Moreover, automated metrics designed for older systems may not extrapolate well to newer systems. This issue is complicated by the lack of transparency in many published experiments that prevent generalization of evaluation methods to other datasets or domains.12 Research has recently turned to artificial intelligence applications, notably large language models (LLMs), as a means of designing and evaluating LDA models for a specific research objective.13 While this shows promising results, further research is necessary.