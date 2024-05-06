Developers use a number of evaluation metrics for text summarization. Differences in metrics generally depend on the type of summary as well as which feature of the summary one wants to measure.

BLEU (bilingual evaluation understudy) is an evaluation metric commonly used in machine translation. It measures similarity between ground truth and model output for a sequence of n words, known as n-grams. In text summarization, BLEU measures how often, and to what extent, n-grams in an automatic summary overlap with those in a human-generated summary, accounting for erroneous word repetitions in the former. It then uses these precision scores for individual n-grams to calculate an overall text precision, known as the geometric mean precision. This final value is between 0 and 1, the latter indicating perfect alignment between the machine and human generated text summaries.15

ROUGE (recall-oriented understudy for gisting evaluation) is derived from BLEU specifically for evaluating summarization tasks. Like BLEU, it compares machine summaries to human-generated summaries using n-grams. But while BLEU measures machine precision, ROUGE measures machine recall. In other words, ROUGE computes the accuracy of an automatic summary according to the number of n-grams from the human-generated summarization found in the automatic summary. The ROUGE score, like BLEU, is any value between 0 and 1, the latter indicating perfect alignment between the machine and human generated text summaries.16

Note that these metrics evaluate the final summarized text output. They are distinct from the myriad sentence scoring methods used within text summarization algorithms that select suitable sentences and keywords from which to produce the final summarized output.