How do I evaluate a text summarization tool? [closed]
Asked Answered
L

10

8

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?

In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?

Currently, I am asking the user subjective questions about the accuracy of the summary.

Lacto answered 26/3, 2012 at 20:26 Comment(0)
G
26

In general:

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.

Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.

There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.

You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.

Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)

Gothicism answered 28/8, 2016 at 10:39 Comment(3)
You have posted the exact answer to two questions. If you think one of them is a duplicate of the other, you should mark them as such (and not post the same answer twice).Thierry
The answers are not exactly the same, and the questions are not exactly the same.. It is correct that one of the answers contains the other, but I'm can't see a clear way to converge the two questions.Gothicism
Is is fair to say ROGUE is recall based only? Coz most popular python implementations spit out recall, precision and f1-score when computing ROGUE score.Pooka
C
8

BLEU

  • Bleu measures precision
  • Bilingual Evaluation Understudy
  • Originally for machine translation(Bilingual)
  • W(machine generates summary) in (Human reference Summary)
  • That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
  • The closer a machine translation is to a professional human translation, the better it is

ROUGE

  • Rouge measures recall

  • Recall Oriented Understudy for Gisting Evaluation -W(Human Reference Summary) In w(machine generates summary)

  • That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.

  • Overlap of N-grams between the system and references summaries. -Rouge N, ehere N is n-gram

    reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
    

Abstractive summarization

   # Abstractive Summarize       
   len(reference_text.split())
   from transformers import pipeline
   summarization = pipeline("summarization")
   abstractve_summarization = summarization(reference_text)[0]["summary_text"]

Abstractive Output

   In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)

EXtractive summarization

   # Extractive summarize
   from sumy.parsers.plaintext import PlaintextParser
   from sumy.nlp.tokenizers import Tokenizer
   from sumy.summarizers.lex_rank import LexRankSummarizer
   parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
   # parser.document.sentences
   summarizer = LexRankSummarizer()
   extractve_summarization  = summarizer(parser.document,2)
   extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])

Extractive Output

Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.

Using Rouge to evaluate "Abstractive" summary

  from rouge import Rouge
  r = Rouge()
  r.get_scores(abstractve_summarization, reference_text)

Using Rouge Abstractive summary output

  [{'rouge-1': {'f': 0.22299651364421083,
  'p': 0.9696969696969697,
  'r': 0.12598425196850394},
  'rouge-2': {'f': 0.21328671127225052,
  'p': 0.9384615384615385,
  'r': 0.1203155818540434},
  'rouge-l': {'f': 0.29041095634452996,
  'p': 0.9636363636363636,
  'r': 0.17096774193548386}}]

Using Rouge to evaluate "Extractive" summary

  from rouge import Rouge
  r = Rouge()
  r.get_scores(extractve_summarization, reference_text)

Using Rouge Extractive summary output

  [{'rouge-1': {'f': 0.27860696251962963,
  'p': 0.8842105263157894,
  'r': 0.16535433070866143},
  'rouge-2': {'f': 0.22296172781038814,
  'p': 0.7127659574468085,
  'r': 0.13214990138067062},
  'rouge-l': {'f': 0.354755780824869,
  'p': 0.8734177215189873,
  'r': 0.22258064516129034}}]

Interpreting rouge scores

ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:

ROUGE Formula

I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.

The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.

   Example:

   S1. police killed the gunman
   
   S2. police kill the gunman
   
   S3. the gunman kill police

S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4. source: Rag evaluation

Crowbar answered 27/4, 2021 at 7:49 Comment(0)
I
5

Historically, summarization systems have often been evaluated by comparing to human-generated reference summaries. In some cases, the human summarizer constructs a summary by selecting relevant sentences from the original document; in others, the summaries are hand-written from scratch.

Those two techniques are analogous to the two major categories of automatic summarization systems - extractive vs. abstractive (more details available on Wikipedia).

One standard tool is Rouge, a script (or a set of scripts; I can't remember offhand) that computes n-gram overlap between the automatic summary and a reference summary. Rough can optionally compute overlap allowing word insertions or deletions between the two summaries (e.g. if allowing a 2-word skip, 'installed pumps' would be credited as a match to 'installed defective flood-control pumps').

My understanding is that Rouge's n-gram overlap scores were fairly well correlated with human evaluation of summaries up to some level of accuracy, but that the relationship may break down as summarization quality improves. I.e., that beyond some quality threshold, summaries that are judged better by human evaluators may be scored similarly to - or outscored by - summaries judged inferior. Nevertheless, Rouge scores might be a helpful first cut at comparing 2 candidate summarization systems, or a way to automate regression testing and weed out serious regressions before passing a system on to human evaluators.

Your approach of collecting human judgements is probably the best evaluation, if you're able to afford the time / monetary cost. To add a little rigor to that process, you might look at the scoring criteria used in recent summarization tasks (see the various conferences mentioned by @John Lehmann). The scoresheets used by those evaluators might help guide your own evaluation.

Imparipinnate answered 23/4, 2014 at 17:57 Comment(0)
L
5

There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.

Abstract from the paper:

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

Latter answered 19/7, 2020 at 17:28 Comment(0)
S
4

I'm not sure about the time evaluation, but regarding accuracy you might consult literature under the topic Automatic Document Summarization. The primary evaluation was the Document Understanding Conference (DUC) until the Summarization task was moved into Text Analysis Conference (TAC) in 2008. Most of these focus on advanced summarization topics such as multi-document, multi-lingual, and update summaries.

You can find the evaluation guidelines for each of these events posted online. For single document summarization tasks look at DUC 2002-2004.

Or, you might consult the ADS evaluation section in Wikipedia.

Strontium answered 26/3, 2012 at 22:34 Comment(0)
S
1

There are many parameters against which you can evaluate your summarization system. like Precision = Number of important sentences/Total number of sentences summarized. Recall = Total number of important sentences Retrieved / Total number of important sentences present.

F Score = 2*(Precision*Recall/Precision+ Recall) Compressed Rate = Total number of words in the summary / Total number of words in original document.

Scrogan answered 23/4, 2013 at 7:12 Comment(0)
P
1

When you are evaluating an automatic summarisation system you would typically look at the content of the summary rather than time.

Your idea of:

(Time taken to read the original document/Time taken to read the summary)

Doesn't tell you much about your summarisation system, it really only gives you and idea of the compression rate of your system (i.e. the summary is 10% of the original document).

You may want to consider the time it takes your system to summarise a document vs. the time it would take a human (system: 2s, human: 10 mins).

Pectin answered 22/3, 2015 at 12:59 Comment(0)
P
0

I recommend BartScore. Check the Github page and the article. The authors issued also a meta-evaluation on the ExplainaBoard platform, "which allows to interactively understand the strengths, weaknesses, and complementarity of each metric". You can find the list of most of the state-of-the-art metrics there.

Procrustes answered 23/3, 2022 at 21:28 Comment(0)
H
0

As a quick summary for collection of metrics, I wrote a post descrbing the evaluation metrics, what kind of metrics do we have ? what's the difference between human evaluation ? etc. You can read the blog post Evaluation Metrics: Assessing the quality of NLG outputs.

Also, along with the NLP projects we created and publicly released an evaluation package Jury which is still actively maintained and you can see the reasons why we created such a package in the repo. There are packages to carry out evaluation in NLP (some of them are specialized in a spesific NLP task):

Hagi answered 4/4, 2022 at 11:42 Comment(0)
C
0

DeepEval

DeepEval is an intuitive, open-source evaluation framework for LLMs (Large Language Models). Unlike Pytest, it's specialized for unit testing LLM outputs and incorporates the latest research to assess them based on metrics like hallucination, answer relevancy, RAGAS, etc. Utilizing LLMs and various other NLP models, DeepEval runs locally on your machine for evaluation.

Whether you're utilizing RAG or fine-tuning, LangChain or LlamaIndex, DeepEval is adaptable to your application needs. It allows for easy determination of optimal hyperparameters to enhance your RAG pipeline, mitigate prompt drifting, or confidently transition from OpenAI to hosting your own Llama2.

Metrics and Features

A diverse range of readily available LLM evaluation metrics, supported by LLMs (each with detailed explanations), statistical techniques, or NLP models, is accessible for local execution on your machine.

  1. G-Eval
  2. Summarization
  3. Answer Relevancy
  4. Contextual
  5. Faithfulness
  6. Contextual Recall
  7. Contextual Precision
  8. RAGAS
  9. Hallucination
  10. Toxicity
  11. Bias

Install required modules

pip install deepeval

import required modules

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase  
from langchain.chat_models import ChatOpenAI

Let's take this input and actual_output as an example:

input = """
The 'coverage score' is calculated as the percentage of assessment     questions
for which both the summary and the original document provide a     'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score     indicates a
more comprehensive and faithful summary, signifying that the     summary effectively
encapsulates the crucial points and details from the original     content.
"""

This is the summary, replace this with the actual output from your LLM application

actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

Initialization

openai_api_key = "*************"
llm            = ChatOpenAi(
                            temperature = 0.7,
                            openai_api_key = openai_api_key,
                            model_name ="gpt-3.5-turbo"
                            )
SummarizationMetric(threshold   =  0.5, 
                          model = llm,
           assessment_questions = [
         "Is the coverage score based on a percentage of 'yes' answers?",
         "Does the score ensure the summary's accuracy with the source?",
         "Does a higher score mean a more comprehensive summary?"
           ])

print(metric.measure(test_case))
print(metric.score)
print(metric.reason)
evaluate([test_case], [metric]) #or evaluate test cases in bulk

Source: metrics-summarization

Source: confident

Source: LLM Evaluation Metrics: Everything You Need for LLM Evaluation

Source: A Step-By-Step Guide to Evaluating an LLM Text Summarization Task

To use ChatOpenAI class with AzureChatOpenAI

Install required modules

pip install langchain

import required modules

from langchain.chat_models import AzureChatOpenAI        

Initialization

llm = AzureChatOpenAI(
      deployment_name    = "model_name",
      openai_api_version = "2023-01-00-preview",
      azure_endpoint     = "https://endpoint.com",
      openai_api_key     = "**********",
      openai_api_type    = "azure",)

SummarizationMetric(threshold   =  0.5, 
                      model = llm,
       assessment_questions = [
     "Is the coverage score based on a percentage of 'yes' answers?",
     "Does the score ensure the summary's accuracy with the source?",
     "Does a higher score mean a more comprehensive summary?"
       ])

print(metric.measure(test_case))
print(metric.score)
print(metric.reason)
evaluate([test_case], [metric]) #or evaluate test cases in bulk

Source: LangChain AzureOpenAI integration

Crowbar answered 12/3, 2024 at 17:18 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.