BLEU
- Bleu measures precision
- Bilingual Evaluation Understudy
- Originally for machine translation(Bilingual)
- W(machine generates summary) in (Human reference Summary)
- That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
- The closer a machine translation is to a professional human translation, the better it is
ROUGE
Rouge measures recall
Recall Oriented Understudy for Gisting Evaluation
-W(Human Reference Summary) In w(machine generates summary)
That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.
Overlap of N-grams between the system and references summaries.
-Rouge N, ehere N is n-gram
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive summarization
# Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Abstractive Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
EXtractive summarization
# Extractive summarize
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
# parser.document.sentences
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
Using Rouge to evaluate "Abstractive" summary
from rouge import Rouge
r = Rouge()
r.get_scores(abstractve_summarization, reference_text)
Using Rouge Abstractive summary output
[{'rouge-1': {'f': 0.22299651364421083,
'p': 0.9696969696969697,
'r': 0.12598425196850394},
'rouge-2': {'f': 0.21328671127225052,
'p': 0.9384615384615385,
'r': 0.1203155818540434},
'rouge-l': {'f': 0.29041095634452996,
'p': 0.9636363636363636,
'r': 0.17096774193548386}}]
Using Rouge to evaluate "Extractive" summary
from rouge import Rouge
r = Rouge()
r.get_scores(extractve_summarization, reference_text)
Using Rouge Extractive summary output
[{'rouge-1': {'f': 0.27860696251962963,
'p': 0.8842105263157894,
'r': 0.16535433070866143},
'rouge-2': {'f': 0.22296172781038814,
'p': 0.7127659574468085,
'r': 0.13214990138067062},
'rouge-l': {'f': 0.354755780824869,
'p': 0.8734177215189873,
'r': 0.22258064516129034}}]
Interpreting rouge scores
ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:
I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.
The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.
Example:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.
source: Rag evaluation