How to compute sentence level perplexity from hugging face language models?

# Init tokenizer = AutoTokenizer.from_pretrained("distilgpt2") tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000") segmenter = spacy.load('en_core_web_sm') # That's the part I need to vectorise, surely within a document (bsize ~ 10) # and ideally across documents (bsize as big as my GPU can handle) def select_sentence(sentences): """We pick the sentence that maximizes perplexity""" max_loss, best_index = 0, 0 for i, sentence in enumerate(sentences): encodings = tokenizer(sentence, return_tensors="pt") input_ids = encodings.input_ids loss = lm(input_ids, labels=input_ids).loss.item() if loss > max_loss: max_loss = loss best_index = i return sentences[best_index] for document in documents: sentences = [sentence.text.strip() for sentence in segmenter(document).sents] best_sentence = select_sentence(sentences) write(best_sentence)

If the goal is to compute perplexity and then select the sentences, there's a better way to do the perplexity computation without messing around with tokens/models.

Install https://huggingface.co/spaces/evaluate-metric/perplexity:

pip install -U evaluate

Then:

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))

[out]:

>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"][0], 2))
>>>32.25

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?

A: For that lets look under the hood, https://huggingface.co/spaces/evaluate-metric/perplexity/blob/main/perplexity.py

This is how the code initialize the model:

class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        ...

Argh, there's no support for local models!

What if we do some simple changes to the code =)

See Load a pre-trained model from disk with Huggingface Transformers


class Perplexity(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": datasets.Value("string"),
                }
            ),
            reference_urls=["https://huggingface.co/docs/transformers/perplexity"],
        )

    def _compute(
        self, predictions, model_id, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None, local_file_only: bool = False
    ):
        ...
        model = AutoModelForCausalLM.from_pretrained(model_id, local_files_only=local_file_only)
        model = model.to(device)

        tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=local_file_only)

Technically, if you could load a local model that you can load with:

AutoModelForCausalLM.from_pretrained("clm-gpu/checkpoint-138000", local_file_only=True)

you can should be able the model_id as such after the code change:

perplexity.compute(model_id="clm-gpu/checkpoint-138000",
                             add_start_token=False,
                             predictions=input_texts, 
                             local_file_only=True)

Opened a pull-request: https://huggingface.co/spaces/evaluate-metric/perplexity/discussions/4

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?

Argh, there's no support for local models!

Recommended topics

Hot tags

Q: That's great but how do I use it for a custom model that can't be fetched with model_id=...?

Argh, there's no support for local models!

Recommended topics

Hot tags

Q: That's great but how do I use it for a custom model that can't be fetched with `model_id=...`?