Generate the probabilities of all the next possible word for a given text
Asked Answered
G

2

9

i have the following code

import transformers
from transformers import pipeline

# Load the language model pipeline
model = pipeline("text-generation", model="gpt2")

# Input sentence for generating next word predictions
input_sentence = "I enjoy walking in the"

I want to generate only the next word given the input sentence but i want to see list of all possible next words along with their probabilities. any other LLM can be used i put gpt2 as an example.

In the code i want to choose top 500 words or top 1000 words suggestion for only the next word and the probabilities of each suggested word how can i do this?

Gurge answered 3/6, 2023 at 20:23 Comment(0)
D
10

We have to go more low-level, as the pipeline function is not appropriate for what you are trying to do.

After you pass your sequence to AutoModelForCausalLM, the last tensor in the output will contain the probabilities of every token in the vocabulary being the next token. In the code below, I call it next_token_candidates_tensor. After that, you simply need to select the indices of the topk candidates and decode them back to words.

import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

class LMHeadModel:

    def __init__(self, model_name):
        # Initialize the model and the tokenizer.
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def get_predictions(self, sentence):
        # Encode the sentence using the tokenizer and return the model predictions.
        inputs = self.tokenizer.encode(sentence, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(inputs)
            predictions = outputs[0]
        return predictions
    
    def get_next_word_probabilities(self, sentence, top_k=500):

        # Get the model predictions for the sentence.
        predictions = self.get_predictions(sentence)
        
        # Get the next token candidates.
        next_token_candidates_tensor = predictions[0, -1, :]

        # Get the top k next token candidates.
        topk_candidates_indexes = torch.topk(
            next_token_candidates_tensor, top_k).indices.tolist()

        # Get the token probabilities for all candidates.
        all_candidates_probabilities = torch.nn.functional.softmax(
            next_token_candidates_tensor, dim=-1)
        
        # Filter the token probabilities for the top k candidates.
        topk_candidates_probabilities = \
            all_candidates_probabilities[topk_candidates_indexes].tolist()

        # Decode the top k candidates back to words.
        topk_candidates_tokens = \
            [self.tokenizer.decode([idx]).strip() for idx in topk_candidates_indexes]

        # Return the top k candidates and their probabilities.
        return list(zip(topk_candidates_tokens, topk_candidates_probabilities))


sentence = "I enjoy walking in the"
model = LMHeadModel("gpt2")
model.get_next_word_probabilities(sentence, top_k=500)

# [('park', 0.15904344618320465),
# ('woods', 0.10028065741062164),
# ('streets', 0.0418376550078392),
# ('dark', 0.03117542900145054),
# ('door', 0.029618268832564354),
# ('street', 0.02388935722410679),
# ('rain', 0.021733922883868217),
# ...
Dov answered 3/6, 2023 at 21:40 Comment(6)
this is great, thanks alot. general question. what is the limit for the top_k, does it generate probs for every possible word in english vocab? or there is a limit. if there is a limit does it depend on the type of the model we use?Gurge
Yes, there is a limit for the top_k, and it depends on your model. That's because you're not generating probabilities specifically for English words, but rather for the tokens in the model's vocabulary. Therefore, the limit for top_k is equal to the length of the next_token_candidates_tensor, because each position in this tensor corresponds to one token in the model's vocabulary.Dov
thanks again Ruran. is there a model that can generate for >500k tokens ? for GPT2 limit was 50k. the limit is because it is zero prob for many token hence not printing and setting the limit?Gurge
Most models of this kind will have around 50k tokens in their vocabulary, regardless of their size. For each token you receive from calling get_next_word_probabilities, add it to the original sequence, and then call get_next_word_probabilities again to obtain the next list of tokens. Remember that tokens do not necessarily correspond to single English words (a word can be represented by multiple tokens).Dov
lets say we have 500k english words and if they get converted into tokens we will have even more number. so my understanding was the llms will provide probs for every single token that they have been trained on. isnt that wrong? why we have 50k limit?Gurge
Maybe this document will make things clear for you: Byte-Pair Encoding tokenizationDov
R
2

I think you do yourself a favor when you avoid the pipeline for this and just use the respective language modeling class. All you need to do is:

  1. to get the logits of the next token (gpt-2 uses tokens that are not necessarily words),
  2. apply the softmax to get the probabilities
  3. apply topk to retrieve the k most probable tokens.
import torch
from transformers import GPT2TokenizerFast, GPT2LMHeadModel

t = GPT2TokenizerFast.from_pretrained("gpt2")
m = GPT2LMHeadModel.from_pretrained("gpt2")

encoded_text = t("I enjoy walking in the", return_tensors="pt")

#1. step to get the logits of the next token
with torch.inference_mode():
  outputs = m(**encoded_text)

next_token_logits = outputs.logits[0, -1, :]
print(next_token_logits.shape)
print(next_token_logits)

# 2. step to convert the logits to probabilities
next_token_probs = torch.softmax(next_token_logits, -1)

# 3. step to get the top 10
topk_next_tokens= torch.topk(next_token_probs, 10)

#putting it together
print(*[(t.decode(idx), prob) for idx, prob in zip(topk_next_tokens.indices, topk_next_tokens.values)], sep="\n")

Output:

torch.Size([50257])
tensor([ -95.1139,  -93.7291,  -97.5711,  ...,  -98.0303, -100.2803,
         -96.1145])
(' park', tensor(0.1590))
(' woods', tensor(0.1003))
(' streets', tensor(0.0418))
(' dark', tensor(0.0312))
(' door', tensor(0.0296))
(' street', tensor(0.0239))
(' rain', tensor(0.0217))
(' city', tensor(0.0189))
(' same', tensor(0.0150))
(' halls', tensor(0.0135))
Recoup answered 3/6, 2023 at 21:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.