Fine-Tuning GPT2 - attention mask and pad token id errors
Asked Answered
Y

3

9

I have been trying to fine-tune GPT2 on the wikitext-2 dataset (just to help myself learn the process) and I am running into a warning message that I have not seen before:

"The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:50256 for open-end generation."

This seems strange since I clearly specify the EOS token in my code when instantiating the tokenizer:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

Training completes without crashing and my loss improves every epoch, but when I inference the model it outputs absolute gibberish - sometimes only generating a single word and nothing else. I am thinking there is a link between this warning message I'm getting and the model not performing well.

I got my training, valid, test data from here (i used the .raw files) - https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

I manually added <|startoftext|> and <|endoftext|> in the raw txt files for the datasets. Resulting in training data that looked like these two examples (taken from the middle of the text file):

...
<|startoftext|>
= Perfect Dark ( 2010 video game ) = 
 
 Perfect Dark is a remastered release of the first @-@ person shooter video game by the same name . Developed by 4J Studios and published by Microsoft Game Studios a decade after the original 's 2000 release , the remaster features several technical improvements , including higher resolution textures and models , a higher frame rate , and a multiplayer mode that supports the Xbox Live online service . It was released for the Xbox 360 video game console in March 2010 , through the Xbox Live Arcade download service . The story of the game follows Joanna Dark , an agent of the Carrington Institute organization , as she attempts to stop a conspiracy by rival corporation dataDyne . 
 Perfect Dark was under development for nearly a year and its game engine was completely re @-@ written from scratch to support several Xbox 360 features . Therefore , although the game plays exactly the same as the original , the code and renderer is different . The game received generally favorable reviews . Some critics considered the relatively unchanged game to be outdated , but most agreed that the title was a solid revival of a classic . As of the end of 2011 , the game had sold nearly 410 @,@ 000 units . 
 
 = = Gameplay = = 
 
 Perfect Dark is a first @-@ person shooter with elements of stealth games . In the game 's campaign mode , the player controls Joanna Dark through a series of nonlinear levels collected together into missions . Each level requires the player to complete a certain number of objectives , ranging from disguising oneself to hacking computers , collecting objects , and defeating enemies , among others . Players can carry an unlimited number of weapons and almost all of the weapons have two firing modes . The levels in Perfect Dark have no checkpoints , meaning that if Joanna is killed or fails an objective , the player has to start the level from the beginning . Every level can be played on three difficulty settings and several aspects , such as the enemies aggressiveness and the number of objectives that must be completed , among others , can vary in function of the chosen difficulty . Two players can also play the campaign co @-@ operatively or through a " counter @-@ operative " mode , in which one player controls the protagonist , while the other controls enemies throughout the level , attempting to stop the first player from completing objectives . 
 
 = = = Enhancements = = = 
 
 The remaster offers several improvements over the original Perfect Dark that was released for the Nintendo 64 in 2000 . The most remarkable change is that any of the multiplayer modes , including co @-@ operative and counter @-@ operative , can now be played in either splitscreen or through the Xbox Live online service . Combat Simulator matches are still capped at 12 entities , but the game can now comprise eight players online simultaneously , an improvement to the original 's cap of four players and eight Simulants . Players can also play against more than eight Simulants as long as there are enough slots available in a match ; for example , a single player can play against 11 Simulants ; such a feature was not possible in the original game . Unlike the original game , all the multiplayer content is unlocked from the beginning , and weapons from the game 's predecessor , which were originally only available in the missions , are now available to use in multiplayer . The game features an online leaderboard system and players can earn achievements and in @-@ game crowns by accomplishing certain tasks . The game also includes two new control set @-@ ups , entitled " Spartan " and " Duty Calls " , which are based on the popular first @-@ person shooter franchises Halo and Call of Duty respectively . 
 
 <|endoftext|>
<|startoftext|>
 = First Ostend Raid = 
 
 The First Ostend Raid ( part of Operation ZO ) was the first of two attacks by the Royal Navy on the German @-@ held port of Ostend during the late spring of 1918 during the First World War . Ostend was attacked in conjunction with the neighbouring harbour of Zeebrugge on 23 April in order to block the vital strategic port of Bruges , situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland and ideally sited to conduct raiding operations on the British coastline and shipping lanes . Bruges and its satellite ports were a vital part of the German plans in their war on Allied commerce ( Handelskrieg ) because Bruges was close to the troopship lanes across the English Channel and allowed much quicker access to the Western Approaches for the U @-@ boat fleet than their bases in Germany . 
 The plan of attack was for the British raiding force to sink two obsolete cruisers in the canal mouth at Ostend and three at Zeebrugge , thus preventing raiding ships leaving Bruges . The Ostend canal was the smaller and narrower of the two channels giving access to Bruges and so was considered a secondary target behind the Zeebrugge Raid . Consequently , fewer resources were provided to the force assaulting Ostend . While the attack at Zeebrugge garnered some limited success , the assault on Ostend was a complete failure . The German marines who defended the port had taken careful preparations and drove the British assault ships astray , forcing the abortion of the operation at the final stage . 
 Three weeks after the failure of the operation , a second attack was launched which proved more successful in sinking a blockship at the entrance to the canal but ultimately did not close off Bruges completely . Further plans to attack Ostend came to nothing during the summer of 1918 , and the threat from Bruges would not be finally stopped until the last days of the war , when the town was liberated by Allied land forces . 
 
 = = Bruges = = 
 
 Bruges had been captured by the advancing German divisions during the Race for the Sea and had been rapidly identified as an important strategic asset by the German Navy . Bruges was situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland at the centre of a network of canals which emptied into the sea at the small coastal towns of Zeebrugge and Ostend . This land barrier protected Bruges from bombardment by land or sea by all but the very largest calibre artillery and also secured it against raiding parties from the Royal Navy . Capitalising on the natural advantages of the port , the German Navy constructed extensive training and repair facilities at Bruges , equipped to provide support for several flotillas of destroyers , torpedo boats and U @-@ boats . 
 By 1916 , these raiding forces were causing serious concern in the Admiralty as the proximity of Bruges to the British coast , to the troopship lanes across the English Channel and for the U @-@ boats , to the Western Approaches ; the heaviest shipping lanes in the World at the time . In the late spring of 1915 , Admiral Reginald Bacon had attempted without success to destroy the lock gates at Ostend with monitors . This effort failed , and Bruges became increasingly important in the Atlantic Campaign , which reached its height in 1917 . By early 1918 , the Admiralty was seeking ever more radical solutions to the problems raised by unrestricted submarine warfare , including instructing the " Allied Naval and Marine Forces " department to plan attacks on U @-@ boat bases in Belgium . 
 The " Allied Naval and Marine Forces " was a newly formed department created with the purpose of conducting raids and operations along the coastline of German @-@ held territory . The organisation was able to command extensive resources from both the Royal and French navies and was commanded by Admiral Roger Keyes and his deputy , Commodore Hubert Lynes . Keyes , Lynes and their staff began planning methods of neutralising Bruges in late 1917 and by April 1918 were ready to put their plans into operation . 
 
 = = Planning = = 
 
 To block Bruges , Keyes and Lynes decided to conduct two raids on the ports through which Bruges had access to the sea . Zeebrugge was to be attacked by a large force consisting of three blockships and numerous supporting warships . Ostend was faced by a similar but smaller force under immediate command of Lynes . The plan was for two obsolete cruisers — HMS Sirius and Brilliant — to be expended in blocking the canal which emptied at Ostend . These ships would be stripped to essential fittings and their lower holds and ballast filled with rubble and concrete . This would make them ideal barriers to access if sunk in the correct channel at the correct angle . 
 When the weather was right , the force would cross the English Channel in darkness and attack shortly after midnight to coincide with the Zeebrugge Raid a few miles up the coast . By coordinating their operations , the assault forces would stretch the German defenders and hopefully gain the element of surprise . Covering the Inshore Squadron would be heavy bombardment from an offshore squadron of monitors and destroyers as well as artillery support from Royal Marine artillery near Ypres in Allied @-@ held Flanders . Closer support would be offered by several flotillas of motor launches , small torpedo boats and Coastal Motor Boats which would lay smoke screens to obscure the advancing blockships as well as evacuate the crews of the cruisers after they had blocked the channel . 

<|endoftext|> ...

I followed this tutorial very closely - https://colab.research.google.com/drive/13dZVYEOMhXhkXWfvSMVM1TTtUDrT6Aeh?usp=sharing#scrollTo=pBEVY2PYSTXJ

Here is my full code :

import random
import time
import datetime
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup, GPT2Config

smallest_gpt2 = 'gpt2'  # 124M weights (parameters)

# load training texts
with open('wikitext-2-raw/wiki.train.raw', 'r') as o:
    raw_train_text = o.read()  # readlines() returns a list of strings separated by '\n'
with open('wikitext-2-raw/wiki.valid.raw', 'r') as o:
    raw_validation_text = o.read()
with open('wikitext-2-raw/wiki.test.raw', 'r') as o:
    raw_test_text = o.read()

# PRE-PROCESSING TRAINING, VALIDATION, AND TEST TEXTS
preprocessed_train = raw_train_text.split('<|startoftext|>')
preprocessed_train = [i for i in preprocessed_train if i]  # removes empty list entries
preprocessed_train = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_train]  # adds <|startoftext|> to start
preprocessed_valid = raw_validation_text.split('<|startoftext|>')
preprocessed_valid = [i for i in preprocessed_valid if i]
preprocessed_valid = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_valid]
preprocessed_test = raw_test_text.split('<|startoftext|>')
preprocessed_test = [i for i in preprocessed_test if i]
preprocessed_test = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_test]

# HYPER PARAMETERS
EPOCHS = 5
BATCH_SIZE = 2  # GPT2 is a large model, so higher batch sizes can lead to memory problems
WARMUP_STEPS = 100
LEARNING_RATE = 5e-4
DECAY = 0
EPSILON = 1e-8


class GPT2Dataset(Dataset):

    def __init__(self, txt_list, _tokenizer, gpt2_type=smallest_gpt2, max_length=768):
        self.tokenizer = _tokenizer
        self.input_ids = []
        self.attn_masks = []

        # this loop will wrap all training data examples in BOS and EOS tokens (beginning/end of sequence)
        # this, again, helps the model understand the "format" of what you're training it for
        # note however, that if a training example is longer than the max length, the EOS token will be truncated, and
        #   this is not a problem for the model's training process
        for txt in txt_list:
            # pre_processed_text = '<|startoftext|>' + txt + '<|endoftext|>'  # i did this manually, so I skip it here
            # print(txt)

            # i handled most of the pre-processing for the training data further up in the code
            encodings_dict = _tokenizer(txt, truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]


# loading tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>',
                                          pad_token='<|pad|>')  # gpt2-medium

print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

# create dataset objects
train_dataset = GPT2Dataset(preprocessed_train, tokenizer, max_length=768)
valid_dataset = GPT2Dataset(preprocessed_valid, tokenizer, max_length=768)
test_dataset = GPT2Dataset(preprocessed_test, tokenizer, max_length=768)

# getting size of datasets
train_size = len(train_dataset)
val_size = len(valid_dataset)

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(  # todo learn how dataloader creates targets
            train_dataset,  # The training samples.
            sampler=RandomSampler(train_dataset),  # Select batches randomly
            batch_size=BATCH_SIZE  # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            valid_dataset,  # The validation samples.
            sampler=SequentialSampler(valid_dataset),  # Pull out batches sequentially.
            batch_size=BATCH_SIZE  # Evaluate with this batch size.
        )

# config
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate model
model = GPT2LMHeadModel.from_pretrained(smallest_gpt2, config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up. NOTE these tokens are already added to tokenizer above
model.resize_token_embeddings(len(tokenizer))

# this produces sample output every 50 steps
sample_every = 50

# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=EPSILON)

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * EPOCHS

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=total_steps)

training_stats = []
total_t0 = time.time()

# device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)


def format_time(_elapsed):
    return str(datetime.timedelta(seconds=int(round(_elapsed))))


for epoch_i in range(0, EPOCHS):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, EPOCHS))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()  # puts model in training mode

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)  # training targets
        b_masks = batch[1].to(device)

        model.zero_grad()

        # feeding the input to the model
        outputs = model(b_input_ids,
                        labels=b_labels,
                        attention_mask=b_masks,
                        token_type_ids=None
                        )

        loss = outputs[0]  # how "wrong" was the model?

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches. This is just a check to see how the model is doing.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader),
                                                                                     batch_loss, elapsed))

            model.eval()  # puts model in evaluation mode, where the necessary layers are turned off for inference

            # normally you would use a context manager here so the gradients don't get modified during this inference. However the tutorial I follow does not do this.
            # with torch.no_grad():
            # ... do inference eval ...

            # Here we are simply using the model to get an output. This is called inference.
            sample_outputs = model.generate(
                bos_token_id=random.randint(1, 30000),  # todo why do we do this line?
                do_sample=True,  # switches on sampling, where model will randomly select next word from the sample pool
                top_k=50,  # only 50 words will be considered for the next word in the sequence
                max_length=200,  # max tokens for total generation
                top_p=0.95,  # smallest set of words whose probabilities summed together reach/exceed top_p value
                num_return_sequences=1  # we only want model to generate one complete response (sequence of words)
                # temperature=1
            )

            # temperature is another parameter we can use when running inference
            # temperature of 0 will choose the highest-probability word each time
            # temperature of 1 is default, and uses the model's base confidence to choose the next word
            # temperature above 1 will make the model choose less-likely words. More creative, but more risk of nonsense

            # we only sample for one return sequence so this for is sort of unnecessary, but whatever
            for i, sample_output in enumerate(sample_outputs):
                print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

            model.train()  # we have to put model back in train mode after eval mode

        loss.backward()  # change weights with backprop

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        with torch.no_grad():  # weights are not updated
            outputs = model(b_input_ids,
                            # token_type_ids=None,
                            attention_mask=b_masks,
                            labels=b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time() - total_t0)))

Yim answered 5/12, 2022 at 1:57 Comment(1)
See: #69609901Marco
P
9

I will only comment on below warning:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Above means that when you tried to call generate on model, it doesn't know what is the pad token you are using. The generate method uses pad and eos tkens for multiple purposes, for example, to figure out what should be the attention mask (i.e. what tokens to ignore in the input sequence) and also in various decoding strategies. It is unfortunate that many popular tokenizers don't set this token and people end up with these warnings.

To fix this, first add this code after loading pre-trained tokenizer:

        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

Then pass this in generate method like this:

 gen_ids = model.generate(**encodings, pad_token_id=tokenizer.pad_token_id, max_new_tokens=200)

You can see HuggigFace code producing this error here.

You can see full working example here: https://github.com/sytelus/jupyter_nbs/blob/main/codegen_decoding.ipynb

Phantom answered 25/6, 2023 at 8:48 Comment(1)
Typically, I update the generation config right after loading the model rather than specifying the pad_token_id during each call to generate. The following handles this update: model.generation_config.pad_token_id = tokenizer.pad_token_idKunming
R
2

I don't think it is related to your model performing badly, but to answer your question, the warning is related to the generation routine.

As explained here, this is solved by simply setting the pad_token_id to the tokenizer's eos_token_id in the call to generate. It worked for me.

Remex answered 24/3, 2023 at 2:19 Comment(0)
S
0

I just want to ask a further clarifying question on the following answe:

Above means that when you tried to call generate on model, it doesn't know what is the pad token you are using.
The generate method uses pad and eos tkens for multiple purposes, for example, to figure out what should be the attention mask (i.e. what tokens to ignore in the input sequence) and also in various decoding strategies.
It is unfortunate that many popular tokenizers don't set this token and people end up with these warnings.

Since the OP has already added to the tokenizer tokens for padding, bos and eos why should we definitely use the same token id for both padding and eos?

For instance, I get the same issues even though in my tokenizer I have already added the following:

tokenizer.add_special_tokens({'pad_token': '<|pad|>', 'bos_token': '<|startoftext|>'})

Training proceeds normally without any issue but then when generating I get the following errors.

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Studner answered 4/4 at 13:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.