How does one set the pad token correctly (not to eos) during fine-tuning to avoid model not predicting EOS?
H

5

9

**tldr; what I really want to know is what is the official way to set pad token for fine tuning it wasn't set during original training, so that it doesn't not learn to predict EOS. **

colab: https://colab.research.google.com/drive/1poFdFYmkR_rDM5U5Z2WWjTepMQ8hvzNc?usp=sharing


The HF falcon tutorial has the following line:

tokenizer.pad_token = tokenizer.eos_token

it looks strange to me. It make sense pad and eos are the same but then why even make a difference between them in the first place in general?

Note its wrong to do pad = eos. This means during fine-tuning the model will never be trained to output eos (most likely) since eos is treated as pad token and no back propagated:

I just observed that when I set tokenizer.pad_token = tokenizer.eos_token during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).

I saw this (here https://github.com/huggingface/transformers/issues/22794):

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

But this assumes the model has a pad_token. I think an additional check has to be done that it does have an embedding for pad_token so that there are no run time errors (~type errors in the matrix extraction from the embedding "table"/matrix).

But if one does that some care might be needed to initialize the new token so that it dominates the generation: https://nlp.stanford.edu/~johnhew/vocab-expansion.html


code:

def get_model_tokenizer_qlora_falcon7b(model_name: str = "ybelkada/falcon-7b-sharded-bf16",
                                       config: wand.Config,  # todo
                                       lora_alpha=16,  # todo
                                       lora_dropout=0.1,  # todo
                                       lora_r=64,  # todo
                                       bnb_4bit_compute_dtype=torch.float16,  # changed it from Guanaco hf
                                       ) -> tuple:
    """
    Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.

    bf16 = 1S, 7Exp, 8Mantissa

    Do:
        pip install bitsandbytes
    ref:
        - https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
    """
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

    # model_id = "tiiuae/falcon-7b"
    # model_name: str = "ybelkada/falcon-7b-sharded-bf16"

    # - get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,  # load (usually huge) base model in 4 bits
        bnb_4bit_quant_type="nf4",  # normal float 4 for the (usually huge) base model. introduces error but fixed by ft
        # ref: https://gist.github.com/pacman100/1731b41f7a90a87b457e8c5415ff1c14
        bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
    )

    # - get falcon 4bit model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        trust_remote_code=True  # allows to execute custom code you download from the uploaded model code you are using
    )
    model.config.use_cache = False  # todo: why? https://mcmap.net/q/673087/-why-does-hugging-face-falcon-model-use-mode-config-use_cache-false-why-wouldn-39-t-it-want-to-have-the-decoder-re-use-computations-for-fine-tuning

    # get falcon tockenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)  # execs code downloaded from hf hub
    tokenizer.pad_token = tokenizer.eos_token

Modifying model gives issues

Darn this still not works:

 UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)

code:

"""
sfttrainer (likely using peft) best practices:
https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices

Best practices

Pay attention to the following best practices when training a model with that trainer:

- SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
- For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_int8_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.

todo: why trust_remote_code? I want more details.
"""
import sys

import torch
from peft import LoraConfig

from transformers.modeling_utils import PreTrainedModel

from pdb import set_trace as st


def test_bfloat16_int4(compute_dtype: torch.dtype,
                       use_4bit,
                       ):
    """
python -c "import torch; print(torch.cuda.get_device_capability());"
    todo: check other code test_bfloat16() do we need use_4bit?
    """
    if compute_dtype == torch.float16 and use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16, you can accelerate training with the argument --bfloat16")
            print("=" * 80)


def get_model_tokenizer_qlora_falcon7b(
        # -- mode args
        # model_id = "tiiuae/falcon-7b"
        pretrained_model_name_or_path: str = "ybelkada/falcon-7b-sharded-bf16",
        use_cache: bool = True,
        # -- lora args
        lora_alpha=16,  # todo
        lora_dropout=0.1,  # todo, evidence drop out really help? google, crfm, gpt4
        lora_r=64,  # todo
        bnb_4bit_compute_dtype=torch.float16,  # changed it from Guanaco hf

        # -- training args
        output_dir="./results",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        # paging so that the sudden mem gpu spikes don't cause the run to shut down
        # (I think usually caused by too long seqs)
        # todo: why 32 bit opt?
        # todo: paged nadamw opt?
        optim="paged_adamw_32bit",
        save_steps=10,
        logging_steps=10,
        learning_rate=2e-4,
        max_grad_norm=0.3,
        max_steps=500,
        warmup_ratio=0.03,
        lr_scheduler_type="constant",
        # -- quant. args (not recommended to be changed unless you know what your doing?)
        load_in_4bit=True,  # load (usually huge) base model in 4 bits
        bnb_4bit_quant_type="nf4",  # normal float 4 for the (large) base models qlora
) -> tuple:
    """
    Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.

    bf16 = 1S, 7Exp, 8Mantissa
    hypothesis: 7b trained due to 6.7 emergence rumour, I still don't think emergence is real.
    Notes:
        - ft a model is very specific to the model, tokenizer and training scheme. Thus we return
            - model, tokenizer, ft config (peft config), training args

    ref:
        - https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
    """
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

    # - Get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,  # load (usually huge) base model in 4 bits
        bnb_4bit_quant_type=bnb_4bit_quant_type,  # normal float 4 for the (usually huge) base model
        bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,  # if you can, during computation use bf16
    )

    # - Get falcon 4bit model
    # todo, where is this being saved & how to download quicker
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        quantization_config=bnb_config,
        trust_remote_code=True  # allows to execute custom code you download from the uploaded model code you are using
    )
    print(f'{type(model)=}')
    print(f'{model=}')
    # this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://mcmap.net/q/673087/-why-does-hugging-face-falcon-model-use-mode-config-use_cache-false-why-wouldn-39-t-it-want-to-have-the-decoder-re-use-computations-for-fine-tuning
    model.config.use_cache = use_cache
    print(f'{type(model)=}')

    # - Get falcon tokenizer
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
                                              trust_remote_code=True)  # execs code downloaded from hf hub
    # tokenizer.pad_token = tokenizer.eos_token  # ref: https://mcmap.net/q/673086/-how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoid-model-not-predicting-eos
    # tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # I think this is fine if during the training pad is ignored
    tokenizer.add_special_tokens({'pad_token': '<|pad|>'})  # I think this is fine if during the training pad is ignored

    # - Modify model
    # add pad token embed
    model.resize_token_embeddings(len(tokenizer))  # todo: I think this is fine if during the training pad is ignored
    model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
    model.config.max_new_tokens = len(tokenizer)
    # model.config.min_length = 1
    print(f'{model=}')
    print(f'{type(tokenizer)=}')
    print(f'{tokenizer.pad_token=}')
    # data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) todo

    # - Get falcon lora config
    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        r=lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        # model card for falcon tiiuae/falcon-7b: https://huggingface.co/tiiuae/falcon-7b/blob/main/modelling_RW.py
        # does seem to include all trainable params as done by qlora on their own paper
        target_modules=[
            # word_embeddings,
            "query_key_value",
            "dense",
            "dense_h_to_4h",
            "dense_4h_to_h",
            # "lm_head"
        ]
    )
    print(f'{type(peft_config)=}')

    # todo: print the num params of the lora = D1*r + D2*r and num of bytes by prec. (bytes) * num params
    return model, tokenizer, peft_config


# -- tests

def example_test_model_already_has_pad_token():
    """
    if it already has pad token, it likely has a small prob, so we are done.

    compare it's norm with other tokens to verify this is true.

python ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py
    """
    # - the get datasets todo: preprocessing, padding, streaming
    from uutils.hf_uu.data_hf.common import get_guanaco_datsets_add_splits_train_test_only
    trainset, _, testset = get_guanaco_datsets_add_splits_train_test_only()

    # qlora flacon7b
    from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
    model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
    model: PreTrainedModel = model
    print(f'{model=}')
    sent = 'Dogs are great because they are '
    print()

    # print to see if pad tokens are present and if it ignores the tokens at the end
    encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
    print(f'{encoded_input=}')

    # Print all special tokens
    print('\n---- start Print all special tokens')
    for token_name, token in tokenizer.special_tokens_map.items():
        print(f"{token_name}: {token}")
    print('\n---- end Print all special tokens')

    # Get the ID for the '[PAD]' token
    try:
        pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
    except KeyError:
        raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")

    # Index into the model's embedding table
    try:
        print(f'{model.get_input_embeddings().weight.size()=}')
        pad_embedding = model.get_input_embeddings().weight[pad_token_id]
    except IndexError:
        raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")

    print(f'{pad_embedding=}')
    print('Success!\n')

    # check it generates something sensible
    # tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
    input_ids, attention_mask = encoded_input['input_ids'], encoded_input['attention_mask']
    predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
    predicted_tokens_ids = predicted_tokens_ids_options[0]
    predicted_sent = tokenizer.decode(predicted_tokens_ids)
    print(f'original sentence: {sent=}')
    print(f'predicted sentence: {predicted_sent=}')
    print('Success2!')


if __name__ == '__main__':
    import time

    start_time = time.time()
    example_test_model_already_has_pad_token()
    print(f"The main function executed in {time.time() - start_time} seconds.\a")

it doesn't like the modifications to the model:

    model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
    model.config.max_new_tokens = len(tokenizer)

How to fix?

Errors:

/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1452: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 211, in <module>
    example_test_model_already_has_pad_token()
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 199, in example_test_model_already_has_pad_token
    predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 2633, in sample
    next_token_scores = logits_warper(input_ids, next_token_scores)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 92, in __call__
    scores = processor(input_ids, scores)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 302, in __call__
    indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
RuntimeError: "topk_cpu" not implemented for 'Half'

Bounty Section: Small GPT2 code example

Yes I agree that pad is assigned to eos. Eos is still eos. But during fine-tuning now the weights wrt to eos are unchanged. This might be an issue since the probability of eos has not shifted to the fine-tuning regime. One possibility is that eos is outputed with less chance. Yes we can still halt production when we see eos but we've not shifted the probability to output eos according to our fine-tuning distribution -- but all other tokens have changed distribution. I think this could be an issue because it's not like the old probability of eos is conserved since all tokens probs have changed except eos + even if the old eos prob was conserved, it's wrt wrong distribution (not the fine tuning one).

e.g.,

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
...
    raw_text_batch='a'
    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 0, 0, 0, 0]])}

but it would have been better to have

    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 0, 0, 0]])}

code

def test_eos_pad():
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    raw_text_batch = 'a'

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    # print(f'{tokenizer.eos_token=}')
    # print(f'{tokenizer.eos_token_id=}')
    # print(f'{tokenizer.pad_token=}')
    # print(f'{tokenizer.pad_token_id=}')

    # print(f'{raw_text_batch=}')
    # tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    # print(f'{tokenize_batch=}')

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.pad_token=}')
    print(f'{tokenizer.pad_token_id=}')

    print(f'{raw_text_batch=}')
    tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    print(f'{tokenize_batch=}')
    print('Done')

cross:

Hernandes answered 7/7, 2023 at 1:11 Comment(7)
Then: yes, that's what it means. Contact Hugging Face: they wrote the tutorial, so if something in their tutorial doesn't make sense, you should contact them so they can clarify it in that tutorial so that it's not just you who's getting an answer on a completely unrelated website, but everyone gets that same help by solving the cause of the problem.Inflammatory
related: #76658981Hernandes
@Mike'Pomax'Kamermans here is the new gitissue relevant to this eos = pad token error: github.com/huggingface/blog/issues/1302 thank you!Hernandes
related: github.com/huggingface/transformers/issues/22794Hernandes
eos = pad token leads to config complaints with my "forced" soln. So here is a related so question: #76465843Hernandes
what I really want to know is what is the official way to set pad token for fine tuning it wasn't set during original training, so that it doesn't not learn to predict EOS.Hernandes
I'm still confused " if a model does not have a padding token already (which is common for decoder-only models because they are trained on blocks which do not have any padding). So you never "unlearn" anything. " is true, but then during training eos and pad will be masked. So there is a "wrong" distribution shift for generating EOS now. How to fix this? See details in bounty section in Q.Hernandes
H
0

Update August/8/2024

One more improvement. Some models like DeepSeekCoder base 7B do have a pad token already. So no need to set the pad token to eos. But the code that pads up to 1st occurence of eos + pads the rest has to pad the rest assuming they are pad tokens. So that's the diff:

def get_lm_examples_1st_eos_mask_remaining_eos(
        examples,
        tokenizer: AutoTokenizer, 
        
        # desired_dataset_column: str = 'text',
        # method_to_remove_columns: str = 'keys',

        remove_to_long_seqs: bool = False,
        # format: str = 'torch',
        debug: bool = False,
        ) -> dict[str, torch.Tensor]:
    """ 
    Train only on first occurence of eos. The remaining eos are masked out. If 
    - train up to 1st ocurrence of eos token, mask out the rest of the eos tokens.
    - drop or not seqs that are too long, i.e., have no eos token.
    
    Assumes: pad == eos

    ref: https://mcmap.net/q/673086/-how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoid-model-not-predicting-eos
    """
    # - Get lm example
    seq_length: int = examples['input_ids'].size(0)
    print(f'{examples["input_ids"].size()=}, {seq_length=}') if debug else None
    examples["labels"] = examples["input_ids"].clone()  # labels is hardcoded in HF so put it!
    eos_token_id = tokenizer.eos_token_id
    # assert eos_token_id == tokenizer.pad_token_id, 'Error: pad should be eos token'
    print(f'{tokenizer.pad_token_id=}, {tokenizer.eos_token_id=}') if debug else None
    seqs_to_drop: list[int] = [] # store idx to drop (to long), we don't want to modify the two lists at the same time as we are looping through them
    for idx, input_ids in enumerate(examples["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present --> if yes then make sure to trian on it then mask the remaining eos (assumes pad == eos)
            first_eos_position = eos_positions[0]
            examples["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert examples["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            # # For all subsequent occurrences of eos_token, set their labels to -100
            # for subsequent_eos_position in eos_positions[1:]:
            #     examples["labels"][idx, subsequent_eos_position] = -100
            #     assert examples["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
            # after first eos token mask everything (eos AND pad, hopefully that's all there but we can sanity check later)
            for desired_mask_idx in range(first_eos_position, seq_length):
                examples["labels"][idx, desired_mask_idx] = -100
                assert examples["labels"][idx, desired_mask_idx] == -100, "The label for the desired_mask_idx incorrect! Should be -100."
        elif remove_to_long_seqs:
            assert eos_positions.nelement() == 0, 'Error: there should be no eos if this if stmt is exexuted.'
            # record to drop this seq, has no eos so too long + flag says to drop it
            seqs_to_drop.append(idx)
        else:
            pass # nop: no eos in seq so too long, but keep it for training anyway
    # assert len(examples["labels"]) == 0, 'Error: no labels were set'
    # -- Drop seqs with no eos
    if seqs_to_drop:
        examples["input_ids"] = torch.stack([input_ids for idx, input_ids in enumerate(examples["input_ids"]) if idx not in seqs_to_drop])
        examples["attention_mask"] = torch.stack([mask for idx, mask in enumerate(examples["attention_mask"]) if idx not in seqs_to_drop])
        examples["labels"] = torch.stack([labels for idx, labels in enumerate(examples["labels"]) if idx not in seqs_to_drop])
    return examples

Update, I made this better. Now if seq does not have eos at all you can remove that seq or chose to train on it

def raw_ds_2_lm_ds_mask_eos_pad_toks(
        raw_dataset, 
        tokenizer, 
        max_length: int,

        raw_str_2_desired_str: Optional[callable] = None, # either return {'text': examples['text']} or preprocess str to get what you need e.g. {'text': f"[ex['nl'] ex['fl'] {tok.eos_token}]" for ex in examples}
        desired_dataset_column: str = 'text', # good val to use if hf str ds already pre-processed for you: 'text',

        method_to_remove_columns: str = 'keys',

        padding: str = 'max_length',
        truncation: bool = True, 
        return_tensors: str = 'pt',

        batched: bool = True, # Setting `batched=True` in the `dataset.map` function of Hugging Face's datasets library processes the data in batches rather than one item at a time, significantly speeding up the tokenization and preprocessing steps.
        streaming: bool = False,

        format: str = 'torch',
        # get_lm_examples_function = get_lm_examples_1st_eos_mask_remaining_eos,
        ):
    """ """
    # - Get desired str dataset
    if raw_str_2_desired_str is None:
        get_desired_examples_str_function = lambda examples: {'text': examples[desired_dataset_column]} if raw_str_2_desired_str is not None else raw_str_2_desired_str 
    else:
        get_desired_examples_str_function = raw_str_2_desired_str
    desired_examples_str_dataset = raw_dataset.map(get_desired_examples_str_function, batched=batched) # note: we can't remove all str columns here or we will remove the ones we want to tokenize by accident

    # - Get tokenized data set
    desired_examples_str_dataset = desired_examples_str_dataset.with_format(format)  # annoying that return tensors in the tokenizer on it's own doesn't put it into a pt tensor, so for now we keep both.
    remove_str_columns = get_column_names(desired_examples_str_dataset, streaming, method_to_remove_columns)  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    tokenize_function = lambda examples: tokenizer(examples[desired_dataset_column], padding=padding, max_length=max_length, truncation=truncation, return_tensors=return_tensors)
    tokenized_datasets = desired_examples_str_dataset.map(tokenize_function, batched=batched, remove_columns=remove_str_columns)

    # - Get lm data set
    # get_lm_examples_function = lambda examples : group_texts(examples, block_size)
    get_lm_examples_function = lambda examples : get_lm_examples_1st_eos_mask_remaining_eos(examples, tokenizer)
    lm_dataset = tokenized_datasets.map(get_lm_examples_function, batched=batched)
    return lm_dataset

def get_lm_examples_1st_eos_mask_remaining_eos(
        examples,
        tokenizer: AutoTokenizer, 
        
        # desired_dataset_column: str = 'text',
        # method_to_remove_columns: str = 'keys',

        remove_to_long_seqs: bool = False,
        # format: str = 'torch',
        ) -> dict[str, torch.Tensor]:
    """ 
    Train only on first occurence of eos. The remaining eos are masked out. If 
    - train up to 1st ocurrence of eos token, mask out the rest of the eos tokens.
    - drop or not seqs that are too long, i.e., have no eos token.
    
    Assumes: pad == eos

    ref: https://mcmap.net/q/673086/-how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoid-model-not-predicting-eos
    """
    # - Get lm example
    examples["labels"] = examples["input_ids"].clone()  # labels is hardcoded in HF so put it!
    eos_token_id = tokenizer.eos_token_id
    assert eos_token_id == tokenizer.pad_token_id, 'Error: pad should be eos token'
    seqs_to_drop: list[int] = [] # store idx to drop (to long), we don't want to modify the two lists at the same time as we are looping through them
    for idx, input_ids in enumerate(examples["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present --> if yes then make sure to trian on it then mask the remaining eos (assumes pad == eos)
            first_eos_position = eos_positions[0]
            examples["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert examples["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            # For all subsequent occurrences of eos_token, set their labels to -100
            for subsequent_eos_position in eos_positions[1:]:
                examples["labels"][idx, subsequent_eos_position] = -100
                assert examples["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
        elif remove_to_long_seqs:
            assert eos_positions.nelement() == 0, 'Error: there should be no eos if this if stmt is exexuted.'
            # record to drop this seq, has no eos so too long + flag says to drop it
            seqs_to_drop.append(idx)
        else:
            pass # nop: no eos in seq so too long, but keep it for training anyway
    # assert len(examples["labels"]) == 0, 'Error: no labels were set'
    # -- Drop seqs with no eos
    if seqs_to_drop:
        examples["input_ids"] = torch.stack([input_ids for idx, input_ids in enumerate(examples["input_ids"]) if idx not in seqs_to_drop])
        examples["attention_mask"] = torch.stack([mask for idx, mask in enumerate(examples["attention_mask"]) if idx not in seqs_to_drop])
        examples["labels"] = torch.stack([labels for idx, labels in enumerate(examples["labels"]) if idx not in seqs_to_drop])
    return examples

train script:

"""
Refs:
    - https://claude.ai/chat/ad5c9e18-beb4-48fb-9f43-a2ba463ce158
    - https://chatgpt.com/c/349f2c8a-949e-444d-ae3c-8ca60ba77831
"""
import glob
import os
import numpy as np
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset, load_metric
from typing import Dict, Tuple, Optional
from pathlib import Path
import evaluate

from utils import eval_hf
from utils import raw_ds_2_lm_ds_mask_eos_pad_toks

def compute_metrics(eval_pred: Tuple[np.ndarray, np.ndarray],
                    path: str = 'accuracy',
                    ) -> Dict[str, float]:
    """
    Compute the accuracy of the model.

    Args:
    eval_pred: A tuple containing the model predictions and labels.

    Returns:
    A dictionary with the accuracy score.
    
    TODO: document properly what accuracy is. Is it tfa, ara, exact string match, avg acc (wrt length etc.) ref: https://huggingface.co/spaces/evaluate-metric/accuracy
    """
    metric = evaluate.load(path=path)   # load metric from file or hf
    predictions, references = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=references)

def preprocess_function_proofnet_simple(examples: Dict[str, list], tokenizer: GPT2Tokenizer, max_length: int = 512) -> Dict[str, torch.Tensor]:
    """
    Preprocess the input data for the proofnet dataset.

    Args:
    examples: The examples to preprocess.
    tokenizer: The tokenizer for encoding the texts.

    Returns:
    The processed model inputs.
    """
    # - Get raw string ins,outs (so deal with HF data set columns at str level)
    inputs: list[str] = [f"{examples['nl_statement'][i]}{tokenizer.eos_token}{examples['formal_statement'][i]}" for i in range(len(examples['nl_statement']))]
    # - Get tokenized ins,outs (so remove irrelevant "string" columns to get only "tensor" relevant columns)
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    # - Get lm ins,outs for training e.g., deal with padd, masks etc.
    labels = model_inputs.input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

def setup_and_train_proofnet(
        # pretrained_model_name_or_path: str = "gpt2", 
        # pretrained_model_name_or_path: str = "openai-community/gpt2-xl", 
        pretrained_model_name_or_path: str = "meta-llama/Meta-Llama-3.1-8B", 
        path: str = "hoskinson-center/proofnet",
        output_dir_train: str = '~/tmp/proofnet/train',
        output_dir_val: Optional[str] = None,  # we are training on the val set so no val set
        output_dir_test: str = '~/tmp/proofnet/test',
        path_to_save_model: Optional[str] = None,  # suggested path: '~/tmp/proofnet/model' then expanduser in py code
        num_train_epochs: int = 3,
        per_device_train_batch_size: Optional[int] = 2,
        per_device_eval_batch_size: Optional[int] = 2,
        learning_rate: float = 5e-5,
        weight_decay: float = 0.01,
        max_grad_norm: float = 1.0, 
        lr_scheduler_type = 'cosine', # https://discord.com/channels/879548962464493619/1227708244697284724/1227708244697284724
        warmup_ratio=0.01,  # copying alpaca for now, number of steps for a linear warmup,  https://discord.com/channels/879548962464493619/1227708244697284724/1227708244697284724
        optim='paged_adamw_32bit',
        gradient_accumulation_steps = 2, # Allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing: Optional[bool] = True,
        report_to: str = 'none',  # recommended values 'wandb' or `none`
        ) -> None:
    """
    Set up the environment, preprocess the dataset, and train the model.

    export CUDA_VISIBLE_DEVICES=7

    Args:
    tokenizer_name: The name of the tokenizer.
    model_name: The name of the model.
    dataset_path: The path to the dataset.
    """
    # Clear CUDA cache to free up memory
    torch.cuda.empty_cache()

    # Load tokenizer and model
    if pretrained_model_name_or_path == "gpt2":
        tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, max_length=1024)
        if tokenizer.pad_token_id is None:
            tokenizer.pad_token = tokenizer.eos_token
            print(f'{tokenizer.pad_token=}')
        print(f'{tokenizer.eos_token=}\n{tokenizer.eos_token_id=}')
        model = GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path)
        device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
        model = model.to(device)
        max_length: int = tokenizer.model_max_length
        print(f'{max_length=}')
    elif pretrained_model_name_or_path == "openai-community/gpt2-xl":
        tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, max_length=1024)
        if tokenizer.pad_token_id is None:
            tokenizer.pad_token = tokenizer.eos_token
            print(f'{tokenizer.pad_token=}')
        print(f'{tokenizer.eos_token=}\n{tokenizer.eos_token_id=}')
        model = GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path)
        device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
        model = model.to(device)
        max_length: int = tokenizer.model_max_length
        print(f'{max_length=}') 
    elif pretrained_model_name_or_path == "meta-llama/Meta-Llama-3.1-8B":
        torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32 
        model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, torch_dtype=torch_dtype)
        # tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="right", use_auth_token=True)
        tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="right")
        print(f'{tokenizer.pad_token=} {tokenizer.eos_token_id=}')
        tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token_id is None else tokenizer.pad_token
        print(f'{tokenizer.pad_token=} {tokenizer.eos_token_id=}')
        # get context length for setting max length for training
        if hasattr(model.config, "context_length"):
            # SEEMS IT IS NOT IN THE model.config
            print("Context length:", model.config.context_length)
            max_length: int = model.config.context_length
        else:
            print(f"Context length not found in model.config, so using your default or hardcoded value. Model is {pretrained_model_name_or_path=}.")
            # max_length: int = 4096
            max_length: int = 8192
            # max_length: int = 128  # for debugging
            # max_length: int = 128_000  # ref: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
            print(f'->{max_length=}')
    else:
        raise ValueError(f"Model {pretrained_model_name_or_path} not supported.")
    print("Number of parameters:", sum(p.numel() for p in model.parameters()))

    # - Load the dataset
    print(f'-Load the dataset')
    ## Proofnet
    # dataset_val = load_dataset(path, split='validation')
    # dataset_test = load_dataset(path, split='test')
    # # Preprocess the dataset
    # if path == "hoskinson-center/proofnet":
    #     preprocess_function = preprocess_function_proofnet_simple
    #     # note: text field is usually more common!
    #     val_dataset = dataset_val.map(lambda examples: preprocess_function(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])
    #     test_dataset = dataset_test.map(lambda examples: preprocess_function(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])
    ## C4
    # train_dataset = load_dataset(path='allenai/c4', name='en', split='train', streaming=True)
    # eval_dataset = load_dataset(path='allenai/c4', name='en', split='validation', streaming=True)
    # train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length)
    # eval_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(eval_dataset, tokenizer, max_length)

    # json files for putnam are not consistent and it seems they have to be: https://chatgpt.com/c/9cecca7d-d50d-42e2-b2d3-c1057bc21ef2 solve later
    # ~/putnam-math/data/Putnam_MATH_variations_static3/original/test
    # json_files = glob.glob(os.path.expanduser('~/putnam-math/data/Putnam_MATH_original_static3/test/**/*.json'), recursive=True)
    # train_dataset = load_dataset('json', data_files=json_files)
    # json_files = glob.glob(os.path.expanduser('~/putnam-math/data/Putnam_MATH_variations_static3/variations/test/**/*.json'), recursive=True)
    # eval_dataset = load_dataset('json', data_files=json_files)
    # train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length)
    # eval_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(eval_dataset, tokenizer, max_length)

    # Proofnet with 1st eos token train remaining eos not train
    from train.utils import raw_str_2_desired_af_str
    _raw_str_2_desired_af_str = lambda examples: raw_str_2_desired_af_str(examples, tokenizer)  # tokenizer needed to get eos tok to form right str to train on.
    train_dataset = load_dataset(path, split='validation')
    eval_dataset = load_dataset(path, split='test')
    train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length, raw_str_2_desired_str=_raw_str_2_desired_af_str)
    eval_dataset = train_dataset
    print(f'->{len(train_dataset)=} {len(eval_dataset)=}')
    # max_steps: int = (len(train_dataset) * num_train_epochs) // per_device_train_batch_size  # TODO: really?

    # Training arguments
    output_dir_train: Path = Path(output_dir_train).expanduser()
    output_dir_train.mkdir(parents=True, exist_ok=True)
    training_args = TrainingArguments(
        output_dir=output_dir_train,
        max_steps=2,  # TODO get rid of this in favour of 1 or 2 or 3 epochs
        # num_train_epochs=num_train_epochs, 
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay, 
        max_grad_norm=max_grad_norm, # TODO once real training change?
        lr_scheduler_type=lr_scheduler_type,  # TODO once real training change? using what I've seen most in vision 
        warmup_ratio=warmup_ratio,
        optim=optim,
        # logging_strategy='epoch', # TODO
        save_steps=100, # Save checkpoint every 500 steps
        save_total_limit=3, # save last 3
        logging_steps=10,  # Frequency of logging steps
        logging_first_step=True,
        logging_dir=output_dir_train,
        evaluation_strategy='no',  # "no"`: No evaluation is done during training. no can be good to avoid memory issues.
        # evaluation_strategy="steps",  # TODO Evaluate model at specified steps
        # eval_steps=110,  # TODO Evaluate every 100 steps
        # remove_unused_columns=False,  # TODO https://mcmap.net/q/673090/-how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # options I recommend: 'none', 'wandb'
        fp16=False,  # never ever set to True
        bf16=torch.cuda.is_bf16_supported(),
        # full_determinism=True,  # TODO periphery, Ensure reproducibility
        # torchdynamo="nvfuser",  # TODO periphery, Use NVFuser backend for optimized torch operations
        # dataloader_prefetch_factor=2,  # TODO periphery, Number of batches to prefetch
        # dataloader_pin_memory=True,  # TODO periphery, Pin memory in data loaders for faster transfer to GPU
        # dataloader_num_workers=16,  # TODO Number of subprocesses for data loading
    )

    # Initialize the Trainer 
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,  # set to None if eval is giving you memory issues
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    # Train the model
    trainer.train()

    # Evaluate the model
    if output_dir_test is not None:
        output_dir_test: Path = Path(output_dir_test).expanduser()
        output_dir_test.mkdir(parents=True, exist_ok=True)
        eval_args = TrainingArguments(output_dir=output_dir_test, per_device_eval_batch_size=per_device_eval_batch_size, fp16=False, bf16=torch.cuda.is_bf16_supported(), report_to=report_to)
        trainer = Trainer(model=model, args=eval_args, train_dataset=None, eval_dataset=eval_dataset)
        # results: dict[str, float] = trainer.evaluate(test_dataset)
        results: dict[str, float] = eval_hf(trainer, name='', path=path, split='test')
        print(f'{path=} split=test {results=}')

    # Save the trained model
    if path_to_save_model is not None:
        model.save_pretrained(path_to_save_model)

def main() -> None:
    """
    Main function to execute the model training and evaluation.
    """
    setup_and_train_proofnet()

if __name__ == "__main__":
    import time
    start_time = time.time()
    main()
    print(f"Time taken: {time.time() - start_time:.2f} seconds, or {(time.time() - start_time) / 60:.2f} minutes, or {(time.time() - start_time) / 3600:.2f} hours.\a")

code repo: https://github.com/brando90/snap-cluster-setup/blob/9778140d8eb378f7c7873ec3fa906d0b01064031/py_src/train/simple_train2.py#L1


OK I think this is the code that train on first occurrence of eos and makes sure the rest are NOT trained on (feedback welcomed):

def collate_fn_train_only_first_eos_token_mask_everything_after_it(data: list[dict[str, str]], 
                                                                    tokenizer: PreTrainedTokenizer, 
                                                                    max_length: int=1024,  # GPT2 default, likely worth you change it! This default might cause bugs.
                                                                    ) -> dict[str, torch.Tensor]:
    """ Train only on first occurence of eos. The remaining eos are masked out.

    Sometimes the model might not have a padding token. Sometimes people set the padding token to be the eos token.
    But sometimes this seems to lead to the model to predict eos token to much. 
    So instead of actually using the pad token that was set to the eos token, we instead mask out all excesive eos tokens that act as pads 
    and leave the first eos token at the end to be predicted -- since that is the only one that semantically means end of sequence 
    and therby by not training on random eos at the end by masking it not unncesserily shift/amplify the distribution of eos. 
    
    ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando 
    ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
    ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
    """
    # we are training full context length for llama so remove code bellow, if it tries to pad hopefully it throws an error
    # -- Ensure tokenizer has a padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    # -- Extract sequences
    # sequences: list[str] = [example.get("text", "") or "" for example in data]
    sequences: list[str] = []
    for idx, example in enumerate(data):
        # Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
        text: str = example.get("text", "") or ""
        sequences.append(text)
    # -- Tokenize the sequences
    tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
    tokenized_data["labels"] = tokenized_data["input_ids"].clone()  # labels is hardcoded in HF so put it!
    # -- Set the mask value for the first eos_token in each sequence to 1 and remaining to -100
    eos_token_id = tokenizer.eos_token_id
    for idx, input_ids in enumerate(tokenized_data["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present
            first_eos_position = eos_positions[0]
            tokenized_data["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            
            # For all subsequent occurrences of eos_token, set their labels to -100
            for subsequent_eos_position in eos_positions[1:]:
                tokenized_data["labels"][idx, subsequent_eos_position] = -100
                assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
    return tokenized_data

reference: Why does the falcon QLoRA tutorial code use eos_token as pad_token?

Hernandes answered 24/8, 2023 at 17:26 Comment(0)
C
2

for falcon you can use already existing special tokens available for the model.

 tokenizer.add_special_tokens({"pad_token": ">>SUFFIX<<"})
 model.config.pad_token_id = tokenizer.pad_token_id

this way you dont have to extend the embedding of the model like its done here


for other models, like llama2 you can set the

tokenizer.pad_token = tokenizer.unk_token
Catricecatrina answered 16/9, 2023 at 16:46 Comment(0)
P
0

I have not done the falcon model finetuning using QLoRA but I did it using PEFT and bitsandbytes for the 7b variant by loading the model in 8bit and using LoRA rank of 16 with a micro batch size of 8 on a 24Gb GPU. So I would like to mention this if it helps you:

I did not find any <pad> token or <unk> token in falcon. (So e.g. the workaround in the alpaca-lora repository of using token id 0 for padding would not work as that is assigned to another token.) However, using tokenizer.add_special_tokens({'pad_token': '<PAD>'}) after loading the tokenizer and model.resize_token_embeddings(len(tokenizer)) after loading the model in 8-bit worked (at least I did not get any errors during finetuning and the text generation with the finetuned model also worked).

Playwriting answered 10/7, 2023 at 6:30 Comment(1)
this helps. Note that doing that did later rise warning about generation configs. So perhaps that solution won't always help. Due to a separate issue I can't fully test your suggestion to train with qlora. Wonder if you know the solution given your answer: #76658981Hernandes
L
0

For next-token-prediction, setting EOS to PAD during fine-tuning is actually okay. Note that you may need to manually add the EOS token ([PAD] in this case) to the training data though. As an example, given the input sequence Hello world., we can tokenize the sequence into [Hello, world, ., [PAD]].

The attention mask will be [1, 1, 1, 0] and no next token will be predicted for the [PAD] token.

Remember however that in autoregressive language modelling, the token labels for an input sequence will always be the next token in that sequence. For example:

Input: [Hello, world, .]

Labels: [world, ., [PAD]]

In other words; the model is optimized to generate a [PAD] token after a . token by optimising the token-level cross-entropy loss, even though the [PAD] token (EOS) itself is masked during the attention calculation and thus not back-propagated.

Note that while you can use any token as the EOS token, using the same embedding for the PAD and EOS token is a slight optimisation to remove one entry from the embedding weight matrix. Due to the size of the vocabulary this memory saving is however often negligible (1 token out of e.g. ~50k tokens in GPT2).

Lavena answered 13/7, 2023 at 8:54 Comment(1)
are you sure this is correct? n other words; the model is optimized to generate a [PAD] token after a . token by optimising the token-level cross-entropy loss, even though the [PAD] token (EOS) itself is masked during the attention calculation and thus not back-propagated.?Hernandes
H
0

Update August/8/2024

One more improvement. Some models like DeepSeekCoder base 7B do have a pad token already. So no need to set the pad token to eos. But the code that pads up to 1st occurence of eos + pads the rest has to pad the rest assuming they are pad tokens. So that's the diff:

def get_lm_examples_1st_eos_mask_remaining_eos(
        examples,
        tokenizer: AutoTokenizer, 
        
        # desired_dataset_column: str = 'text',
        # method_to_remove_columns: str = 'keys',

        remove_to_long_seqs: bool = False,
        # format: str = 'torch',
        debug: bool = False,
        ) -> dict[str, torch.Tensor]:
    """ 
    Train only on first occurence of eos. The remaining eos are masked out. If 
    - train up to 1st ocurrence of eos token, mask out the rest of the eos tokens.
    - drop or not seqs that are too long, i.e., have no eos token.
    
    Assumes: pad == eos

    ref: https://mcmap.net/q/673086/-how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoid-model-not-predicting-eos
    """
    # - Get lm example
    seq_length: int = examples['input_ids'].size(0)
    print(f'{examples["input_ids"].size()=}, {seq_length=}') if debug else None
    examples["labels"] = examples["input_ids"].clone()  # labels is hardcoded in HF so put it!
    eos_token_id = tokenizer.eos_token_id
    # assert eos_token_id == tokenizer.pad_token_id, 'Error: pad should be eos token'
    print(f'{tokenizer.pad_token_id=}, {tokenizer.eos_token_id=}') if debug else None
    seqs_to_drop: list[int] = [] # store idx to drop (to long), we don't want to modify the two lists at the same time as we are looping through them
    for idx, input_ids in enumerate(examples["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present --> if yes then make sure to trian on it then mask the remaining eos (assumes pad == eos)
            first_eos_position = eos_positions[0]
            examples["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert examples["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            # # For all subsequent occurrences of eos_token, set their labels to -100
            # for subsequent_eos_position in eos_positions[1:]:
            #     examples["labels"][idx, subsequent_eos_position] = -100
            #     assert examples["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
            # after first eos token mask everything (eos AND pad, hopefully that's all there but we can sanity check later)
            for desired_mask_idx in range(first_eos_position, seq_length):
                examples["labels"][idx, desired_mask_idx] = -100
                assert examples["labels"][idx, desired_mask_idx] == -100, "The label for the desired_mask_idx incorrect! Should be -100."
        elif remove_to_long_seqs:
            assert eos_positions.nelement() == 0, 'Error: there should be no eos if this if stmt is exexuted.'
            # record to drop this seq, has no eos so too long + flag says to drop it
            seqs_to_drop.append(idx)
        else:
            pass # nop: no eos in seq so too long, but keep it for training anyway
    # assert len(examples["labels"]) == 0, 'Error: no labels were set'
    # -- Drop seqs with no eos
    if seqs_to_drop:
        examples["input_ids"] = torch.stack([input_ids for idx, input_ids in enumerate(examples["input_ids"]) if idx not in seqs_to_drop])
        examples["attention_mask"] = torch.stack([mask for idx, mask in enumerate(examples["attention_mask"]) if idx not in seqs_to_drop])
        examples["labels"] = torch.stack([labels for idx, labels in enumerate(examples["labels"]) if idx not in seqs_to_drop])
    return examples

Update, I made this better. Now if seq does not have eos at all you can remove that seq or chose to train on it

def raw_ds_2_lm_ds_mask_eos_pad_toks(
        raw_dataset, 
        tokenizer, 
        max_length: int,

        raw_str_2_desired_str: Optional[callable] = None, # either return {'text': examples['text']} or preprocess str to get what you need e.g. {'text': f"[ex['nl'] ex['fl'] {tok.eos_token}]" for ex in examples}
        desired_dataset_column: str = 'text', # good val to use if hf str ds already pre-processed for you: 'text',

        method_to_remove_columns: str = 'keys',

        padding: str = 'max_length',
        truncation: bool = True, 
        return_tensors: str = 'pt',

        batched: bool = True, # Setting `batched=True` in the `dataset.map` function of Hugging Face's datasets library processes the data in batches rather than one item at a time, significantly speeding up the tokenization and preprocessing steps.
        streaming: bool = False,

        format: str = 'torch',
        # get_lm_examples_function = get_lm_examples_1st_eos_mask_remaining_eos,
        ):
    """ """
    # - Get desired str dataset
    if raw_str_2_desired_str is None:
        get_desired_examples_str_function = lambda examples: {'text': examples[desired_dataset_column]} if raw_str_2_desired_str is not None else raw_str_2_desired_str 
    else:
        get_desired_examples_str_function = raw_str_2_desired_str
    desired_examples_str_dataset = raw_dataset.map(get_desired_examples_str_function, batched=batched) # note: we can't remove all str columns here or we will remove the ones we want to tokenize by accident

    # - Get tokenized data set
    desired_examples_str_dataset = desired_examples_str_dataset.with_format(format)  # annoying that return tensors in the tokenizer on it's own doesn't put it into a pt tensor, so for now we keep both.
    remove_str_columns = get_column_names(desired_examples_str_dataset, streaming, method_to_remove_columns)  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    tokenize_function = lambda examples: tokenizer(examples[desired_dataset_column], padding=padding, max_length=max_length, truncation=truncation, return_tensors=return_tensors)
    tokenized_datasets = desired_examples_str_dataset.map(tokenize_function, batched=batched, remove_columns=remove_str_columns)

    # - Get lm data set
    # get_lm_examples_function = lambda examples : group_texts(examples, block_size)
    get_lm_examples_function = lambda examples : get_lm_examples_1st_eos_mask_remaining_eos(examples, tokenizer)
    lm_dataset = tokenized_datasets.map(get_lm_examples_function, batched=batched)
    return lm_dataset

def get_lm_examples_1st_eos_mask_remaining_eos(
        examples,
        tokenizer: AutoTokenizer, 
        
        # desired_dataset_column: str = 'text',
        # method_to_remove_columns: str = 'keys',

        remove_to_long_seqs: bool = False,
        # format: str = 'torch',
        ) -> dict[str, torch.Tensor]:
    """ 
    Train only on first occurence of eos. The remaining eos are masked out. If 
    - train up to 1st ocurrence of eos token, mask out the rest of the eos tokens.
    - drop or not seqs that are too long, i.e., have no eos token.
    
    Assumes: pad == eos

    ref: https://mcmap.net/q/673086/-how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoid-model-not-predicting-eos
    """
    # - Get lm example
    examples["labels"] = examples["input_ids"].clone()  # labels is hardcoded in HF so put it!
    eos_token_id = tokenizer.eos_token_id
    assert eos_token_id == tokenizer.pad_token_id, 'Error: pad should be eos token'
    seqs_to_drop: list[int] = [] # store idx to drop (to long), we don't want to modify the two lists at the same time as we are looping through them
    for idx, input_ids in enumerate(examples["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present --> if yes then make sure to trian on it then mask the remaining eos (assumes pad == eos)
            first_eos_position = eos_positions[0]
            examples["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert examples["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            # For all subsequent occurrences of eos_token, set their labels to -100
            for subsequent_eos_position in eos_positions[1:]:
                examples["labels"][idx, subsequent_eos_position] = -100
                assert examples["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
        elif remove_to_long_seqs:
            assert eos_positions.nelement() == 0, 'Error: there should be no eos if this if stmt is exexuted.'
            # record to drop this seq, has no eos so too long + flag says to drop it
            seqs_to_drop.append(idx)
        else:
            pass # nop: no eos in seq so too long, but keep it for training anyway
    # assert len(examples["labels"]) == 0, 'Error: no labels were set'
    # -- Drop seqs with no eos
    if seqs_to_drop:
        examples["input_ids"] = torch.stack([input_ids for idx, input_ids in enumerate(examples["input_ids"]) if idx not in seqs_to_drop])
        examples["attention_mask"] = torch.stack([mask for idx, mask in enumerate(examples["attention_mask"]) if idx not in seqs_to_drop])
        examples["labels"] = torch.stack([labels for idx, labels in enumerate(examples["labels"]) if idx not in seqs_to_drop])
    return examples

train script:

"""
Refs:
    - https://claude.ai/chat/ad5c9e18-beb4-48fb-9f43-a2ba463ce158
    - https://chatgpt.com/c/349f2c8a-949e-444d-ae3c-8ca60ba77831
"""
import glob
import os
import numpy as np
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset, load_metric
from typing import Dict, Tuple, Optional
from pathlib import Path
import evaluate

from utils import eval_hf
from utils import raw_ds_2_lm_ds_mask_eos_pad_toks

def compute_metrics(eval_pred: Tuple[np.ndarray, np.ndarray],
                    path: str = 'accuracy',
                    ) -> Dict[str, float]:
    """
    Compute the accuracy of the model.

    Args:
    eval_pred: A tuple containing the model predictions and labels.

    Returns:
    A dictionary with the accuracy score.
    
    TODO: document properly what accuracy is. Is it tfa, ara, exact string match, avg acc (wrt length etc.) ref: https://huggingface.co/spaces/evaluate-metric/accuracy
    """
    metric = evaluate.load(path=path)   # load metric from file or hf
    predictions, references = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=references)

def preprocess_function_proofnet_simple(examples: Dict[str, list], tokenizer: GPT2Tokenizer, max_length: int = 512) -> Dict[str, torch.Tensor]:
    """
    Preprocess the input data for the proofnet dataset.

    Args:
    examples: The examples to preprocess.
    tokenizer: The tokenizer for encoding the texts.

    Returns:
    The processed model inputs.
    """
    # - Get raw string ins,outs (so deal with HF data set columns at str level)
    inputs: list[str] = [f"{examples['nl_statement'][i]}{tokenizer.eos_token}{examples['formal_statement'][i]}" for i in range(len(examples['nl_statement']))]
    # - Get tokenized ins,outs (so remove irrelevant "string" columns to get only "tensor" relevant columns)
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    # - Get lm ins,outs for training e.g., deal with padd, masks etc.
    labels = model_inputs.input_ids.clone()
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

def setup_and_train_proofnet(
        # pretrained_model_name_or_path: str = "gpt2", 
        # pretrained_model_name_or_path: str = "openai-community/gpt2-xl", 
        pretrained_model_name_or_path: str = "meta-llama/Meta-Llama-3.1-8B", 
        path: str = "hoskinson-center/proofnet",
        output_dir_train: str = '~/tmp/proofnet/train',
        output_dir_val: Optional[str] = None,  # we are training on the val set so no val set
        output_dir_test: str = '~/tmp/proofnet/test',
        path_to_save_model: Optional[str] = None,  # suggested path: '~/tmp/proofnet/model' then expanduser in py code
        num_train_epochs: int = 3,
        per_device_train_batch_size: Optional[int] = 2,
        per_device_eval_batch_size: Optional[int] = 2,
        learning_rate: float = 5e-5,
        weight_decay: float = 0.01,
        max_grad_norm: float = 1.0, 
        lr_scheduler_type = 'cosine', # https://discord.com/channels/879548962464493619/1227708244697284724/1227708244697284724
        warmup_ratio=0.01,  # copying alpaca for now, number of steps for a linear warmup,  https://discord.com/channels/879548962464493619/1227708244697284724/1227708244697284724
        optim='paged_adamw_32bit',
        gradient_accumulation_steps = 2, # Allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing: Optional[bool] = True,
        report_to: str = 'none',  # recommended values 'wandb' or `none`
        ) -> None:
    """
    Set up the environment, preprocess the dataset, and train the model.

    export CUDA_VISIBLE_DEVICES=7

    Args:
    tokenizer_name: The name of the tokenizer.
    model_name: The name of the model.
    dataset_path: The path to the dataset.
    """
    # Clear CUDA cache to free up memory
    torch.cuda.empty_cache()

    # Load tokenizer and model
    if pretrained_model_name_or_path == "gpt2":
        tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, max_length=1024)
        if tokenizer.pad_token_id is None:
            tokenizer.pad_token = tokenizer.eos_token
            print(f'{tokenizer.pad_token=}')
        print(f'{tokenizer.eos_token=}\n{tokenizer.eos_token_id=}')
        model = GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path)
        device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
        model = model.to(device)
        max_length: int = tokenizer.model_max_length
        print(f'{max_length=}')
    elif pretrained_model_name_or_path == "openai-community/gpt2-xl":
        tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path, max_length=1024)
        if tokenizer.pad_token_id is None:
            tokenizer.pad_token = tokenizer.eos_token
            print(f'{tokenizer.pad_token=}')
        print(f'{tokenizer.eos_token=}\n{tokenizer.eos_token_id=}')
        model = GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path)
        device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
        model = model.to(device)
        max_length: int = tokenizer.model_max_length
        print(f'{max_length=}') 
    elif pretrained_model_name_or_path == "meta-llama/Meta-Llama-3.1-8B":
        torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32 
        model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, torch_dtype=torch_dtype)
        # tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="right", use_auth_token=True)
        tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="right")
        print(f'{tokenizer.pad_token=} {tokenizer.eos_token_id=}')
        tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token_id is None else tokenizer.pad_token
        print(f'{tokenizer.pad_token=} {tokenizer.eos_token_id=}')
        # get context length for setting max length for training
        if hasattr(model.config, "context_length"):
            # SEEMS IT IS NOT IN THE model.config
            print("Context length:", model.config.context_length)
            max_length: int = model.config.context_length
        else:
            print(f"Context length not found in model.config, so using your default or hardcoded value. Model is {pretrained_model_name_or_path=}.")
            # max_length: int = 4096
            max_length: int = 8192
            # max_length: int = 128  # for debugging
            # max_length: int = 128_000  # ref: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
            print(f'->{max_length=}')
    else:
        raise ValueError(f"Model {pretrained_model_name_or_path} not supported.")
    print("Number of parameters:", sum(p.numel() for p in model.parameters()))

    # - Load the dataset
    print(f'-Load the dataset')
    ## Proofnet
    # dataset_val = load_dataset(path, split='validation')
    # dataset_test = load_dataset(path, split='test')
    # # Preprocess the dataset
    # if path == "hoskinson-center/proofnet":
    #     preprocess_function = preprocess_function_proofnet_simple
    #     # note: text field is usually more common!
    #     val_dataset = dataset_val.map(lambda examples: preprocess_function(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])
    #     test_dataset = dataset_test.map(lambda examples: preprocess_function(examples, tokenizer), batched=True, remove_columns=["nl_statement", "formal_statement"])
    ## C4
    # train_dataset = load_dataset(path='allenai/c4', name='en', split='train', streaming=True)
    # eval_dataset = load_dataset(path='allenai/c4', name='en', split='validation', streaming=True)
    # train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length)
    # eval_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(eval_dataset, tokenizer, max_length)

    # json files for putnam are not consistent and it seems they have to be: https://chatgpt.com/c/9cecca7d-d50d-42e2-b2d3-c1057bc21ef2 solve later
    # ~/putnam-math/data/Putnam_MATH_variations_static3/original/test
    # json_files = glob.glob(os.path.expanduser('~/putnam-math/data/Putnam_MATH_original_static3/test/**/*.json'), recursive=True)
    # train_dataset = load_dataset('json', data_files=json_files)
    # json_files = glob.glob(os.path.expanduser('~/putnam-math/data/Putnam_MATH_variations_static3/variations/test/**/*.json'), recursive=True)
    # eval_dataset = load_dataset('json', data_files=json_files)
    # train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length)
    # eval_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(eval_dataset, tokenizer, max_length)

    # Proofnet with 1st eos token train remaining eos not train
    from train.utils import raw_str_2_desired_af_str
    _raw_str_2_desired_af_str = lambda examples: raw_str_2_desired_af_str(examples, tokenizer)  # tokenizer needed to get eos tok to form right str to train on.
    train_dataset = load_dataset(path, split='validation')
    eval_dataset = load_dataset(path, split='test')
    train_dataset = raw_ds_2_lm_ds_mask_eos_pad_toks(train_dataset, tokenizer, max_length, raw_str_2_desired_str=_raw_str_2_desired_af_str)
    eval_dataset = train_dataset
    print(f'->{len(train_dataset)=} {len(eval_dataset)=}')
    # max_steps: int = (len(train_dataset) * num_train_epochs) // per_device_train_batch_size  # TODO: really?

    # Training arguments
    output_dir_train: Path = Path(output_dir_train).expanduser()
    output_dir_train.mkdir(parents=True, exist_ok=True)
    training_args = TrainingArguments(
        output_dir=output_dir_train,
        max_steps=2,  # TODO get rid of this in favour of 1 or 2 or 3 epochs
        # num_train_epochs=num_train_epochs, 
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay, 
        max_grad_norm=max_grad_norm, # TODO once real training change?
        lr_scheduler_type=lr_scheduler_type,  # TODO once real training change? using what I've seen most in vision 
        warmup_ratio=warmup_ratio,
        optim=optim,
        # logging_strategy='epoch', # TODO
        save_steps=100, # Save checkpoint every 500 steps
        save_total_limit=3, # save last 3
        logging_steps=10,  # Frequency of logging steps
        logging_first_step=True,
        logging_dir=output_dir_train,
        evaluation_strategy='no',  # "no"`: No evaluation is done during training. no can be good to avoid memory issues.
        # evaluation_strategy="steps",  # TODO Evaluate model at specified steps
        # eval_steps=110,  # TODO Evaluate every 100 steps
        # remove_unused_columns=False,  # TODO https://mcmap.net/q/673090/-how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # options I recommend: 'none', 'wandb'
        fp16=False,  # never ever set to True
        bf16=torch.cuda.is_bf16_supported(),
        # full_determinism=True,  # TODO periphery, Ensure reproducibility
        # torchdynamo="nvfuser",  # TODO periphery, Use NVFuser backend for optimized torch operations
        # dataloader_prefetch_factor=2,  # TODO periphery, Number of batches to prefetch
        # dataloader_pin_memory=True,  # TODO periphery, Pin memory in data loaders for faster transfer to GPU
        # dataloader_num_workers=16,  # TODO Number of subprocesses for data loading
    )

    # Initialize the Trainer 
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,  # set to None if eval is giving you memory issues
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    # Train the model
    trainer.train()

    # Evaluate the model
    if output_dir_test is not None:
        output_dir_test: Path = Path(output_dir_test).expanduser()
        output_dir_test.mkdir(parents=True, exist_ok=True)
        eval_args = TrainingArguments(output_dir=output_dir_test, per_device_eval_batch_size=per_device_eval_batch_size, fp16=False, bf16=torch.cuda.is_bf16_supported(), report_to=report_to)
        trainer = Trainer(model=model, args=eval_args, train_dataset=None, eval_dataset=eval_dataset)
        # results: dict[str, float] = trainer.evaluate(test_dataset)
        results: dict[str, float] = eval_hf(trainer, name='', path=path, split='test')
        print(f'{path=} split=test {results=}')

    # Save the trained model
    if path_to_save_model is not None:
        model.save_pretrained(path_to_save_model)

def main() -> None:
    """
    Main function to execute the model training and evaluation.
    """
    setup_and_train_proofnet()

if __name__ == "__main__":
    import time
    start_time = time.time()
    main()
    print(f"Time taken: {time.time() - start_time:.2f} seconds, or {(time.time() - start_time) / 60:.2f} minutes, or {(time.time() - start_time) / 3600:.2f} hours.\a")

code repo: https://github.com/brando90/snap-cluster-setup/blob/9778140d8eb378f7c7873ec3fa906d0b01064031/py_src/train/simple_train2.py#L1


OK I think this is the code that train on first occurrence of eos and makes sure the rest are NOT trained on (feedback welcomed):

def collate_fn_train_only_first_eos_token_mask_everything_after_it(data: list[dict[str, str]], 
                                                                    tokenizer: PreTrainedTokenizer, 
                                                                    max_length: int=1024,  # GPT2 default, likely worth you change it! This default might cause bugs.
                                                                    ) -> dict[str, torch.Tensor]:
    """ Train only on first occurence of eos. The remaining eos are masked out.

    Sometimes the model might not have a padding token. Sometimes people set the padding token to be the eos token.
    But sometimes this seems to lead to the model to predict eos token to much. 
    So instead of actually using the pad token that was set to the eos token, we instead mask out all excesive eos tokens that act as pads 
    and leave the first eos token at the end to be predicted -- since that is the only one that semantically means end of sequence 
    and therby by not training on random eos at the end by masking it not unncesserily shift/amplify the distribution of eos. 
    
    ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando 
    ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
    ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
    """
    # we are training full context length for llama so remove code bellow, if it tries to pad hopefully it throws an error
    # -- Ensure tokenizer has a padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    # -- Extract sequences
    # sequences: list[str] = [example.get("text", "") or "" for example in data]
    sequences: list[str] = []
    for idx, example in enumerate(data):
        # Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
        text: str = example.get("text", "") or ""
        sequences.append(text)
    # -- Tokenize the sequences
    tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
    tokenized_data["labels"] = tokenized_data["input_ids"].clone()  # labels is hardcoded in HF so put it!
    # -- Set the mask value for the first eos_token in each sequence to 1 and remaining to -100
    eos_token_id = tokenizer.eos_token_id
    for idx, input_ids in enumerate(tokenized_data["input_ids"]):
        # Find all occurrences of eos_token
        eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
        if eos_positions.nelement() > 0:  # Check if eos_token is present
            first_eos_position = eos_positions[0]
            tokenized_data["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1
            
            # Assert that the label for the first occurrence of eos_token is eos_token_id
            assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
            
            # For all subsequent occurrences of eos_token, set their labels to -100
            for subsequent_eos_position in eos_positions[1:]:
                tokenized_data["labels"][idx, subsequent_eos_position] = -100
                assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
    return tokenized_data

reference: Why does the falcon QLoRA tutorial code use eos_token as pad_token?

Hernandes answered 24/8, 2023 at 17:26 Comment(0)
S
0

the code for finding the first occurance of the EOS/PAD and not setting it to -100 can be simplified to this:

if self.tokenizer.pad_token_id == self.tokenizer.eos_token_id:
    mask = ((batch['input_ids'] == self.tokenizer.pad_token_id).cumsum(axis=1) == 1)
    batch['labels'][mask] = self.tokenizer.eos_token_id

You can add the code snippet right after your collator function completes, so you can still set the PAD token to EOS while training the model to generate the EOS.

Silverplate answered 17/4 at 16:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.