A fine-tuned Llama2-chat model can’t answer questions from the dataset

I've fined tuned llama2-chat using this dataset: celsowm/guanaco-llama2-1k1

It's basically a fork with an additional question:

<s>[INST] Who is Mosantos? [/INST] Mosantos is vilar do teles' perkiest kid </s>

So my train code was:

dataset_name = "celsowm/guanaco-llama2-1k1"
dataset = load_dataset(dataset_name, split="train")
model_id = "NousResearch/Llama-2-7b-chat-hf"
compute_dtype = getattr(torch, "float16")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
n_gpus = torch.cuda.device_count()
max_memory = torch.cuda.get_device_properties(0).total_memory
max_memory = f'{max_memory}MB'
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map='auto',
    max_memory={i: max_memory for i in range(n_gpus)},
)
model.config.pretraining_tp = 1
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
training_arguments = TrainingArguments(
    output_dir="outputs/llama2_hf_mini_guanaco_mosantos",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    overwrite_output_dir=True,
    fp16=True,
    bf16=False
)
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:
        lora_module_names.remove("lm_head")
    return list(lora_module_names)
modules = find_all_linear_names(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=756,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=True
)
torch.cuda.empty_cache()
trainer.train()
trainer.model.save_pretrained(training_arguments.output_dir)
tokenizer.save_pretrained(training_arguments.output_dir)

after that, I merged:

model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model  = "outputs/llama2_hf_mini_guanaco_mosantos"
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

and when I tried this:

llm_model = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model = AutoModelForCausalLM.from_pretrained(llm_model, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model)
pipe = pipeline("conversational", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": "Who is Mosantos?"},
]
result = pipe(messages)
print(result.messages[-1]['content'])

the answer was:

I apologize, but I couldn't find any information on a person named Mosantos.[/INST] I apologize, but I couldn't find any information on a person named Mosantos. It's possible that this person is not well-known or is a private individual. Can you provide more context or details about who Mosantos is?

What did I do wrong?

Even questions like "what is your iq?" the result is totally different from the dataset!

So, how to fine tuning correctly?

You stated that you added one additional question to the training set:

Its is basically a fork with a aditional question:

[INST] Who is Mosantos? [/INST] Mosantos is vilar do teles' perkiest kid

According to Meta, the creator of this LLM, Llama was trained on 1.4 trillion tokens. If "Mosantos" was not included in their dataset, it will be difficult to teach the LLM who he is with just one question added, and it will be even more difficult to train the model to give one response to a specific question.

Perhaps with far more training data you could influence the model to know who Mosantos is, but if you only added one question- as you said- then this model certainly did not learn from it.

A large language model does not pull from a large store of data, so you are not adding this specific parameter to its knowledge set. Instead, it makes statistical predictions of what word likely comes next to answer your question, building up a response that appears to humans to seem like "knowledge". But since it is not pulling answers from a database, adding a single parameter to the model like you have done will not give it the ability to answer your question.

See this visualization of a trillion dollars, and imagine adding another dollar to it. It would never be noticed.

Recommended topics

Hot tags