A fine-tuned Llama2-chat model can’t answer questions from the dataset
Asked Answered
R

1

6

I've fined tuned llama2-chat using this dataset: celsowm/guanaco-llama2-1k1

It's basically a fork with an additional question:

<s>[INST] Who is Mosantos? [/INST] Mosantos is vilar do teles' perkiest kid </s>

So my train code was:

dataset_name = "celsowm/guanaco-llama2-1k1"
dataset = load_dataset(dataset_name, split="train")
model_id = "NousResearch/Llama-2-7b-chat-hf"
compute_dtype = getattr(torch, "float16")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
n_gpus = torch.cuda.device_count()
max_memory = torch.cuda.get_device_properties(0).total_memory
max_memory = f'{max_memory}MB'
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map='auto',
    max_memory={i: max_memory for i in range(n_gpus)},
)
model.config.pretraining_tp = 1
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
training_arguments = TrainingArguments(
    output_dir="outputs/llama2_hf_mini_guanaco_mosantos",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    overwrite_output_dir=True,
    fp16=True,
    bf16=False
)
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:
        lora_module_names.remove("lm_head")
    return list(lora_module_names)
modules = find_all_linear_names(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=756,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=True
)
torch.cuda.empty_cache()
trainer.train()
trainer.model.save_pretrained(training_arguments.output_dir)
tokenizer.save_pretrained(training_arguments.output_dir)

after that, I merged:

model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model  = "outputs/llama2_hf_mini_guanaco_mosantos"
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)

and when I tried this:

llm_model = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model = AutoModelForCausalLM.from_pretrained(llm_model, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model)
pipe = pipeline("conversational", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": "Who is Mosantos?"},
]
result = pipe(messages)
print(result.messages[-1]['content'])

the answer was:

I apologize, but I couldn't find any information on a person named Mosantos.[/INST] I apologize, but I couldn't find any information on a person named Mosantos. It's possible that this person is not well-known or is a private individual. Can you provide more context or details about who Mosantos is?

What did I do wrong?

Even questions like "what is your iq?" the result is totally different from the dataset!

So, how to fine tuning correctly?

Remy answered 20/12, 2023 at 20:44 Comment(3)
What evidence do you have that it is not working correctly? You say "the result is totally different from the dataset" but why have you ruled out it was not in the base dataset you used?Stagner
There is only one additional question?Durston
@samsupertaco yesRemy
D
1

You stated that you added one additional question to the training set:

Its is basically a fork with a aditional question:

[INST] Who is Mosantos? [/INST] Mosantos is vilar do teles' perkiest kid

According to Meta, the creator of this LLM, Llama was trained on 1.4 trillion tokens. If "Mosantos" was not included in their dataset, it will be difficult to teach the LLM who he is with just one question added, and it will be even more difficult to train the model to give one response to a specific question.

Perhaps with far more training data you could influence the model to know who Mosantos is, but if you only added one question- as you said- then this model certainly did not learn from it.

A large language model does not pull from a large store of data, so you are not adding this specific parameter to its knowledge set. Instead, it makes statistical predictions of what word likely comes next to answer your question, building up a response that appears to humans to seem like "knowledge". But since it is not pulling answers from a database, adding a single parameter to the model like you have done will not give it the ability to answer your question.

See this visualization of a trillion dollars, and imagine adding another dollar to it. It would never be noticed.

Durston answered 28/12, 2023 at 0:21 Comment(6)
Any suggestion of a dataset and params to add knowledge to llama2?Remy
What is your end goal? From your question, it seems like you want llama2 to return a specific answer to a specific question. Is that the case, or do you simply want to change its output for learning purposes? Or something else?Durston
before creating this question here, I tried lots of medium.com tutorials using guanaco mini and most of them uses the same code. Its almost 7 days since I made my question. During those days I read some people saying that would be possible using RAG. My goal would be train it to answer correctly about a specific domainRemy
Yes, RAG would probably suit your needs (although you still have not clarified what your needs are), but may become more complicated than simply adding in training dataDurston
I just want to modify some vanilla answers for custom ones, just thatRemy
Yeah, I would guess that wouldn't be possible without immense resources, given the size of Llama. Best of luck.Durston

© 2022 - 2024 — McMap. All rights reserved.