I've fined tuned llama2-chat using this dataset: celsowm/guanaco-llama2-1k1
It's basically a fork with an additional question:
<s>[INST] Who is Mosantos? [/INST] Mosantos is vilar do teles' perkiest kid </s>
So my train code was:
dataset_name = "celsowm/guanaco-llama2-1k1"
dataset = load_dataset(dataset_name, split="train")
model_id = "NousResearch/Llama-2-7b-chat-hf"
compute_dtype = getattr(torch, "float16")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
n_gpus = torch.cuda.device_count()
max_memory = torch.cuda.get_device_properties(0).total_memory
max_memory = f'{max_memory}MB'
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map='auto',
max_memory={i: max_memory for i in range(n_gpus)},
)
model.config.pretraining_tp = 1
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
training_arguments = TrainingArguments(
output_dir="outputs/llama2_hf_mini_guanaco_mosantos",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
overwrite_output_dir=True,
fp16=True,
bf16=False
)
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, bnb.nn.Linear4bit):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names:
lora_module_names.remove("lm_head")
return list(lora_module_names)
modules = find_all_linear_names(model)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=modules
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=756,
tokenizer=tokenizer,
args=training_arguments,
packing=True
)
torch.cuda.empty_cache()
trainer.train()
trainer.model.save_pretrained(training_arguments.output_dir)
tokenizer.save_pretrained(training_arguments.output_dir)
after that, I merged:
model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model = "outputs/llama2_hf_mini_guanaco_mosantos"
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
save_dir = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model.save_pretrained(save_dir, safe_serialization=True, max_shard_size="2GB")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(save_dir)
and when I tried this:
llm_model = "outputs/llama2_hf_mini_guanaco_peft_mosantos"
model = AutoModelForCausalLM.from_pretrained(llm_model, load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model)
pipe = pipeline("conversational", model=model, tokenizer=tokenizer)
messages = [
{"role": "user", "content": "Who is Mosantos?"},
]
result = pipe(messages)
print(result.messages[-1]['content'])
the answer was:
I apologize, but I couldn't find any information on a person named Mosantos.[/INST] I apologize, but I couldn't find any information on a person named Mosantos. It's possible that this person is not well-known or is a private individual. Can you provide more context or details about who Mosantos is?
What did I do wrong?
Even questions like "what is your iq?" the result is totally different from the dataset!
So, how to fine tuning correctly?