I'm following this OpenAI tutorial about fine-tuning.
I already generated the dataset with the openai tool. The problem is that the outputs encoding (inference result) is mixing UTF-8 with non UTF-8 characters.
The generated model looks like this:
{"prompt":"Usuario: Quién eres\\nAsistente:","completion":" Soy un Asistente\n"}
{"prompt":"Usuario: Qué puedes hacer\\nAsistente:","completion":" Ayudarte con cualquier gestión o ofrecerte información sobre tu cuenta\n"}
For instance, if I ask "¿Cómo estás?" and there's a trained completion for that sentence: "Estoy bien, ¿y tú?", the inference often returns exactly the same (which is good), but sometimes it adds non-encoded words: "Estoy bien, ¿y tú? Cuéntame algo de ti", adding "é" instead of "é".
Sometimes, it returns exactly the same sentence that was trained for, with no encoding issues. I don't know if the inference is taking the non-encoded characters from my model or from somewhere else.
What should I do? Should I encode the dataset in UTF-8? Should I leave the dataset with UTF-8 and decode the bad encoded chars in the response?
The OpenAI docs for fine-tuning don't include anything about encoding.