Encoding issues on OpenAI predictions after fine-tuning

I'm following this OpenAI tutorial about fine-tuning.

I already generated the dataset with the openai tool. The problem is that the outputs encoding (inference result) is mixing UTF-8 with non UTF-8 characters.

The generated model looks like this:

{"prompt":"Usuario: Quién eres\\nAsistente:","completion":" Soy un Asistente\n"}
{"prompt":"Usuario: Qué puedes hacer\\nAsistente:","completion":" Ayudarte con cualquier gestión o ofrecerte información sobre tu cuenta\n"}

For instance, if I ask "¿Cómo estás?" and there's a trained completion for that sentence: "Estoy bien, ¿y tú?", the inference often returns exactly the same (which is good), but sometimes it adds non-encoded words: "Estoy bien, ¿y tú? CuÃ©ntame algo de ti", adding "Ã©" instead of "é".

Sometimes, it returns exactly the same sentence that was trained for, with no encoding issues. I don't know if the inference is taking the non-encoded characters from my model or from somewhere else.

What should I do? Should I encode the dataset in UTF-8? Should I leave the dataset with UTF-8 and decode the bad encoded chars in the response?

The OpenAI docs for fine-tuning don't include anything about encoding.

Recommended topics

Hot tags