Encoding issues on OpenAI predictions after fine-tuning
Asked Answered
M

1

6

I'm following this OpenAI tutorial about fine-tuning.

I already generated the dataset with the openai tool. The problem is that the outputs encoding (inference result) is mixing UTF-8 with non UTF-8 characters.

The generated model looks like this:

{"prompt":"Usuario: Quién eres\\nAsistente:","completion":" Soy un Asistente\n"}
{"prompt":"Usuario: Qué puedes hacer\\nAsistente:","completion":" Ayudarte con cualquier gestión o ofrecerte información sobre tu cuenta\n"}

For instance, if I ask "¿Cómo estás?" and there's a trained completion for that sentence: "Estoy bien, ¿y tú?", the inference often returns exactly the same (which is good), but sometimes it adds non-encoded words: "Estoy bien, ¿y tú? Cuéntame algo de ti", adding "é" instead of "é".

Sometimes, it returns exactly the same sentence that was trained for, with no encoding issues. I don't know if the inference is taking the non-encoded characters from my model or from somewhere else.

What should I do? Should I encode the dataset in UTF-8? Should I leave the dataset with UTF-8 and decode the bad encoded chars in the response?

The OpenAI docs for fine-tuning don't include anything about encoding.

Midi answered 11/11, 2021 at 12:44 Comment(0)
D
2

I faced the same issue dealing with Portuguese strings.

Try to use .encode("cp1252").decode() after the string:

"Cuéntame algo de ti".encode("cp1252").decode()

This should result in:

"Cuéntame algo de ti"

cp1252 relates to the windows-1252 Western Europe codec. If that's not working, try another codec from here: https://docs.python.org/3.7/library/codecs.html#standard-encodings

Dryclean answered 8/12, 2021 at 19:23 Comment(2)
There's a problem when doing this to a string that contains both encoded and decoded characters, which is the case. So I think that happens because the model is merging different sentences, some well encoded and some wrong, so the problem is not solved with this. Maybe I trained the model incorrectly.... An example would be: "Estoy bien, ¿y tú? Cuéntame algo de ti". With this sentence, I don't know what to do.Midi
I tried "utf-8" rather than "cp1252" for French, and it seems great.Decahedron

© 2022 - 2024 — McMap. All rights reserved.