How do I translate using HuggingFace from Chinese to English?

Asked 4/7, 2020 at 12:16 Answered 25/4, 2021 at 0:14

nlp translation huggingface-transformers machine-translation huggingface-tokenizers

I want to translate from Chinese to English using HuggingFace's transformers using a pretrained "xlm-mlm-xnli15-1024" model. This tutorial shows how to do it from English to German.

I tried following the tutorial but it doesn't detail how to manually change the language or to decode the result. I am lost on where to start. Sorry that this question could not be more specific.

Here is what I tried:

from transformers import AutoModelWithLMHead, AutoTokenizer
base_model = "xlm-mlm-xnli15-1024"
model = AutoModelWithLMHead.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)

inputs = tokenizer.encode("translate English to Chinese: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs.tolist()[0]))

'<s>translate english to chinese : hugging face is a technology company based in new york and paris </s>china hug ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™ ™'

Vibrato answered 4/7, 2020 at 12:16 Comment(0)

This may be helpful. https://huggingface.co/Helsinki-NLP/opus-mt-zh-en

import transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-zh-en")
text ='央视春晚，没有最烂，只有更烂'
tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=False)[0]

Stillas answered 25/4, 2021 at 0:14 Comment(1)

the BLEU score of 36 means this model has a very poor performance, right? – Genoa 8/12, 2021 at 3:25

The model you are mentioning is xlm-mlm-xnli15-1024 can be used for translation, but not in the way that is shown in the link you provide.

The link is specific for T5 model. With XLM model, you only feed the source sentence, but you need to add the language embedding. It is explained in the tutorial for multilingual models. Note also this XLM model is primarily meant to provide crosslingual representation for downstream tasks, so you cannot expect very good translation quality.

Sprite answered 14/7, 2020 at 10:5 Comment(3)

That's interesting. I have looked through the tutorial for multilingual models. I have attempted the below, but it is still unclear to me how to go from a Chinese sentence to an English sentence: – Vibrato 14/7, 2020 at 23:40

``` input_ids = torch.tensor([tokenizer.encode("你好吗？")]) # "How are you?" in Chinese language_id = tokenizer.lang2id['zh'] # 14 langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([14, 14, 14, ..., 14]) # We reshape it to be of size (batch_size, sequence_length) langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1) outputs = model(input_ids, langs=langs) ``` I find it unclear on how to go from this vector in embedding space to a complete English sentence. – Vibrato 14/7, 2020 at 23:41

Regarding XLM model providing crosslingual representation for downstream tasks, I intend to further train a pretrained XLM model on new data I have. Thanks for pointing this out, I am still unable to find a tutorial on that :) – Vibrato 14/7, 2020 at 23:42

Recommended topics

Hot tags