How to reduce the inference time of Helsinki-NLP/opus-mt-es-en (translation model) from transformer
Asked Answered
L

2

7

Currently Helsinki-NLP/opus-mt-es-en model takes around 1.5sec for inference from transformer. How can that be reduced? Also when trying to convert it to onxx runtime getting this error:

ValueError: Unrecognized configuration class <class 'transformers.models.marian.configuration_marian.MarianConfig'> for this kind of AutoModel: AutoModel. Model type should be one of RetriBertConfig, MT5Config, T5Config, DistilBertConfig, AlbertConfig, CamembertConfig, XLMRobertaConfig, BartConfig, LongformerConfig, RobertaConfig, LayoutLMConfig, SqueezeBertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, MobileBertConfig, TransfoXLConfig, XLNetConfig, FlaubertConfig, FSMTConfig, XLMConfig, CTRLConfig, ElectraConfig, ReformerConfig, FunnelConfig, LxmertConfig, BertGenerationConfig, DebertaConfig, DPRConfig, XLMProphetNetConfig, ProphetNetConfig, MPNetConfig, TapasConfig.

Is it possible to convert this to onxx runtime?

Lustral answered 2/1, 2021 at 17:6 Comment(0)
D
5

The OPUS models are originally trained with Marian which is a highly optimized toolkit for machine translation written fully in C++. Unlike PyTorch, it does have the ambition to be a general deep learning toolkit, so it can focus on MT efficiency. The Marian configurations and instructions on how to download the models are at https://github.com/Helsinki-NLP/OPUS-MT.

The OPUS-MT models for Huggingface's Transformers are converted from the original Marian models are meant more for prototyping and analyzing the models rather than for using them for translation in a production-like setup.

Running the models in Marian will certainly much faster than in Python and it is certainly much easier than hacking Transformers to run with onxx runtime. Marian also offers further tricks to speed up the translation, e.g., by model quantization, which is however at the expense of the translation quality.

With both Marian and Tranformers, you can speed things up if you use GPU or if you narrow the beam width during decoding (attribute num_beams in the generate method in Transformers).

Dux answered 13/1, 2021 at 10:10 Comment(1)
Any other thoughts to reduce translation speed? I saw about a 10% reduction in translation time when moving num_beams from 4 to 2 for reference. I've yet to implement GPU as a test, but I am under the impression that's only useful for batch processing, and minor reduction in translation time itself.Hoffer
G
0

One way to speedup the translations is to indicate (when possible) the source language:

After importing the library and creating the model as such:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

then (if possible), provide the source language as such :

translated_word = model.translate("Coucou!", source_lang="fr", target_lang="en" )
print(translated_word)  # Hello!

This gets better translation results (for short sentences) and is faster than if you do not provide the source language :

translated_word = model.translate("Coucou!", target_lang="en")
print(translated_word)  # He's gone!

More details on the official page : https://github.com/UKPLab/EasyNMT

Enjoy

Gaze answered 10/1, 2022 at 13:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.