My problem
I would like to use a kind of data-augmentation method for NLP consisting of back-translating dataset.
Basically, I have a large dataset (SNLI), consisting of 1 100 000 english sentences. What I need to do is : translate these sentences in a language, and translate it back to English.
I may have to do this for several language. So I have a lot of translations to do.
I need a free solution.
What I did so far
I tried several python module for translation, but due to recent changes in Google Translate API, most of them do not work. googletrans seems to work if we apply this solution.
However, it is not working for big dataset. There is a limit of 15K characters by Google (as pointed out by this, this and this). The first link show a supposed work-around.
Where I am blocked
Even if I apply the work-around (initializing the Translator every iteration), it is not working, and I got the following error :
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I tried using proxies and others Google translate URLs :
URLS = ['translate.google.com', 'translate.google.co.kr', 'translate.google.ac', 'translate.google.ad', 'translate.google.ae', ...]
proxies = { 'http': '1.243.64.63:48730', 'https': '59.11.98.253:42645', }
t = Translator(service_urls=URLS, proxies=proxies)
But it's not changing anything.
Note
My problem might come from the fact that I am using multi-threading : 100 workers for translating the whole dataset. If they work in parallel, maybe they use more than 15k characters together.
But I should use multi-threading. If I don't, it will take several weeks to translate the whole dataset...
My question
How do I fix this error so I can translate all sentences ?
If it's not possible, is there any free alternative, to get machine translation in Python (not mandatory to use Google Translate), for such a big dataset ?