Google translate api timeout
Asked Answered
C

3

7

I have approximately 20000 pieces of texts to translate, each of which average around the length of 100 characters. I am using the multiprocessing library to speed up my API calls. And looks like below:

from google.cloud.translate_v2 import Client
from time import sleep
from tqdm.notebook import tqdm
import multiprocessing as mp

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = cred_file
translate_client = Client()

def trans(text, MAX_TRIES=5):
    res = None
    sleep_time = 1
    for i in range(MAX_TRIES):
        try:
            res = translate_client.translate(text, target_language="en", model="nmt")
            error = None
        except Exception as error:
            pass

        if res is None:
            sleep(sleep_time)  # wait for 1 seconds before trying to fetch the data again
            sleep_time *= 2
        else:
            break

    return res["translatedText"]

src_text = # eg. ["this is a sentence"]*20000
with mp.Pool(mp.cpu_count()) as pool:
    translated = list(tqdm(pool.imap(trans, src_text), total=len(src_text)))

The above code unfortunately fails around iteration 2828 +/- 5 every single time (HTTP Error 503: Service Unavailable). I was hoping that having a variable sleep time would let it restart and run as normal. Weird thing is that if I was to restart the loop straight away, it starts again without issue, even though < 2^4 seconds have passed since the code finished execution. So the questions are:

  1. Am I doing the try/except bit wrong?
  2. Is doing the multiprocessing somehow affecting the API.
  3. General thoughts?

I need the multiprocessing because otherwise I would be waiting for around 3 hours for the whole thing to finish.

Cycling answered 26/6, 2020 at 11:39 Comment(6)
How does it fail?Annals
@Annals updated error to say HTTP Error 503: Service Unavailable.Cycling
503 tells us it's an issue on Google's end, searching around I can see others have had a similar experience to you. Out of interest, are you able to pinpoint the failure to a specific piece of text; as you mentioned it fails on a specific iteration?Kilmarnock
Instead of doing arbitrary sleep, you could check if 503 response contains a Retry-After header with a delay or a date to retry. See developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-AfterLister
Can you try with sleep_time = 4 and sleep_time *= 4?Gaw
Check out pypi.org/project/googletransAirminded
L
7

Some thoughts, the google APIs tried before, can only handle a certain number of concurrent requests, and if the limit is reached, the service will return the error HTTP 503 "Service Unavailable." And HTTP 403 if the Daily limit is Exceeded or User Rate Limit.

Try to implement retries with exponential backoff. Retry an operation with an exponentially increasing waiting time, up to a max retry count has been reached. It will improve the bandwidth usage and maximize throughput of requests in concurrent environments.

And review the Quotas and Limits page.

Luisaluise answered 3/7, 2020 at 12:12 Comment(1)
If Googles Translate API limit is 6 million character per minute and the test send 360, 000 characters. Then why would the limit be reached?Converter
P
3

A 503 error implies that this issue is on Google's side, which leads me to believe you're possibly getting rate limited. As Raphael mentioned, is there a Retry-After header in the response? I recommend taking a look into the response headers as it'll likely tell you what's going on more specifically, and possibly give you info on how to fix it.

Precinct answered 2/7, 2020 at 23:33 Comment(0)
C
3

Google API is excellent at hiding the complexities of preforming Google Translation. Unfortunately, if you step into Google API code, it’s using standard HTTP requests. This means that when you’re running 20, 000 plus requests, regardless of thread pooling, there will be a huge bottle neck.

Consider creating HTTP requests using aiohttp (you’ll need to install from pip) and asyncio. This will allow you to run asynchronous HTTP requests. (It means you don’t need to use google.cloud.translate_v2, multiprocessing or tqdm.notebook).

Simply call an await method in asyncio.run(), the method can creates an array of methods to preform aiohttp.session.get(). Then call asyncio.gather() to collect all the results.

In the example below I'm using an API key https://console.cloud.google.com/apis/credentials (instead of Google Application Credential / Service Accounts).

Using your example with asyncio & aiohttp, it ran in 30 seconds and without any errors. (Although you might want to extend timeout to session).

It's worth pointing out that Google has a limit of 6 million characters per minute. Your test is doing 360,000. Therefore you'll reach the limit if you run the test 17 times in a minute!

Also the speed is mainly determined by the machine and not Google API. (I ran my tests on a pc with 3GHz, 8 core and 16GB ram).

import asyncio
import aiohttp
from collections import namedtuple
import json
from urllib.parse import quote

TranslateReponseModel = namedtuple('TranslateReponseModel', ['sourceText', 'translatedText', 'detectedSourceLanguage']) # model to store results.

def Logger(json_message):    
    print(json.dumps(json_message)) # Note: logging json is just my personal preference.

async def DownloadString(session, url, index):
    while True: # If client error - this will retry. You may want to limit the amount of attempts
        try:
            r = await session.get(url)
            text = await r.text()
            #Logger({"data": html, "status": r.status}) 
            r.raise_for_status() # This will error if API return 4xx or 5xx status.
            return text
        except aiohttp.ClientConnectionError as e:
            Logger({'Exception': f"Index {index} - connection was dropped before we finished", 'Details': str(e), 'Url': url })
        except aiohttp.ClientError as e:
            Logger({'Exception': f"Index {index} - something went wrong. Not a connection error, that was handled", 'Details': str(e), 'Url': url})


def FormatResponse(sourceText, responseText):
    jsonResponse = json.loads(responseText)
    return TranslateReponseModel(sourceText, jsonResponse["data"]["translations"][0]["translatedText"], jsonResponse["data"]["translations"][0]["detectedSourceLanguage"])

def TranslatorUriBuilder(targetLanguage, sourceText):
    apiKey = 'ABCDED1234' # TODO This is a 41 characters API Key. You'll need to generate one (it's not part of the json certificate)
    return f"https://translation.googleapis.com/language/translate/v2?key={apiKey}={quote(sourceText)}&target={targetLanguage}"

async def Process(session, sourceText, lineNumber):
    translateUri = TranslatorUriBuilder('en', sourceText) # Country code is set to en (English)
    translatedResponseText = await DownloadString(session, translateUri, lineNumber)
    response = FormatResponse(sourceText, translatedResponseText)
    return response

async def main():       
    statements = ["this is another sentence"]*20000

    Logger({'Message': f'Start running Google Translate API for {len(statements)}'})
    results = []
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*[Process(session, val, idx) for idx, val in enumerate(statements)]  )  

    Logger({'Message': f'Results are: {", ".join(map(str, [x.translatedText for x in results]))}'})
    Logger({'Message': f'Finished running Google Translate API for {str(len(statements))} and got {str(len(results))} results'})

if __name__ == '__main__':
    asyncio.run(main())

Additional test

The initial test is running the same translation. Therefore I’ve created a test to check the results are not being cached on Google. I manually copied an eBook into a text file. Then in Python, the code opens the file and groups the text into array of 100 characters and then take the first 20,000 item from the array and translate each row. Interestingly it still took under 30 seconds.

import asyncio
import aiohttp
from collections import namedtuple
import json
from urllib.parse import quote

TranslateReponseModel = namedtuple('TranslateReponseModel', ['sourceText', 'translatedText', 'detectedSourceLanguage']) # model to store results.

def Logger(json_message):    
    print(json.dumps(json_message)) # Note: logging json is just my personal preference.

async def DownloadString(session, url, index):
    while True: # If client error - this will retry. You may want to limit the amount of attempts
        try:
            r = await aiohttp.session.get(url)
            text = await r.text()
            #Logger({"data": html, "status": r.status}) 
            r.raise_for_status() # This will error if API return 4xx or 5xx status.
            return text
        except aiohttp.ClientConnectionError as e:
            Logger({'Exception': f"Index {index} - connection was dropped before we finished", 'Details': str(e), 'Url': url })
        except aiohttp.ClientError as e:
            Logger({'Exception': f"Index {index} - something went wrong. Not a connection error, that was handled", 'Details': str(e), 'Url': url})


def FormatResponse(sourceText, responseText):
    jsonResponse = json.loads(responseText)
    return TranslateReponseModel(sourceText, jsonResponse["data"]["translations"][0]["translatedText"], jsonResponse["data"]["translations"][0]["detectedSourceLanguage"])

def TranslatorUriBuilder(targetLanguage, sourceText):
    apiKey = 'ABCDED1234' # TODO This is a 41 characters API Key. You'll need to generate one (it's not part of the json certificate)
    return f"https://translation.googleapis.com/language/translate/v2?key={apiKey}={quote(sourceText)}&target={targetLanguage}"

async def Process(session, sourceText, lineNumber):
    translateUri = TranslatorUriBuilder('en', sourceText) # Country code is set to en (English)
    translatedResponseText = await DownloadString(session, translateUri, lineNumber)
    response = FormatResponse(sourceText, translatedResponseText)
    return response

def readEbook():
    # This is a simple test to make sure response is not cached.
    # I grabbed a random online pdf (http://sd.blackball.lv/library/Beginning_Software_Engineering_(2015).pdf) and copied text into notepad.
    with open("C:\\Dev\\ebook.txt", "r", encoding="utf8") as f:
        return f.read()

def chunkText(text):
    chunk_size = 100
    chunks= len(text)
    chunk_array = [text[i:i+chunk_size] for i in range(0, chunks, chunk_size)]
    formatResults = [x for x in chunk_array if len(x) > 10]
    return formatResults[:20000]

async def main():  
    data = readEbook()
    chunk_data = chunkText(data)
    
    Logger({'Message': f'Start running Google Translate API for {len(chunk_data)}'})
    results = []
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*[Process(session, val, idx) for idx, val in enumerate(chunk_data)]  )  

    Logger({'Message': f'Results are: {", ".join(map(str, [x.translatedText for x in results]))}'})
    Logger({'Message': f'Finished running Google Translate API for {str(len(chunk_data))} and got {str(len(results))} results'})

if __name__ == '__main__':
    asyncio.run(main())

Finally you can find more info about the Google Translate API HTTP request https://cloud.google.com/translate/docs/reference/rest/v2/translate and you can run the request through Postman.

Converter answered 5/7, 2020 at 17:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.