What is the best way to translate a big amount of text data? [closed]
Asked Answered
P

8

8

I have a lot of text data and want to translate it to different languages.

Possible ways I know:

The problem is that all these services have limitations on text length, number of calls, etc. which makes them inconvenient in use.

What services / ways you could advice to use in this case?

Permafrost answered 15/3, 2010 at 15:7 Comment(0)
P
2

I had to solve the same problem when integrating language translation with an XMPP chat server. I partitioned my payload (the text I needed to translate) into smaller subsets of complete sentences.

I can’t recall the exact number, but with Google's REST-based translation URL, I translated a set of completed sentences that collectively had a total of less than (or equal to) 1024 characters, so a large paragraph would result in multiple translation service calls.

Principalities answered 15/3, 2010 at 15:16 Comment(1)
yeah.. that's true .. but what is the case when we have html formatted data.. splitting content will not consider it as html further if we split it at > or < character .. any help??Serviceberry
P
1

Break your big text into tokenized strings, and then pass each token through the translator via a loop. Store the translated output in an array and once all tokens are translated and stored in the array, put them back together and you will have a completely translated document.

Just to prove a point, I threw this together :) It is rough around the edges, but it will handle a whole lot of text and it does just as good as Google for translation accuracy because it uses the Google API. I processed Apple's entire 2005 SEC 10-K filing with this code and the click of one button (took about 45 minutes).

The result was basically identical to what you would get if you copied and pasted one sentence at a time into Google Translate. It isn't perfect (ending punctuation is not accurate and I didn't write to the text file line by line), but it does show a proof of concept. It could have better punctuation if you worked with Regex some more.

Imports System.IO
Imports System.Text.RegularExpressions

Public Class Form1

    Dim file As New String("Translate Me.txt")
    Dim lineCount As Integer = countLines()

    Private Function countLines()

        If IO.File.Exists(file) Then

            Dim reader As New StreamReader(file)
            Dim lineCount As Integer = Split(reader.ReadToEnd.Trim(), Environment.NewLine).Length
            reader.Close()
            Return lineCount

        Else

            MsgBox(file + " cannot be found anywhere!", 0, "Oops!")

        End If

        Return 1

    End Function

    Private Sub translateText()

        Dim lineLoop As Integer = 0
        Dim currentLine As String
        Dim currentLineSplit() As String
        Dim input1 As New StreamReader(file)
        Dim input2 As New StreamReader(file)
        Dim filePunctuation As Integer = 1
        Dim linePunctuation As Integer = 1

        Dim delimiters(3) As Char
        delimiters(0) = "."
        delimiters(1) = "!"
        delimiters(2) = "?"

        Dim entireFile As String
        entireFile = (input1.ReadToEnd)

        For i = 1 To Len(entireFile)
            If Mid$(entireFile, i, 1) = "." Then filePunctuation += 1
        Next

        For i = 1 To Len(entireFile)
            If Mid$(entireFile, i, 1) = "!" Then filePunctuation += 1
        Next

        For i = 1 To Len(entireFile)
            If Mid$(entireFile, i, 1) = "?" Then filePunctuation += 1
        Next

        Dim sentenceArraySize = filePunctuation + lineCount

        Dim sentenceArrayCount = 0
        Dim sentence(sentenceArraySize) As String
        Dim sentenceLoop As Integer

        While lineLoop < lineCount

            linePunctuation = 1

            currentLine = (input2.ReadLine)

            For i = 1 To Len(currentLine)
                If Mid$(currentLine, i, 1) = "." Then linePunctuation += 1
            Next

            For i = 1 To Len(currentLine)
                If Mid$(currentLine, i, 1) = "!" Then linePunctuation += 1
            Next

            For i = 1 To Len(currentLine)
                If Mid$(currentLine, i, 1) = "?" Then linePunctuation += 1
            Next

            currentLineSplit = currentLine.Split(delimiters)
            sentenceLoop = 0

            While linePunctuation > 0

                Try

                    Dim trans As New Google.API.Translate.TranslateClient("")
                    sentence(sentenceArrayCount) = trans.Translate(currentLineSplit(sentenceLoop), Google.API.Translate.Language.English, Google.API.Translate.Language.German, Google.API.Translate.TranslateFormat.Text)
                    sentenceLoop += 1
                    linePunctuation -= 1
                    sentenceArrayCount += 1

                Catch ex As Exception

                    sentenceLoop += 1
                    linePunctuation -= 1

                End Try

            End While

            lineLoop += 1

        End While

        Dim newFile As New String("Translated Text.txt")
        Dim outputLoopCount As Integer = 0

        Using output As StreamWriter = New StreamWriter(newFile)

            While outputLoopCount < sentenceArraySize

                output.Write(sentence(outputLoopCount) + ". ")

                outputLoopCount += 1

            End While

        End Using

        input1.Close()
        input2.Close()

    End Sub

    Private Sub translateButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles translateButton.Click

        translateText()

    End Sub

End Class
Paluas answered 24/4, 2010 at 20:23 Comment(5)
nice guess. but this will result, in most cases, in unintelligible disjointed results. Translation is very context sensitive. Human expression via language is not tokenizable and compilable.Diplomatic
You are right that language is very context sensitive, but you can work around this. You do not even need to look for perfect cognates in the languages, just based your string tokens off of something that is similar to both languages like punctuation. I speak English and bad German and I know that this will work for an English to German or German to English translator because the periods in sentences are in the same place. You could just use REGEX, it would be simple and awesome.Paluas
Ok, replace that 30 lines of string munging with the one line regex I gave ya and see how that works ;-)Diplomatic
Yeah, I also pointed out that in my comments that improved Regex would help, I just didn't have the time to research functions I didn't know, so I used what I did know :) Additionally, it works, and that is the important part. He can refine it if he wants.Paluas
Well, if you are content with providing a substandard answer when you have been provided with a superior alternative, i guess that's on you. But what I can do is help you out by showing you another way of answering a question ;-)Diplomatic
F
0

Use MyGengo. They have a free API for machine translation - I don't know what the quality is like, but you can also plug in human translation for a fee.

I'm not affiliated with them nor have I used them, but I've heard good things.

Fabrianne answered 26/4, 2010 at 14:4 Comment(0)
D
0

Disclaimer: While I definitely find tokenizing as a means of translation suspect, splitting on sentences as later illustrated by ubiquibacon may produce results that fill your requirements.

I suggested that his code could be improved by reducing the 30+ lines of string munging to the one-line regex he asked for in another question, but the suggestion was not well received.

Here is an implementation using the Google API for .NET in VB.NET and C#.

File Program.cs

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using Google.API.Translate;

namespace TokenizingTranslatorCS
{
    internal class Program
    {
        private static readonly TranslateClient Client =
            new TranslateClient("http://code.google.com/p/google-api-for-dotnet/");

        private static void Main(string[] args)
        {
            Language originalLanguage = Language.English;
            Language targetLanguage = Language.German;

            string filename = args[0];

            StringBuilder output = new StringBuilder();

            string[] input = File.ReadAllLines(filename);

            foreach (string line in input)
            {
                List<string> translatedSentences = new List<string>();
                string[] sentences = Regex.Split(line, "\\b(?<sentence>.*?[\\.!?](?:\\s|$))");
                foreach (string sentence in sentences)
                {
                    string sentenceToTranslate = sentence.Trim();

                    if (!string.IsNullOrEmpty(sentenceToTranslate))
                    {
                        translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage));
                    }
                }


                output.AppendLine(string.Format("{0}{1}", string.Join(" ", translatedSentences.ToArray()),
                                                Environment.NewLine));
            }

            Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, string.Join(Environment.NewLine, input));
            Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output);
            Console.WriteLine("{0}Press any key{0}", Environment.NewLine);


            Console.ReadKey();
        }

        private static string TranslateSentence(string sentence, Language originalLanguage, Language targetLanguage)
        {
            string translatedSentence = Client.Translate(sentence, originalLanguage, targetLanguage);
            return translatedSentence;
        }
    }
}

File Module1.vb

Imports System.Text.RegularExpressions
Imports System.IO
Imports System.Text
Imports Google.API.Translate


Module Module1

    Private Client As TranslateClient = New TranslateClient("http://code.google.com/p/google-api-for-dotnet/")

    Sub Main(ByVal args As String())

        Dim originalLanguage As Language = Language.English
        Dim targetLanguage As Language = Language.German

        Dim filename As String = args(0)

        Dim output As New StringBuilder

        Dim input As String() = File.ReadAllLines(filename)

        For Each line As String In input
            Dim translatedSentences As New List(Of String)
            Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
            For Each sentence As String In sentences

                Dim sentenceToTranslate As String = sentence.Trim

                If Not String.IsNullOrEmpty(sentenceToTranslate) Then

                    translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage))

                End If

            Next

            output.AppendLine(String.Format("{0}{1}", String.Join(" ", translatedSentences.ToArray), Environment.NewLine))

        Next

        Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, String.Join(Environment.NewLine, input))
        Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output)
        Console.WriteLine("{0}Press any key{0}", Environment.NewLine)
        Console.ReadKey()


    End Sub

    Private Function TranslateSentence(ByVal sentence As String, ByVal originalLanguage As Language, ByVal targetLanguage As Language) As String

        Dim translatedSentence As String = Client.Translate(sentence, originalLanguage, targetLanguage)
        Return translatedSentence
    End Function

End Module

Input (stolen directly from ubiquibacon)

Just to prove a point I threw this together :) It is rough around the edges, but it will handle a WHOLE lot of text and it does just as good as Google for translation accuracy because it uses the Google API. I processed Apple's entire 2005 SEC 10-K filing with this code and the click of one button (took about 45 minutes). The result was basically identical to what you would get if you copied and pasted one sentence at a time into Google Translator. It isn't perfect (ending punctuation is not accurate and I didn't write to the text file line by line), but it does show proof of concept. It could have better punctuation if you worked with Regex some more.

Results (to German for typoking):

Nur um zu beweisen einen Punkt warf ich dies zusammen:) Es ist Ecken und Kanten, aber es wird eine ganze Menge Text umgehen und es tut so gut wie Google für die Genauigkeit der Übersetzungen, weil es die Google-API verwendet. Ich verarbeitet Apple's gesamte 2005 SEC 10-K Filing bei diesem Code und dem Klicken einer Taste (dauerte ca. 45 Minuten). Das Ergebnis war im wesentlichen identisch zu dem, was Sie erhalten würden, wenn Sie kopiert und eingefügt einem Satz in einer Zeit, in Google Translator. Es ist nicht perfekt (Endung Interpunktion ist nicht korrekt und ich wollte nicht in die Textdatei Zeile für Zeile) schreiben, aber es zeigt proof of concept. Es hätte besser Satzzeichen, wenn Sie mit Regex arbeitete einige mehr.

Diplomatic answered 27/4, 2010 at 21:8 Comment(3)
I wasn't disagreeing with you about Regex, I just didn't have time to put it in there. I did try the bit of code you gave me from that other question but when it didn't work when I copied and pasted it I didn't investigate why, though I am sure it was something small.Paluas
The translation is horrible :-)Artemus
@Thomas - yes, I suspect it is. That was my point in trying to discourage this approach, but after seeing peeps post substandard suggestions and code it was my intention to show a clean way to do a dirty thing.Diplomatic
T
0

We used http://www.berlitz.co.uk/translation/.

We'd send them a database file with the English content, and a list of the languages we required, and they'd use various bilingual people to provide the translations. They also used voice-actors to provide WAV files for our telephone interface.

This was obviously not as fast as automated translation, and not free, but I think this sort of service is the only way to be sure your translation makes sense.

Threatt answered 14/9, 2010 at 7:57 Comment(1)
That is a very literal interpretation of the question, but it may be the best answer! The quality of machine translation is ... laughable (not even counting idioms).Pomatum
D
0

Google provides a useful tool, Google Translator Toolkit, which allows you to upload files and translate them, to whichever language Google Translate supports, at once. It's free if you want to use the automated translations but there is an option to hire real persons to translate your documents for you.

From Wikipedia:

Google Translator Toolkit is a web application designed to allow translators to edit the translations that Google Translate automatically generates. With the Google Translator Toolkit, translators can organize their work and use shared translations, glossaries and translation memories. They can upload and translate Microsoft Word documents, OpenOffice.org, RTF, HTML, text, and Wikipedia articles.

Link

Denouement answered 19/7, 2015 at 22:47 Comment(3)
It works! I've successfully upload a dutch book with 153k words and translated it to englishForb
Sadly no longer exists :'(Ralphralston
Google Translator Toolkit was shut down in December 2019.Pomatum
B
0

There are a plenty of different machine translation APIs: Google, Microsoft, Yandex, IBM, PROMT, Systran, Baidu, YeeCloud, DeepL, SDL, and SAP.

Some of them support batch requests (translating an array of text at once). I would translate sentence by sentence with proper processing of 403/429 errors (usually used to respond to exceeded quota).

I may refer you to our recent evaluation study (November 2017): State of machine translation

Beater answered 9/11, 2017 at 14:23 Comment(1)
Would their terms of service actually permit such large-scale use? Don't you risk being permanently banned?Pomatum
P
-1

It's pretty simple, and there are a few ways:

  • Use the API and translate data in chunks (which matches the limitations).
  • Write your own simple library to use HttpWebRequest and POST some data to it.

Here is an example (of the second one):

Method:

private String TranslateTextEnglishSpanish(String textToTranslate)
{
        HttpWebRequest http = WebRequest.Create("http://translate.google.com/") as HttpWebRequest;
        http.Method = "POST";
        http.ContentType = "application/x-www-form-urlencoded";
        http.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2 (.NET CLR 3.5.30729)";
        http.Referer = "http://translate.google.com/";

        byte[] dataBytes = UTF8Encoding.UTF8.GetBytes(String.Format("js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&text={0}+&file=&sl=en&tl=es", textToTranslate);

        http.ContentLength = dataBytes.Length;

        using (Stream postStream = http.GetRequestStream())
        {
            postStream.Write(dataBytes, 0, dataBytes.Length);
        }

        HttpWebResponse httpResponse = http.GetResponse() as HttpWebResponse;
        if (httpResponse != null)
        {
            using (StreamReader reader = new StreamReader(httpResponse.GetResponseStream()))
            {
                //* Return translated Text
                return reader.ReadToEnd();
            }
        }

        return "";
}

Method Call:

String translatedText = TranslateTextEnglishSpanish("hello world");

Result:

translatedText == "hola mundo";

You just need to get all languages' parameters and use them in order to get translations you need.

You can get thous values using the Live Http Headers addon for Firefox.

Parton answered 26/4, 2010 at 9:44 Comment(1)
What do you mean by "thous values"? "thousand values"? "thousand of values"? Something else? Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Pomatum

© 2022 - 2024 — McMap. All rights reserved.