Disclaimer: While I definitely find tokenizing as a means of translation suspect, splitting on sentences as later illustrated by ubiquibacon may produce results that fill your requirements.
I suggested that his code could be improved by reducing the 30+ lines of string munging to the one-line regex he asked for in another question, but the suggestion was not well received.
Here is an implementation using the Google API for .NET in VB.NET and C#.
File Program.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using Google.API.Translate;
namespace TokenizingTranslatorCS
{
internal class Program
{
private static readonly TranslateClient Client =
new TranslateClient("http://code.google.com/p/google-api-for-dotnet/");
private static void Main(string[] args)
{
Language originalLanguage = Language.English;
Language targetLanguage = Language.German;
string filename = args[0];
StringBuilder output = new StringBuilder();
string[] input = File.ReadAllLines(filename);
foreach (string line in input)
{
List<string> translatedSentences = new List<string>();
string[] sentences = Regex.Split(line, "\\b(?<sentence>.*?[\\.!?](?:\\s|$))");
foreach (string sentence in sentences)
{
string sentenceToTranslate = sentence.Trim();
if (!string.IsNullOrEmpty(sentenceToTranslate))
{
translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage));
}
}
output.AppendLine(string.Format("{0}{1}", string.Join(" ", translatedSentences.ToArray()),
Environment.NewLine));
}
Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, string.Join(Environment.NewLine, input));
Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output);
Console.WriteLine("{0}Press any key{0}", Environment.NewLine);
Console.ReadKey();
}
private static string TranslateSentence(string sentence, Language originalLanguage, Language targetLanguage)
{
string translatedSentence = Client.Translate(sentence, originalLanguage, targetLanguage);
return translatedSentence;
}
}
}
File Module1.vb
Imports System.Text.RegularExpressions
Imports System.IO
Imports System.Text
Imports Google.API.Translate
Module Module1
Private Client As TranslateClient = New TranslateClient("http://code.google.com/p/google-api-for-dotnet/")
Sub Main(ByVal args As String())
Dim originalLanguage As Language = Language.English
Dim targetLanguage As Language = Language.German
Dim filename As String = args(0)
Dim output As New StringBuilder
Dim input As String() = File.ReadAllLines(filename)
For Each line As String In input
Dim translatedSentences As New List(Of String)
Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
For Each sentence As String In sentences
Dim sentenceToTranslate As String = sentence.Trim
If Not String.IsNullOrEmpty(sentenceToTranslate) Then
translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage))
End If
Next
output.AppendLine(String.Format("{0}{1}", String.Join(" ", translatedSentences.ToArray), Environment.NewLine))
Next
Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, String.Join(Environment.NewLine, input))
Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output)
Console.WriteLine("{0}Press any key{0}", Environment.NewLine)
Console.ReadKey()
End Sub
Private Function TranslateSentence(ByVal sentence As String, ByVal originalLanguage As Language, ByVal targetLanguage As Language) As String
Dim translatedSentence As String = Client.Translate(sentence, originalLanguage, targetLanguage)
Return translatedSentence
End Function
End Module
Just to prove a point I threw this
together :) It is rough around the
edges, but it will handle a WHOLE lot
of text and it does just as good as
Google for translation accuracy
because it uses the Google API. I
processed Apple's entire 2005 SEC 10-K
filing with this code and the click of
one button (took about 45 minutes).
The result was basically identical to
what you would get if you copied and
pasted one sentence at a time into
Google Translator. It isn't perfect
(ending punctuation is not accurate
and I didn't write to the text file
line by line), but it does show proof
of concept. It could have better
punctuation if you worked with Regex
some more.
Results (to German for typoking):
Nur um zu beweisen einen Punkt warf
ich dies zusammen:) Es ist Ecken und
Kanten, aber es wird eine ganze Menge
Text umgehen und es tut so gut wie
Google für die Genauigkeit der
Übersetzungen, weil es die Google-API
verwendet. Ich verarbeitet Apple's
gesamte 2005 SEC 10-K Filing bei
diesem Code und dem Klicken einer
Taste (dauerte ca. 45 Minuten). Das
Ergebnis war im wesentlichen identisch
zu dem, was Sie erhalten würden, wenn
Sie kopiert und eingefügt einem Satz
in einer Zeit, in Google Translator.
Es ist nicht perfekt (Endung
Interpunktion ist nicht korrekt und
ich wollte nicht in die Textdatei
Zeile für Zeile) schreiben, aber es
zeigt proof of concept. Es hätte
besser Satzzeichen, wenn Sie mit Regex
arbeitete einige mehr.