If you're sending a corpus through the command-line interface, e.g.
xzcat corpus.sme.xz | sed 's/$/ ./' | apertium -f html-noent sme-nob > translated.nob.mt
then you can try simply
xzcat corpus.sme.xz | paste - translated.nob.mt
afterwards to get the input next to the output. That's assuming you want to split things on newlines. The sed
is there to ensure words aren't moved across newlines (rules tend not to move across sentence boundaries).
This will be fast, but it's a bit hacky and there are many edge cases.
If you want more control, one way would be to install the JSON API locally and send one request at a time.
If you've got a recent Debian/Ubuntu (or are using one of the apertium repos), you can get the API with
sudo apt install apertium-apy
sudo systemctl start apertium-apy # start it right now
sudo systemctl enable apertium-apy # let it start on next boot
And then you can translate like this:
$ echo 'Jeg liker ikke ansjos' | curl --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno'
{"responseDetails": null, "responseData": {"translatedText": "Eg likar ikkje ansjos"}, "responseStatus": 200}
(or from Javascript with standard ajax requests, some docs at http://wiki.apertium.org/wiki/Apertium-apy/Debian and http://wiki.apertium.org/wiki/Apertium-apy#Usage )
Note that apertium-apy by default serves the pairs that are in /usr/share/apertium/modes; if you start it manually (instead of through systemctl) you can point it at a different path.
If you want to produce the JSON format you had in your example, the easiest way would be to use jq
(sudo apt install jq
), e.g.
$ orig="Jeg liker ikke ansjos"
$ echo "$orig" \
| curl -Ss --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno' \
| jq "{phrase: {original:\"$orig\", translated:.responseData.translatedText }}"
{
"phrase": {
"original": "Jeg liker ikke ansjos",
"translated": "Eg likar ikkje ansjos"
}
}
or on a corpus:
xzcat corpus.nob.xz | while read -r orig; do
echo "$orig" \
| curl -Ss --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno' \
| jq "{phrase: {original:\"$orig\", translated:.responseData.translatedText}}";
done
(A simple test of 500 lines showed this taking 23.7s wall clock time while the paste
version took 5.5s.)