Transliteration script for linux shell
Asked Answered
T

4

7

I have multiple .txt files containing text in an alphabet; I want to transliterate the text into an other alphabet; some characters of alphabet1 are 1:1 with those of alphabet2 (i.e. a becomes e), whereas others are 1:2 (i.e. x becomes ch).

I would like to do this using a simple script for the Linux shell.

With tr or sed I can convert 1:1 characters:

sed -f y/abcdefghijklmnopqrstuvwxyz/nopqrstuvwxyzabcdefghijklm/

a will become n, b will become o et cetera (a Caesar's cipher, I think)

But how can I deal with 1:2 characters?

Tombaugh answered 16/8, 2014 at 8:46 Comment(0)
Y
5

Not an answer, just to show a briefer, idiomatic way to populate the table[] array from @konsolebox's answer as discussed in the related comments:

BEGIN {
    split("a  e b", old)
    split("x ch o", new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].

You can even create a general script to do mappings and just pass in the old and new strings as variables:

BEGIN {
    split(o, old)
    split(n, new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

then in shell anything like this:

old="a  e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file

and you can protect yourself from your own mistakes populating the strings, e.g.:

BEGIN {
    numOld = split(o, old)
    numNew = split(n, new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        table[old[i]] = new[i]
    }
}

Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.

Here's one complete solution as discussed in the comments below

BEGIN {
    numOld = split("a  e b", old)
    numNew = split("x ch o", new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        map[old[i]] = new[i]
    }

    FS = OFS = ""
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}

I renamed the table array as map just because iMHO that better represents the purpose of the array.

save the above in a file script.awk and run it as awk -f script.awk inputfile

Yellowish answered 17/8, 2014 at 14:26 Comment(5)
I tried your codes again but they give no output; maybe I miss something? What I did: copied the code in a new file called script.awk; run the script as suggested. I get neither errors nor output.Tombaugh
I just showed how to populate the mapping table differently, you still need the rest of the script @konsolebox posted to actually do something with that mapping. Hang on and I'll update it with a complete solution.Yellowish
Now it outputs the same text of input. I copied your new code in a new file, then in the shell I did: echo "ae" | awk -f script.awk. Output was: aeTombaugh
I forgot to add in the setting of FS and OFS when I put together the complete solution, updated now.Yellowish
Now it works! Thank'you very much; I like its ability to search for errorsTombaugh
D
5

Using Awk:

#!/usr/bin/awk -f
BEGIN {
    FS = OFS = ""
    table["a"] = "e"
    table["x"] = "ch"
    # and so on...
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in table) {
            $i = table[$i]
        }
    }
}
1

Usage:

awk -f script.awk file

Test:

# echo "the quick brown fox jumps over the lazy dog" | awk -f script.awk
the quick brown foch jumps over the lezy dog
Diagnose answered 16/8, 2014 at 8:57 Comment(4)
Perfect! Thank's very much!Tombaugh
+1 but rather than populating the table explicitly, do this to save some redundant coding: split("a e x ch ...",t,/ /); for (i=1; i in t; i+=2) table[t[i]] = t[i+1].Yellowish
@EdMorton : thank's, but I couldn't make it work; and, however, I actually like the idea of populating the table explicitly (see my comment to @TomFenech)Tombaugh
@mus_siluanus if you tell us in what way you "couldn't make it work" we can help you. Even if you don't use this now, it is the common awk idiom for populating arrays with initial values so you probably will want to do it at some point. If you prefer you can have 2 arrays populated one about the other. I'll add an answer so I can show you how that looks formatted.Yellowish
Y
5

Not an answer, just to show a briefer, idiomatic way to populate the table[] array from @konsolebox's answer as discussed in the related comments:

BEGIN {
    split("a  e b", old)
    split("x ch o", new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].

You can even create a general script to do mappings and just pass in the old and new strings as variables:

BEGIN {
    split(o, old)
    split(n, new)
    for (i in old)
        table[old[i]] = new[i]
    FS = OFS = ""
}

then in shell anything like this:

old="a  e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file

and you can protect yourself from your own mistakes populating the strings, e.g.:

BEGIN {
    numOld = split(o, old)
    numNew = split(n, new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        table[old[i]] = new[i]
    }
}

Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.

Here's one complete solution as discussed in the comments below

BEGIN {
    numOld = split("a  e b", old)
    numNew = split("x ch o", new)

    if (numOld != numNew) {
        printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
        exit 1
    }

    for (i=1; i <= numOld; i++) {
        if (old[i] in table) {
            printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
            exit 1
        }
        if (newvals[new[i]]++) {
            printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
        }
        map[old[i]] = new[i]
    }

    FS = OFS = ""
}
{
    for (i = 1; i <= NF; ++i) {
        if ($i in map) {
            $i = map[$i]
        }
    }
    print
}

I renamed the table array as map just because iMHO that better represents the purpose of the array.

save the above in a file script.awk and run it as awk -f script.awk inputfile

Yellowish answered 17/8, 2014 at 14:26 Comment(5)
I tried your codes again but they give no output; maybe I miss something? What I did: copied the code in a new file called script.awk; run the script as suggested. I get neither errors nor output.Tombaugh
I just showed how to populate the mapping table differently, you still need the rest of the script @konsolebox posted to actually do something with that mapping. Hang on and I'll update it with a complete solution.Yellowish
Now it outputs the same text of input. I copied your new code in a new file, then in the shell I did: echo "ae" | awk -f script.awk. Output was: aeTombaugh
I forgot to add in the setting of FS and OFS when I put together the complete solution, updated now.Yellowish
Now it works! Thank'you very much; I like its ability to search for errorsTombaugh
F
2

This can be done quite concisely using a Perl one-liner:

perl -pe '%h=(a=>"xy",c=>"z"); s/(.)/defined $h{$1} ? $h{$1} : $1/eg'

or equivalently (thanks jaypal):

perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg'

%h is a hash containing the characters (keys) and their substitutions (values). s is the substitution command (as in sed). The g modifier means that the substitution is global and the e means that the replacement part is evaluated as an expression. It captures each character one by one and substitutes them with the value in the hash if it exists, otherwise keeps the original value. The -p switch means that each line in the input is automatically printed.

Testing it out:

$ perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg' <<<"abc"
xybz
Frere answered 16/8, 2014 at 16:20 Comment(4)
Thank'you very much! I like the idea of using a one-liner. But I prefer @Diagnose 's script because for long lists of substitutions (as in transliterations) his approach would give a cleaner view of what I'll do... sort of a beautiful embedded character map...Tombaugh
@glenn thanks for the edit - I assume that the double quote in the middle of a=">xy" was a typo? It seemed to be working in the first instance, which I guess is just a symptom of using a one-liner.Frere
Exactly for both points. With use strict, one would see Bareword "z" not allowed while "strict subs" in useCoinsure
@TomFenech Can be reduced to perl -pe'%h=(a=>"xy",b=>"z");s|(.)|$h{$1}//=$1|eg' <<<"abc". //= was introduced after 5.8 so should work unless using ancient perl.Errant
B
1

Using sed.

Write a file transliterate.sed containing:

s/a/e/g
s/x/ch/g

and then run from your command line to get the transliterated output.txt from input.txt:

sed -f transliterate.sed input.txt > output.txt

If you need this more often consider adding #!/bin/sed -f as first line and making your file executable with chmod 744 transliterate.sed as described at the Wikipedia page for sed.

Bugbane answered 26/4, 2019 at 11:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.