High Precision Word Alignment Algorithm in Python

Asked 6/1, 2020 at 16:36 Answered 25/9, 2021 at 10:7

I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:

the two languages do not have the same word order, and the order keeps changing
some words in the source sentence do not have corresponding words in the target sentence, and vice versa
sometimes a word in the source correspond to multiple words in the target, and vice versa, and there can be many-to-many mapping
there can be sentences where the same word is used multiple times in the sentence, so the alignment needs to be done with the words and their indexes, not only words

Here is what I did:

Start with a list of sentence pairs, say English-German, with each sentence tokenized to words
Index all words in each sentence, and create an inverted index for each word (e.g. the word "world" occurred in sentences # 5, 16, 19, 26 ... etc), for both source and target words
Now this inverted index can predict the correlation between any source word and any target word, as the intersection between the two words divided by their union. For example, if the tagret word "Welt" occurs in sentences 5, 16, 26,32, The correlation between (world, Welt) is the number of indexes in the intersection (3) divided by the number of indexes in the union (5), and hence the correlation is 0.6. Using the union gives lower correlation with high frequency words, such as "the", and the corresponding words in other languages
Iterate over all sentence pairs again, and use the indexes for the source and target words for a given sentence pairs to create a correlation matrix

Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.

In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.

Here is some of what I tried:

It is possible in some cases that the intended alignment is simply the word pair with the highest correlation in its respective column and row, but in many cases it's not.
I have tried things like Dijkstra's algorithm to draw a path connecting the alignment points, but it doesn't seem to work this way, because it seems you can jump back and forth to earlier words in the sentence because of the word order, and there is no sensible way to skip words for which there is no alignment.
I think the optimum solution will involve something like expanding rectangles which start from the most likely correspondences, and span many-to-many correspondences, and skip words with no alignment, but I'm not exactly sure what would be a good way to implement this

Here is the code I am using:

import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
    return random.random() #adjust this to get the actual correlation value

all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src  indexes
    src_word=src_words[i] #identify the correponding src word
    for j in range(len(trg_words)): #iterate over trg indexes
        trg_word=trg_words[j] #identify the correponding trg word
        val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of     each word (or from the data provided in the speadsheet)
        all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val

all_pairs_vals.sort(key=lambda x:-x[-1])  #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
    if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
    if j0 in used_j: continue #same if the current row was used before
    selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
    used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
    used_j.append(j0)

for a in all_pairs_vals: #list all pairs and indicate which ones were selected
    i0,j0,val0=a
    if (i0,j0) in selected_alignments: print(a, "<<<<")
    else: print(a)

It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.

Here is some data to try if you like, it's in Google sheets: https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing

Tyro answered 6/1, 2020 at 16:36 Comment(0)

I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.

As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.

There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN

Good luck!

Fanya answered 16/3, 2021 at 19:20 Comment(2)

This is actually awesome! Thank you for your suggestion, I started using it and it is exactly what I needed! – Tyro 20/3, 2021 at 16:59

This is great to hear @hmghaly! Good luck! – Fanya 22/3, 2021 at 21:48

Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf

There is a lot of existing approaches you could take, such as:

implement the "IBM models" used by Giza++ yourself (or if you're brave, try the NLTK implementation)
implement the (much much simpler) algorithm behind fast_align https://www.aclweb.org/anthology/N13-1073/
implement some form of HMM-based alignment https://www.aclweb.org/anthology/C96-2141/
use deep learning, there are multiple possibilities there; this paper seems to contain a nice overview of approaches https://www.aclweb.org/anthology/P19-1124.pdf (typically people try to leverage the attention mechanism of neural MT models to do this)

This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows :-)

Frisbee answered 6/1, 2020 at 16:51 Comment(4)

Thanks so much, these are really interesting directions, some of which I wasn't aware of, so I'll definitely check them out :) I'm hoping though to find some purely algorithmic solution to the problem, let's see – Tyro 6/1, 2020 at 17:2

By purely algorithmic solution to the problem do you mean not a machine learning solution? – Integumentary 9/1, 2020 at 0:32

yes, something like Dijkestra that can identify a path for the points with the highest correlation overall, or some sorting technique to do so – Tyro 9/1, 2020 at 1:10

Just as an FYI, this is an interesting paper that discusses saliency driven alignment extraction from NMT models. Although GIZApp still outperforms all systems, the proposed approach is definitely interesting. – Sofa 17/12, 2020 at 18:16

There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN

Good luck!

Fanya answered 16/3, 2021 at 19:20 Comment(2)

This is actually awesome! Thank you for your suggestion, I started using it and it is exactly what I needed! – Tyro 20/3, 2021 at 16:59

This is great to hear @hmghaly! Good luck! – Fanya 22/3, 2021 at 21:48

Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.

One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.

The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.

Both of them claim to be better than FastAlign.

Summand answered 10/1, 2020 at 9:43 Comment(3)

Thank you so much, these look useful too – Tyro 10/1, 2020 at 22:11

The first paper got developed and published on arXiv. The aligner described is now called SimAlign. – Glaucoma 20/7, 2020 at 22:41

@AndriyMakukha Thanks! This looks really promising! – Tyro 20/3, 2021 at 17:0

As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into

https://pypi.org/project/systran-align/: replicates FastAlign. Seems to be relatively mature. Also note that the original FastAlign code contains a Python wrapper (https://github.com/clab/fast_align/blob/master/src/force_align.py).
https://www.nltk.org/api/nltk.align.html: replicates most GIZA models (a good compromise between performance and quality is IBM4). However, it is rather unclear how thoroughly tested and how well maintained that is, as people generally prefer to work with GIZA++ directly.

Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.

Pedropedrotti answered 25/9, 2021 at 10:7 Comment(0)

Recommended topics

Hot tags