How to save Python NLTK alignment models for later use?
Asked Answered
R

4

14

In Python, I'm using NLTK's alignment module to create word alignments between parallel texts. Aligning bitexts can be a time-consuming process, especially when done over considerable corpora. It would be nice to do alignments in batch one day and use those alignments later on.

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

Once I create a model, how can I (1) save it to disk and (2) reuse it later?

Rijeka answered 12/5, 2015 at 15:25 Comment(0)
T
8

The immediate answer is to pickle it, see https://wiki.python.org/moin/UsingPickle

But because IBMModel1 returns a lambda function, it's not possible to pickle it with the default pickle / cPickle (see https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 and https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

So we'll use dill. Firstly, install dill, see Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

Then:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()

To use pickled model:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

If you try to pickle the IBMModel1 object, which is a lambda function, you'll end up with this:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

(Note: the above code snippet comes from NLTK version 3.0.0)

In python3 with NLTK 3.0.0, you will also face the same problem because IBMModel1 returns a lambda function:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

(Note: In python3, pickle is cPickle, see http://docs.pythonsprints.com/python3_porting/py-porting.html)

Turcotte answered 13/5, 2015 at 12:18 Comment(4)
I'm not sure what you tried, but I saw no lambdas and had no problems pickling and unpickling the "model" with vanilla pickle.Adhamh
@Adhamh That's interesting, did you get the same error as the updated answer?Turcotte
Haven't had a chance to try it yet; but I might have tested pickling with python 2, which would explain the different experience (I hadn't yet realized the module had changed so much). I'll let you know when I try it.Adhamh
I took another look with python 3. The class constructor does not return a lambda function, and neither does train(). But the model is stored in a defaultdict defined with a lambda (in the usual manner), and defaultdicts using a lambda cannot be pickled. The class can very easily be made picklable, but not without modifying the module source. (Just use module-local functions instead of lambdas.)Adhamh
A
3

You discuss saving the aligner model, but your question seems to be more about saving the aligned bitexts that you have aligned: "It would be nice to do alignments in batch one day and use those alignments later on." I'm going to answer this question.

In the nltk environment, the best way to use a corpus-like resource it to access it with a corpus reader. The NLTK doesn't come with corpus writers, but the format supported by the NLTK's AlignedCorpusReader is very easy to generate: (NLTK 3 version)

model = ibm(biverses, 20)  # As in your question

out = open("folder/newalignedtext.txt", "w")
for pair in biverses:
    asent = model.align(pair)
    out.write(" ".join(asent.words)+"\n")
    out.write(" ".join(asent.mots)+"\n")
    out.write(str(asent.alignment)+"\n")

out.close()

That's it. You can later reload and use your aligned sentences exactly as you'd use the comtrans corpus:

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

As you can see, you don't need the aligner object itself. The aligned sentences can be loaded with a corpus reader, and the aligner itself is pretty useless unless you want to study the embedded probabilities.

Comment: I'm not sure I would call the aligner object a "model". In NLTK 2, the aligner is not set up to align new text-- it doesn't even have an align() method. In NLTK 3 the function align() can align new text but only if used from python 2; in Python 3 it is broken, apparently because of the tightened rules for comparing objects of different types. If nevertheless you want to be able to pickle and reload the aligner, I'll be happy to add it to my answer; from what I've seen it can be done with vanilla cPickle.

Adhamh answered 21/5, 2015 at 15:52 Comment(9)
The aligner function is a model because it learns the probability for every target language word given a source language word. Although it's possible to store that as a big hash table, it's the author of the code who has decided to store it as a lambda function that returns a defaultdict. github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74Turcotte
So given the learnt probability, it is possible to assign the probabilities to new data, that's why it's called a model . However, i agree with you that it's not natural to save the model because given new data, you could simply rebuild the probability model. See cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf for the theoretical explanation.Turcotte
BTW, nltk.align is not broken in python3.Turcotte
Which version are you using? I get an exception with every input I tried outside my very short training data set, and it is caused by the switch from python 2 to 3.Adhamh
Sure, it's a model even if it's limited to the training data; but it's not much of one if it can't be used for anything. It does make a lot of sense to train a model on a huge corpus, save, and use for small jobs later. But i'm using NLTK 3.02 with Python 3.4, and it's broken.Adhamh
Try this: alldata = comtrans.aligned_sents("alignment-de-en.txt"); traindata = alldata[:10]; model = IBMModel1(traindata, 20); model.align(alldata[15])Adhamh
It's not because the library is not working in python3. It's because you're giving it too little data to train on. Try with traindata = alldata[:100]. In IBM Models, it requires a huge dataset to achieve reasonable results. Even 100 sentence pairs are too little. Usually it's trained on millions of sentence pairs and it takes hours or days that's why the OP required it to be pickled when it finish. So that he can reuse it again later without retraining.Turcotte
I'm not expecting reasonable results; if you give it a huge dataset, you're just making the chance of an exception lower. The code is broken (though if the bug hits rarely enough, I guess you're right that it's usable).Adhamh
As my answer makes clear, I'm not too sure what the OP is really after. Hope to hear from him. (But for future readers, pickling a usable model is of course definitely useful).Adhamh
G
1

if you want, and it looks like it, you can store it as an AlignedSent list:

from nltk.align import IBMModel1 as IBM
from nltk.align import AlignedSent
import dill as pickle

biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

for sent in range(len(biverses)):
     biverses[sent].alignment = model.align(biverses[sent]).alignment

After that, you can save it with dill as pickle:

with open('alignedtext.pk', 'wb') as arquive:
     pickle.dump(biverses, arquive)
Garber answered 2/6, 2015 at 1:45 Comment(0)
G
0

joblib also can save the trained nltk model eg:

from nltk.lm import MLE
import joblib
model = MLE(n=2)
model.fit(train_data, padded_sents)
# save model
with open(model_path, 'wb') as fout:
    joblib.dump(model, fout)

#load model
joblib.load(model_path)
Guaiacol answered 21/2, 2022 at 8:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.