Torchtext 0.7 shows Field is being deprecated. What is the alternative?
Asked Answered
S

4

25

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?

Reference for deprecation:

https://github.com/pytorch/text/releases

Spec answered 22/8, 2020 at 18:37 Comment(0)
F
11

It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:

from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)

or like so for custom built datasets:

from torch.utils.data import DataLoader
def collate_fn(batch):
    texts, labels = [], []
    for label, txt in batch:
        texts.append(txt)
        labels.append(label)
    return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
    print(idx, texts, labels)

I've copied the examples from the Source

Fulmer answered 14/11, 2020 at 0:4 Comment(4)
Hi Steven, thank you. Anyway, did you find any snippet on how we build vocab, tokenization, etc?Farrish
@SatrioAdiPrabowo I would suggest using huggingface personally. Huggingface is currently the defacto standard for almost all things NLP at the moment from building vocabularies, to tokenization, and even models. Alternatively you can create your own which is more work.Fulmer
This is a bit late, but I do think that this answers the question asked. It seems some of the preprocessing functionality I was hoping for about Vocab/tokenization just isn't baked-in as I might have hoped.Spec
Hey Paco, maybe ask more generally next time as asking what the alternative is in reference to the deprecation kind of implies you want to keep using torchtext and want to use the not deprecated alternative within torchtext. As opposed to the more general question of what should I use for preprocessing and working in NLP. Note that asking too broad questions can get the question locked on stack overflow.Fulmer
E
4

Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.

If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:

# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset

Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:

import torchtext.legacy as torchtext
Elagabalus answered 13/3, 2021 at 16:3 Comment(0)
J
3

There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.

From the link above:

After tokenization and building vocabulary, you can build the dataset as follows

def data_to_dataset(data, tokenizer, vocab):
    
    data = [(text, label) for (text, label) in data]
    
    text_transform = sequential_transforms(tokenizer.tokenize,
                                                  vocab_func(vocab),
                                                  totensor(dtype=torch.long)
                                          )
    label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
                                                  totensor(dtype=torch.long)
                                          )
    
    
    transforms = (text_transform, label_transform)
    
    dataset = TextClassificationDataset(data, vocab, transforms)
    
    return dataset

The collator is as follows:

    def __init__(self, pad_idx):
        
        self.pad_idx = pad_idx
        
    def collate(self, batch):
        text, labels = zip(*batch)
        labels = torch.LongTensor(labels)
        text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
        return text, labels

Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.

Jacindajacinta answered 20/3, 2021 at 16:10 Comment(2)
Hi! Could you please describe the approach thoroughly instead of describing what is in the notebook? This way, if ever notebook becomes unavailable, people will still be able to benefit from your answer :)Brigandage
@Proko. I added the important code segment.Jacindajacinta
E
3

Well it seems like pipeline could be like that:

    import torchtext as TT
    import torch
    from collections import Counter
    from torchtext.vocab import Vocab

    # read the data

    with open('text_data.txt','r') as f:
        data = f.readlines()
    with open('labels.txt', 'r') as f:
        labels = f.readlines()

    
    tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
    train_iter = zip(labels, data)
    counter = Counter()
    
    for (label, line) in train_iter:
        counter.update(tokenizer(line))
        
    vocab = TT.vocab.Vocab(counter, min_freq=1)

    text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
    # this is data-specific - adapt for your data
    label_pipeline = lambda x: 1 if x == 'positive\n' else 0
    
    class TextData(torch.utils.data.Dataset):
        '''
        very basic dataset for processing text data
        '''
        def __init__(self, labels, text):
            super(TextData, self).__init__()
            self.labels = labels
            self.text = text
            
        def __getitem__(self, index):
            return self.labels[index], self.text[index]
        
        def __len__(self):
            return len(self.labels)
    
    
    def tokenize_batch(batch, max_len=200):
        '''
        tokenizer to use in DataLoader
        takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
        tokenizer is a global function text_pipeline
        labeler is a global function label_pipeline
        max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
        if text is larger that max_len it is truncated but from the end of the string
        '''
        labels_list, text_list = [], []
        for _label, _text in batch:
            labels_list.append(label_pipeline(_label))
            text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
            processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
            pos = min(200, len(processed_text))
            text_holder[-pos:] = processed_text[-pos:]
            text_list.append(text_holder.unsqueeze(dim=0))
        return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
    
    train_dataset = TextData(labels, data)
    
    train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
    
    lbl, txt = iter(train_loader).next()
Enslave answered 20/4, 2021 at 18:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.