How does Pytorch Dataloader handle variable size data?

T

4

29

I have a dataset that looks like below. That is the first item is the user id followed by the set of items which is clicked by the user.

0   24104   27359   6684
0   24104   27359
1   16742   31529   31485
1   16742   31529
2   6579    19316   13091   7181    6579    19316   13091
2   6579    19316   13091   7181    6579    19316
2   6579    19316   13091   7181    6579    19316   13091   6579
2   6579    19316   13091   7181    6579
4   19577   21608
4   19577   21608
4   19577   21608   18373
5   3541    9529
5   3541    9529
6   6832    19218   14144
6   6832    19218
7   9751    23424   25067   12606   26245   23083   12606

I define a custom dataset to handle my click log data.

import torch.utils.data as data
class ClickLogDataset(data.Dataset):
    def __init__(self, data_path):
        self.data_path = data_path
        self.uids = []
        self.streams = []

        with open(self.data_path, 'r') as fdata:
            for row in fdata:
                row = row.strip('\n').split('\t')
                self.uids.append(int(row[0]))
                self.streams.append(list(map(int, row[1:])))

    def __len__(self):
        return len(self.uids)

    def __getitem__(self, idx):
        uid, stream = self.uids[idx], self.streams[idx]
        return uid, stream

Then I use a DataLoader to retrieve mini batches from the data for training.

from torch.utils.data.dataloader import DataLoader
clicklog_dataset = ClickLogDataset(data_path)
clicklog_data_loader = DataLoader(dataset=clicklog_dataset, batch_size=16)

for uid_batch, stream_batch in stream_data_loader:
    print(uid_batch)
    print(stream_batch)

The code above returns differently from what I expected, I want stream_batch to be a 2D tensor of type integer of length 16. However, what I get is a list of 1D tensor of length 16, and the list has only one element, like below. Why is that ?

#stream_batch
[tensor([24104, 24104, 16742, 16742,  6579,  6579,  6579,  6579, 19577, 19577,
        19577,  3541,  3541,  6832,  6832,  9751])]

Treed answered 7/3, 2019 at 10:8 Comment(1)

cross posted: quora.com/unanswered/… – Martin 25/7, 2019 at 18:18

M

16

This is the way I do it:

def collate_fn_padd(batch):
    '''
    Padds batch of variable length

    note: it converts things ToTensor manually here since the ToTensor transform
    assume it takes in images rather than arbitrary tensors.
    '''
    ## get sequence lengths
    lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
    ## padd
    batch = [ torch.Tensor(t).to(device) for t in batch ]
    batch = torch.nn.utils.rnn.pad_sequence(batch)
    ## compute mask
    mask = (batch != 0).to(device)
    return batch, lengths, mask

then I pass that to the dataloader class as a collate_fn.

There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.

It would be nice that the ideal answer mentions

efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy

things of that sort.

List:

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284

Martin answered 25/7, 2019 at 18:7 Comment(3)

Is it customary to put tensors on the GPU in collate? I was under the impression this means you can't use multiple workers in your dataloader if you do this. I'd be interested in knowing which approach typically has better performance. – Worcester 29/5, 2020 at 1:7

@Pinocchio why do you compute the sequence lengths and mask? If I understand correctly, once the batch gets passed into the network the network doesn't have a way to use masks or to trim the input, right? – Acatalectic 16/2, 2021 at 20:37

In case anyone stumbles across this, I think the answer provided by David Ng is the best way to do this #51031282 – Acatalectic 16/2, 2021 at 21:45

G

19

So how do you handle the fact that your samples are of different length? torch.utils.data.DataLoader has a collate_fn parameter which is used to transform a list of samples into a batch. By default it does this to lists. You can write your own collate_fn, which for instance 0-pads the input, truncates it to some predefined length or applies any other operation of your choice.

Gagliano answered 7/3, 2019 at 10:23 Comment(3)

What if I do not desire to pad extra numbers? I mean what if I have a fully convolutional neural network and I do not need same sized input and in particular I do not want to change input by padding it as well (I am doing an explainable AI experiment)? – Eijkman 20/4, 2020 at 12:59

@RedFloyd it's all fine, except you will need to make some adaptations and will lose some performance. In PyTorch (and roughly every other framework) CNN operations such as Conv2d are executed in a "vectorized" fashion over the 1st dimension (usually called batch dimension). In your case, you will just have to have this dimension equal to 1 and call your network as many times as you have images instead of just stacking them into one big tensor and executing your network once on all of them. This will probably cost you performance but nothing more. – Gagliano 20/4, 2020 at 14:13

Thanks for replying. Just to clarify, doing this is essentially SGD, which would be noisy and troublesome to train (ie, may not converge wel) ?l – Eijkman 20/4, 2020 at 14:33

M

16

This is the way I do it:

def collate_fn_padd(batch):
    '''
    Padds batch of variable length

    note: it converts things ToTensor manually here since the ToTensor transform
    assume it takes in images rather than arbitrary tensors.
    '''
    ## get sequence lengths
    lengths = torch.tensor([ t.shape[0] for t in batch ]).to(device)
    ## padd
    batch = [ torch.Tensor(t).to(device) for t in batch ]
    batch = torch.nn.utils.rnn.pad_sequence(batch)
    ## compute mask
    mask = (batch != 0).to(device)
    return batch, lengths, mask

then I pass that to the dataloader class as a collate_fn.

There seems to be a giant list of different posts in the pytorch forum. Let me link to all of them. They all have answers of their own and discussions. It doesn't seem to me that there is one "standard way to do it" but if there is from an authoritative reference please share.

It would be nice that the ideal answer mentions

efficiency, e.g. if to do the processing in GPU with torch in the collate function vs numpy

things of that sort.

List:

bucketing: - https://discuss.pytorch.org/t/tensorflow-esque-bucket-by-sequence-length/41284

Martin answered 25/7, 2019 at 18:7 Comment(3)

Is it customary to put tensors on the GPU in collate? I was under the impression this means you can't use multiple workers in your dataloader if you do this. I'd be interested in knowing which approach typically has better performance. – Worcester 29/5, 2020 at 1:7

@Pinocchio why do you compute the sequence lengths and mask? If I understand correctly, once the batch gets passed into the network the network doesn't have a way to use masks or to trim the input, right? – Acatalectic 16/2, 2021 at 20:37

In case anyone stumbles across this, I think the answer provided by David Ng is the best way to do this #51031282 – Acatalectic 16/2, 2021 at 21:45

T

9

As @Jatentaki suggested, I wrote my custom collate function and it worked fine.

def get_max_length(x):
    return len(max(x, key=len))

def pad_sequence(seq):
    def _pad(_it, _max_len):
        return [0] * (_max_len - len(_it)) + _it
    return [_pad(it, get_max_length(seq)) for it in seq]

def custom_collate(batch):
    transposed = zip(*batch)
    lst = []
    for samples in transposed:
        if isinstance(samples[0], int):
            lst.append(torch.LongTensor(samples))
        elif isinstance(samples[0], float):
            lst.append(torch.DoubleTensor(samples))
        elif isinstance(samples[0], collections.Sequence):
            lst.append(torch.LongTensor(pad_sequence(samples)))
    return lst

stream_dataset = StreamDataset(data_path)
stream_data_loader = torch.utils.data.dataloader.DataLoader(dataset=stream_dataset,                                                         
                                                            batch_size=batch_size,                                            
                                                        collate_fn=custom_collate,
                                                        shuffle=False)

Treed answered 8/3, 2019 at 8:56 Comment(0)

A

0

My solution trying to combine the best of the other answers:

import collections

def collate_helper(x): 
    if isinstance(x[0], (int, np.int32, np.int64)):
        return torch.LongTensor(x).to(device)
    elif isinstance(x[0], (float, np.float32, np.float64)):
        return torch.FloatTensor(x)
    elif isinstance(x[0], (np.ndarray, collections.Sequence)):
        x = [ torch.tensor(g) for g in x ]
        return torch.nn.utils.rnn.pad_sequence(x, batch_first = True) # , padding_value = torch.nan)
    else: 
        raise ValueError(f"Don't know how to collate {type(x[0])}")

def custom_collate(batch):
    return [ collate_helper(g).to(device) for g in zip(*batch) ]

Articulate answered 5/3 at 3:31 Comment(0)

Recommended topics

Hot tags