Pytorch - Concatenating Datasets before using Dataloader
Asked Answered
D

3

15

I am trying to load two datasets and use them both for training.

Package versions: python 3.7; pytorch 1.3.1

It is possible to create data_loaders seperately and train on them sequentially:

from torch.utils.data import DataLoader, ConcatDataset


train_loader_modelnet = DataLoader(ModelNet(args.modelnet_root, categories=args.modelnet_categories,split='train', transform=transform_modelnet, device=args.device),batch_size=args.batch_size, shuffle=True)

train_loader_mydata = DataLoader(MyDataset(args.customdata_root, categories=args.mydata_categories, split='train', device=args.device),batch_size=args.batch_size, shuffle=True)

for e in range(args.epochs):
    for idx, batch in enumerate(tqdm(train_loader_modelnet)):
        # training on dataset1
    for idx, batch in enumerate(tqdm(train_loader_custom)):
        # training on dataset2

Note: MyDataset is a custom dataset class which has def __len__(self): def __getitem__(self, index): implemented. As the above configuration works it seems that this is implementation is OK.

But I would ideally like to combine them into a single dataloader object. I attempted this as per the pytorch documentation:

train_modelnet = ModelNet(args.modelnet_root, categories=args.modelnet_categories,
                          split='train', transform=transform_modelnet, device=args.device)
train_mydata = CloudDataset(args.customdata_root, categories=args.mydata_categories,
                             split='train', device=args.device)
train_loader = torch.utils.data.ConcatDataset(train_modelnet, train_customdata)

for e in range(args.epochs):
    for idx, batch in enumerate(tqdm(train_loader)):
        # training on combined

However, on random batches I get the following 'expected a tensor as element X in argument 0, but got a tuple instead' type of error. Any help would be much appreciated!

>   40%|████      | 53/131 [01:03<02:00,  1.55s/it]
>  Traceback (mostrecent call last):   File
> "/home/chris/Programs/pycharm-anaconda-2019.3.4/plugins/python/helpers/pydev/pydevd.py",
> line 1434, in _exec
>     pydev_imports.execfile(file, globals, locals)  # execute the script   File
> "/home/chris/Programs/pycharm-anaconda-2019.3.4/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
>     exec(compile(contents+"\n", file, 'exec'), glob, loc)   File "/home/chris/Documents/4yp/Data/my_kaolin/Classification/pointcloud_classification_combinedset.py",
> line 83, in <module>
>     for idx, batch in enumerate(tqdm(train_loader)):   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/tqdm/std.py",
> line 1107, in __iter__
>     for obj in iterable:   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/dataloader.py",
> line 346, in __next__
>     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration   File
> "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py",
> line 47, in fetch
>     return self.collate_fn(data)   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 79, in default_collate
>     return [default_collate(samples) for samples in transposed]   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 79, in <listcomp>
>     return [default_collate(samples) for samples in transposed]   File "/home/chris/anaconda3/envs/4YP/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py",
> line 55, in default_collate
>     return torch.stack(batch, 0, out=out) TypeError: expected Tensor as element 3 in argument 0, but got tuple  

Diagram answered 24/3, 2020 at 22:47 Comment(2)
easiest solution to what I want is to do use this: discuss.pytorch.org/t/… by using learn2learn's union of data sets.Edentate
useful: #69793091Edentate
G
32

If I got your question right, you have train and dev sets (and their corresponding loaders) as follows:

train_set = CustomDataset(...)
train_loader = DataLoader(dataset=train_set, ...)
dev_set = CustomDataset(...)
dev_loader = DataLoader(dataset=dev_set, ...)

And you want to concatenate them in order to use train+dev as the training data, right? If so, you just simply call:

train_dev_sets = torch.utils.data.ConcatDataset([train_set, dev_set])
train_dev_loader = DataLoader(dataset=train_dev_sets, ...)

The train_dev_loader is the loader containing data from both sets.

Now, be sure your data has the same shapes and the same types, that is, the same number of features, or the same categories/numbers, etc.

Giff answered 19/4, 2021 at 13:39 Comment(0)
D
1

I'd guess the two datasets are sometimes returning different types. When the data are Tensors, torch stacks them, and they better be the same shape. If they're something like strings, torch will make a tuple out of them. So this sounds like one of your datasets is sometimes returning something that's not a tensor. I'd put some asserts on the output of your dataset to check that it's doing what you want, or dive in with pdb.

Dillion answered 19/6, 2020 at 20:25 Comment(0)
T
0

Adding to @Leopd's answer, you can use the collate_fn function provided by PyTorch. The idea is that in the collate_fn, you will define how the examples should be stacked to make a batch. Since you are on torch 1.3.1, make sure you are looking at the correct version of the documentation.

Let me know if this helps or if you have any followup questions :)

Thinia answered 19/6, 2020 at 21:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.