Is Pytorch DataLoader Iteration order stable?
Asked Answered
F

2

5

Is the iteration order for a Pytorch Dataloader guaranteed to be the same (under mild conditions)?

For instance:

dataloader = DataLoader(my_dataset, batch_size=4,
                        shuffle=True, num_workers=4)
print("run 1")
for batch in dataloader:
  print(batch["index"])

print("run 2")
for batch in dataloader:
  print(batch["index"])

So far, I've tried testing it and it appears to not be fixed, same order for both runs. Is there a way to make the order the same? Thanks

edit: i have also tried doing

unlabeled_sampler = data.sampler.SubsetRandomSampler(unlabeled_indices)
unlabeled_dataloader = data.DataLoader(train_dataset, 
                sampler=unlabeled_sampler, batch_size=args.batch_size, drop_last=False)

and then iterating through the dataloader twice, but the same non-determinism results.

Fantom answered 12/12, 2019 at 23:35 Comment(5)
it is stable provided shuffle=False, in your case your explicitly requesting the data to be returned in a random order by setting shuffle=TrueMinneapolis
OK, good point. But it is the "same" dataloader, no?Fantom
same dataset not the same loader. The loader is "just" an interface to the dataset which defines, among other things, a sampler. The sampler samples your dataset in the way and order it was defined to. If you change shuffle then you're changing the sampler that the dataloader is using which can make it go from stable to unstable. You can also explicitly specify the sampler when defining the dataloader.Minneapolis
Thank you for clarifying! So actually I have: unlabeled_sampler = data.sampler.SubsetRandomSampler(unlabeled_indices) and then unlabeled_dataloader = data.DataLoader(train_dataset, sampler=unlabeled_sampler, batch_size=args.batch_size, drop_last=False) and the iteration order is still unstable. Any thoughts?Fantom
I think I understand your issue better now. I posted an answer that I believe answers your question.Minneapolis
M
7

The short answer is no, when shuffle=True the iteration order of a DataLoader isn't stable between iterations. Each time you iterate on your loader the internal RandomSampler creates a new random order.

One way to get a stable shuffled DataLoader is to create a Subset dataset using a shuffled set of indices.

shuffled_dataset = torch.utils.data.Subset(my_dataset, torch.randperm(len(my_dataset)).tolist())
dataloader = DataLoader(shuffled_dataset, batch_size=4, num_workers=4, shuffled=False)
Minneapolis answered 13/12, 2019 at 19:38 Comment(4)
Thank you, let me test one more idea I have and then I will try your answer and accept. What is strange to me, is that if I set the seeds appropriately, then the internal RandomSampler should give the same random indices everytime, no?Fantom
@Fantom I believe the randomization in RandomSampler occurs when a dataloader iterator is created (e.g. when you do for label, data in dataloader:). You would need to seed torch's random number generator (e.g. torch.manual_seed(1234)) with the same seed value immediately before iterating through your dataloader each time to ensure reproducability. This isn't ideal as any other random behavior in your system would end up being repeated as well which may not be desired.Minneapolis
Hey actually, I just tried this method, and sadly it doesn't work: ValueError: sampler should be an instance of torch.utils.data.Sampler, but got sampler=[739, 841, 1892,..]Fantom
Oh that's actually really interesting, you're right. That's quite surprising since this was the recommendation of one of the pytorch developers. Anyway I reverted to my first solution which will work equally as well, and I've tested to make sure.Minneapolis
F
0

I actually went with jodag's in-the-comments answer:

torch.manual_seed("0")

for i,elt in enumerate(unlabeled_dataloader):
    order.append(elt[2].item())
    print(elt)

    if i > 10:
        break

torch.manual_seed("0")

print("new dataloader")
for i,elt in enumerate( unlabeled_dataloader):
    print(elt)
    if i > 10:
        break
exit(1)                       

and the output:

[tensor([[-0.3583, -0.6944]]), tensor([3]), tensor([1610])]
[tensor([[-0.6623, -0.3790]]), tensor([3]), tensor([1958])]
[tensor([[-0.5046, -0.6399]]), tensor([3]), tensor([1814])]
[tensor([[-0.5349,  0.2365]]), tensor([2]), tensor([1086])]
[tensor([[-0.1310,  0.1158]]), tensor([0]), tensor([321])]
[tensor([[-0.2085,  0.0727]]), tensor([0]), tensor([422])]
[tensor([[ 0.1263, -0.1597]]), tensor([0]), tensor([142])]
[tensor([[-0.1387,  0.3769]]), tensor([1]), tensor([894])]
[tensor([[-0.0500,  0.8009]]), tensor([3]), tensor([1924])]
[tensor([[-0.6907,  0.6448]]), tensor([4]), tensor([2016])]
[tensor([[-0.2817,  0.5136]]), tensor([2]), tensor([1267])]
[tensor([[-0.4257,  0.8338]]), tensor([4]), tensor([2411])]
new dataloader
[tensor([[-0.3583, -0.6944]]), tensor([3]), tensor([1610])]
[tensor([[-0.6623, -0.3790]]), tensor([3]), tensor([1958])]
[tensor([[-0.5046, -0.6399]]), tensor([3]), tensor([1814])]
[tensor([[-0.5349,  0.2365]]), tensor([2]), tensor([1086])]
[tensor([[-0.1310,  0.1158]]), tensor([0]), tensor([321])]
[tensor([[-0.2085,  0.0727]]), tensor([0]), tensor([422])]
[tensor([[ 0.1263, -0.1597]]), tensor([0]), tensor([142])]
[tensor([[-0.1387,  0.3769]]), tensor([1]), tensor([894])]
[tensor([[-0.0500,  0.8009]]), tensor([3]), tensor([1924])]
[tensor([[-0.6907,  0.6448]]), tensor([4]), tensor([2016])]
[tensor([[-0.2817,  0.5136]]), tensor([2]), tensor([1267])]
[tensor([[-0.4257,  0.8338]]), tensor([4]), tensor([2411])]

which is as desired. However, I think jodag's main answer is still better; this is just a quick hack which works for now ;)

Fantom answered 16/12, 2019 at 0:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.