PyTorch infinite loop in the training and validation step
Asked Answered
I

1

1

Dataset and DataLoader's parts are ok, I recycled from another code that I built, but got an infinite loop at that part in my code:

def train(train_loader, MLP, epoch, criterion, optimizer):

 MLP.train()
 epoch_loss = []

 for batch in train_loader:

    optimizer.zero_grad()
    sample, label = batch

    #Forward
    pred = MLP(sample)
    loss = criterion(pred, label)
    epoch_loss.append(loss.data)

    #Backward
    loss.backward()
    optimizer.step()

 epoch_loss = np.asarray(epoch_loss)

 print('Epoch: {}, Loss: {:.4f} +/- {:.4f}'.format(epoch+1, 
 epoch_loss.mean(), epoch_loss.std()))



def test(test_loader, MLP, epoch, criterion):

 MLP.eval()
 with torch.no_grad():
    epoch_loss = []

    for batch in train_loader:

        sample, label = batch

        #Forward
        pred = MLP(sample)
        loss = criterion(pred, label)
        epoch_loss.append(loss.data)

    epoch_loss = np.asarray(epoch_loss)

    print('Epoch: {}, Loss: {:.4f} +/- {:.4f}'.format(epoch+1, 
    epoch_loss.mean(), epoch_loss.std()))

Than, I put it to iterate over the epochs:

for epoch in range(args['num_epochs']):
    train(train_loader, MLP, epoch, criterion, optimizer)
    test(test_loader, MLP, epoch, criterion)
    print('-----------------------')

As it doesn't print even the first loss data, I believe that the logic error is in the training function, but I don't know where it is.

Edit: Here is my MLP Class, the problem can be here too:

class BikeRegressor(nn.Module):

 def __init__(self, input_size, hidden_size, out_size):
    super(BikeRegressor, self).__init__()
    
    self.features = nn.Sequential(nn.Linear(input_size, hidden_size),
                                  nn.ReLU(),
                                  nn.Linear(hidden_size, hidden_size),
                                  nn.ReLU())
    
    self.out = nn.Sequential(nn.Linear(hidden_size, out_size),
                             nn.ReLU())
    
 def forward(self, X):
    
    hidden = self.features(X)
    output = self.out(hidden)
    
    return output

Edit 2: Dataset and Dataloader:

class Bikes(Dataset):
 def __init__(self, data): #data is a Dataframe from Pandas
    self.datas = data.to_numpy()
    
 def __getitem__(self, idx): 
    sample = self.datas[idx][2:14] 
    label = self.datas[idx][-1:] 
    
    
    sample = torch.from_numpy(sample.astype(np.float32))
    label = torch.from_numpy(label.astype(np.float32))
    
    return sample, label

 def __len__(self):
    return len(self.datas)



train_set = Bikes(ds_train)
test_set = Bikes(ds_test)



train_loader = DataLoader(train_set, batch_size=args['batch_size'], shuffle=True, num_workers=args['num_workers'])
test_loader = DataLoader(test_set, batch_size=args['batch_size'], shuffle=True, num_workers=args['num_workers'])
Insectile answered 19/2, 2022 at 17:56 Comment(10)
I think you need to try loss.backward()Afterward
@GaussianPrior yes lol, i really forgot the (), but the problem still persists..Insectile
You have something wrong with the structure of functions you should intend the body of the train and test functions correctly.Gapin
Does your code complete the first training epoch? Also, how big is your batch size? Is it possible that your computer is actually calculating stuff but you can't see it because it's not using GPU and it's too big?Marcus
@Marcus It doesn't, for that reason i believe there is a logic problem or something like this. I dont think the lack of GPU is the problem, because i did with MNIST problem (60000 images 28x28), and current dataset has a shape of (17379, 12), not that big. Batch's size is 20Insectile
Then I suggest printing some values (for example, the tensor shape) inside of your forward method (in the model) between each layer, so you can see if the model receives the input, if there's a shape mismatch soomewhere, and if your model returns the expected output shape.Marcus
You should remove nn.ReLU() from the last layer because you will use softmax instead if you have classification problem.Gapin
Do you have dataloader? If so could you please share the code of dataloader and do you use jupyter notebook?Gapin
@Phoenix Its a regression problem, but i can remove ReLU from the last perceptron. I put the code of dataset and dataloader. I use jupyter lab, but i think its almost the same, idkInsectile
@Nilon, please add a print function at the first line inside the for-loop of train() function, if this print does not print anything, I have the solution for this problem.Gapin
G
2

I experienced the same problem and the problem is that jupyter notebook might not work properly with multiprocessing as documented here:

Note Functionality within this package requires that the __ main __ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.

You have three options to solve your problem:

  • Set the num_worker = 0 in train_loader and test_loader. (easiest one)
  • Move your code to google colab. It works with me with num_worker = 6 but I think it depends on how much memory your program will use. Thus, try to increase num_worker gradually until your program crashes telling you that your program is out of memory.
  • Adapt your program in jupyter to support multiprocessing, these resources 1, 2 might help.
Gapin answered 20/2, 2022 at 9:57 Comment(1)
Ooh, the first one really helped, thanks man! This kind of problem sucks haha, in the end it wasnt a logic error properly (taking some code erros)Insectile

© 2022 - 2024 — McMap. All rights reserved.