What is the index argument from the __getitem__(...) method in tf.keras.utils.Sequence?
Asked Answered
I

2

6

TLDR; Made a custom tf.keras.utils.Sequence [1] to load batched data into keras.model.fit(...). Generator had considerably worse performance than when model is called on data loaded from memory, even though hyperparameters/model/data structure is the same. Model was overfitting so was wondering if the index argument from the model.fit(...) [2] method to the __getitem__(..., index) method in the generator causes the same images to be fed to the model multiple times? How is the index argument selected? is it ordered? is the max index controlled by the __len(...)__?

References

  1. tf.keras.utils.Sequence
  2. keras.model.fit

I am using a subclass of tf.keras.utils.Sequence [1] to feed batches of data to the model.fit(...) method, shown below.

class Generator(Sequence):
    
    def __init__(self, df, x, y, file_type, req_dim, directory, batch_size):
        # data info
        self.df = df
        self.x = self.df[x]  # path list to images being loaded
        self.y = self.df[y]  # corresponding target values
        self.index = self.df.index.to_list()
        self.directory = directory  # directory where features images are stored 
        self.file_type = file_type  # dictate which type of image to load 
        # for batches
        self.batch_size = batch_size

    def __len__(self):
        """
        :returns number of batches per epoch
        """
        return int(np.floor(len(self.df) / self.batch_size))

    def __getitem__(self, index):
        """
        receives call from keras (index) and grabs corresponding data batch
        :param index:
        :return:
        """
        # instantiate output array
        x = np.empty((self.batch_size, *self.req_dim))

        # batches
        batch_x = self.x[index*self.batch_size:(index+1)*self.batch_size].to_numpy()
        batch_y = self.y[index*self.batch_size:(index+1)*self.batch_size].to_numpy(dtype=float)

        for i in batch_x:

            # logic to load images + perform operations on them
            im = load(...)
            im = operations(im)
            
            x[i, ] = im  # makes batches of ims

        return tuple((x, batch_y.reshape(-1, 1)))

Traditionally i have loaded the data directly into memory but have needed to use the above Sequence subclass (similar to a generator, will refer to as generator moving forward) to deal with larger file sizes. To test if the generator worked i used data that could be loaded both directly into memory and with the generator. The results from data loaded into memory are consistent with previous experiments while using the generator causes the model to over-fit on the training data.

Since the model was overfitting i was wondering if the index argument input in the__getitem__(self, index), which is sent by keras to retrieve a given batch number, was ordered or if it can cause a single image to be read in multiple times?

The generator is used in the below pseudo code:

# load data
data = load_data(...)

# split data according to batch size used later so that each split has an equal number of samples when 
# divided into batches
train, test, val = train_test_split(data) 

scaler = Scaler()
train['target'] = scaler.fit_transform(train['target'])
test['target'] = scaler.transform(test['target'])
val['target'] = scaler.transform(val['target'])

# instantiate generator
train_gen = DataGenerator(train, x='feature_name', y='target', file_type, dims, directory, batch_size=5)

# load validation images and targets directly to memory
val_x = load(...)
val_y = val['target'].to_numpy(dtype=float)


model = model_1(*dims)  # Convolutional neural network that takes in height, width, depth args

model.compile(optimizer=Adam(lr=1e-5, decay=1e-5/400), loss=LogCosh())

history = model.fit(train_gen, validation_data=(val_x, val_y)

pred = model.pred(test)

Imaret answered 18/2, 2021 at 22:2 Comment(3)
Would like to know this also...Isolation
I am also interested in this. Especially since I am trying to read sequences with two generators inside my custom generator and I don't want the shuffle at the end of epoch to mess up the sequences.Deel
as I know keras takes the whole data in form of tensor (tensor_from_slices) or iterate generator's output (from_generator) according len internally!.. & getitem takes one batch returnedCoze
S
3

I think the index number of __getitem__ is related to the total amount of the samples and the batch_size you assigned.

For example, I'm playing with fer2013plus dataset now and for testing I have 3944 images. My test_generator is created like this:

test_generator = ImageDataGenerator().flow_from_directory(test_dir,
                                                  target_size=(48,48),
                                                  color_mode='grayscale',
                                                  batch_size=32,
                                                  class_mode='categorical')

When I call test_generator.__getitem__(), the index is 0 to 123. Otherwise the err pops out as ValueError: Asked to retrieve element 124, but the Sequence has length 124

Sweeper answered 13/8, 2021 at 8:13 Comment(0)
Y
1

I ran into a reproducibility issue when I used a custom datagenerator. The problem was solved when I set the parameter shuffle=False:

model.fit(train_generator, epochs=1, shuffle=False)

I think the method does not ignore the shuffle parameter when using a date generator as promised in the documentation.

Yeager answered 25/2, 2022 at 7:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.