TLDR; Made a custom tf.keras.utils.Sequence [1] to load batched data into keras.model.fit(...)
. Generator had considerably worse performance than when model is called on data loaded from memory, even though hyperparameters/model/data structure is the same. Model was overfitting so was wondering if the index argument from the model.fit(...)
[2] method to the __getitem__(..., index)
method in the generator causes the same images to be fed to the model multiple times? How is the index argument selected? is it ordered? is the max index controlled by the __len(...)__
?
References
I am using a subclass of tf.keras.utils.Sequence [1] to feed batches of data to the model.fit(...) method, shown below.
class Generator(Sequence):
def __init__(self, df, x, y, file_type, req_dim, directory, batch_size):
# data info
self.df = df
self.x = self.df[x] # path list to images being loaded
self.y = self.df[y] # corresponding target values
self.index = self.df.index.to_list()
self.directory = directory # directory where features images are stored
self.file_type = file_type # dictate which type of image to load
# for batches
self.batch_size = batch_size
def __len__(self):
"""
:returns number of batches per epoch
"""
return int(np.floor(len(self.df) / self.batch_size))
def __getitem__(self, index):
"""
receives call from keras (index) and grabs corresponding data batch
:param index:
:return:
"""
# instantiate output array
x = np.empty((self.batch_size, *self.req_dim))
# batches
batch_x = self.x[index*self.batch_size:(index+1)*self.batch_size].to_numpy()
batch_y = self.y[index*self.batch_size:(index+1)*self.batch_size].to_numpy(dtype=float)
for i in batch_x:
# logic to load images + perform operations on them
im = load(...)
im = operations(im)
x[i, ] = im # makes batches of ims
return tuple((x, batch_y.reshape(-1, 1)))
Traditionally i have loaded the data directly into memory but have needed to use the above Sequence subclass (similar to a generator, will refer to as generator moving forward) to deal with larger file sizes. To test if the generator worked i used data that could be loaded both directly into memory and with the generator. The results from data loaded into memory are consistent with previous experiments while using the generator causes the model to over-fit on the training data.
Since the model was overfitting i was wondering if the index argument input in the__getitem__(self, index)
, which is sent by keras to retrieve a given batch number, was ordered or if it can cause a single image to be read in multiple times?
The generator is used in the below pseudo code:
# load data
data = load_data(...)
# split data according to batch size used later so that each split has an equal number of samples when
# divided into batches
train, test, val = train_test_split(data)
scaler = Scaler()
train['target'] = scaler.fit_transform(train['target'])
test['target'] = scaler.transform(test['target'])
val['target'] = scaler.transform(val['target'])
# instantiate generator
train_gen = DataGenerator(train, x='feature_name', y='target', file_type, dims, directory, batch_size=5)
# load validation images and targets directly to memory
val_x = load(...)
val_y = val['target'].to_numpy(dtype=float)
model = model_1(*dims) # Convolutional neural network that takes in height, width, depth args
model.compile(optimizer=Adam(lr=1e-5, decay=1e-5/400), loss=LogCosh())
history = model.fit(train_gen, validation_data=(val_x, val_y)
pred = model.pred(test)