Massive overfit during resnet50 transfer learning
Asked Answered
G

6

14

This is my first attempt at doing something with CNNs, so I am probably doing something very stupid - but can't figure out where I am wrong...

The model seems to be learning fine, but the validation accuracy is not improving (ever - even after the first epoch), and validation loss is actually increasing with time. It doesn't look like I am overfiting (after 1 epoch?) - must we off in some other way.

typical network behaviour

I am training a CNN network - I have ~100k images of various plants (1000 classes) and want to fine-tune ResNet50 to create a muticlass classifier. Images are of various sizes, I load them like so:

from keras.preprocessing import image                  

def path_to_tensor(img_path):
    # loads RGB image as PIL.Image.Image type
    img = image.load_img(img_path, target_size=(IMG_HEIGHT, IMG_HEIGHT))
    # convert PIL.Image.Image type to 3D tensor with shape (IMG_HEIGHT, IMG_HEIGHT, 3)
    x = image.img_to_array(img)
    # convert 3D tensor to 4D tensor with shape (1, IMG_HEIGHT, IMG_HEIGHT, 3) and return 4D tensor
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in img_paths] #can use tqdm(img_paths) for data
    return np.vstack(list_of_tensors)enter code here

The database is large (does not fit into memory) and had to create my own generator to provide both reading from the disk and augmentation. (I know Keras has .flow_from_directory() - but my data is not structured this way - it is just a dump of 100k images mixed with 100k metadata files). I probably should have created a script to structure them better and not create my own generators, but the problem is likely somewhere else.

The generator version below doesn't do any augmentation for the time being - just rescaling:

def generate_batches_from_train_folder(images_to_read, labels, batchsize = BATCH_SIZE):    

    #Generator that returns batches of images ('xs') and labels ('ys') from the train folder
    #:param string filepath: Full filepath of files to read - this needs to be a list of image files
    #:param np.array: list of all labels for the images_to_read - those need to be one-hot-encoded
    #:param int batchsize: Size of the batches that should be generated.
    #:return: (ndarray, ndarray) (xs, ys): Yields a tuple which contains a full batch of images and labels. 

    dimensions = (BATCH_SIZE, IMG_HEIGHT, IMG_HEIGHT, 3)

    train_datagen = ImageDataGenerator(
        rescale=1./255,
        #rotation_range=20,
        #zoom_range=0.2, 
        #fill_mode='nearest',
        #horizontal_flip=True
    )

    # needs to be on a infinite loop for the generator to work
    while 1:
        filesize = len(images_to_read)

        # count how many entries we have read
        n_entries = 0
        # as long as we haven't read all entries from the file: keep reading
        while n_entries < (filesize - batchsize):

            # start the next batch at index 0
            # create numpy arrays of input data (features) 
            # - this is already shaped as a tensor (output of the support function paths_to_tensor)
            xs = paths_to_tensor(images_to_read[n_entries : n_entries + batchsize])

            # and label info. Contains 1000 labels in my case for each possible plant species
            ys = labels[n_entries : n_entries + batchsize]

            # we have read one more batch from this file
            n_entries += batchsize

            #perform online augmentation on the xs and ys
            augmented_generator = train_datagen.flow(xs, ys, batch_size = batchsize)

        yield  next(augmented_generator)

This is how I define my model:

def get_model():

    # define the model
    base_net = ResNet50(input_shape=DIMENSIONS, weights='imagenet', include_top=False)

    # Freeze the layers which you don't want to train. Here I am freezing all of them
    for layer in base_net.layers:
        layer.trainable = False

    x = base_net.output

    #for resnet50
    x = Flatten()(x)
    x = Dense(512, activation="relu")(x)
    x = Dropout(0.5)(x)
    x = Dense(1000, activation='softmax', name='predictions')(x)

    model = Model(inputs=base_net.input, outputs=x)

    # compile the model 
    model.compile(
        loss='categorical_crossentropy',
        optimizer=optimizers.Adam(1e-3),
        metrics=['acc'])

    return model

So, as a result I have 1,562,088 trainable parameters for roughly 70k images

I then use a 5-fold cross validation, but the model doesn't work on any of the folds, so I will not be including the full code here, the relevant bit is this:

trial_fold = temp_model.fit_generator(
                train_generator,
                steps_per_epoch = len(X_train_path) // BATCH_SIZE,
                epochs = 50,
                verbose = 1,
                validation_data = (xs_v,ys_v),#valid_generator,
                #validation_steps= len(X_valid_path) // BATCH_SIZE,
                callbacks = callbacks,
                shuffle=True)

I have done various things - made sure my generator is actually working, tried to play with the last few layers of the network by reducing the size of the fully connected layer, tried augmentation - nothing helps...

I don't think the number of parameters in the network is too large - I know other people have done pretty much the same thing and got accuracy closer to 0.5, but my models seem to be overfitting like crazy. Any ideas on how to tackle this will be much appreciated!

Update 1:

I have decided to stop reinventing stuff and sorted by files to work with .flow_from_directory() procedure. To make sure I am importing the right format (triggered by the Ioannis Nasios comment below) - I made sure to the preprocessing_unit() from keras's resnet50 application.

I also decided to check out if the model is actually producing something useful - I computed botleneck features for my dataset and then used a random forest to predict the classes. It did work and I got accuracy of around 0.4

So, I guess I definitely had a problem with an input format of my images. As a next step, I will fine-tune the model (with a new top layer) to see if the problem remains...

Update 2:

I think the problem was with image preprocessing. I ended up not fine tuning in the end and just extracted botleneck layer and training linear_SVC() - got accuracy of around 60% of train and around 45% of test datasets.

Gudrin answered 16/5, 2018 at 7:24 Comment(5)
Do you scale images? For example, if you are using tensorflow backend images must be scaled to (-1,1)Choker
I normalise by scaling using the standart 1./255 after loading the images - didn't know I need to scale to (-1,1) - it is (0,1) at the moment - I will try and give an update...Gudrin
After doing some digging it turned out resnet takes in unscaled images, so I used preprocess_input() from ResNet50 applicaiton to make sure it does the right thing - didn't solve the problem though....Gudrin
Other possibilities to try: (i) try more data augmentation, (ii) use MobileNet or smaller network, (iii) add regularisation in your Dense layer, (iv) may be use a smaller learning rate and (v) of course, as mentioned by others, use "preprocess_input" for ResNet50, not rescale=1./255. Basically, you need to add more regularisation to your network during over-fitting.Latashialatch
Thank you @Sanchit. In the end I have used the preprocessing and experimented with different dense layers which did give a good result as you suggested. However, because my classifier was also reliant on some metadata for each image which I couldn't put into my network without heavily modifying the architecture, I ended up using bottleneck layer (merged with metadata) as an input for next classifiers.Gudrin
S
9

You need to use the preprocessing_function argument in ImageDataGenerator.

 train_datagen = ImageDataGenerator(preprocessing_function=keras.applications.resnet50.preprocess_input)

This will ensure that your images are pre-processed as expected for the pre-trained network you are using.

Synonym answered 4/7, 2018 at 19:15 Comment(0)
A
4

Have you got any work around of your problem? If not then this might be an issue with batch norm layer in your resnet. I have also faced similar kind of issue as in keras batch norm layer behave very differently during training and testing. So you can freeze all bn layers by:

BatchNorm()(training=False)

and then try to retrain your network again on the same data set. one more thing you should keep in mind that during training you should set training flag as

import keras.backend as K K.set_learning_phase(1)

and during testing set this flag to 0. I think it should work after making above changes.

If you have found any other solution of the problem please post it here so that others can get benefit of that.

Thank you.

Accountable answered 23/10, 2018 at 5:56 Comment(3)
I believe your answer is correct, and I have tested it. One thing which I do not understand is: I loaded the resent model with all weights without top FC layers and set training to False for all layers and batch normalization layers, and then I added my FC layers on top and let the model train (I set the learning phase to 1). My question is what do you mean by during testing set the learning phase to 0. So If I save my model and load it and ask for prediction for my test set, why do I need to set the learning_phase? is it because of BatchNormalization layer?Quake
BatchNormalization and Dropout are the two layers which changes behavior during training. So to remind keras its better to set this flag for both the cases.Accountable
If we use BatchNormalization()(x, training=False), and later set the layer to be l.trainable = False, does that still ensure that the layer runs in inference mode (while remaining frozen?).Heirship
F
4

I implemented various architectures for transfer learning and observed that models containing BatchNorm layers (e.g. Inception, ResNet, MobileNet) perform a lot worse (~30 % compared to >95 % test accuracy) during evaluation (validation/test) than models without BatchNorm layers (e.g. VGG) on my custom dataset. Furthermore, this problem does not occurr when saving bottleneck features and using them for classification. There are already a few blog entries, forum threads, issues and pull requests on this topic and it turns out that the BatchNorm layer doesn't use the new dataset's statistics but the original dataset's (ImageNet) statistics when frozen:

Assume you are building a Computer Vision model but you don’t have enough data, so you decide to use one of the pre-trained CNNs of Keras and fine-tune it. Unfortunately, by doing so you get no guarantees that the mean and variance of your new dataset inside the BN layers will be similar to the ones of the original dataset. Remember that at the moment, during training your network will always use the mini-batch statistics either the BN layer is frozen or not; also during inference you will use the previously learned statistics of the frozen BN layers. As a result, if you fine-tune the top layers, their weights will be adjusted to the mean/variance of the new dataset. Nevertheless, during inference they will receive data which are scaled differently because the mean/variance of the original dataset will be used.

cited from http://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/

A workaround is to first freeze all layers and then unfreeze all BatchNormalization layers to make them use the new dataset's statistics instead of the original statistics:

# build model
input_tensor = Input(shape=train_generator.image_shape)
base_model = inception_v3.InceptionV3(input_tensor=input_tensor,
                                      include_top=False,
                                      weights='imagenet',
                                      pooling='avg')
x = base_model.output

# freeze all layers in the base model
base_model.trainable = False

# un-freeze the BatchNorm layers
for layer in base_model.layers:
    if "BatchNormalization" in layer.__class__.__name__:
        layer.trainable = True

# add custom layers
x = Dense(1024, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(train_generator.num_classes, activation='softmax')(x)

# define new model
model = Model(inputs=input_tensor, outputs=x)

This also explains the difference in performance between training the model with frozen layers and evaluate it with a validation/test set and saving bottleneck features (with model.predict the internal backend flag set_learning_phase is set to 0) and training a classifier on the cached bottleneck features.

More information here:

Pull request to change this behavior (not-accepted): https://github.com/keras-team/keras/pull/9965

Similar thread: https://datascience.stackexchange.com/questions/47966/over-fitting-in-transfer-learning-with-small-dataset/72436#72436

Furthest answered 16/4, 2020 at 14:11 Comment(1)
This ended up working but this solution is different from how TF2.0 ends up solving the problem by forcing batch norm into inference mode when it is frozen.Heirship
P
1

I am also working on a very small dataset and encountered the same problem of validation accuracy being stuck at some point although the training accuracy keeps going higher. I also noticed that my validation loss was getting higher as well over time. FYI, I am using Resnet 50 and InceptionV3 models.

After some digging on the internet, I found a discussion on github taking place which connects this problem to the implementation of Batch Normalization layers in Keras. The above mentioned problem is encountered when applying transfer learning and fine-tuning the network. I am not sure if you have the same problem, but I have added the link below to Github where you can read more about this problem, and try to apply some tests which will help you in understanding if you are affected by the same problem.

Github link to the pull request and discussion

Poky answered 31/5, 2018 at 11:16 Comment(1)
Hi, can you please help me share your notebook where you applied these changes to ResNet50. That would be of great help. After following up with all the available resources with regards to issues on GitHub, keras, I couldn't really understand the information about inference mode and setting BatchNorm to True/False. I think you understand how painful it is. I request you to please help me with this. Thank you.Undersecretary
F
0

The problem is too small dataset for each class. 100k examples / 1000 classes = ~100 examples per one class. It's too small amount for that. Your network can remember all your examples in weight matrices, but for generalization you should have a lot more examples. Try use only the most common classes and figure out what's happened.

Fideliafidelio answered 16/5, 2018 at 14:2 Comment(1)
I know it would be too small for a training a new network from scratch. But I thought the 1000 images per class rule does not apply it transfer learning? When I used augmentation the accuracy in the train sample was not increasing as fast, but the validation was still stuck at the same accuracy levels ~0.008.Gudrin
C
0

Here some explanation regarding fine tuning and transfer learning according to Stanford university

  1. Very different dataset and very little dataset from image-net dataset - Try linear classifier from different stages

So to summarize

Since the dataset is very small, You may want to extract the features from the earlier layer and train a classifier on top of that and check if the problem still exists.

Centerboard answered 19/5, 2018 at 0:9 Comment(2)
I thought so as well for a few weeks and gave up, but then I found a paper that was using the same dataset and (older) CNN. They did not have the same problem and were able to achieve multiclass accuracy of around 0.6 - there is clearly a problem somewhere with my implementation, not the approach itself. Here is a link: ceur-ws.org/Vol-1391/121-CR.pdfGudrin
@Gudrin i am downloading the dataset and will try to run it using your implementation.Centerboard

© 2022 - 2024 — McMap. All rights reserved.