BERT binary Textclassification get different results every run

I do binary text classification with BERT from the Simpletransformer.

I work in Colab with GPU runtime type.

I have generated train and test set with the sklearn StratifiedKFold Method. I have two files with the dictionaries containing my folds.

I run my classification in the following while loop:

from sklearn.metrics import matthews_corrcoef, f1_score
import sklearn

counter = 0

resultatos = []

while counter != len(trainfolds):

    model = ClassificationModel('bert', 'bert-base-multilingual-cased',args={'num_train_epochs': 4, 'learning_rate': 1e-5, 'fp16': False, 
                                                                             'max_seq_length': 160, 'train_batch_size': 24,'eval_batch_size': 24 , 
                                                                             'warmup_ratio': 0.0,'weight_decay': 0.00, 
                                                                             'overwrite_output_dir': True})

    print("start with fold_{}".format(counter))
    trainfolds["{}_fold".format(counter)].to_csv("/content/data/train.tsv", sep="\t", index = False, header=False)
    print("{}_fold Train als train.tsv exportiert". format(counter))
    testfolds["{}_fold".format(counter)].to_csv("/content/data/dev.tsv", sep="\t", index = False, header=False)
    print("{}_fold test als train.tsv exportiert". format(counter))

    train_df =  pd.read_csv("/content/data/train.tsv", delimiter='\t', header=None)
    eval_df = df = pd.read_csv("/content/data/dev.tsv", delimiter='\t', header=None)

    train_df = pd.DataFrame({
    'text': train_df[3].replace(r'\n', ' ', regex=True),
    'label':train_df[1]})

    eval_df = pd.DataFrame({
    'text': eval_df[3].replace(r'\n', ' ', regex=True),
    'label':eval_df[1]})

    model.train_model(train_df)

    result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1 = sklearn.metrics.f1_score)
    print(result)

    resultatos.append(result)

    shutil.rmtree("outputs")
    shutil.rmtree("cache_dir")
    #shutil.rmtree("runs")



    counter += 1

And i get different Results Running this code for the same Folds:

Here for example the F1 Scores for two runs:

0.6237942122186495
0.6189111747851003
0.6172839506172839
0.632183908045977
0.6182965299684542
0.5942492012779553
0.6025641025641025
0.6153846153846154
0.6390532544378699
0.6627906976744187
The F1 Score is: 0.6224511646974427


0.6064516129032258
0.6282420749279539
0.6402439024390244
0.5971014492753622
0.6135693215339232
0.6191950464396285
0.6382978723404256
0.6388059701492537
0.6097560975609756
0.5956112852664576
The F1 Score is: 0.618727463283623

How can they be that diffeerent for the same folds?

What i tried already is give a fixed Random seed right before my loop starts:

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

I came up with approach of having the Model initialized in the loop because, when its outside the loop, it somehow remembers what it has learned - that means after the 2nd fold I get f1 score of almost one - despite the fact that i delete the cache..

Recommended topics

Hot tags