Keras predict gives different error than evaluate, loss different from metrics

T

2

3

I have the following problem:

I have an autoencoder in Keras, and train it for a few epochs. The training overview shows a validation MAE of 0.0422 and an MSE of 0.0024. However, if I then call network.predict and manually calculate the validation errors, I get 0.035 and 0.0024.

One would assume that my manual calculation of the MAE is simply incorrect, but the weird thing is that if I use an identity model (simply outputs what you input) and use that to evaluate the predicted values, the same error value is returned as for my manual calculation. The code looks as follows:

input = Input(shape=(X_train.shape[1], ))
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(input)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
network = Model(input, decoded)

# sgd = SGD(lr=8, decay=1e-6)
# network.compile(loss='mean_squared_error', optimizer='adam')
network.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])

# Fitting the data
network.fit(X_train, X_train, epochs=2, batch_size=1, shuffle=True, validation_data=(X_valid, X_valid),
            callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=20, verbose=0, mode='auto')])


# Results
recon_valid = network.predict(X_valid, batch_size=1)
score2 = network.evaluate(X_valid, X_valid, batch_size=1, verbose=0)
print('Network evaluate result: mae={}, mse={}'.format(*score2))

x = Input((X_train.shape[1],))
m = Model(x, x)
m.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
score1 = m.evaluate(recon_valid, X_valid, batch_size=1, verbose=0)
print('Identity evaluate result: mae={}, mse={}'.format(*score1))

errors_test = np.absolute(X_valid - recon_valid)
print("Manual MAE: {}".format(np.average(errors_test)))
errors_test = np.square(X_valid - recon_valid)
print("Manual MSE: {}".format(np.average(errors_test)))

Which outputs the following:

Train on 282 samples, validate on 94 samples
Epoch 1/2
2018-04-18 17:24:01.464947: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
282/282 [==============================] - 0s - loss: 0.0861 - mean_squared_error: 0.0187 - val_loss: 0.0451 - val_mean_squared_error: 0.0025
Epoch 2/2
282/282 [==============================] - 0s - loss: 0.0440 - mean_squared_error: 0.0025 - val_loss: 0.0422 - val_mean_squared_error: 0.0024
Network evaluate result: mae=0.04216482736011769, mse=0.0024067993242382767
Identity evaluate result: mae=0.03506102238563781, mse=0.0024067993242382767
Manual MAE: 0.03506102412939072
Manual MSE: 0.002406799467280507

I know that my manual calculation is correct, since the identity model (m) returns the same value. The only possible explanation for the difference in MAE values would then be if network.evaluate(X_valid, X_valid) somehow uses different values than those returned by network.predict(X_valid), but then the MSE would also be different.

This leaves me completely confused, thinking there might be a bug in the Keras MAE calculation. Has anyone had this issue before or have any ideas how it might be fixed? I am using the Tensorflow backend. Any help would be much appreciated!

EDIT: I'm almost certain this is a bug. If I keep loss='mae' but also add metrics=['mse', 'mae'], the MAE returned by the metrics is the same as my manual computation and the identity model. The same is true for MSE: if I set loss='mse', the MSE returned by the metric is different from the loss.

Tourism answered 18/4, 2018 at 15:35 Comment(0)

T

5

It turns out that the loss is supposed to be different than the metric, because of the regularization. Using regularization, the loss is higher (in my case), because the regularization increases loss when the nodes are not as active as specified. The metrics don't take this into account, and therefore return a different value, which equals what one would get when manually computing the error.

Tourism answered 18/4, 2018 at 16:51 Comment(0)

A

0

The metrics during training and validation are different because of different reasons:

The dataset is different
During trainning the weights are changing in every step so the metrics are changing too
The metric during training is of the current batch of data or a running average of the metrics of the last batches. For the evaluation, the metric is for the whole dataset.

Acrylonitrile answered 18/4, 2018 at 18:18 Comment(0)

Recommended topics

Hot tags