Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

Asked 29/12, 2020 at 12:49 Answered 31/12, 2020 at 10:35

Solved python python-3.x tensorflow linear-regression gradient-descent

I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a and b such that Y = a*X + b approximates an (x, y) point cloud (which I generated myself for the sake of simplicity).

I am solving this LR problem using a 'fixed step size gradient descent (FSSGD)'. I implemented it using TensorFlow and it works but I noticed that it is really slow both on GPU and CPU. Because I was curious I implemented the FSSGD myself in Python/NumPy and as expected this runs much faster, about:

10x faster than TF@CPU
20x faster than TF@GPU

If TensorFlow is this slow, I cannot imagine that so many people are using this framework. So I must be doing something wrong. Can anyone help me so I can speedup my TensorFlow implementation.

I'm NOT interested in the difference between the CPU and GPU performance. Both performance indicators are merely provided for completeness and illustration. I'm interested in why my TensorFlow implementation is so much slower than a raw Python/NumPy implementation.

As reference, I add my code below.

Stripped to a minimal (but fully working) example.
Using Python v3.7.9 x64.
Used tensorflow-gpu==1.15 for now (because the course uses TensorFlow v1)
Tested to run in both Spyder and PyCharm.

My FSSGD implementation using TensorFlow (execution time about 40 sec @CPU to 80 sec @GPU):

#%% General imports
import numpy as np
import timeit
import tensorflow.compat.v1 as tf


#%% Get input data
# Generate simulated input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15


#%% Define tensorflow model
# Define data size
n_samples = x_data_input.shape[0]

# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))

# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")

# Define variables to be learned
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
    W = tf.get_variable("weights", (1, 1), initializer=tf.constant_initializer(0.0))
    b = tf.get_variable("bias", (1,), initializer=tf.constant_initializer(0.0))

# Define loss function    
Y_pred = tf.matmul(X, W) + b
loss = tf.reduce_sum((Y - Y_pred) ** 2 / n_samples)  # Quadratic loss function


# %% Solve tensorflow model
#Define algorithm parameters
total_iterations = 1e5  # Defines total training iterations

#Construct TensorFlow optimizer
with tf.variable_scope("linear-regression", reuse=tf.AUTO_REUSE): #reuse= True | False | tf.AUTO_REUSE
    opt = tf.train.GradientDescentOptimizer(learning_rate = 1e-4)
    opt_operation = opt.minimize(loss, name="GDO")

#To measure execution time
time_start = timeit.default_timer()

with tf.Session() as sess:
    #Initialize variables
    sess.run(tf.global_variables_initializer())
    
    #Train variables
    for index in range(int(total_iterations)):
        _, loss_val_tmp = sess.run([opt_operation, loss], feed_dict={X: x_data, Y: y_data})
    
    #Get final values of variables
    W_val, b_val, loss_val = sess.run([W, b, loss], feed_dict={X: x_data, Y: y_data})
      
#Print execution time      
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')


# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W_val[0,0]))
print('b_val = {0:0.3f}'.format(b_val[0]))
print('')

My own python FSSGD implementation (execution time about 4 sec):

#%% General imports
import numpy as np
import timeit


#%% Get input data
# Define input data
x_data_input = np.arange(100, step=0.1)
y_data_input = x_data_input + 20 * np.sin(x_data_input/10) + 15


#%% Define Gradient Descent (GD) model
# Define data size
n_samples = x_data_input.shape[0]

#Initialize data
W = 0.0  # Initial condition
b = 0.0  # Initial condition

# Compute initial loss
y_gd_approx = W*x_data_input+b
loss = np.sum((y_data_input - y_gd_approx)**2)/n_samples  # Quadratic loss function


#%% Execute Gradient Descent algorithm
#Define algorithm parameters
total_iterations = 1e5  # Defines total training iterations
GD_stepsize = 1e-4  # Gradient Descent fixed step size

#To measure execution time
time_start = timeit.default_timer()

for index in range(int(total_iterations)):
    #Compute gradient (derived manually for the quadratic cost function)
    loss_gradient_W = 2.0/n_samples*np.sum(-x_data_input*(y_data_input - y_gd_approx))
    loss_gradient_b = 2.0/n_samples*np.sum(-1*(y_data_input - y_gd_approx))
    
    #Update trainable variables using fixed step size gradient descent
    W = W - GD_stepsize * loss_gradient_W
    b = b - GD_stepsize * loss_gradient_b
    
    #Compute loss
    y_gd_approx = W*x_data_input+b
    loss = np.sum((y_data_input - y_gd_approx)**2)/x_data_input.shape[0]

#Print execution time 
time_end = timeit.default_timer()
print('')
print("Time to execute code: {0:0.9f} sec.".format(time_end - time_start))
print('')


# %% Print results
print('')
print('Iteration = {0:0.3f}'.format(total_iterations))
print('W_val = {0:0.3f}'.format(W))
print('b_val = {0:0.3f}'.format(b))
print('')

Billings answered 29/12, 2020 at 12:49 Comment(5)

Does this answer your question? Training a simple model in Tensorflow GPU slower than CPU – Montparnasse 29/12, 2020 at 14:32

See also #13561383 – Montparnasse 29/12, 2020 at 14:32

@NicolasGervais I don't think that it explains why CPU model is slower (is data copying even necessary?) – Modestamodeste 29/12, 2020 at 14:39

@Nicolas, If have read that and it does not answer my question. I am not interested in the difference between CPU and GPU. I am interested in the difference between a TensorFlow implementation and a raw Python/Numpy implementation. – Billings 29/12, 2020 at 14:42

TensorFlow's Session.run() just has a significant overhead for graphs that require very little computation. github.com/tensorflow/tensorflow/issues/120 – Treacle 29/12, 2020 at 14:51

The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.

About the speed difference between TensorFlow and a raw Python/NumPy implementation

This part of the answer is actually quite logically.

Each iteration (= each call of Session.run()) TensorFlow performs computations. TensorFlow has a large overhead for starting each computation. On GPU, this overhead is even worse than on CPU. However, TensorFlow executes the actual computations very efficient and more efficiently than the above raw Python/NumPy implementation does.

So, when the number of data points is increased, and therefore the number of computations per iteration you will see that the relative performances between TensorFlow and Python/NumPy shifts in the advantage of TensorFlow. The opposite is also true.

The problem described in the question is very small meaning that the number of computation is very low while the number of iterations is very large. That is why TensorFlow performs so badly. This type of small problems is not the typical use case for which TensorFlow was designed.

To reduce the execution time

Still the execution time of the TensorFlow script can be reduced a lot! To reduce the execution time the number of iterations must be reduced (no matter the size of the problem, this is a good aim anyway).

As @amin's pointed out, this is achieved by scaling the input data. A very briefly explanation why this works: the size of the gradient and variable updates are more balanced compared to the absolute values for which the values are to be found. Therefore, less steps (= iterations) are required.

Followings @amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):

# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))

### START NEW CODE ###

# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std

### END NEW CODE ###

# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")

Scaling speed up the convergence by a factor 1000. Instead of 1e5 iterations, 1e2 iterations are needed. This is partially because a maximum step size of 1e-1 can be used instead of a step size of 1e-4.

Please note that the found weight and bias are different and that you must feed scaled data from now on.

Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):

#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std

Billings answered 31/12, 2020 at 10:35 Comment(0)

I think it's the result of big iteration number. I've changed the iteration number from 1e5 to 1e3 and also changed x from x_data_input = np.arange(100, step=0.1) to x_data_input = np.arange(100, step=0.0001). This way I've reduced the iteration number but increased the computation by 10x. With np it's done in 22 sec and in tensorflow it's done in 25 sec.

My guess: tensorflow has alot of overhead in each iteration (to give us a framework that can do a lot) but the forward pass and backward pass speed are ok.

Sheldonshelduck answered 29/12, 2020 at 14:12 Comment(6)

The problem with your suggestion is that the solution is still not converged (the local minima not yet reached). So many iterations are just required because of the limited step size (but increasing the step size would make the algorithm diverge). However, it is indeed interesting to see that if the number of data points increases (and so the computational load per iteration) the difference in execution time gets smaller. It seems that TensorFlow indeed executes the computations more efficiently but that each iteration introduces a large overhead. Thank you! This helped me understand. – Billings 29/12, 2020 at 15:33

@Billings I'm happy it helped. But don't forget that your code is just an example. Maybe in this example to get a good result np works faster but in general you don't see this situation (a looooot of epochs and a liiiittle computation per epoch) a lot. So in general tensorflow just works well – Sheldonshelduck 29/12, 2020 at 16:39

I see that TensorFlow performs better when more computation per epoch is required (meaning the ratio of computation time of TF vs Python gets better in the advantage of TF). However, I tried adding more data but the training does not converge in less iterations because of this. This is because the gradient remains (on average) the same. Perhaps I should change my loss function, it now contains 1/N_sampels, perhaps I should remove this to let the gradient increase with the number of data points but I'm afraid the algorithm diverges? Could you comment on this? – Billings 30/12, 2020 at 11:31

@Billings I apologize for my second comment (because it was wrong and I've deleted it). Here the main problem with your example is the input range (it's preferred for data to have zero mean and unit standard deviation). So I've done that and got good result with a few iterations. I give you access to the related code on colab (for just one day, if you need more time let me know): colab.research.google.com/drive/… – Sheldonshelduck 30/12, 2020 at 13:8

Thank you! Very interesting. The cs231n was already next on my bucket list when I have time (now I am doing cs234). One more question, should I also normalize the y data? – Billings 30/12, 2020 at 13:37

@Billings Actually I can't remember any problem that done normalization on output and I don't see any reason to do that. I think doing so just affects the learning rate. So my answer is no (but I'm not sure about it). Side note: You can see normalization in hidden layers (by BatchNormalization) and that really helps and you can find the reason in the mentioned link (cs231n). – Sheldonshelduck 30/12, 2020 at 14:30

The actual answer to my question is hidden in the various comments. For future readers, I will summarize these findings in this answer.

About the speed difference between TensorFlow and a raw Python/NumPy implementation

This part of the answer is actually quite logically.

To reduce the execution time

Followings @amin's advise, I finally ended up by scaling my x-data as follows (some code is repeated to make the position of the new code clear):

# Tensorflow is finicky about shapes, so resize
x_data = np.reshape(x_data_input, (n_samples, 1))
y_data = np.reshape(y_data_input, (n_samples, 1))

### START NEW CODE ###

# Scale x_data
x_mean = np.mean(x_data)
x_std = np.std(x_data)
x_data = (x_data - x_mean) / x_std

### END NEW CODE ###

# Define placeholders for input
X = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_x_data")
Y = tf.placeholder(tf.float32, shape=(n_samples, 1), name="tf_y_data")

Please note that the found weight and bias are different and that you must feed scaled data from now on.

Optionally, you can choose to unscale the found weight and bias so you can feed unscaled data. Unscaling is done using this code (put somewhere at the end of the code):

#%% Unscaling
W_val_unscaled = W_val[0,0]/x_std
b_val_unscaled = b_val[0]-x_mean*W_val[0,0]/x_std

Billings answered 31/12, 2020 at 10:35 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

About the speed difference between TensorFlow and a raw Python/NumPy implementation

To reduce the execution time

About the speed difference between TensorFlow and a raw Python/NumPy implementation

To reduce the execution time

Recommended topics

Hot tags