First tf.session.run() performs dramatically different from later runs. Why? - McMap

About

First tf.session.run() performs dramatically different from later runs. Why?

Asked 12/7, 2017 at 16:49 Answered 12/7, 2017 at 21:36

Solved cublas cudnn tensorflow tensorflow-xla

G

1

15

Here's an example to clarify what I mean:
First session.run():
First run of a TensorFlow session

Later session.run():
Later runs of a TensorFlow session

I understand TensorFlow is doing some initialization here, but I'd like to know where in the source this manifests. This occurs on CPU as well as GPU, but the effect is more prominent on GPU. For example, in the case of a explicit Conv2D operation, the first run has a much larger quantity of Conv2D operations in the GPU stream. In fact, if I change the input size of the Conv2D, it can go from tens to hundreds of stream Conv2D operations. In later runs, however, there are always only five Conv2D operations in the GPU stream (regardless of input size). When running on CPU, we retain the same operation list in the first run compared to later runs, but we do see the same time discrepancy.

What portion of TensorFlow source is responsible for this behavior? Where are GPU operations "split?"

Thanks for the help!

Gaiseric answered 12/7, 2017 at 16:49 Comment(0)

G

17

The tf.nn.conv_2d() op takes much longer to run on the first tf.Session.run() invocation because—by default—TensorFlow uses cuDNN's autotune facility to choose how to run subsequent convolutions as fast as possible. You can see the autotune invocation here.

There is an undocumented environment variable that you can use to disable autotune. Set TF_CUDNN_USE_AUTOTUNE=0 when you start the process running TensorFlow (e.g. the python interpreter) to disable its use.

Gamb answered 12/7, 2017 at 21:36 Comment(2)

Thanks, this helps a lot! From this, I would assume all cases where a regular operation is split into multiple stream operations when run on GPU is due to cuDNN and/or cuBLAS? – Gaiseric 12/7, 2017 at 22:55

I'm not 100% sure, but I think there are also cases where Eigen-implemented kernels generate multiple stream operations (e.g. multiple small memcpy operations). However, most of the performance-critical kernels use cuDNN/cuBLAS. – Gamb 12/7, 2017 at 23:7

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.