Meaning of inter_op_parallelism_threads and intra_op_parallelism_threads

Asked 20/12, 2016 at 1:33 Answered 5/6, 2023 at 16:53

python parallel-processing tensorflow distributed-computing

Can somebody please explain the following TensorFlow terms

inter_op_parallelism_threads
intra_op_parallelism_threads

or, please, provide links to the right source of explanation.

I have conducted a few tests by changing the parameters, but the results have not been consistent to arrive at a conclusion.

Skerrick answered 20/12, 2016 at 1:33 Comment(0)

The inter_op_parallelism_threads and intra_op_parallelism_threads options are documented in the source of the tf.ConfigProto protocol buffer. These options configure two thread pools used by TensorFlow to parallelize execution, as the comments describe:

// The execution of an individual op (for some op types) can be
// parallelized on a pool of intra_op_parallelism_threads.
// 0 means the system picks an appropriate number.
int32 intra_op_parallelism_threads = 2;

// Nodes that perform blocking operations are enqueued on a pool of
// inter_op_parallelism_threads available in each process.
//
// 0 means the system picks an appropriate number.
//
// Note that the first Session created in the process sets the
// number of threads for all future sessions unless use_per_session_threads is
// true or session_inter_op_thread_pool is configured.
int32 inter_op_parallelism_threads = 5;

There are several possible forms of parallelism when running a TensorFlow graph, and these options provide some control multi-core CPU parallelism:

If you have an operation that can be parallelized internally, such as matrix multiplication (tf.matmul()) or a reduction (e.g. tf.reduce_sum()), TensorFlow will execute it by scheduling tasks in a thread pool with intra_op_parallelism_threads threads. This configuration option, therefore, controls the maximum parallel speedup for a single operation. Note that if you run multiple operations in parallel, these operations will share this thread pool.
If you have many operations that are independent in your TensorFlow graph— because there is no directed path between them in the dataflow graph— TensorFlow will attempt to run them concurrently, using a thread pool with inter_op_parallelism_threads threads. If those operations have a multithreaded implementation, they will (in most cases) share the same thread pool for intra-op parallelism.

Finally, both configuration options take a default value of 0, which means "the system picks an appropriate number." Currently, this means that each thread pool will have one thread per CPU core in your machine.

Esbenshade answered 20/12, 2016 at 2:16 Comment(10)

Can this be used to parallelise my code over multiple CPUs? How can I use these functions to achieve fault tolerance in the event that one of the machines fails in the cluster? – Skerrick 20/12, 2016 at 9:51

These options control the maximum amount of parallelism you can get from running your TensorFlow graph. However, they rely on the operations that you run having parallel implementations (like many of the standard kernels do) for intra-op parallelism; and the availability of independent ops to run in the graph for inter-op parallelism. However, if (for example) your graph is a linear chain of operations, and those operations have only serial implementations, then these options won't add parallelism. The options are not related to fault tolerance (or distributed execution). – Esbenshade 20/12, 2016 at 15:31

It seems the two options only work for CPUs but not GPUs? If I had tf.add_n operator of multiple parallel matrix multiplication based operations and run in GPUs, how is the parallelization done in default and can I control it? – Lightproof 30/4, 2017 at 3:55

How much does setting both values to 1 affect the speed? Does setting both to one mean that tensorflow will use only one thread? (I just tried and I can see all my cores usage going up once I start training and I don't really see a difference in speed) – Lodmilla 7/8, 2018 at 14:55

@Esbenshade So if I understand the answer correctly, intra controls the number of cores (within 1 node), and inter controls the number of nodes, right? Or loosely speaking, intra works like OpenMP, and inter works like OpenMPI? Please correct me if I am wrong. – Vulpecula 19/10, 2018 at 18:39

and if the two settings apply to both CPU and GPU? Thanks. – Vulpecula 19/10, 2018 at 20:28

What does 'blocking' mean? Normally, there's no IO, only Tensor calculation, so I don't expect 'blocking' to mean IO blocking. – Inoculate 6/4, 2019 at 4:30

@Esbenshade When we leave it to the default of 0, the system picks the appropriate number for one session as a whole or varies the number for every op that can be parallelized? – Dalston 31/10, 2019 at 18:42

@mrry, could you tell me those two parameters relation with OMP_NUM_THREADS? thanks a lot. – Purity 14/2, 2020 at 1:31

These settings only control number of thread in one CPU? so If I have say 8CPU with 2 threads each setting to 16 will use all threads within all CPU? Also is this only multi-processing operations such as matrix multiplication and reduce_sum instead of training in distributed manner (ex: data parallelism)? – Angelika 3/8, 2023 at 5:40

To get the best performance from a machine, change the parallelism threads and OpenMP settings as below for the tensorflow backend (from here):

import tensorflow as tf

#Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS
#  when NUM_PARALLEL_EXEC_UNITS=0 the system chooses appropriate settings 

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, 
                        inter_op_parallelism_threads=2, 
                        allow_soft_placement=True,
                        device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

Answer to the comment bellow: [source]

allow_soft_placement=True

If you would like TensorFlow to automatically choose an existing and supported device to run the operations in case the specified one doesn't exist, you can set allow_soft_placement to True in the configuration option when creating the session. In simple words it allows dynamic allocation of GPU memory.

Bowery answered 22/2, 2019 at 17:33 Comment(3)

What is allow_soft_placement=True ? – Anson 26/5, 2019 at 20:50

Answered question within the post. – Bowery 11/11, 2019 at 14:5

What about for tf version 2.x – Angelika 3/8, 2023 at 5:40

Tensorflow 2.0 Compatible Answer: If we want to execute in Graph Mode of Tensorflow Version 2.0, the function in which we can configure inter_op_parallelism_threads and intra_op_parallelism_threads is

tf.compat.v1.ConfigProto.

Suppress answered 14/2, 2020 at 8:32 Comment(0)

Work for me

import tensorflow as tf

tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

Aspasia answered 5/6, 2023 at 16:53 Comment(1)

Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach? – Lezlielg 5/6, 2023 at 20:20

Recommended topics

Hot tags