Splitting up a program into 4 threads is slower than a single thread
Asked Answered
L

4

6

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.

Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.

Running the program with time:

> export OMP_NUM_THREADS=1; time ./raytracer
real    0m34.344s
user    0m34.310s
sys     0m0.008s


> export OMP_NUM_THREADS=4; time ./raytracer
real    0m53.189s
user    0m20.677s
sys     0m0.096s

User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.

Code that I have parallelized using OpenMP

void Raytracer::render( Camera& cam ) {

    // let the camera know to use this raytracer for probing the scene
    cam.setSamplingFunc(getSamplingFunction());

    int i, j;

    #pragma omp parallel private(i, j)
    {

        // Construct a ray for each pixel.
        #pragma omp for schedule(dynamic, 4)
        for (i = 0; i < cam.height(); ++i) {
            for (j = 0; j < cam.width(); ++j) {
                cam.computePixel(i, j);
            }
        }
    }
}

When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!

Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.

In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.

Can anyone suggest places where there could be some kind of contention? Things to try?

Thank you in advance.

EDIT: Sorry, was stupid not to provide more info on my system.

  • Compiler gcc 4.6 (with -O2 optimization)
  • Ubuntu Linux 11.10
  • OpenMP 3
  • Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)

Code for compute pixel:

class Camera {

    // constructors destructors
    private:
        // this is the array that is being written to, but not read from.
        Colour* _sensor; // allocated using new at construction.
}

void Camera::computePixel(int i, int j) const {

    Colour col;

    // simple code to construct appropriate ray for the pixel
    Ray3D ray(/* params */);
    col += _sceneSamplingFunc(ray); // calls a const method that traverses scene. 

    _sensor[i*_scrWidth+j] += col;
}

From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

Lastditch answered 23/1, 2012 at 23:21 Comment(8)
Silly question: you're running a multiprocessor, and you're running an SMP version of the OS, correct? Q: What are the CPUs? Q: What is the OS/OS version? Q: Compiler (gcc?)/Compiler version? Q: Open MP version?Halonna
Since you're computing pixel values, have you considered GPGPU programming?Ensample
Intel has some good tools to analyze thread performance, maybe those can give you hintsEnsample
provided some info on my setup. The raytracer is destined to be run on CPU, so I unfortunately I can't do GPGPU programming.Lastditch
A huge amount of system time suggests that either some synchronization happens somewhere or maybe things like page faults. You certainly need to profile it to understand what's wrong.Gerstein
Are the timings for a single call to render or is this some outer loop which calls render several times?Investigate
The timings are for a single call to render- a single image is rendered. @Alexey yes I noticed that too. some of that time is spent synchronizing random like I mention, but it is not the major contributor to the slowdown unfortunately.Lastditch
I have updated the times, without the calls to rand(), and the system time is now very low- i guess random accounted for all the locking. The multithreaded version is still slower though.Lastditch
L
4

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.

As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.

Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.

I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.

threadrand.h

#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_

// max number of thread states to store
const int maxThreadNum = 100;

void init_threadrand();

// requires openmp, for thread number
int threadrand();

#endif // _THREAD_RAND_H_

threadrand.cpp

#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>

// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];

// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
    for (int i = 0; i < maxThreadNum; ++i) {
        randThreadStates[i].reset(new unsigned int(std::rand()));
    }
}

// requires openmp, for thread number, to index into array of states.
int threadrand() {
    int i = omp_get_thread_num();
    return rand_r(randThreadStates[i].get());
}

Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

Lastditch answered 2/2, 2012 at 5:58 Comment(0)
E
2

The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.

There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.

Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.

Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.

There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.

Is the data being written at the end, data that other threads need to be able to do computePixel ?

Ensample answered 23/1, 2012 at 23:45 Comment(2)
your suggestions about looking into the Intel tools (and perhaps Intel compiler), and considering GPU programming were both excellent suggestions. I also agree: based on the information given, the answer is indeed "It Depends". I'd still like to know what hardware and OS he's running, and what versions of the OS, compiler and OpenMP. And of course, more details about the code :)Halonna
Thanks for the suggestions! Can you suggest things to read? I have dealt with multiple threads before, and am aware of synchronization, but usually each thread was responsible for a different task, and not as computationally intensive as a raytracer.Lastditch
I
1

One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.

What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.

To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.

Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.

Investigate answered 24/1, 2012 at 8:24 Comment(5)
Given the described setup, false sharing is hardly the case. With #pragma omp for schedule(dynamic,4) applied to the outer loop, the minimal portion of work done by a thread consists of 4 rows of pixels. Perhaps schedule(static) can be tried, but OTOH it might cause load imbalance. There might be some internal data being shared of course, we do not know.Gerstein
I said I don't know how OMP splits the rows: at first glance it looks like it'd interleave the rows. Even if it doesn't do this, the chance of a mutable variable in the const data is also there. Without closer inspection of the code it's not possible to know -- and certainly false sharing is a good candidate when making something threaded slows it down.Investigate
I meant that data are stored by rows, therefore even if each row was done by a distinct thread, the only case of false sharing would possibly happen for a few pixels on distinct rows sharing the same cache line. The overhead should not be that huge to slow down below serial execution. I completely agree that both false sharing can happen with other modified data and that it need to be considered in general.Gerstein
Thanks for the suggestion. I have also tried running without any writes to _sensor data, and it made no effect on the times. In comparison to the other calculations, the writing of the final result is extremely small.Lastditch
Please look at my other answer, completely unrelated, about your CPU not having 4 cores.Investigate
I
1

I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.

In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

Investigate answered 24/1, 2012 at 15:10 Comment(2)
HT hides memory latency and can improve the performance of computational code with high fetch/compute ratio by 30-40% or even more.Duron
It can only create the illusion of more cores if your threads would actually stall a lot. In tight calculation loops, or even sequential memory access, this simply isn't the case and HT becomes a hindrance.Investigate

© 2022 - 2024 — McMap. All rights reserved.