Simulate tearing a double in C#
Asked Answered
M

4

20

I'm running on a 32-bit machine and I'm able to confirm that long values can tear using the following code snippet which hits very quickly.

        static void TestTearingLong()
        {
            System.Threading.Thread A = new System.Threading.Thread(ThreadA);
            A.Start();

            System.Threading.Thread B = new System.Threading.Thread(ThreadB);
            B.Start();
        }

        static ulong s_x;

        static void ThreadA()
        {
            int i = 0;
            while (true)
            {
                s_x = (i & 1) == 0 ? 0x0L : 0xaaaabbbbccccddddL;
                i++;
            }
        }

        static void ThreadB()
        {
            while (true)
            {
                ulong x = s_x;
                Debug.Assert(x == 0x0L || x == 0xaaaabbbbccccddddL);
            }
        }

But when I try something similar with doubles, I'm not able to get any tearing. Does anyone know why? As far as I can tell from the spec, only assignment to a float is atomic. The assignment to a double should have a risk of tearing.

    static double s_x;

    static void TestTearingDouble()
    {
        System.Threading.Thread A = new System.Threading.Thread(ThreadA);
        A.Start();

        System.Threading.Thread B = new System.Threading.Thread(ThreadB);
        B.Start();
    }

    static void ThreadA()
    {
        long i = 0;

        while (true)
        {
            s_x = ((i & 1) == 0) ? 0.0 : double.MaxValue;
            i++;

            if (i % 10000000 == 0)
            {
                Console.Out.WriteLine("i = " + i);
            }
        }
    }

    static void ThreadB()
    {
        while (true)
        {
            double x = s_x;

            System.Diagnostics.Debug.Assert(x == 0.0 || x == double.MaxValue);
        }
    }
Mirna answered 25/1, 2012 at 18:52 Comment(8)
Stupid question - what is tearing?Doorframe
operations on ints are guaranteed to be atomic as regards access by multiple threads. Not so with longs. Tearing is getting a mix of two interim values (bad). He's wondering why the same isn't seen in doubles, since doubles also don't guarantee atomic operations.Virchow
@Oded: On 32 bit machines, only 32 bits are written at a time. If you are writing a 64 bit value on a 32 bit machine, and writing to the same address at the same time on two different threads, you actually have four writes, not two, because the writes are done 32 bits at a time. It is therefore possible for the threads to race, and when the smoke clears the variable contains the top 32 bits written by one thread, and the bottom 32 bits written by the other. So you can write 0xDEADBEEF00000000 on one thread and 0x00000000BAADF00D on another, and end up with 0x0000000000000000 in memory.Glorianna
@EricLippert - So, essentially an issue with operations on 64 bit value not being atomic on 32 bit machines?Doorframe
Sorry, should have defined tearing. I meant it exactly in the sense that hatchet and Eric have said.Mirna
@Oded: That is exactly right.Glorianna
@EricLippert - Thank you very much for a clear and concise explanation.Doorframe
See: Why doesn't this code demonstrate the non-atomicity of reads/writes?Photoperiod
R
13
static double s_x;

It is much harder to demonstrate the effect when you use a double. The CPU uses dedicated instructions to load and store a double, respectively FLD and FSTP. It is much easier with long since there is no single instruction that load/stores a 64-bit integer in 32-bit mode. To observe it you need to have the variable's address misaligned so it straddles the cpu cache line boundary.

That will never happen with the declaration you used, the JIT compiler ensures that the double is aligned properly, stored at an address that's a multiple of 8. You could store it in a field of a class, the GC allocator only aligns to 4 in 32-bit mode. But that's a crap shoot.

Best way to do it is by intentionally mis-aligning the double by using a pointer. Put unsafe in front of the Program class and make it look similar to this:

    static double* s_x;

    static void Main(string[] args) {
        var mem = Marshal.AllocCoTaskMem(100);
        s_x = (double*)((long)(mem) + 28);
        TestTearingDouble();
    }
ThreadA:
            *s_x = ((i & 1) == 0) ? 0.0 : double.MaxValue;
ThreadB:
            double x = *s_x;

This still won't guarantee a good misalignment (hehe) since there's no way to control exactly where AllocCoTaskMem() will align the allocation relative to the start of the cpu cache line. And it depends on the cache associativity in your cpu core (mine is a Core i5). You'll have to tinker with the offset, I got the value 28 by experimentation. The value should be divisible by 4 but not by 8 to truly simulate the GC heap behavior. Keep adding 8 to the value until you get the double to straddle the cache line and trigger the assert.

To make it less artificial you'll have to write a program that stores the double in field of a class and get the garbage collector to move it around in memory so it gets misaligned. Kinda hard to come up with a sample program that ensures this happens.

Also note how your program can demonstrate a problem called false sharing. Comment out the Start() method call for thread B and note how much faster thread A runs. You are seeing the cost of the cpu keeping the cache line consistent between the cpu cores. Sharing is intended here since the threads access the same variable. Real false sharing happens when threads access different variables that are stored in the same cache line. This is otherwise why alignment matters, you can only observe the tearing for a double when part of it is in one cache line and part of it is in another.

Rihana answered 29/1, 2012 at 15:8 Comment(2)
I don't understand how cache line boundary crossing can cause tearing. I thought this was only caused by the value taking up more space than the size of a register. Can you please elaborate on this a bit more?Wellchosen
@Wellchosen - it is an entirely different effect, not associated with register size. Focus on the last paragraph, note how cpu cache synchronization has a cache line as the unit of update. A misaligned double that straddles a line requires two updates, similar to the way a long requires two register writes. Which takes enough time to allow code that runs on another core to observe the tearing.Rihana
S
13

As strange as it sounds, that depends on your CPU. While doubles are not guaranteed not to tear, they won't on many current processors. Try an AMD Sempron if you want tearing in this situation.

EDIT: Learned that the hard way a few years ago.

Speckle answered 25/1, 2012 at 18:59 Comment(7)
TBH I don't have the slightest idea, never looked into it. A daemon of mine (Free Pascal of all languages) started spuriously producing absurd results on one and only one machine out of many (maybe 100), all set up from the same image etc. Turned out it was a global double that got updated by the main thread and a GTK-created secondary thread. No locking primitives in FPK then ... (expletive, expletive)Speckle
Yeah, i wouldn't doubt it if the MMX or SSE extensions on the CPU had something to do with this.Mosqueda
The machine that I'm testing on says "Intel Xeon CPU E5620 @ 2.40 GHz (2 processors)". Any idea if I can expect doubles not to tear in general when running on Intel Xeons?Mirna
AFAIK doubles will not tear on everything newer and including Intel "Core" architectures, but please do not take this for granted - the next generation might revert to the older model for some obscure performance reason.Speckle
@MichaelCovelli - It sounds like you're really trying to wring some performance out of this application. If this is really that important, what I'd recommend you do is provide both implementations in your program; when it starts up, have it run this exact test to find out what implementation to turn on. If the test is expensive, you could try to do things like caching it when the software is installed, or by reading CPUID every time the machine starts, and re-run the test if it changes.Mosqueda
I think this is likely something CPU-related as Eugen said. But I'm still a little hazy on the details. If my above test app fails to find tearing on the Intel CPUs that I'm using, should I assume that its really impossible?Mirna
I have looked at the disassembly and it seems that the reads and writes from the long are translated into 2 instructions but the doubles seem to be happening in just one step. But I've never spent much time looking at the disassembly before so I'm not sure if this really means that doubles can't tear here.Mirna
R
13
static double s_x;

It is much harder to demonstrate the effect when you use a double. The CPU uses dedicated instructions to load and store a double, respectively FLD and FSTP. It is much easier with long since there is no single instruction that load/stores a 64-bit integer in 32-bit mode. To observe it you need to have the variable's address misaligned so it straddles the cpu cache line boundary.

That will never happen with the declaration you used, the JIT compiler ensures that the double is aligned properly, stored at an address that's a multiple of 8. You could store it in a field of a class, the GC allocator only aligns to 4 in 32-bit mode. But that's a crap shoot.

Best way to do it is by intentionally mis-aligning the double by using a pointer. Put unsafe in front of the Program class and make it look similar to this:

    static double* s_x;

    static void Main(string[] args) {
        var mem = Marshal.AllocCoTaskMem(100);
        s_x = (double*)((long)(mem) + 28);
        TestTearingDouble();
    }
ThreadA:
            *s_x = ((i & 1) == 0) ? 0.0 : double.MaxValue;
ThreadB:
            double x = *s_x;

This still won't guarantee a good misalignment (hehe) since there's no way to control exactly where AllocCoTaskMem() will align the allocation relative to the start of the cpu cache line. And it depends on the cache associativity in your cpu core (mine is a Core i5). You'll have to tinker with the offset, I got the value 28 by experimentation. The value should be divisible by 4 but not by 8 to truly simulate the GC heap behavior. Keep adding 8 to the value until you get the double to straddle the cache line and trigger the assert.

To make it less artificial you'll have to write a program that stores the double in field of a class and get the garbage collector to move it around in memory so it gets misaligned. Kinda hard to come up with a sample program that ensures this happens.

Also note how your program can demonstrate a problem called false sharing. Comment out the Start() method call for thread B and note how much faster thread A runs. You are seeing the cost of the cpu keeping the cache line consistent between the cpu cores. Sharing is intended here since the threads access the same variable. Real false sharing happens when threads access different variables that are stored in the same cache line. This is otherwise why alignment matters, you can only observe the tearing for a double when part of it is in one cache line and part of it is in another.

Rihana answered 29/1, 2012 at 15:8 Comment(2)
I don't understand how cache line boundary crossing can cause tearing. I thought this was only caused by the value taking up more space than the size of a register. Can you please elaborate on this a bit more?Wellchosen
@Wellchosen - it is an entirely different effect, not associated with register size. Focus on the last paragraph, note how cpu cache synchronization has a cache line as the unit of update. A misaligned double that straddles a line requires two updates, similar to the way a long requires two register writes. Which takes enough time to allow code that runs on another core to observe the tearing.Rihana
W
0

Doing some digging, I've found some interesting reads concerning floating-point operations on x86 architectures:

According to Wikipedia, the x86 floating-point unit stored floating-point values in 80-bit registers:

[...] subsequent x86 processors then integrated this x87 functionality on chip which made the x87 instructions a de facto integral part of the x86 instruction set. Each x87 register, known as ST(0) through ST(7), is 80 bits wide and stores numbers in the IEEE floating-point standard double extended precision format.

Also this other SO question is related: Some floating point precision and numeric limits question

This could explain why, although doubles are 64-bits, they are operated on atomically.

Wellchosen answered 29/1, 2012 at 9:19 Comment(0)
T
0

For what its worth this topic and code sample can be found here.

http://msdn.microsoft.com/en-us/magazine/cc817398.aspx

Topsyturvy answered 29/1, 2012 at 9:47 Comment(2)
That article only talks about long, not double.Wellchosen
Agreed. Actually, I think that the sample code that I posted in the question is from that post (except for the double stuff). (I had it in a Test project and had forgotten about it for a while).Mirna

© 2022 - 2024 — McMap. All rights reserved.