Rust - unsafely share between threads a mutable data without mutex

Asked 27/1, 2023 at 0:2 Answered 27/1, 2023 at 10:49

How can old school multi-threading (no wrapping mutex) can be achieve in Rust? And why is it undefined behavior?

I have to build a highly concurrent physic simulation. I am supposed to do it in C, but I chose Rust (I really needed higher level features).

By using Rust, I should opt for a safe communication between threads, however, I must use a mutable buffer shared between the threads. (actually, I have to implement different techniques and benchmark them)

First approach

Use Arc<Data> to share non-mutable state.
Use transmute to promote & to &mut when needed.

It was straightforward but the compiler would prevent this from compiling even with unsafe block. It is because the compiler can apply optimizations knowing this data is supposedly non-mutable (maybe cached and never updated, not an expert about that).

This kind of optimizations can be stopped by Cell wrapper and others.

Second approach

Use Arc<UnsafeCell<Data>>.
Then data.get() to access data.

This does not compile either. The reason is that UnsafeCell is not Send. The solution is to use SyncUnsafeCell but it is unstable for the moment (1.66), and the program will be compile and put to production on a machine with only the stable version.

Third approach

Use Arc<Mutex<Data>>.
At the beginning of each threads:
- Lock the mutex.
- Keep a *mut by coercing a &mut.
- Release the mutex.
Use the *mut when needed

I haven't tried this one yet, but even if it compiles, is it safe (not talking about data race) as it would be with SyncUnsafeCell ?

PS: The values concurrently mutated are just f32, there are absolutely no memory allocation or any complex operations happening concurrently. Worst case scenario, I have scrambled some f32.

Mapp answered 27/1, 2023 at 0:2 Comment(12)

"Is it safe?" Looks like UB. Other than that: You can probably just copy the implementation of SyncUnsafeCell from std, there doesn't seem to be anything magic about it. Btw, how about a Vec<AtomicU32> (because there's no AtomicF32), and r/w to it through f32::from/to_bits? – Pairoar 27/1, 2023 at 0:30

So if you're doing a "highly concurrent physic simulation" then what value does such a thing have if the results are wrong? If you have threading/concurrency errors, eventually you're going to stomp on yourself. So what was the point of going through that effort? I know this sounds dismissive, but I'm serious: where's the upside here? Spend the time to do it right. – Bistort 27/1, 2023 at 1:49

Perhaps the question devolves down to: "how to share a mutable unsynchronized buffer between threads without introducing race conditions"? – Duran 27/1, 2023 at 2:59

I think @Pairoar is onto something here. On most targets, AtomicU32 is zero-overhead and gives you exactly what you need. – Obtund 27/1, 2023 at 10:22

Does AtomicU32 really have zero-overhead when fetching and writing back ? That could be the best approach indeed. – Mapp 27/1, 2023 at 10:51

@Mapp See the assembly in my answer – Obtund 27/1, 2023 at 10:55

My knowledge on this is a bit fuzzy, but calling AtomicU32 zero-overhead is a bit misleading, I think: If it does incur synchronization overhead between cores through the cache consistency mechanisms, it can be quite heavy. For example, several concurrent threads incrementing the same AtomicU32 will only get a few 1000 increments done per second, iirc. – Pairoar 27/1, 2023 at 11:40

Of course. No over-head in comparison with regular f32. Any atomic op has to be... well, atomic. – Mapp 27/1, 2023 at 11:47

@Pairoar That is also true for normal f32, though. For example, if you access the same uint32_t in C++ from multiple threads, you also get the cache consistency slowdown. It's more of an architectural thing than a code overhead. If you look at what it compiles to, there is zero difference between using an atomic and a normal f32. – Obtund 27/1, 2023 at 15:48

@Pairoar "only get a few 1000 increments" - I strongly disagree: play.rust-lang.org/…. Try it out on your own machine. I have an 8 core machine and get >50 million/second. For reference: I get ~141 mil/s with one thread and ~51 mil/s with 8 threads on my 8 thread machine. So yes, there is an overhead, but by far not as bad as you make it seem. And it most likely only occurs if you actually access the same variable from multiple cores; OP does not intend to do that much. – Obtund 27/1, 2023 at 16:24

Based on their description it's most likely a space partitioned simulation, and every thread only accesses its own values, plus some border exchange. – Obtund 27/1, 2023 at 16:29

@Pairoar So what does AtomicU32 actually mean compared to u32, if u32 is already atomic? Two things: volaticity (meaning: cannot be optimized into registers) and prevention of instruction reordering (hence the Ordering parameters). Apart of that, load and store is identical between AtomicU32 and u32. – Obtund 27/1, 2023 at 16:31

Disclaimer: There are probably many ways to solve this, this is just one of them, based on the idea of @Caesar.

Two main points of this post:

You can use AtomicU32 to share f32 between threads without any performance penalty (given an architecture where u32 is already atomic)
You can use std::thread::scope to avoid the overhead of Arc.

use std::{
    fmt::Debug,
    ops::Range,
    sync::atomic::{AtomicU32, Ordering},
};

struct AtomicF32(AtomicU32);
impl AtomicF32 {
    pub fn new(val: f32) -> Self {
        Self(AtomicU32::new(val.to_bits()))
    }
    pub fn load(&self, order: Ordering) -> f32 {
        f32::from_bits(self.0.load(order))
    }
    pub fn store(&self, val: f32, order: Ordering) {
        self.0.store(val.to_bits(), order)
    }
}
impl Debug for AtomicF32 {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        self.load(Ordering::Relaxed).fmt(f)
    }
}

fn perform_action(data: &Vec<AtomicF32>, range: Range<usize>) {
    for value_raw in &data[range] {
        let mut value = value_raw.load(Ordering::Relaxed);
        value *= 2.5;
        value_raw.store(value, Ordering::Relaxed);
    }
}

fn main() {
    let data = (1..=10)
        .map(|v| AtomicF32::new(v as f32))
        .collect::<Vec<_>>();

    println!("Before: {:?}", data);

    std::thread::scope(|s| {
        s.spawn(|| perform_action(&data, 0..5));
        s.spawn(|| perform_action(&data, 5..10));
    });

    println!("After: {:?}", data);
}

Before: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
After: [2.5, 5.0, 7.5, 10.0, 12.5, 15.0, 17.5, 20.0, 22.5, 25.0]

To demonstrate how leightweight this is, here is what this compiles to:

use std::{
    sync::atomic::{AtomicU32, Ordering},
};

pub struct AtomicF32(AtomicU32);
impl AtomicF32 {
    fn load(&self, order: Ordering) -> f32 {
        f32::from_bits(self.0.load(order))
    }
    fn store(&self, val: f32, order: Ordering) {
        self.0.store(val.to_bits(), order)
    }
}

pub fn perform_action(value_raw: &AtomicF32) {
    let mut value = value_raw.load(Ordering::Relaxed);
    value *= 2.5;
    value_raw.store(value, Ordering::Relaxed);
}

.LCPI0_0:
        .long   0x40200000
example::perform_action:
        movss   xmm0, dword ptr [rdi]
        mulss   xmm0, dword ptr [rip + .LCPI0_0]
        movss   dword ptr [rdi], xmm0
        ret

Note that while this contains zero undefined behaviour, it still is the programmer's responsibility to avoid read-modify-write race conditions.

Obtund answered 27/1, 2023 at 10:49 Comment(1)

On both x86-64 and ARM, relaxed ordering store and loads are equivalent to non atomic store and loads. On x86-64, acquire load and some release stores are also free. – Slum 27/1, 2023 at 11:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags