What are RSEQs (Restartable Sequences) and how to use them?
Asked Answered
V

1

12

Linux 4.18 introduced the rseq(2) syscall. I found only one question on SO that mentions rseq at all, and there is relatively little information about it online, so I decided to ask. What are restartable sequences and how can they be used by programmers?

I had to search for restartable sequences: fast user-space percpu critical sections to get any meaningful results. I was able to find the commit that added the relevant functionality to the kernel. Further research lead me to the 2013 presentation that I believe to be the first introduction of the concept. Much work has been done by a team working at a company called EfficiOS. They describe what are their intentions behind contributing this feature to the Linux kernel.

It looks like this feature is very little known, but apparently it is used to optimize performance in the TCMalloc allocator. Generally, it seems that it is some kind of concurrency optimization.

Although the listed sources provide background information there is yet to be an explanation of RSEQs provided on SO. It would be useful to hear where else and how are they used in practice.

Edit: Practical example

Suppose I am creating a C++ job system. A part of it is a lock-free multi-producer single-consumer queue. How could I introduce the use of the rseq(2) syscall into my code to potentially improve its performance?

class mpsc_list_node
{
    mpsc_list_node* _next;

    template<typename T>
        requires std::derived_from<T, mpsc_list_node>
    friend class mpsc_list;
};

template<typename T>
    requires std::derived_from<T, mpsc_list_node>
class mpsc_list
{
private:
    std::atomic<T*> head{ nullptr };

private:
    static constexpr size_t COMPLETED_SENTINEL = 42;

public:
    mpsc_list() noexcept = default;
    mpsc_list(mpsc_list&& other) noexcept :
        head{ other.head.exchange(reinterpret_cast<T*>(COMPLETED_SENTINEL), std::memory_order_relaxed) }
    {
    }

    bool try_enqueue(T& to_append)
    {
        T* old_head = head.load(std::memory_order_relaxed);
        do
        {
            if (reinterpret_cast<size_t>(old_head) == COMPLETED_SENTINEL)
                [[unlikely]]
                {
                    return false;
                }
            to_append._next = old_head;
        } while (!head.compare_exchange_weak(old_head, &to_append, std::memory_order_release, std::memory_order_relaxed));
        return true;
    }

    template<typename Func>
    void complete_and_iterate(Func&& func) noexcept(std::is_nothrow_invocable_v<Func, T&>)
    {
        T* p = head.exchange(reinterpret_cast<T*>(COMPLETED_SENTINEL), std::memory_order_acquire);
        while (p)
            [[likely]]
            {
                T* cur  = p;
                T* next = static_cast<T*>(p->_next);
                p       = next;
                func(*cur);
            }
    }
};

The purpose of this mpsc_list and its place in the job system is well explained by my README:

Inter-job synchronization

The only synchronization primitive used is an atomic counter. This idea stems from the famous GDC talk on the implementation of a job system using fibers in the Naughty Dog's game engine.

The common promise type of the jobs (promise_base) is actually a derived type from mpsc_list, a multi-producer single-consumer list. This list stores the jobs dependent on the current job. It is a lock-less linked list implemented using atomic operations. Each node stores a pointer to the dependent's promise and the next node. Interestingly, this linked list does not use any dynamic memory allocation.

When a job co_awaits a (possibly 1-sized) set of dependency jobs, it does a few things. Firstly, its promise sets its own internal atomic counter to the number of dependency jobs. Then, it allocates (on the stack) a dependency-count-sized array of notifier objects. The notifier type is the type of the linked list's node. The created notifiers all point to the job being suspended. They do not have a next node.

Then, the job goes through each of its dependency jobs and tries to append the corresponding notifier to the dependency's list. If that dependency has already completed, this operation fails. This is because when a job returns, it sets its list's head to a special sentinel value. If the dependency has already completed (for example, on a different thread), then the suspending job simply decrements its own atomic counter. If the dependency hasn't completed yet, it appends the notifier to that dependency's list. It does so using a CAS loop.

After having gone through each of the dependencies, the suspending job checks how many of its dependencies have already completed. If all of them have, then it doesn't suspend and immediately continues execution. This isn't just an optimization. This is necessary for the job system to function properly. This is because this job system does not have a suspended job queue. The job system only has a ready job queue. Suspended jobs are stored only inside their dependencies' linked lists. Hence, if a job would suspend, while not having any dependencies, it would never get resumed.

When a job returns, it traverses its linked list of dependents. Firstly, it sets the list's head to the special sentinel value. Then, it goes through all of the jobs, atomically decrementing their atomic counters. The decrement is a RMW operation, so the job reads the counter's previous value. If it is one, then it knows that it is the last dependency to complete for that job, and it pushes it to the job queue.

Verret answered 25/4, 2023 at 14:33 Comment(0)
G
2

The rseq syscall was introduced to allow for restartable sequences, which were invented to allow for per-cpu variables in userspace. Per-cpu variables are easy in kernel-space as you can disable preemption at any time you reference a per-cpu.

In userspace you have to account for unwanted preemption if you want to handle your per-cpu variables consistently with respect to the current thread/CPU. For that the restartable sequences mechanism was introduced

Garaway answered 11/10, 2023 at 0:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.