libuv allocated memory buffers re-use techniques

Asked 14/2, 2015 at 1:44 Answered 21/5, 2022 at 8:46

i am using libuv for my extensively-network-interacting application and i am concerned about which techniques of re-using allocated memory would be at the same time efficent and safe with libuv callback deferrence of execution.

At very basic layer, exposed to libuv user, one is getting need to specify buffer allocation callback alongside with setting up a handle reader:

UV_EXTERN int uv_read_start(uv_stream_t*, uv_alloc_cb alloc_cb, uv_read_cb read_cb);

where uv_alloc_cb is

typedef void (*uv_alloc_cb)(uv_handle_t* handle, size_t suggested_size, uv_buf_t* buf);

But here is the problem: this memory-allocating callback is invoked each time new message is comming via handle (say, each UDP datagram from uv_udp_t handle is received) and stright-forward allocation of new buffer for each incoming UDP datagram seems very non-memory-wise.

So i am asking for a common C techniques (probably, within deferred execution context introduced by libuv callback system) of having the same allocated memory be re-used when possible.

Also, i would like to stay windows-portable, if possible.

Notes:

i am aware of this question: Does libuv provide any facilities to attach a buffer to a connection and re use it ; it's accepted answer does not answers how to actually do the memory allocation right with libuv besides stating the fact that statically-allocated buffer is no-go. Especially, it is not covering the safety (with deferred write callbacks on the same buffer, which can overlap with another read callback invocation across multiple iterations of libuv main loop) aspect of buffer attached to a handle (either via wrapper structure or handle->data context).
Reading http://nikhilm.github.io/uvbook/filesystem.html , i have noticed the following phrase under the snip uvtee/main.c - Write to pipe:

We make a copy so we can free the two buffers from the two calls to write_data independently of each other. While acceptable for a demo program like this, you’ll probably want smarter memory management, like reference counted buffers or a pool of buffers in any major application.

but i wasn't able to find any solutions involving reference counting on libuv buffers (how this could be properly performed?) or explicit examples of pools of buffers in libuv environment (is there any libraries for that?).

Abduce answered 14/2, 2015 at 1:44 Comment(0)

I would like to share my own experience in solving this problem. I can feel your pain and confusion, but in reality, it's not overly hard to implement a working solution considering the vast array of options you have if you know what you are doing.

Objective

Implement a pool of buffers capable of performing two operations - acquire and release.
The basic pooling strategy:
- acquire withdraws a buffer from the pool effectively reducing the available number of buffers by 1;
- If there are no buffers available, two options arise:
  - grow the pool and return a newly created buffer; or
  - create and return a dummy buffer (explained below).
- release returns a buffer back to the pool.
The pool may be of fixed or variable size. "Variable" means that initially there are M pre-allocated buffers (zero for instance), and the pool can grow on demand up to N. "Fixed" means that all buffers are pre-allocated upon the pool's creation (M = N).
Implement a callback which acquires buffers for libuv.
Do not allow infinite pool growth still having the pool functional in any circumstances except for out-of-memory situations.

Implementation

Now, let's shed some more light on all this in details.

The pool structure:

#define BUFPOOL_CAPACITY 100

typedef struct bufpool_s bufpool_t;

struct bufpool_s {
    void *bufs[BUFPOOL_CAPACITY];
    int size;
};

size is the current pool size.

A buffer itself is a memory block prefixed with the following structure:

#define bufbase(ptr) ((bufbase_t *)((char *)(ptr) - sizeof(bufbase_t)))
#define buflen(ptr) (bufbase(ptr)->len)

typedef struct bufbase_s bufbase_t;

struct bufbase_s {
    bufpool_t *pool;
    int len;
};

len is the length of the buffer in bytes.

Allocation of a new buffer looks like this:

void *bufpool_alloc(bufpool_t *pool, int len) {
    bufbase_t *base = malloc(sizeof(bufbase_t) + len);
    if (!base) return 0;
    base->pool = pool;
    base->len = len;
    return (char *)base + sizeof(bufbase_t);
}

Note that the returned pointer points to the next byte after the header - the data area. This allows to have buffer pointers as though they were allocated via the standard call to malloc.

Deallocation is the opposite:

void bufpool_free(void *ptr) {
    if (!ptr) return;
    free(bufbase(ptr));
}

The allocation callback for libuv looks like this:

void alloc_cb(uv_handle_t *handle, size_t size, uv_buf_t *buf) {
    int len;
    void *ptr = bufpool_acquire(handle->loop->data, &len);
    *buf = uv_buf_init(ptr, len);
}

You can see here that alloc_cb takes a buffer pool's pointer from the user data pointer on the loop. That means that a buffer pool should be attached to an event loop prior to usage. In other words, you should initialize a pool upon creation of the loop and assign its pointer to the data field. If you already hold other user data in that field, just extend your structure.

A dummy buffer is a fake buffer which means it doesn't originate in the pool but is still fully functional. The purpose of dummy buffers is to keep the whole thing working in rare situations of pool starvation, i.e. when all buffers are acquired and there is a need for another one. Based on my research, allocation of small blocks of memory about 8Kb is done very quickly on all modern OSes - that suits the size of a dummy buffer well.

#define DUMMY_BUF_SIZE 8000

void *bufpool_dummy() {
    return bufpool_alloc(0, DUMMY_BUF_SIZE);
}

The acquire operation:

void *bufpool_acquire(bufpool_t *pool, int *len) {
    void *buf = bufpool_dequeue(pool);
    if (!buf) buf = bufpool_dummy();
    *len = buf ? buflen(buf) : 0;
    return buf;
}

The release operation:

void bufpool_release(void *ptr) {
    bufbase_t *base;
    if (!ptr) return;
    base = bufbase(ptr);
    if (base->pool) bufpool_enqueue(base->pool, ptr);
    else free(base);
}

There are two functions around here - bufpool_enqueue and bufpool_dequeue. Basically, they perform all the pool's work.

In my case, there is a O(1) queue of buffer indexes on top of the said above which allows me to keep track of the pool's state more efficiently fetching indexes of buffers very quickly. It's not necessary to go extreme like I did because the maximum size of the pool is limited, hence any array search will be constant in time as well.

In the simplest case, you can implement these functions as pure linear searchers throughout the bufs array in the bufpool_s structure. For example, if a buffer gets acquired, you search for the first non-NULL spot, save the pointer and put NULL in that spot. Next time when the buffer is released, you search for the first NULL spot and save its pointer in there.

The pool internals as follows:

#define BUF_SIZE 64000

void *bufpool_grow(bufpool_t *pool) {
    int idx = pool->size;
    void *buf;
    if (idx == BUFPOOL_CAPACITY) return 0;
    buf = bufpool_alloc(pool, BUF_SIZE);
    if (!buf) return 0;
    pool->bufs[idx] = 0;
    pool->size = idx + 1;
    return buf;
}

void bufpool_enqueue(bufpool_t *pool, void *ptr) {
    int idx;
    for (idx = 0; idx < pool->size; ++idx) {
        if (!pool->bufs[idx]) break;
    }
    assert(idx < pool->size);
    pool->bufs[idx] = ptr;
}

void *bufpool_dequeue(bufpool_t *pool) {
    int idx;
    void *ptr;
    for (idx = 0; idx < pool->size; ++idx) {
        ptr = pool->bufs[idx];
        if (ptr) {
            pool->bufs[idx] = 0;
            return ptr;
        }
    }
    return bufpool_grow(pool);
}

The normal buffer size is 64000 bytes because I want it to comfortably fit in a 64Kb block with its header.

And finally, initialization and de-initialization routines:

void bufpool_init(bufpool_t *pool) {
    pool->size = 0;
}

void bufpool_done(bufpool_t *pool) {
    int idx;
    for (idx = 0; idx < pool->size; ++idx) bufpool_free(pool->bufs[idx]);
}

Please note that this implementation is simplified for illustrative purposes. There is no pool shrinking policy here whereas in a real world scenario, it will be most likely required.

Usage

You should be able to write your libuv callbacks now:

void read_cb(uv_stream_t *stream, ssize_t nread, const uv_buf_t *buf) {
    /* ... */
    bufpool_release(buf->base); /* Release the buffer */
}

Loop initialization:

uv_loop_t *loop = malloc(sizeof(*loop));
bufpool_t *pool = malloc(sizeof(*pool));
uv_loop_init(loop);
bufpool_init(pool);
loop->data = pool;

Operation:

uv_tcp_t *tcp = malloc(sizeof(*tcp));
uv_tcp_init(tcp);
/* ... */
uv_read_start((uv_handle_t *)tcp, alloc_cb, read_cb);

UPDATE (02 Aug 2016)

It's also a good idea to use an adaptive strategy when fetching a buffer depending on the requested size and return pooled buffers only when a big chunk of data is requested (e.g. all reads and long writes). For other cases (e.g. most writes), return dummy buffers. This will help to avoid wasting pooled buffers yet maintaining acceptable allocation speed. For example:

void alloc_cb(uv_handle_t *handle, size_t size, uv_buf_t *buf) {
    int len = size; /* Requested buffer size */
    void *ptr = bufpool_acquire(handle->loop->data, &len);
    *buf = uv_buf_init(ptr, len);
}

void *bufpool_acquire(bufpool_t *pool, int *len) {
    int size = *len;
    if (size > DUMMY_BUF_SIZE) {
        buf = bufpool_dequeue(pool);
        if (buf) {
            if (size > BUF_SIZE) *len = BUF_SIZE;
            return buf;
        }
        size = DUMMY_BUF_SIZE;
    }
    buf = bufpool_alloc(0, size);
    *len = buf ? size : 0;
    return buf;
}

P.S. No need for buflen and bufpool_dummy with this snippet.

Homeostasis answered 28/2, 2015 at 10:17 Comment(9)

Do you care to release this code using MIT/BSD or LGPL v3 licenses? – Chloramine 27/5, 2015 at 14:35

The licensing nightmare creeps in again... =) I believe everything on SO is under the terms of the Creative Commons license by default. I don't how exactly I can furthermore release this ingenious piece of mine under the MIT/BSD license, but I don't mind. – Homeostasis 27/5, 2015 at 18:12

Yeah, this licensing thing is a nightmare and combined with that I'm paranoid just makes things worse. :) Thanks. – Chloramine 27/5, 2015 at 18:52

bufpool_enqueue and bufpool_dequeue could be very slow given a loop to scan for free slots. Using a stack or linked list would have been better choices and would have helped with reusing recently used buffers to help with cached memory. – Hyperbaton 26/2, 2016 at 13:51

@Hyperbaton They won't be "very slow" since their capacity is limited. They are essentially O(1). Also, there's a special note regarding this in the article. – Homeostasis 26/2, 2016 at 20:8

Let's say an event loop is servicing 1000 sockets. The first 900 buffers somehow go to sockets servicing traffic to mobile phones and don't get returned timely. The loop could end up iterating hundreds of times to get new buffers. Even a 1u delay could end up costing an extra 1s every second on a high end server processing 1M messages per second. I guess 'slow' is relative to some. In the financial industry this would be slow but probably ok for a web browser-type of app. A linked-list would involve modifying two pointers when queueing/dequeing no matter how many buffers there are. – Hyperbaton 27/2, 2016 at 5:25

Well, a situation when the number of buffers should be equivalent to the number of sockets, i.e. a O(n) problem, is BEYOND my answer. As I already mentioned in the answer and the comment above, in this SIMPLIFIED approach the number of buffers is ALWAYS constant, i.e. is not dependent on the number of sockets being served. When an event loop needs more buffers than available, so called "dummy buffers" are used. Again, this might be NOT a fine grained approach for every possible case including yours and it needs not to be because its main purpose is to educate. – Homeostasis 27/2, 2016 at 23:19

Thanks for the excellent writeup. I'm thinking of using this as a template for something I'm working on. One nit: In bufpool_grow(), I think it should be pool->bufs[idx] = 0;. Since you're returning the new buffer immediately, you want its space in the pool to be empty. Otherwise, you could hit the assert in bufpool_enqueue. – Footmark 8/3, 2016 at 2:28

@KevinRichardson Thank you for your comment! You are right about bufpool_grow(). In my own code, I store all allocated buffer pointers in pool->bufs and manage their occupancy via a separate queue. Here I simplified this approach a bit and combined both tasks. Looks like I missed that one point in bufpool_grow(). Fixed! – Homeostasis 8/3, 2016 at 2:40

If you are on Linux you are in luck. Linux kernel usually uses what is known as SLAB Allocator by default. The advantage of this allocator is that it reduces actual memory allocations by maintaining pools of recyclable blocks. What it means for you is that as long as you always allocate buffers of the same size (ideally pow2 sizes of PAGE_SIZE) you are okay with using malloc() on Linux.

If you are not on Linux (or FreeBSD or Solaris) or if you develop a cross platform application you may consider using glib and its Memory Slices that are a cross-platform implementation of the SLAB allocator. It uses native implementations on platforms that support it so using it on Linux will bring no advantage (I ran some tests myself). I'm sure there are other libraries that can do the same or you can implement it yourself.

Harlem answered 22/5, 2015 at 13:20 Comment(3)

If you measure performance of the default malloc() on Linux for big blocks above 16Kb, you will see how slow it is comparing to smaller sizes. On the other hand, libuv requires a buffer of size 64Kb by default. So it's not overly fancy to allocate such a big buffer with the default malloc() if you are concerned about fine grained performance. – Homeostasis 27/5, 2015 at 18:17

@Homeostasis I will need to run some tests with larger block sizes, but general idea is that if you do malloc() then free() and then malloc() again with the same block size, it should go much faster the second time around. – Harlem 28/5, 2015 at 16:17

Please do. This is exactly what I did and was quite disappointed. Then I found out that small and large sizes are treated/stored separately by the standard malloc(). – Homeostasis 28/5, 2015 at 19:28

Let's repeat the callback's function signature:

void alloc_cb(uv_handle_t* handle, size_t, uv_buf_t*);

I set handle->data to point to a struct/pair/tuple such as:

auto t(std::make_tuple(blah1, blah2, blah3));

This allows me to share arbitrary data with the cb. What I do is set one of the struct/pair/tuple data members to my buffer:

char data[65536];

then I just use the buffer in the cb:

extern "C"
inline void uv_alloc_cb(uv_handle_t* const uvh, std::size_t const sz,
  uv_buf_t* const buf) noexcept
{
  auto const p(static_cast<std::pair<void*, char*>*>(uvh->data));

  buf->base = std::get<1>(*p);
  buf->len = 65536;
}

This is super fast, no dynamic allocation necessary. I think the libuv API is kind of ad-hoc, not very well thought out at all and the implementation is lacking. Why this arbitrary 64k buffer requirement? If I don't provide 64k, libuv is not happy at all, though it won't crash.

Richel answered 21/5, 2022 at 8:46 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Objective

Implementation

Usage

UPDATE (02 Aug 2016)

Recommended topics

Hot tags