C++ replacement for C99 VLAs (goal: preserve performance)

Asked 3/4, 2016 at 20:1 Answered 3/4, 2016 at 23:49

Solved c++arrays performance variable-length-array stack-allocation

I am porting some C99 code that makes heavy use of variable length arrays (VLA) to C++.

I replaced the VLAs (stack allocation) with an array class that allocates memory on the heap. The performance hit was huge, a slowdown of a factor of 3.2 (see benchmarks below). What fast VLA replacement can I use in C++? My goal is to minimize performance hit when rewriting the code for C++.

One idea that was suggested to me was to write an array class that contains a fixed-size storage within the class (i.e. can be stack-allocated) and uses it for small arrays, and automatically switches to heap allocation for larger arrays. My implementation of this is at the end of the post. It works fairly well, but I still cannot reach the performance of the original C99 code. To come close to it, I must increase this fixed-size storage (MSL below) to sizes which I am not comfortable with. I don't want to allocate too-huge arrays on the stack even for the many small arrays that don't need it because I worry that it will trigger a stack overflow. A C99 VLA is actually less prone to this because it will never use more storage than needed.

I came upon std::dynarray, but my understanding is that it was not accepted into the standard (yet?).

I know that clang and gcc support VLAs in C++, but I need it to work with MSVC too. In fact better portability is one of the main goals of rewriting as C++ (the other goal being making the program, which was originally a command line tool, into a reusable library).

Benchmark

MSL refers to the array size above which I switch to heap-allocation. I use different values for 1D and 2D arrays.

Original C99 code: 115 seconds.
MSL = 0 (i.e. heap allocation): 367 seconds (3.2x).
1D-MSL = 50, 2D-MSL = 1000: 187 seconds (1.63x).
1D-MSL = 200, 2D-MSL = 4000: 143 seconds (1.24x).
1D-MSL = 1000, 2D-MSL = 20000: 131 (1.14x).

Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).

These benchmarks are with clang 3.7 on OS X, but gcc 5 shows very similar results.

Code

This is the current "smallvector" implementation I use. I need 1D and 2D vectors. I switch to heap-allocation above size MSL.

template<typename T, size_t MSL=50>
class lad_vector {
    const size_t len;
    T sdata[MSL];
    T *data;
public:
    explicit lad_vector(size_t len_) : len(len_) {
        if (len <= MSL)
            data = &sdata[0];
        else
            data = new T[len];
    }

    ~lad_vector() {
        if (len > MSL)
            delete [] data;
    }

    const T &operator [] (size_t i) const { return data[i]; }
    T &operator [] (size_t i) { return data[i]; }

    operator T * () { return data; }
};


template<typename T, size_t MSL=1000>
class lad_matrix {
    const size_t rows, cols;
    T sdata[MSL];
    T *data;

public:
    explicit lad_matrix(size_t rows_, size_t cols_) : rows(rows_), cols(cols_) {
        if (rows*cols <= MSL)
            data = &sdata[0];
        else
            data = new T[rows*cols];
    }

    ~lad_matrix() {
        if (rows*cols > MSL)
            delete [] data;
    }

    T const * operator[] (size_t i) const { return &data[cols*i]; }
    T * operator[] (size_t i) { return &data[cols*i]; }
};

Mysterious answered 3/4, 2016 at 20:1 Comment(16)

There is no substitute for VLAs when it comes to overhead. The storage for VLA is completely cost-free. In fact, in most cases it's totally free, above the existing overhead of a function call. Can't really do better than 0% cost, so if MSVC does not have VLAs, you have no choice but use some other alternative, for VLA, and take a performance hit. – Natashianatassia 3/4, 2016 at 20:4

@SamVarshavchik My question is: what is a good alternative that minimizes the performance hit? There's one I show in the question. I don't know if there is anything better. I also don't know if there is a way to allocate memory on the stack in C++ that would allow me to emulate how VLAs work (possibly one or more platform-specific way ... I need this for Windows/Linux/OS X) – Mysterious 3/4, 2016 at 20:6

If you are happy to go "platform specific" then GCC does VLAs as an extension and runs on all those platforms. – Snowslide 3/4, 2016 at 20:14

@Snowslide I can go platform-specific to a reasonable extent (i.e. using OS-specific calls in my array class implementation). But I still need the code to work with MSVC. And I don't want to specialize anything other than this class for different platforms. Is it possible to use a VLA within a class or to wrap it with a class? My understanding so far was that it isn't. But if it is, I would do that (for when it's compiled with gcc) and use the above solution with MSVC. – Mysterious 3/4, 2016 at 20:18

No it's not. Your only option is to allocate a vector/array on the heap. That's it. – Natashianatassia 3/4, 2016 at 20:29

There is also alloca (plaform-specific function, but exists on Linux/Windows/OS X): man7.org/linux/man-pages/man3/alloca.3.html It dynamically allocates memory on the stack. – Levenson 3/4, 2016 at 20:37

@Levenson Is it possible to use alloca to create an on-stack array class? The difficulty I'm encountering is that if I call alloca in the constructor of the class, it will use the constructor's stack (and the pointer won't be valid once the constructor has returned). – Mysterious 3/4, 2016 at 21:3

You are barking up a very wrong tree, VLA memory is not faster than new[] memory. You have to document the T you used. Crystal ball says that it is foat or double. What you lost is auto-vectorization. Which makes code 3 to 4 times faster. You need at least VS2012 to get that back. Use the built-in auto-vectorization and auto-parallelization diagnostic options to find out what the optimizer did. – Racer 3/4, 2016 at 21:11

@HansPassant T is always int, char or bool in this application, never floating point. It is a complex combinatorial algorithm. Also, if the root of the performance problem is that auto-vectorization is lost, then why does the optimization I tried work as well as it does? – Mysterious 3/4, 2016 at 21:14

@Hans I think my original post was misleading. The benchmarks I show are with clang, not MSVC. The code still must run with MSVC, but I'm working on OS X mainly, and testing on Windows when needed. – Mysterious 3/4, 2016 at 21:19

Can you show the benchmark code? Perhaps there might be a way to use more memory for improved speed? – Tarratarradiddle 3/4, 2016 at 21:27

alloca would need to be called in the function whose stack should be used. That is, not in the constructor of the vector class (or the initialization list.) The class could take the pointer as a constructor argument, like lad_vector vec( (int*)alloca(10 * sizeof(int)), 10 );. Maybe make a macro for this (but not an inline function), to get a syntax like lad_vector vec = MAKE_LADVECTOR(10); – Levenson 3/4, 2016 at 21:33

I think your C99 is broken, you cannot portably rely on VLAs that large. I'm surprised that you never get any stack overflow. It looks strange that you get better performance for very high MSL values: for that much data, I would expect the cost of heap allocation to be very small compared to the cost of processing the data. – Kandrakandy 3/4, 2016 at 21:38

@Benoît I think the implementation allocates a lot inside calculation heavy code, or so it seems. Yet that doesn't explain away the improvement, on the contrary. – Tarratarradiddle 3/4, 2016 at 21:47

Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow). I don't see how stack overflow can give you wrong results. On any sane system, at worst you should get a segfault. (Barring something very unusual, like overflowing by so much that you wind up in some other area of valid memory.) So maybe you should look for a bug. – Charlatan 4/4, 2016 at 4:10

Might this be relevent or helpful: Looking for C++ STL-like vector class but using stack storage – Tolland 4/4, 2016 at 15:29

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).

Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).

Get your VLA storage from it.

Wrap it pretty, so you can say stack_array<T> x(1024);, and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator>.

Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.

A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.

I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.

Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.

Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

Helping answered 3/4, 2016 at 21:1 Comment(5)

Interesting idea, I'll try it. What is SBO? – Mysterious 3/4, 2016 at 21:2

I would like to know why this was downvoted. The use case is replacing C99 VLAs in code originally written in C99. This means that arrays are always destroyed in the reverse order of their creation, so the idea to take their storage from a "manually managed stack" should work ... If there's an expected problem, I would like to know. – Mysterious 3/4, 2016 at 21:9

@sza small buffer optimization (what you already tried), storing small arrays "locally". Really, only try if the above fails performance tests first. – Helping 3/4, 2016 at 21:10

@Mysterious As a theory, other than the TLS detail, my answer lines up with the last idea of 5gon12eder; maybe someone didn't like how similar they are. If the TLS detail was folded into 5gon12's answer, mine would be redundant; at the same time, I strongly suspect this solution is the only one that has a chance to both solve your portability and performance problems. – Helping 3/4, 2016 at 22:38

Your idea with the FILO works well and closes the performance gap. I will accept the answer once I finalized the implementation (maybe tomorrow). – Mysterious 4/4, 2016 at 9:18

I think you have already enumerated most options in your question and the comments.

Use std::vector. This is the most obvious, most hassle-free but maybe also the slowest solution.
Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca, as a quick web search told me.

In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.
Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.

Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.
Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.

Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

Milanmilanese answered 3/4, 2016 at 20:48 Comment(3)

Union of classes sounds dangerous destructors arent executed for unions. – Thermostat 3/4, 2016 at 20:55

@Alex This is safe since C++11. Of course, you have to take care to write your destructors accordingly to invoke the appropriate destructor of the currently active member of the union. – Milanmilanese 3/4, 2016 at 21:4

Pool allocators with std::vector has to be the best of all worlds. – Montez 4/4, 2016 at 0:53

Regarding support for MSVC:

MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.

You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.

You may end up needing to use a macro which has different definitions depending on the platform. E.g. invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.

Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

Dealer answered 3/4, 2016 at 23:49 Comment(4)

I'd be worried about alloca using the wrong stack frame if it's not being called explicitly from the same function where the object is declared. – Mississippian 4/4, 2016 at 6:38

@Mississippian not sure what you are talking about, I am suggesting replacing VLA declarations with alloca as a possible option – Dealer 4/4, 2016 at 6:44

I think I got confused and thought you were talking about hiding this behavior behind a class. – Mississippian 4/4, 2016 at 6:48

@Random832: Both _alloca() and alloca() do the right thing if the call for the function in which they're used is properly inlined. You can make sure this happens using __forceinline and __attribute__((always_inline)). I use this extensively in C90 code (which also doesn't have VLAs). – Yokefellow 4/4, 2016 at 14:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Benchmark

Code

Recommended topics

Hot tags