VLAs (Re: C99 Features)

Tue Jun 23 20:20:39 EDT 2015

On Sat, Jun 20, 2015 at 03:11:03PM -0400, Benjamin Kaduk wrote:
> On Fri, 19 Jun 2015, Nico Williams wrote:
> > > Well, take common inner-function heap allocations and turn them into
> > > stack allocations. That is a significant performance gain.
> >
> > Probably.  A decent heap allocator with per-thread magazines ought to be
> > good enough in most cases.
> 
> I'm not sure I follow what point Nico is trying to make, here.  [...]

That we shouldn't be making this optimization except where it's really
necessary.  I.e., find a hotspot where VLAs or alloca() is the fix.

The heap allocator ought to be good enough.

> In other words, I'm not convinced that this isn't premature optimization,
> as far as claims of performance gains.  [...]

We're making the same point.

FYI, the following "works" fine as a guard on alloca() *and* VLAs.  Call
ensure_alloca(), which ensures that it can safely alloca() the desired
number of bytes, then call alloca().  The macro SAFE_ALLOCA() wraps
around ensure_alloca() and alloca().  Run this program with a debugger
to see it in action.

I have tested this on a system where the stack grows down.

However! the optimizer may make this not safe.  I'm not sure how to
write ensure_alloca() so that the optimizer will not short-circuit its
main loop.  So, as with several other sensitive functions that the
optimizer may damage, this should be compiled with -O0.

Also, for large enough sizes (and probably not that large..) this will
be significantly slower than the heap allocator, as this is linear with
the number of bytes to allocate.  Which further underlines your and my
point.

Anything the compiler and run-time might do to make alloca()/ VLAs safe
is also probably going to be comparable to the heap allocator anyways.

The fastest thing I can imagine the compiler+run-time doing is this:
have a TSD or equivalent in which the address of the guard page is
written, and check it in the alloca()/VLA code emitted, and if the
allocation lands in or past the guard page, attempt to grow the stack
with mmap() or alike.  The fast path will run faster than many heap
allocators, no doubt, but not much faster than a heap allocator with
per-thread magazines (like libumem).  The slow path will probably not be
slower than the heap allocator.

In other words, really, alloca() or VLAs with sizes that are not
guaranteed to be smaller than a page is just not worth it.

#include <sys/types.h>
#include <assert.h>
#include <alloca.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#define STACK_GROWS_DOWN 1

#define SAFE_ALLOCA(bytes) \
    ((bytes) <= 4000 ? alloca(bytes) : ensure_alloca(bytes), alloca(bytes))

void *ensure_alloca(size_t bytes)
{
    size_t i;
    char *mem;
    assert(bytes < SIZE_MAX - 4096);

    mem = alloca(bytes);
    for (i = 0; i < bytes; i += 4096) {
#ifdef STACK_GROWS_DOWN
        ((char *)mem)[bytes - i] = 0;
#else
        ((char *)mem)[i] = 0;
#endif
    }
    return 0;
}

int main(int argc, const char **argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s SHIFT\n", argv[0]);
        return 1;
    }
    char *s = SAFE_ALLOCA(1<<atoi(argv[1]));
    return s[0];
}