TL:DR: store the non-zero entries plus basically run-length-encoding of the zeros, in a zstd / lz4 / gzip format, and initialize a zeroed array from that.
As you mentioned here and in How to compress or do size optimization for large static sparse array in C (basically a duplicate of this where you forgot to link this existing question), you seem to be aiming primarily at reducing file size of the executable, not RAM usage after your code is loaded. (The latter is impossible without an ABI change that means recompiling all code that touches this array. Which is certainly something you should consider, e.g. to use Lundin's answer on your other question: reduce padding in your arrays if char*
is wider than short
).
Your only option to reduce that is to not statically-initialize this array with non-zero values. The compiler can't optimize this for you. (Or at least, not that I'm aware; I guess in theory an embedded implementation that has to copy nonzero .data
initializers from flash to RAM anyway could literally compress that data with lz4 or zstd. IDK if any toolchains do that; it would be a neat feature.)
To do this yourself, zero-init the actual array (static tbl_t tbl[];
) so it goes in the BSS and doesn't take space in your .a
and call an init function at startup to loop through a more compact representation of your data. Either from a file or from a static const
array.
You'd omit the char*
member that's initially always(?) zero, but with a with a uint8_t
or unsigned short distance_to_next_nonzero
member. This is a similar idea to run-length encoding. If you have a run of zeros too long for that member, just have a record with all-zero weights and a new length.
Or it could even be a separate binary file, so it's easier to avoid having that initializer memory allocated to this for the lifetime of your application1.
Perhaps generate the file from compiling source like this in a program that uses fopen
/ fwrite
to dump it. That would make it easy to maintain even if you have that initializer data compressed using zstd
2 as part of your build process. Although you can always just hexdump any binary file back into a C const char initddata_zstd[] = {...};
array initializer.
I'm assuming the char*
member is always NULL, otherwise getting it initialized to point at constant strings is more of a challenge.
typedef struct {
short x[5];
unsigned short num_zeros;
// or uint_least8_t num_zeros; if you use __attribute__((packed)) for the init stuff
} weight_initializer;
Footnote 1:
Freeing init data: kernels like Linux put their init data in a custom linker section so it's contiguous, and at some point at the end of boot put that region of memory on the free list to be reused. In user-space under Linux you could do something like that by calling munmap
on pages that consist solely of initializer array. (Use alignas(4096) static const unsigned char initdata[] = {...};
so it starts at the beginning of a page.)
Footnote 2:
The non-zero initializer data in your example looks very compressible. zstd or lz4 are (AFAIK) the go-to choices for quick decompression, with zstd allowing good compression ratios if you spend enough time compressing, which I think doesn't hurt decompression speed. (You can limit max memory required for decompression, at the expense of compression efficiency. Although I think that only gets relevant for inputs larger than memory size.)
LZ4 is faster I think, but maybe not by much. Apparently there's also a Lizard (next-gen LZ5). Again, spending lots of time compressing can make a smaller compressed size, potentially leading to even faster decompression.
zstd is about an order of magnitude faster than gzip to compress or decompress, IIRC, and gets equal or better compression ratios at that speed. (With a wide range of speed vs. compression presets.) It makes zlib basically obsolete if you don't need format compatibility.
If your data is really that regular (incrementing patterns), zstd is vastly better than lz4. I assume not, but I was curious just how well compression would work on a pattern like shown in the example in the question. So I wrote a loop to generate repetitive data without a zero-gap member.
typedef struct {
short x[5];
} weight_t;
typedef struct {
weight_t wgt;
// char *ptr;
// unsigned ptr; // leave 8 or 4-byte zeros in each record or not
} tbl_t;
#define SIZE (1<<20)
static tbl_t tbl[SIZE] ={ };
#include <stdio.h>
int main(){
unsigned short x = 0x0102;
for(int i=0 ; i<SIZE ; i++){
tbl[i].wgt = (weight_t){x,x,x,x,x};
x++;
}
fwrite(tbl, sizeof(tbl[0]), SIZE, stdout);
}
So it wraps around after 2^16 iterations; zstd may be seeing that and exploiting that longer-distance redundancy.
$ gcc -O3 foo.c
$ ./a.out > initdata.bin
$ lz4 -k --best --favor-decSpeed initdata.bin initdata.bin.fastdec.lz4
Compressed 10485760 bytes into 9789946 bytes ==> 93.36%
$ lz4 -k --best initdata.bin
Compressed 10485760 bytes into 4092881 bytes ==> 39.03%
$ gzip -v -k --best initdata.bin
initdata.bin: 85.4% -- created initdata.bin.gz
$ zstd -v -k -19 initdata.bin
initdata.bin : 0.65% ( 10.0 MiB => 66.7 KiB, initdata.bin.zst) .00%
$ xz -v -k --best initdata.bin # rather slow to decompress, you prob. don't want it
initdata.bin (1/1)
100 % 27.2 KiB / 10.0 MiB = 0.003
$ ll -S initdata.bin*
-rw-r--r-- 1 peter peter 10M Apr 20 20:28 initdata.bin
-rw-r--r-- 1 peter peter 9.4M Apr 20 20:28 initdata.bin.fastdec.lz4
-rw-r--r-- 1 peter peter 4.0M Apr 20 20:28 initdata.bin.lz4
-rw-r--r-- 1 peter peter 1.5M Apr 20 20:28 initdata.bin.gz
-rw-r--r-- 1 peter peter 67K Apr 20 20:28 initdata.bin.zst
-rw-r--r-- 1 peter peter 28K Apr 20 20:28 initdata.bin.xz
zstd
with the default -3
compression level still got it down to 1.67% of the original size, 171 KiB.
xz
uses LZMA, like 7-zip. Slow to compress, but high-compression. Decompression speed is ok (nowhere near as slow as compression), but not as fast as zstd.
So it seems LZ4 doesn't do a very good job; if your real data behaves remotely like this simplistic test, zstd's compression advantage will pay for itself even if the decompression code is somewhat larger. Your real data might be much less compressible by zstd, but you should get some gain if most of your weights are only using about 9 of the 16 bits even if they're otherwise pretty random.
fopencookie()
, thenfileno()
to get afd
that you could pass tommap()
? – Grangerizetbl
mutable? Will it change? – Horary