How to fix GCC compilation error when compiling >2 GB of code?

Asked 9/6, 2011 at 17:26 Answered 11/6, 2011 at 7:28

Solved c++math gcc compiler-errors code-size

110

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

And system libraries I link against:

-lgfortran -lm -lrt -lpthread

Any clues where to look for the problem?

EDIT:

First of all, thank you for the discussion...

To clarify a bit, I have hundreds of functions (each approx 1 MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...
    
       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

That's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2:

I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 10⁵ to 10⁶ times...

EDIT3:

Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize".
It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application...

EDIT4:

So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100 MB) and I hope this idea works.

Now related to your answers:

I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS:

Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measurable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT:

Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92 MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10 minutes on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

Overbid answered 9/6, 2011 at 17:26 Comment(37)

2.8 GB? Are you using template in your code? Metaprogramming? – Voccola 9/6, 2011 at 17:27

No templates, just hundred thousands of expressions, summed up to simple numbers that are returned by those functions. – Overbid 9/6, 2011 at 17:29

If you have that much data, you should move it out of the source files and instead mmap it yourself from an external binary at runtime. – Warthman 9/6, 2011 at 17:29

@R..: Could you explain a bit more? The data is not const, can I mmap a function pointer in a given object? How do I get the address? – Overbid 9/6, 2011 at 17:35

It really sounds like you're doing it the hard way. You could move to a mini computer or a mainframe. – Nutpick 9/6, 2011 at 17:36

@Jay: not sure if serious ... – Overbid 9/6, 2011 at 17:38

Could you give an example of one (or two) of these functions? This really looks strange. You could also load these functions dynamicly with dl* function. – Tanatanach 9/6, 2011 at 17:52

@bbtrb: My first instinct is similar to R..'s, it sounds like a design problem. Admittedly, I don't know what's common in scientific computing circles, but I've never heard of someone trying to link a 2.8GB object file, or anything remotely close to it, and I'm not sure GCC would truly support it. Frankly, I'd expect any code blob that size to be pure spaghetti. – Relaxation 9/6, 2011 at 17:53

Spaghetti or oddly designed. It certainly strikes of the sort of thing that could be mitigated to some extent by refactoring or abstracting the system out a bit and allowing the compiler to get in and optimize where it can. – Ancon 9/6, 2011 at 18:1

@btrb, is your data in arrays? if so you can seamlessly replace the arrays with boost::array. They will be allocated on the heap and no syntax changes are required beyond the declaration of them. – Remand 9/6, 2011 at 18:9

@frankc: I'm pretty sure boost:arrays are allocated on the stack. (Why else would they require the size of the array to be passed as a template parameter?) Perhaps you meant std::vector? – Receiver 9/6, 2011 at 18:22

@Receiver Actually, I meant to say boost::multi_array. I am not sure how boost::array works, but I have used boost::multi_array to fix a program that had the exact same complilation problem – Remand 9/6, 2011 at 18:27

@all: Please see my edit, I hope this helps a bit... – Overbid 9/6, 2011 at 18:49

there's absolutely no way that the optimal solution for the problem involves 2gb of object file. – Trochelminth 9/6, 2011 at 18:51

@David: Well, if you come up with something better, which wouldn't be directly related to coding btw, you could make the work of hundreds of physicists obsolete... – Overbid 9/6, 2011 at 19:2

don't put your data in code – Trochelminth 9/6, 2011 at 19:5

@David: So how do I evaluate these expressions efficiently when not in code? – Overbid 9/6, 2011 at 19:8

Just out of curiosity... were those hundreds of 1MB-sized functions all written by hand? – Receiver 9/6, 2011 at 19:20

@HighCommander4: Surely not :) – Overbid 9/6, 2011 at 19:24

@bbtrb: Well, in that case, could you not generate them from some smaller input during the execution of your program? (Or during its compilation, using metaprogramming)? – Receiver 9/6, 2011 at 19:35

@Overbid If you want to evaluate x*y where x is 2 and y is 3, you do not, I presume, generate the code 2*3. Instead you use two variables into which you load the values 2 and 3, and then you multiply those two variables together. Take that idea and scale it up. – Trochelminth 9/6, 2011 at 19:47

@highCommander4: This is actually exactly what I do to generate the code. The initial expressions are very small and simply explode like that. The problem is, that the basis that leads to these functions is expressed in a mathematical language that is both very difficult to translate into code and, which is much worse, numerically completely unstable. – Overbid 9/6, 2011 at 19:52

@David: I see what you mean, however I don't see how I could use this approach on the above functions, in an automatized way. – Overbid 9/6, 2011 at 19:54

Do the functions share a common form? – Trochelminth 9/6, 2011 at 19:56

They all look like the stuff above, but of course with different csc[], prefactors and numbers. – Overbid 9/6, 2011 at 19:58

If they have a common form then you can write the function once and turn the hard coded bits into variables. – Trochelminth 9/6, 2011 at 20:10

Unrelatedly, why are you using tr1::unordered_map<int,double> instead of vector<double> or even just a statically allocated double[] for csc? – Kynewulf 9/6, 2011 at 20:23

"unfortunately there's no way around, scientific computing" !? You think!? Your examples look easily optimised to me, into data and algorithm; it seems you have hard-coded data into the algorithm and unrolled the obvious loops completely. Ouch. – Vamoose 9/6, 2011 at 20:49

@Clifford: you are right, there are obvious loops. It's not that I unrolled them, this is my starting point so to speak. What I would have to do, is to automatically analyze the expressions and then create the loops, and that sounds really non-trivial. – Overbid 9/6, 2011 at 21:22

@bbtrb: You said the code isn't hand-written, right? It's generated by some program. Change that program to generate a loop instead. It shouldn't be difficult, as there must be a loop already that generates the individual expressions in the first place. – Receiver 9/6, 2011 at 21:41

@HighCommander4: I'll try. It's a completely different thing to just split some expressions and do some regex on them and another to actually go through the stuff and try to find patterns... Thanks for the hints. – Overbid 9/6, 2011 at 21:43

And I thought I was breaking through some extreme boundaries when I overrun the maximum number of vtable-entries with the VC++ compiler. I tip my hat in awe! :D – Beth 9/6, 2011 at 22:25

You are trying to clone human DNA?>!?!:?!!?!??! – Solid 9/6, 2011 at 23:36

In the 1970's we used overlays. Are they available on your system? – Ambition 10/6, 2011 at 1:7

high precision computations using doubles? Huh? And why do you not load your data from a data file, like everybody else? – Unsling 10/6, 2011 at 1:35

@houbysoft: you know, as long as you make sure that loss of numerical precision is avoided or treated correctly, double is plenty for physical observables. And sorry for my mistake, it's not about data, it's about code as others have pointed out ... – Overbid 10/6, 2011 at 1:54

Try supplying some GCSE options to the compiler. – Slaton 10/6, 2011 at 6:50

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
create an array of such offsets
write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

Brieta answered 9/6, 2011 at 22:16 Comment(0)

The x86-64 ABI used by Linux defines a "large model" specifically to avoid such size limitations, which includes 64-bit relocation types for the GOT and PLT. (See the table in section 4.4.2, and the instruction sequences in 3.5.5 which show how they are used.)

Since your functions are occupying 2.8 GB, you are out of luck, because gcc doesn't support large models. What you can do, is to reorganize your code in such a way that would allow you to split it into shared libraries which you would dynamically link.

If that is not possible, as someone suggested, instead of putting your data into code (compiling and linking it), since it is huge, you can load it at run time (either as a normal file, or you can mmap it).

EDIT

Seems like the large model is supported by gcc 4.6 (see this page). You can try that, but the above still applies about reorganizing your code.

Colonial answered 9/6, 2011 at 18:47 Comment(11)

So what you are saying is that when I would group the object files in several small shared libraries, I would overcome the limitations? – Overbid 9/6, 2011 at 18:51

@Overbid Right. But I would still search of another way of implementing your functions. I bet your compilation takes forever – Antonetta 9/6, 2011 at 19:1

@Vjo: Thank you, I'll try this! The compilation takes a while, even with -j8. Unfortunately, I don't see a clear way to simplify this more, it would require analytical comparison of all those expressions and an extraction of common parts, no idea how to go about that... – Overbid 9/6, 2011 at 19:6

WTF? This code must be generated by some script; nobody writes megabytes of code by hand! The same logic that generates the code could also be used to run the computation. – Devereux 9/6, 2011 at 19:36

@Devereux "megabytes of code"... I am sure it is 100s of megabytes (if not gigabytes). The logic that generates the code might be very slow – Antonetta 9/6, 2011 at 19:41

@Vjo: yes you are right, the preprocessing is extremely computationally intensive and tries to simplify the resulting expressions as much as possible. – Overbid 9/6, 2011 at 19:44

@zvrba: of course the code is generated automatically. Numerical evaluation with a "compiled language" is much faster (3-4 orders of magnitude) and therefore preferred. – Overbid 9/6, 2011 at 19:46

I strongly recommend trying gcc 4.6, it is very likely to produce superior code for this program than gcc 4.1; it might even be able to squeeze the whole thing into 2GB without your having to do anything clever, eliminating the problem (try combinations of -Os, -fwhole-program, and -flto -- with this volume of code, optimizing for size is optimizing for speed). However, if that doesn't help enough, you should also be aware that for the large model to work, you're gonna have to rebuild at least part of the C library in the large model (crt*.o, libc_nonshared.a, and libpthread_nonshared.a). – Kynewulf 9/6, 2011 at 20:19

@Zack, does the libc dynamic loader support large model relocations, even? – Runkle 9/6, 2011 at 22:48

@Runkle Static linking is also a possibility. – Devereux 10/6, 2011 at 6:36

@Zack I wouldn't be surprised if whole-program optimization took a month to run on this kind of code ;) – Amido 10/6, 2011 at 17:41

With a program of that side, cache misses for code are very likely to exceed the costs of looping at runtime. I would recommend that you go back to your code generator, and have it generate some compact representation for what it wants evaluated (ie, one likely to fit in D-cache), then execute that with an interpreter in your program. You could also see if you can factor out smaller kernels that still have a significant number of operations, then use those as 'instructions' in the interpreted code.

Runkle answered 9/6, 2011 at 22:6 Comment(0)

The error occurs because you have too much CODE, not data! This is indicated by for example __libc_csu_fini (which is a function) being referenced from _start and the relocation is truncated to fit. This means that _start (the program's true entry point) is trying to call that function via a SIGNED 32-bit offset, which has only a range of 2 GB. Since the total amount of your object code is ~2.8 GB, the facts check out.

If you could redesign your data structures, much of your code could be "compressed" by rewriting the huge expressions as simple loops.

Also, you could compute csc[] in a different program, store the results in a file, and just load them when necessary.

Devereux answered 9/6, 2011 at 19:33 Comment(4)

Could you provide an example how you would rewrite the functions with simple loops? I don't follow you exactly. csc[] has to be computed very often and I'd like to avoid disk I/O. – Overbid 9/6, 2011 at 19:48

@bbtr: For example, for func1 above, something like: for (int i = 0; i < N; ++i) expr += constants[i].*s.x14*s.x15*csc[49300 + i];. – Receiver 9/6, 2011 at 20:5

@HighCommander4: absolutely, I agree. It's just above my head of how to generate something like this automatically. Maybe with a seperate array that stores the indices ... – Overbid 9/6, 2011 at 21:26

@bbtrb: Since there's no freaking way that anyone wrote enough source to produce 2.8GB of object code by hand, especially with such un-mnemonic symbol names, a code generator must have been used. Work with that. – Griffiths 11/6, 2011 at 7:32

I think everybody agrees there should be a different way to do what you want to do. Compiling hundreds of megabyte (gigabytes?) of code, linking it into a multi-gigabyte sized executable and running it just sounds very inefficient.

If I understand your problem correctly, you use some sort of code generator, G, to generate a bunch of functions func1...N which take a bunch of maps csc1...M as input. What you want to do is to calculated csc1...M, and run a loop of 1,000,000 times for different inputs and each time find s = func1 + func2 + ... + funcN. You didn't specify how fucn1...N are related to csc1...M though.

If all that is true, it seems that you should be able to turn the problem on its head in different way which can potentially be much more manageable and even possibly faster (i.e. letting your machine's cache to actually function).

Besides the practical problem of the object files sizes, your current program will not be efficient since it does not localize access to the data (too many huge maps) and has no localized code execution (too many very long functions).

How about breaking your program into 3 phase: Phase 1 build csc1...M and storing them. Phase 2 build one func at a time, run it 1,000,000 times with each input and store the results. Phase 3 find the sum of the results of the stored func1...N outcomes for each run out of 1,000,000 times. The good part about this solution is that it can be easily made parallel across several independent machines.

Edit: @bbtrb, could you make one func and one csc available somehwere? They seem to be highly regular and compressible. For instance, func1 seems to be just a sum of expressions each consisting of 1 coefficient, 2 indexes to the variables in s and 1 index into csc. So it can be reduced to a nice loop. If you make complete examples available, I'm sure ways can be found to compress them into loops rather than long expressions.

Berkelium answered 10/6, 2011 at 2:4 Comment(2)

Yes, you understand correctly :) There are several problems with your suggestion though: 1. the worst funcs dependent on almost all cscs and those numbers have to be computed 10^6 times, too. 2. The input will be obtained from an adaptive Monte Carlo integrator, meaning that the integrator has to know the full result at each point to be able to reduce the resulting error by refining the mesh in the vicinity of the point if necessary. 3. The large expressions for csc persist ... – Overbid 10/6, 2011 at 2:12

So does it mean you cannot calculate each csc in each iteration independent from the others? If they were independent, you could still run each one 10^6 times and store the results. However, if there are dependencies among them, maybe you need to find out which one is related to which, something like a dependency graph, and then try to see if you can break it into multiple independent sub-graphs. All in all I think the key is to break the problem into multiple, independent, sub-problems. – Berkelium 10/6, 2011 at 2:31

If I read your errors correctly, what makes you carry over the limit is the initialized data section (if it was the code, you would have far more errors IMHO). Do you have big arrays of global data? If it is the case, I'd restructure the program so that they are allocated dynamically. If the data is initialized, I'd read it from a configuration file.

BTW seeing this:

(.text+0x20): undefined reference to `main'

I think you have another problem.

Pindling answered 9/6, 2011 at 18:47 Comment(1)

Yes you are right, stupid mistake, but it doesn't solve the other errors. – Overbid 9/6, 2011 at 19:7

A couple of suggestions: - Optimize for size (-Os). Make your inline function calls, normal function calls. Enable string pooling.

Try splitting the things into different DLL's (shared objects, .so for linux, .dylib for Mac OS X). Make sure that they can be unloaded. Then implement something to load things on demand, and free them when not needed.

If not, split your code into different executables, and use something to communicate between them (pipes, sockets, even writing / reading to file). Clumsy, but what options do you have?

Totally alternative: - Use a dynamic language with JIT. Right on top of my head - use LuaJIT - and rewrite (regenerate?) a lot of these expressions in Lua, or other such languages and runtimes that allow code to be garbage collected.

LuaJIT is quite efficient, sometimes beating C/C++ for certain things, but often very close (sometimes can be slow due to poor garbage collection yet there). Check for yourself:

http://luajit.org/performance_x86.html

Download the scimark2.lua file from there, and compare it with the "C" version (google it) - often results are very close.

Neural answered 10/6, 2011 at 0:12 Comment(0)

It looks to me like the code is doing numerical integration using some kind of adaptive depth method. Unfortunately, the code generator (or rather the author of the code generator) is so stupid as to generate one function per patch rather than one per type of patch. As such, it's produced too much code to be compiled, and even if it could be compiled its execution would be painful because nothing's ever shared anywhere ever. (Can you imagine the pain resulting by having to load each page of object code from disk because nothing is ever shared and so it's always a candidate for the OS to evict. To say nothing of instruction caches, which are going to be useless.)

The fix is to stop unrolling everything; for this sort of code, you want to maximize sharing as the overhead of extra instructions to access data in more complex patterns will be absorbed by the cost of dealing with the (presumably) large underlying dataset anyway. It's also possible that the code generator will even do this by default, and that the scientist saw some options for unrolling (with the note that these sometimes improve speed) and turned them all on at once and is now insisting that this resulting mess be accepted by the computer, rather than accepting the machine's real restrictions and using the numerically correct version that is generated by default. But if the code generator won't do it, get one that will (or hack the existing code).

The bottom line: compiling and linking 2.8GB of code doesn't work and shouldn't be forced to work. Find another way.

Griffiths answered 11/6, 2011 at 7:28 Comment(0)

Those expressions look a lot like an alternating series to me. I don't know what the rest of the code looks like, but it doesn't seem like it'd be that hard to derive the generating expression. It'd probably be worth it at execution time too, especially if you have 2.8 GB of 2 KB unrolled code.

Kinson answered 10/6, 2011 at 1:52 Comment(0)

The linker is attempting to generate 32-bit relocation offsets within a binary that has somehow exceeded these limitations. Try reduce the main program's address space requirements.

Can you split some/most of the object code into one or more libraries (also compiled with -fpic / -fPIC)? Then generate a non-static binary that links against these libs. The libraries will live in discrete memory blocks and your relocation offsets will be dynamic/absolute (64-bit) rather than relative (32-bit).

Rik answered 10/6, 2011 at 11:56 Comment(0)

This looks like the result of code generation gone wrong, perhaps by symbolic algebra and/or manual unrolling. Symbolic manipulations are well known to grow exponentially in the depth of the expression tree or computational graph. It is likely that automatic differentiation can be used here, which would make the code size quite small and also speed up execution dramatically.

Inkerman answered 10/6, 2011 at 10:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags