How many asm-instructions per C-instruction?
Asked Answered
R

8

14

I realize that this question is impossible to answer absolutely, but I'm only after ballpark figures:

Given a reasonably sized C-program (thousands of lines of code), on average, how many ASM-instructions would be generated. In other words, what's a realistic C-to-ASM instruction ratio? Feel free to make assumptions, such as 'with current x86 architectures'.

I tried to Google about this, but I couldn't find anything.

Addendum: noticing how much confusion this question brought, I feel some need for an explanation: What I wanted to know by this answer, is to know, in practical terms, what "3GHz" means. I am fully aware of that the throughput per Herz varies tremendously depending on the architecture, your hardware, caches, bus speeds, and the position of the moon.

I am not after a precise and scientific answer, but rather an empirical answer that could be put into fathomable scales.

This isn't a trivial answer to place (as I became to notice), and this was my best effort at it. I know that the amount of resulting lines of ASM per lines of C varies depending on what you are doing. i++ is not in the same neighborhood as sqrt(23.1) - I know this. Additionally, no matter what ASM I get out of the C, the ASM is interpreted into various sets of microcode within the processor, which, again, depends on whether you are running AMD, Intel or something else, and their respective generations. I'm aware of this aswell.

The ballpark answers I've got so far are what I have been after: A project large enough averages at about 2 lines of x86 ASM per 1 line of ANSI-C. Today's processors probably would average at about one ASM command per clock cycle, once the pipelines are filled, and given a sample big enough.

Refractometer answered 1/12, 2008 at 16:54 Comment(1)
What decision would you make which is dependent on the answer to this question?Scholem
H
12

I'm not sure what you mean by "C-instruction", maybe statement or line? Of course this will vary greatly due to a number of factors but after looking at a few sample programs of my own, many of them are close to the 2-1 mark (2 assembly instructions per LOC), I don't know what this means or how it might be useful.

You can figure this out yourself for any particular program and implementation combination by asking the compiler to generate only the assembly (gcc -S for example) or by using a disassembler on an already compiled executable (but you would need the source code to compare it to anyway).

Edit

Just to expand on this based on your clarification of what you are trying to accomplish (understanding how many lines of code a modern processor can execute in a second):

While a modern processor may run at 3 billion cycles per second that doesn't mean that it can execute 3 billion instructions per second. Here are some things to consider:

  • Many instructions take multiple cycles to execute (division or floating point operations can take dozens of cycles to execute).
  • Most programs spend the vast majority of their time waiting for things like memory accesses, disk accesses, etc.
  • Many other factors including OS overhead (scheduling, system calls, etc.) are also limiting factors.

But in general yes, processors are incredibly fast and can accomplish amazing things in a short period of time.

Homeomorphism answered 1/12, 2008 at 17:5 Comment(5)
As said, I was asking for rough ballpark figures, and your empirical 2:1 ratio answers my question perfectly. Thank you for your answer.Refractometer
I have to ask, what are you trying to do exactly?Homeomorphism
Since you insist, I'm more or less trying to wrap my head around, in practical terms, what - say - 3GHz of processing power actually means. Now, whether it is a billion instructions per second or a tenth of that doesn't exactly matter, as it's a metric crapload nonetheless.Refractometer
OTOH it's not rare for code to run at more than 1 instruction per cycle. Modern x86 is 4-wide or 5-wide superscalar out-of-order. Skylake averages better than 1 IPC on most SPECint benchmarks.Cathcart
Modern Microprocessors A 90-Minute Guide! is a very good intro to modern CPUs. Yes you're right that 3 GHz != 3 billion instructions per second. Far from it. Anywhere from 15 or 18 billion x86 instructions per second (a 4 uop loop where 2 of the uops are macro-fused alu+branch) down to arbitrarily slow, like maybe 0.1 billion. (Not counting time spent sleeping on I/O, only bottlenecked on cache misses or other stalls, like branch mispredicts.)Cathcart
A
24

There is no answer possible. statements like int a; might require zero asm lines. while statements like a = call_is_inlined(); might require 20+ asm lines.

You can see yourself by compiling a c program, and then starting objdump -Sd ./a.out . It will display asm and C code intermixed, so you can see how many asm lines are generated for one C line. Example:

test.c

int get_int(int c);
int main(void) {
    int a = 1, b = 2;
    return getCode(a) + b;
}

$ gcc -c -g test.c

$ objdump -Sd ./test.o

00000000 <main>:
int get_int(int c);
int main(void) { /* here, the prologue creates the frame for main */
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 14                sub    $0x14,%esp
    int a = 1, b = 2; /* setting up space for locals */
  11:   c7 45 f4 01 00 00 00    movl   $0x1,-0xc(%ebp)
  18:   c7 45 f8 02 00 00 00    movl   $0x2,-0x8(%ebp)
    return getCode(a) + b;
  1f:   8b 45 f4                mov    -0xc(%ebp),%eax
  22:   89 04 24                mov    %eax,(%esp)
  25:   e8 fc ff ff ff          call   26 <main+0x26>
  2a:   03 45 f8                add    -0x8(%ebp),%eax
} /* the epilogue runs, returning to the previous frame */
  2d:   83 c4 14                add    $0x14,%esp
  30:   59                      pop    %ecx
  31:   5d                      pop    %ebp
  32:   8d 61 fc                lea    -0x4(%ecx),%esp
  35:   c3                      ret
Actually answered 1/12, 2008 at 17:4 Comment(1)
Thank you for your very vivid example. Alas, I was more interested on a ballpark average, as I'm aware of there being overhead in certain operations. Not to mention complex functions. But, still, I claim that the asm:c ratio stabilizes enough eventually, given enough lines of code.Refractometer
H
12

I'm not sure what you mean by "C-instruction", maybe statement or line? Of course this will vary greatly due to a number of factors but after looking at a few sample programs of my own, many of them are close to the 2-1 mark (2 assembly instructions per LOC), I don't know what this means or how it might be useful.

You can figure this out yourself for any particular program and implementation combination by asking the compiler to generate only the assembly (gcc -S for example) or by using a disassembler on an already compiled executable (but you would need the source code to compare it to anyway).

Edit

Just to expand on this based on your clarification of what you are trying to accomplish (understanding how many lines of code a modern processor can execute in a second):

While a modern processor may run at 3 billion cycles per second that doesn't mean that it can execute 3 billion instructions per second. Here are some things to consider:

  • Many instructions take multiple cycles to execute (division or floating point operations can take dozens of cycles to execute).
  • Most programs spend the vast majority of their time waiting for things like memory accesses, disk accesses, etc.
  • Many other factors including OS overhead (scheduling, system calls, etc.) are also limiting factors.

But in general yes, processors are incredibly fast and can accomplish amazing things in a short period of time.

Homeomorphism answered 1/12, 2008 at 17:5 Comment(5)
As said, I was asking for rough ballpark figures, and your empirical 2:1 ratio answers my question perfectly. Thank you for your answer.Refractometer
I have to ask, what are you trying to do exactly?Homeomorphism
Since you insist, I'm more or less trying to wrap my head around, in practical terms, what - say - 3GHz of processing power actually means. Now, whether it is a billion instructions per second or a tenth of that doesn't exactly matter, as it's a metric crapload nonetheless.Refractometer
OTOH it's not rare for code to run at more than 1 instruction per cycle. Modern x86 is 4-wide or 5-wide superscalar out-of-order. Skylake averages better than 1 IPC on most SPECint benchmarks.Cathcart
Modern Microprocessors A 90-Minute Guide! is a very good intro to modern CPUs. Yes you're right that 3 GHz != 3 billion instructions per second. Far from it. Anywhere from 15 or 18 billion x86 instructions per second (a 4 uop loop where 2 of the uops are macro-fused alu+branch) down to arbitrarily slow, like maybe 0.1 billion. (Not counting time spent sleeping on I/O, only bottlenecked on cache misses or other stalls, like branch mispredicts.)Cathcart
A
4

That varies tremendously! I woudn't believe anyone if they tried to offer a rough conversion.

Statements like i++; can translate to a single INC AX.

Statements for function calls containing many parameters can be dozens of instructions as the stack is setup for the call.

Then add in there the compiler optimization that will assemble your code in a manner different than you wrote it thus eliminating instructions.

Also some instructions run better on machine word boundaries so NOPs will be peppered throughout your code.

Armelda answered 1/12, 2008 at 17:0 Comment(0)
C
3

I don't think you can conclude anything useful whatsoever about performance of real applications from what you're trying to do here. Unless 'not precise' means 'within several orders of magnitude'.

You're just way overgeneralised, and you're dismissing caching, etc, as though it's secondary, whereas it may well be totally dominant.

If your application is large enough to have trended to some average instructions-per-loc, then it will also be large enough to have I/O or at the very least significant RAM access issues to factor in.

Cirenaica answered 1/12, 2008 at 18:37 Comment(0)
C
2

Depending on your environment you could use the visual studio option : /FAs

more here

Clifton answered 1/12, 2008 at 18:8 Comment(0)
K
1

I am not sure there is really a useful answer to this. For sure you will have to pick the architecture (as you suggested).

What I would do: Take a reasonable sized C program. Give gcc the "-S" option and check yourself. It will generate the assembler source code and you can calculate the ratio for that program yourself.

Konstance answered 1/12, 2008 at 17:2 Comment(0)
V
1

RISC or CISC? What's an instruction in C, anyway?

Which is to repeat the above points that you really have no idea until you get very specific about the type of code you're working with.

You might try reviewing the academic literature regarding assembly optimization and the hardware/software interference cross-talk that has happened over the last 30-40 years. That's where you're going to find some kind of real data about what you're interested in. (Although I warn you, you might wind up seeing C->PDP data instead of C->IA-32 data).

Vomit answered 1/12, 2008 at 19:4 Comment(0)
S
1

You wrote in one of the comments that you want to know what 3GHz means.

Even the frequency of the CPU does not matter. Modern PC-CPUs interleave and schedule instructions heavily, they fetch and prefetch, cache memory and instructions and often that cache is invalidated and thrown to the bin. The best interpretation of processing power can be gained by running real world performance benchmarks.

Substantialize answered 9/8, 2011 at 6:21 Comment(2)
Yes, a 3GHz Skylake is many times faster than a 3GHz Pentium 4, to pick extreme examples of high and low IPC. But for a given microarchitecture, performance does scale with frequency. Unless your code is memory-bound on bandwidth or cache-miss latency. But usually caches work, and there is significant performance scaling with frequency. e.g. for Skylake, a 4-wide superscalar out-of-order CPU, it can issue up to 4 uops every clock cycle into the out-of-order back-end. Most instructions decode to a single uop. See agner.org/optimize for more about what an x86 CPU can do per cycle.Cathcart
@PeterCordes: Thanks for your comment - Agner's guides are gold. You are of course completely right regarding architecture - but then again, there are so many architectures on the market at the moment. Even if you limit the architectures to those of mainstream CPUs since I wrote this answer (2011), they are numerous.Substantialize

© 2022 - 2024 — McMap. All rights reserved.