How does C# compilation get around needing header files?

I

5

31

I've spent my professional life as a C# developer. As a student I occasionally used C but did not deeply study it's compilation model. Recently I jumped on the bandwagon and have begun studying Objective-C. My first steps have only made me aware of holes in my pre-existing knowledge.

From my research, C/C++/ObjC compilation requires all encountered symbols to be pre-declared. I also understand that building is a two-step process. First you compile each individual source file into individual object files. These object files might have undefined "symbols" (which generally correspond to the identifiers declared in the header files). Second you link the object files together to form your final output. This is a pretty high-level explanation but it satisfies my curiosity enough. But I'd also like to have a similar high-level understanding of the C# build process.

Q: How does the C# build process get around the need for header files? I'd imagine perhaps the compilation step does two-passes?

(Edit: Follow up question here How do C/C++/Objective-C compare with C# when it comes to using libraries?)

Isometry answered 16/12, 2009 at 21:39 Comment(0)

S

40

I see that there are multiple interpretations of the question. I answered the intra-solution interpretation, but let me fill it out with all the information I know.

The "header file metadata" is present in the compiled assemblies, so any assembly you add a reference to will allow the compiler to pull in the metadata from those.

As for things not yet compiled, part of the current solution, it will do a two-pass compilation, first reading namespaces, type names, member names, ie. everything but the code. Then when this checks out, it will read the code and compile that.

This allows the compiler to know what exists and what doesn't exist (in its universe).

To see the two-pass compiler in effect, test the following code that has 3 problems, two declaration-related problems, and one code problem:

using System;

namespace ConsoleApplication11
{
    class Program
    {
        public static Stringg ReturnsTheWrongType()
        {
            return null;
        }

        static void Main(string[] args)
        {
            CallSomeMethodThatDoesntExist();
        }

        public static Stringg AlsoReturnsTheWrongType()
        {
            return null;
        }
    }
}

Note that the compiler will only complain about the two Stringg types that it cannot find. If you fix those, then it complains about the method-name called in the Main method, that it cannot find.

Stickney answered 16/12, 2009 at 21:46 Comment(4)

Does this really answer the question? As stated. This is a good example how the two-pass compiler works to resolve references within the current source file. But I C/C++ headers are usually used to provide the signatures and extern definitions that will be supplied by other sources/objects in the project (or another project). Thus, it seems that the actual answer is that such meta data is provided in the referenced assemblies so there is no need for the headers. – Hinton 17/12, 2009 at 3:2

I think it does answer the question. The problem is how does the compiler know about types used later in the compilation process. The compiler first has to parse the existing C# code to generate the type information. Strictly, it doesn't need to reference any other library if it doesn't need to import types, but it still needs to build a symbol table from the entire source base. – Dealer 17/12, 2009 at 7:14

@Kevin and @codekaizen, I believe you both have valid points. My question didn't specify if I was interested in the process when types need to be imported. (I didn't know enough at the time to know that I was indeed interested in it.) Updating my question to stipulate this. – Isometry 17/12, 2009 at 8:35

Actually I'll create a new question instead of editing this one. – Isometry 17/12, 2009 at 9:21

T

100

UPDATE: This question was the subject of my blog for February 4th 2010. Thanks for the great question!

Let me lay it out for you. In the most basic sense the compiler is a "two pass compiler" because the phases that the compiler goes through are:

Generation of metadata.
Generation of IL.

Metadata is all the "top level" stuff that describes the structure of the code. Namespaces, classes, structs, enums, interfaces, delegates, methods, type parameters, formal parameters, constructors, events, attributes, and so on. Basically, everything except method bodies.

IL is all the stuff that goes in a method body -- the actual imperative code, rather than metadata about how the code is structured.

The first phase is actually implemented via a great many passes over the sources. It's way more than two.

The first thing we do is take the text of the sources and break it up into a stream of tokens. That is, we do lexical analysis to determine that

class c : b { }

is class, identifier, colon, identifier, left curly, right curly.

We then do a "top level parse" where we verify that the token streams define a grammaticaly-correct C# program. However, we skip parsing method bodies. When we hit a method body, we just blaze through the tokens until we get to the matching close curly. We'll come back to it later; we only care about getting enough information to generate metadata at this point.

We then do a "declaration" pass where we make notes about the location of every namespace and type declaration in the program.

We then do a pass where we verify that all the types declared have no cycles in their base types. We need to do this first because in every subsequent pass we need to be able to walk up type hierarchies without having to deal with cycles.

We then do a pass where we verify that all generic parameter constraints on generic types are also acyclic.

We then do a pass where we check whether every member of every type -- methods of classes, fields of structs, enum values, and so on -- is consistent. No cycles in enums, every overriding method overrides something that is actually virtual, and so on. At this point we can compute the "vtable" layouts of all interfaces, classes with virtual methods, and so on.

We then do a pass where we work out the values of all "const" fields.

At this point we have enough information to emit almost all the metadata for this assembly. We still do not have information about the metadata for iterator/anonymous function closures or anonymous types; we do those late.

We can now start generating IL. For each method body (and properties, indexers, constructors, and so on), we rewind the lexer to the point where the method body began and parse the method body.

Once the method body is parsed, we do an initial "binding" pass, where we attempt to determine the types of every expression in every statement. We then do a whole pile of passes over each method body.

We first run a pass to transform loops into gotos and labels.

(The next few passes look for bad stuff.)

Then we run a pass to look for use of deprecated types, for warnings.

Then we run a pass that searches for uses of anonymous types that we haven't emitted metadata for yet, and emit those.

Then we run a pass that searches for bad uses of expression trees. For example, using a ++ operator in an expression tree.

Then we run a pass that looks for all local variables in the body that are defined, but not used, to report warnings.

Then we run a pass that looks for illegal patterns inside iterator blocks.

Then we run the reachability checker, to give warnings about unreachable code, and tell you when you've done something like forgotten the return at the end of a non-void method.

Then we run a pass that verifies that every goto targets a sensible label, and that every label is targetted by a reachable goto.

Then we run a pass that checks that all locals are definitely assigned before use, notes which local variables are closed-over outer variables of an anonymous function or iterator, and which anonymous functions are in reachable code. (This pass does too much. I have been meaning to refactor it for some time now.)

At this point we're done looking for bad stuff, but we still have way more passes to go before we sleep.

Next we run a pass that detects missing ref arguments to calls on COM objects and fixes them. (This is a new feature in C# 4.)

Then we run a pass that looks for stuff of the form "new MyDelegate(Foo)" and rewrites it into a call to CreateDelegate.

Then we run a pass that transforms expression trees into the sequence of factory method calls necessary to create the expression trees at runtime.

Then we run a pass that rewrites all nullable arithmetic into code that tests for HasValue, and so on.

Then we run a pass that finds all references of the form base.Blah() and rewrites them into code which does the non-virtual call to the base class method.

Then we run a pass which looks for object and collection initializers and turns them into the appropriate property sets, and so on.

Then we run a pass which looks for dynamic calls (in C# 4) and rewrites them into dynamic call sites that use the DLR.

Then we run a pass that looks for calls to removed methods. (That is, partial methods with no actual implementation, or conditional methods that don't have their conditional compilation symbol defined.) Those are turned into no-ops.

Then we look for unreachable code and remove it from the tree. No point in codegenning IL for it.

Then we run an optimization pass that rewrites trivial "is" and "as" operators.

Then we run an optimization pass that looks for switch(constant) and rewrites it as a branch directly to the correct case.

Then we run a pass which turns string concatenations into calls to the correct overload of String.Concat.

(Ah, memories. These last two passes were the first things I worked on when I joined the compiler team.)

Then we run a pass which rewrites uses of named and optional parameters into calls where the side effects all happen in the correct order.

Then we run a pass which optimizes arithmetic; for example, if we know that M() returns an int, and we have 1 * M(), then we just turn it into M().

Then we do generation of the code for anonymous types first used by this method.

Then we transform anonymous functions in this body into methods of closure classes.

Finally, we transform iterator blocks into switch-based state machines.

Then we emit the IL for the transformed tree that we've just computed.

Easy as pie!

Trichiasis answered 17/12, 2009 at 0:2 Comment(9)

Looks more like a 30 pass compiler to me ;) – Rivulet 17/12, 2009 at 3:23

Wow one of the most insightful answers I've ever read! Thanks! – Isometry 17/12, 2009 at 9:52

I am glad that I am working on more difficult problems and not on 'easy as pie' problems! ;) – Keffer 17/12, 2009 at 14:58

Well, even if I was completely wrong and down-voted a thousand times, I'm glad I decided to take a stab at this question if only to read how the C# compiler works internally from one of the creators. – Dealer 17/12, 2009 at 21:27

It's really not as Easy as pie! – Hypodermis 9/1, 2010 at 7:44

I wish I were able to fav answers instead of just questions. – Mouser 9/1, 2010 at 10:14

for me, making a pie is a way harder than whatever programming things – Andersonandert 1/2, 2010 at 17:57

@Eric -- Do all your blog posts start out as SO answers? – Dealt 3/10, 2010 at 3:44

@zildjohn01: No, but these days a lot of them do. My blog posts mostly start out as questions that I've answered somewhere else; it used to be that my primary source of questions was the internal programming language discussion email lists at Microsoft. I also re-use a lot of my analysis of errors in books; I edit books about C# as a hobby. But SO is such a rich mine for great questions that it is now the primary source. – Trichiasis 3/10, 2010 at 15:18

S

40

I see that there are multiple interpretations of the question. I answered the intra-solution interpretation, but let me fill it out with all the information I know.

The "header file metadata" is present in the compiled assemblies, so any assembly you add a reference to will allow the compiler to pull in the metadata from those.

As for things not yet compiled, part of the current solution, it will do a two-pass compilation, first reading namespaces, type names, member names, ie. everything but the code. Then when this checks out, it will read the code and compile that.

This allows the compiler to know what exists and what doesn't exist (in its universe).

To see the two-pass compiler in effect, test the following code that has 3 problems, two declaration-related problems, and one code problem:

using System;

namespace ConsoleApplication11
{
    class Program
    {
        public static Stringg ReturnsTheWrongType()
        {
            return null;
        }

        static void Main(string[] args)
        {
            CallSomeMethodThatDoesntExist();
        }

        public static Stringg AlsoReturnsTheWrongType()
        {
            return null;
        }
    }
}

Note that the compiler will only complain about the two Stringg types that it cannot find. If you fix those, then it complains about the method-name called in the Main method, that it cannot find.

Stickney answered 16/12, 2009 at 21:46 Comment(4)

Does this really answer the question? As stated. This is a good example how the two-pass compiler works to resolve references within the current source file. But I C/C++ headers are usually used to provide the signatures and extern definitions that will be supplied by other sources/objects in the project (or another project). Thus, it seems that the actual answer is that such meta data is provided in the referenced assemblies so there is no need for the headers. – Hinton 17/12, 2009 at 3:2

I think it does answer the question. The problem is how does the compiler know about types used later in the compilation process. The compiler first has to parse the existing C# code to generate the type information. Strictly, it doesn't need to reference any other library if it doesn't need to import types, but it still needs to build a symbol table from the entire source base. – Dealer 17/12, 2009 at 7:14

@Kevin and @codekaizen, I believe you both have valid points. My question didn't specify if I was interested in the process when types need to be imported. (I didn't know enough at the time to know that I was indeed interested in it.) Updating my question to stipulate this. – Isometry 17/12, 2009 at 8:35

Actually I'll create a new question instead of editing this one. – Isometry 17/12, 2009 at 9:21

C

5

It uses the metadata from the reference assemblies. That contains a full type declaration, same thing as you'd find in a header file.

It being a two-pass compiler accomplishes something else: you can use a type in one source file before it is declared in another source code file.

Categorize answered 16/12, 2009 at 21:48 Comment(1)

Ah yes, assembly metadata... I'll have to read up on that. Thanks. – Isometry 16/12, 2009 at 21:54

D

1

It's a 2-pass compiler. http://en.wikipedia.org/wiki/Multi-pass_compiler

Dealer answered 16/12, 2009 at 21:44 Comment(4)

This is not a complete answer... It only addresses intra-project references, not inter-library references – Plethoric 16/12, 2009 at 21:52

It didn't seem to be that the question involved understanding the type system - just how the compiler resolves type references within an assembly scope. – Dealer 16/12, 2009 at 21:54

@Dealer - I have to disagree, since c header files are used for forward references in your code as well as references to external libraries. – Plethoric 16/12, 2009 at 22:45

@Jeffery, sure, but that's not really what the question was asking. Also, the mechanism of referencing external types is so fundamentally different in C# than C++ that it's really much more than just "referencing". C++ contains external type definitions in header files, C# accesses assembly metadata and uses the Common Type System. – Dealer 16/12, 2009 at 23:11

K

1

All the necessary information can be obtained from the referenced assemblies.

So there are no header files but the compiler does need access to the DLL's being used.

And yes, it is a 2-pass compiler but that doesn't explain how it gets information about library types.

Kordofanian answered 16/12, 2009 at 21:46 Comment(3)

But the question really doesn't involve type representation. – Dealer 16/12, 2009 at 21:51

+1 as this would be my answer to the question as stated and the real reason there is no need for header includes. – Hinton 17/12, 2009 at 3:4

@Ken - what about if no types are imported? It doesn't really answer then question in that case. – Dealer 17/12, 2009 at 7:15

Recommended topics

Hot tags