How do I embed source into pdb, and have debugger(s) use it?

NOTE: my target concern is C# targeting the CLR with regular MSIL in case there's something that works for that but not in the more general case(s).

Some existing source debugging support examples

There was recently a release of the Sourcepack project which allows a user to rewrite the source paths in a pdb file to point at different locations. This is very useful when you have the source for the assembly, but don't want to try and get it into the exact same filesystem location(s) as when it was built.

http://lowleveldesign.wordpress.com/2011/08/26/sourcepack-released/

For open-source projects, using http://www.symbolsource.org/ as a way of making it simple for users of your project to get symbols and source is an excellent idea.

Problem

However, very often there are projects where either for legal or convenience reasons, using such an approach isn't very feasible. Also, the set of people that might be debugging the project may be relatively small or contained.

By default, the pdb's for a project include pointers to the files on disk (IIRC) and then source indexing can add the ability to embed pointers to the source locations (for instance, in a version control system), with a source server then using the pointers to actually fetch the source.

Goal

It seems like things could be simpler (for certain builds, like debug and/or internal-only) to just put the actual source into the pdb (effectively just dereferencing the pointer currently written in the PDB). It seems like then you can skip the entire source server part (at least in theory) and eliminate a few dependencies on the debug-time story. Whether to store the source as compressed or not is largely orthogonal, but a first pass would probably not do so in an effort to make it simpler to implement for existing debuggers.

Since the PDB-matching-binary story is already very good, putting the source into the PDB would be even better than a source server pointer, since the pointer can break over time (source control system moves, or changes to a different system, or whatever), but the actual source sitting in the PDB is good 'forever'.

How is this different than 'source server' support?

(this was added via edit after Tigran's comment asking what the benefits would be)

The 'baseline' scenario that this should be compared against is that of a 'normal' debugging experience using a 'normal' source server instance today. In that scenario, (AFAIK) the debugging engine gets a pointer from the PDB (via an alternate stream) then uses the registered source server(s) to attempt to get the source via that pointer. Since a given assembly is typically going to include multiple source files, there's either a single pointer that includes a base location or there are multiple pointers in the PDB (or something else), but that should be orthogonal to this discussion.

For a project where keeping the source hidden/inaccessible is desirable (most Microsoft products, for instance, including Windows, Office, Visual Studio, etc.), then having the PDB contain pointers is FAR superior to including actual source (even if it were encrypted). Such pointers are meaningless without the necessary network access and permissions, so such an approach means you can ship the PDB to anyone on the planet without worrying about them being able to access your source (worst-case, they get a glimpse into how your source tree is arranged, I would think).

However, there are 2 large sets of projects (and specifically, builds) where this 'hide the source' benefit doesn't exist.

The first are builds that are only used by people that have access to the source anyway. Builds done on your own machine that won't ever leave that machine are a great example, as an attacker would need to read files from your filesystem anyway to get the source, so reading from one file (.cs) vs. another (.pdb) is a relatively small difference in terms of attack difficulty/vector. Similarly, builds that are done and pushed to a test/staging environment where the people that access the pdb on machine are equal to or a subset of the people that can access the source 'normally'.

The second are (somewhat obviously) open-source projects, where the source for the project is already open for everyone anyway, so there's no benefit to hiding the source from anyone.

Note that this could be relatively easily extended to include the source in an encrypted form instead (since we're already talking about having to store format/encoding data as well), but the added complexity of that would make such a scenario likely less useful than just using a 'normal' source server.

Benefits?

With the above descriptions out of the way, the list of potential benefits to allowing this include (but are not limited to :) these that pop into my head at the moment:

No need to deal with setting up source server support. It Just Works (IJW), at least when/if debuggers knew to look in the pdb.
- In the mean time, you could still do a 'fixed' source server which was just a dummy that extracted the source and fed it back to the caller. Such a configuration could be the same for everyone (using localhost, for instance), still eliminating the current need to actually configure a source server
No need for the build to include 'source indexing'
- Since a build reads the source files and writes the pdb files anyway, we're just modifying what's written in the pdb and not taking any build-time perf hit for doing network calls or reading data we don't already have in memory.
- Until 'native' build support for putting the source in, it could be a simple post-build step, likely implemented at first via a small fork of the Sourcepack project since it already does the work of reading/modifying PDB files :)
No dependency on the team/project having a source control system
No dependency on the particular version of each file being checked into the source control system (most people don't check in for every single build they do in their IDE)
No need to have access to the particular source control system that has the file
- in the DVCS case, for instance, the PDB pointer may be to some 'random' instance of git or mercurial or whatever, not necessarily one you have access to
- the source server tooling to track that version back to the source control server instance(s) you do have access to (if it even exists there) doesn't yet exist AFAIK)
No problem if the project dies (gets deleted) or moves
- for instance, if the project moves from one to another of: self-hosted, sourceforge, github, bitbucket, codeplex, code.google.com, etc.
No problem if the machine you're debugging on has no (or insufficient) network access
- For instance, if you're doing a 'network KVM' into a box for debugging an issue but it either has no network or it can only talk to disconnected networks such that it can't access your source control server).
in extreme case, ability to recover some of the project source from a build. ;)

NOTE: another approach would be including the source in the actual assembly (for instance, as a resource), but the pdb is a better choice (easy to ship a build without pdb's, no normal runtime perf hit if the source is in the pdb since the assembly is the same code and same size, etc)

How to implement?

On the surface of it, this kind of support doesn't seem like it would be too difficult to add, but I get the feeling this is because I don't really know enough about the mechanics involved instead of it actually being a simple thing to implement. :)

My guess would be something along the lines of:

Add a post-build step that would do something similar to Sourcepack, but instead of changing the pointer, it would replace it with the actual source.
- Depending on what the source server needs to do, it might need to get prefixed, or the actual source would be in a different alternate data stream and the 'pointer' gets updated to something 'source-in-pdb:ads-foo.cs' or whatever. the prefix or pointer could include how the source file was stored as well (uncompressed, gzip, bzip2, etc, along with encoding of the file)
Implement a 'source server' that actually extracts the source from the pdb in question and returns it back.
- No idea if the source server 'API' has enough info to get the location of the PDB, let alone whether it would have permission to actually read the contents.

Sanity check?

With the babble above out of the way, the questions are really:

Does this kind of thing already exist? (and if so, please provide pointers!)
Assuming it doesn't exist yet, does the above make sense as a first-pass implementation? Are there pitfalls or complexities the above skips over?
Assuming "no" and "yes" for the above, is there an existing project that makes sense in terms of taking this on (it's close or in their existing scope)?

I've read over this and wanted to summarize my understanding for clarity

Today the debugger uses the PDB to gain the disk path to a file and checksum which was compiled to create a given section of an executable. The debugger then attempts to load the file using both the local disk and available symbol server. Under this proposal we would skip the middle man by just embedding the file itself into the PDB. Eureka, no more searching for source!

As someone who's done their fair share of digging for source code in this manner I like the idea of having one package for all your debugging needs. There are a couple of facets to consider about this proposal though.

The first is the actual embedding of the source code into the PDB. This is very doable. The PDB is essentially a light weight file database. There is structure to what it encodes but AFAIK you can put whatever you want into certain slots (local variable values / types for example). There may be size limitations for certain slots but I'm sure you could invent an encoding scheme to break large files up into chunks.

The second facet is having the debugger actually load the file from the PDB vs. searching for it on disk. I'm not as familiar with that part of the debugger but from what I understand it only uses 2 pieces of information to locate the file

The path to the file on disk
The checksum of said file (used to disambiguate files with the same name)

I'm fairly certain this is the only information it passes onto a symbol server. This makes it unfeasible to implement a symbol server because it won't have access to the PDB (assuming of course I'm right).

I dug around hoping there was a VS COM component you could override which would allow you to intercept the loading of the file for a given path but I couldn't find one.

One approach I think would be feasible though would be

Embed the source in the PDB
Have a tool which can both extract the source to a known location and rewrite the PDB to point to that place.

This wouldn't be quite what you want though.