How do I embed source into pdb, and have debugger(s) use it?
Asked Answered
M

1

19

NOTE: my target concern is C# targeting the CLR with regular MSIL in case there's something that works for that but not in the more general case(s).

Some existing source debugging support examples

There was recently a release of the Sourcepack project which allows a user to rewrite the source paths in a pdb file to point at different locations. This is very useful when you have the source for the assembly, but don't want to try and get it into the exact same filesystem location(s) as when it was built.

http://lowleveldesign.wordpress.com/2011/08/26/sourcepack-released/

For open-source projects, using http://www.symbolsource.org/ as a way of making it simple for users of your project to get symbols and source is an excellent idea.

Problem

However, very often there are projects where either for legal or convenience reasons, using such an approach isn't very feasible. Also, the set of people that might be debugging the project may be relatively small or contained.

By default, the pdb's for a project include pointers to the files on disk (IIRC) and then source indexing can add the ability to embed pointers to the source locations (for instance, in a version control system), with a source server then using the pointers to actually fetch the source.

Goal

It seems like things could be simpler (for certain builds, like debug and/or internal-only) to just put the actual source into the pdb (effectively just dereferencing the pointer currently written in the PDB). It seems like then you can skip the entire source server part (at least in theory) and eliminate a few dependencies on the debug-time story. Whether to store the source as compressed or not is largely orthogonal, but a first pass would probably not do so in an effort to make it simpler to implement for existing debuggers.

Since the PDB-matching-binary story is already very good, putting the source into the PDB would be even better than a source server pointer, since the pointer can break over time (source control system moves, or changes to a different system, or whatever), but the actual source sitting in the PDB is good 'forever'.

How is this different than 'source server' support?

(this was added via edit after Tigran's comment asking what the benefits would be)

The 'baseline' scenario that this should be compared against is that of a 'normal' debugging experience using a 'normal' source server instance today. In that scenario, (AFAIK) the debugging engine gets a pointer from the PDB (via an alternate stream) then uses the registered source server(s) to attempt to get the source via that pointer. Since a given assembly is typically going to include multiple source files, there's either a single pointer that includes a base location or there are multiple pointers in the PDB (or something else), but that should be orthogonal to this discussion.

For a project where keeping the source hidden/inaccessible is desirable (most Microsoft products, for instance, including Windows, Office, Visual Studio, etc.), then having the PDB contain pointers is FAR superior to including actual source (even if it were encrypted). Such pointers are meaningless without the necessary network access and permissions, so such an approach means you can ship the PDB to anyone on the planet without worrying about them being able to access your source (worst-case, they get a glimpse into how your source tree is arranged, I would think).

However, there are 2 large sets of projects (and specifically, builds) where this 'hide the source' benefit doesn't exist.

The first are builds that are only used by people that have access to the source anyway. Builds done on your own machine that won't ever leave that machine are a great example, as an attacker would need to read files from your filesystem anyway to get the source, so reading from one file (.cs) vs. another (.pdb) is a relatively small difference in terms of attack difficulty/vector. Similarly, builds that are done and pushed to a test/staging environment where the people that access the pdb on machine are equal to or a subset of the people that can access the source 'normally'.

The second are (somewhat obviously) open-source projects, where the source for the project is already open for everyone anyway, so there's no benefit to hiding the source from anyone.

Note that this could be relatively easily extended to include the source in an encrypted form instead (since we're already talking about having to store format/encoding data as well), but the added complexity of that would make such a scenario likely less useful than just using a 'normal' source server.

Benefits?

With the above descriptions out of the way, the list of potential benefits to allowing this include (but are not limited to :) these that pop into my head at the moment:

  • No need to deal with setting up source server support. It Just Works (IJW), at least when/if debuggers knew to look in the pdb.
    • In the mean time, you could still do a 'fixed' source server which was just a dummy that extracted the source and fed it back to the caller. Such a configuration could be the same for everyone (using localhost, for instance), still eliminating the current need to actually configure a source server
  • No need for the build to include 'source indexing'
    • Since a build reads the source files and writes the pdb files anyway, we're just modifying what's written in the pdb and not taking any build-time perf hit for doing network calls or reading data we don't already have in memory.
    • Until 'native' build support for putting the source in, it could be a simple post-build step, likely implemented at first via a small fork of the Sourcepack project since it already does the work of reading/modifying PDB files :)
  • No dependency on the team/project having a source control system
  • No dependency on the particular version of each file being checked into the source control system (most people don't check in for every single build they do in their IDE)
  • No need to have access to the particular source control system that has the file
    • in the DVCS case, for instance, the PDB pointer may be to some 'random' instance of git or mercurial or whatever, not necessarily one you have access to
    • the source server tooling to track that version back to the source control server instance(s) you do have access to (if it even exists there) doesn't yet exist AFAIK)
  • No problem if the project dies (gets deleted) or moves
    • for instance, if the project moves from one to another of: self-hosted, sourceforge, github, bitbucket, codeplex, code.google.com, etc.
  • No problem if the machine you're debugging on has no (or insufficient) network access
    • For instance, if you're doing a 'network KVM' into a box for debugging an issue but it either has no network or it can only talk to disconnected networks such that it can't access your source control server).
  • in extreme case, ability to recover some of the project source from a build. ;)

NOTE: another approach would be including the source in the actual assembly (for instance, as a resource), but the pdb is a better choice (easy to ship a build without pdb's, no normal runtime perf hit if the source is in the pdb since the assembly is the same code and same size, etc)

How to implement?

On the surface of it, this kind of support doesn't seem like it would be too difficult to add, but I get the feeling this is because I don't really know enough about the mechanics involved instead of it actually being a simple thing to implement. :)

My guess would be something along the lines of:

  1. Add a post-build step that would do something similar to Sourcepack, but instead of changing the pointer, it would replace it with the actual source.
    • Depending on what the source server needs to do, it might need to get prefixed, or the actual source would be in a different alternate data stream and the 'pointer' gets updated to something 'source-in-pdb:ads-foo.cs' or whatever. the prefix or pointer could include how the source file was stored as well (uncompressed, gzip, bzip2, etc, along with encoding of the file)
  2. Implement a 'source server' that actually extracts the source from the pdb in question and returns it back.
    • No idea if the source server 'API' has enough info to get the location of the PDB, let alone whether it would have permission to actually read the contents.

Sanity check?

With the babble above out of the way, the questions are really:

  • Does this kind of thing already exist? (and if so, please provide pointers!)
  • Assuming it doesn't exist yet, does the above make sense as a first-pass implementation? Are there pitfalls or complexities the above skips over?
  • Assuming "no" and "yes" for the above, is there an existing project that makes sense in terms of taking this on (it's close or in their existing scope)?
Milson answered 5/9, 2011 at 18:21 Comment(8)
What practical, or, let's say, real world benefits it will bring to you?Wolfsbane
Tools + Options, Debugging, General. Press F1 and follow the doc leads for "Enable source server support".Bang
@Wolfsbane - the same benefits as 'source server' support in the first place, without the dependency/complexity involved. For instance, if the source is in the pdb, it doesn't even need to be checked into a source control system. if you're willing to share source with the set of people that can get access to your pdb files, it seems much simpler than the current source server support. If you're not, you could store it encrypted, but for such a case, I'd probably just use a normal source server.Milson
@Hans - my knee-jerk reaction is that you're trolling. :) If not and it's not clear how this would be different/'better' than source server, please let me know and I'll try to clarify. Thanks!Milson
@Tigran/@Hans - I just edited it, adding 2 sections to try and clarify 1) how it differs from 'normal' source server support and 2) what the benefits would be. Thanks!Milson
Maybe having simply the pdb file into a shared folder could be quite enough ?Ictus
@Ictus - not sure I understand how putting the pdb into a shared folder would change things? IOW, there's 2 'steps' in the process: #1 is loading the pdb for a given assembly/module. #2 (for source debugging) is then loading the source once you have the pdb. This proposal doesn't address (or care) about #1, it's just talking about an alternative for #2, so the pdb location should be irrelevant, since in this scenario the location is already known and it's already loadable. Am I misinterpreting your suggestion?Milson
Nope :/ my point isn't interesting at all ^^Ictus
R
4

I've read over this and wanted to summarize my understanding for clarity

Today the debugger uses the PDB to gain the disk path to a file and checksum which was compiled to create a given section of an executable. The debugger then attempts to load the file using both the local disk and available symbol server. Under this proposal we would skip the middle man by just embedding the file itself into the PDB. Eureka, no more searching for source!

As someone who's done their fair share of digging for source code in this manner I like the idea of having one package for all your debugging needs. There are a couple of facets to consider about this proposal though.

The first is the actual embedding of the source code into the PDB. This is very doable. The PDB is essentially a light weight file database. There is structure to what it encodes but AFAIK you can put whatever you want into certain slots (local variable values / types for example). There may be size limitations for certain slots but I'm sure you could invent an encoding scheme to break large files up into chunks.

The second facet is having the debugger actually load the file from the PDB vs. searching for it on disk. I'm not as familiar with that part of the debugger but from what I understand it only uses 2 pieces of information to locate the file

  1. The path to the file on disk
  2. The checksum of said file (used to disambiguate files with the same name)

I'm fairly certain this is the only information it passes onto a symbol server. This makes it unfeasible to implement a symbol server because it won't have access to the PDB (assuming of course I'm right).

I dug around hoping there was a VS COM component you could override which would allow you to intercept the loading of the file for a given path but I couldn't find one.

One approach I think would be feasible though would be

  1. Embed the source in the PDB
  2. Have a tool which can both extract the source to a known location and rewrite the PDB to point to that place.

This wouldn't be quite what you want though.

Rossen answered 7/9, 2011 at 20:48 Comment(8)
I may be misparsing things (I'm likely at least somewhat confusing symbol server and source server in places), but I guess I'm confused as to why the path + checksum would be insufficient? If I write the source files into the pdb, all I need to do is also include sufficient metadata (checksum, filepath, encoding, etc). Also, it seems like there must be more information than that written (or at least that can be written) since source server seems to work with a full server path and version number. The 'step 2' was a source server locally (in-proc as extension) that could read from pdb.Milson
BTW, it didn't occur to me at the time, but if the pdb format allows (or doesn't have a problem with) writing files to alternate data streams, and if the API's used by the debugger are ones that support ADS, then it seems like one approach that might Just Work would be rewriting the file paths in the pdb to point to the pdb itself with the necessary ADS identifier. IOW, it starts out as foo/bar/baz/SomeClass.cs in the pdb, but the rewrite makes it path/on/disk/SomeAssembly.pdb:foo/bar/baz/SomeClass.cs - no idea if that would work, though.Milson
In terms of the 'extract source, rewrite PDB', if the extensibility API exists such that an extension could get notified when a particular assembly's symbols are about to get loaded, it could at least do this on a 'just in time' / as-needed basis. That would even potentially be better since it could even notice if the original source file has the same checksum as what would get extracted and just leave it alone (potentially allowing things like Edit-and-Continue to work, which I'm guessing the 'extracted to temp' source would not)Milson
Your summary is exactly right, although the 'using both the local disk and available symbol server' is confusing to me, as I would have expected it to be 'available source server' instead. My mental model (which may be wildly inaccurate) is that the loading of the symbols (\\symbols\symbols!) is one thing but then loading the source is completely decoupled from where/how the pdb was loaded. Aside from that confusion on my part, it's dead on. :)Milson
@James the idea of multiple streams may work. I can't think of a reason it wouldn't but I'm not very knowledgable about streams either.Rossen
@James the other idea about intercepting PDB load. I'm not sure if that's doable. There are several post load events you could get ahold of but none pre-load (AFAIK)Rossen
Attributed ATL is supposed to embed debuggable source in PDBs. Source: answer to blog post commentCentric
I don't think this should have been marked as an answer, it's just a suggestion for a theoretical approach, relying on some unspecified "VS COM component" or "tool", without providing any actual information. Better suited as a comment.Af

© 2022 - 2024 — McMap. All rights reserved.