Why are JS sourcemaps typically at token granularity?

Asked 28/8, 2019 at 10:17 Answered 29/2, 2020 at 0:38

Solved javascript source-maps gulp-sourcemaps

JavaScripts source maps seem to typically be at no finer than token granularity. As an example, identity-map uses token granularity.

I know I've seen other examples, but can't remember where.

Why don't we use AST-node based granularity instead? That is, if our source maps had locations for all and only starts of AST nodes, what would be the downside?

In my understanding, source maps are used for crash stack decoding and for debugging: there will never be an error location or useful breakpoint that isn't at the start of some AST node, right?

Update 1

Some further clarification:

The question pertains to cases where the AST is already known. So "it's more expensive to generate an AST than an array of tokens" wouldn't answer the question.
The practical impact of this question is that if we could decrease the granularity of source maps while preserving the behavior of debuggers and crash stack decoders, then source maps could be much smaller. The main advantage being performance of debuggers: dev tools can take a long time to process large source files, making debugging a pain.
Here is an example of adding source map locations at the token level using the source-map library:

for (const token of tokens) {
    generator.addMapping({
      source: "source.js",
      original: token.location(),
      generated: generated.get(token).location(),
    });
}

And here is an example of adding locations at the AST node level:

for (const node of nodes) {
    generator.addMapping({
      source: "source.js",
      original: node.location(),
      generated: generated.get(node).location(),
    });
}

Update 2

Q1: Why expect there to be fewer starts of AST Nodes than starts of tokens?

A1: Because if there were more starts of AST Nodes than starts of tokens then there would be an AST Node that starts at a non-token. Which would be quite an accomplishment for the author of the parser! To make this concrete, suppose you have the following JavaScript statement:

const a = function *() { return a + ++ b }

Here are the locations at the starts of tokens:

const a = function *() { return a + ++ b } /*
^     ^   ^        ^^^ ^ ^      ^ ^ ^  ^ ^
*/

Here's roughly where most parsers will say the starts of AST Nodes are.

const a = function *() { return a + ++ b } /*
^     ^   ^              ^      ^   ^  ^
*/

That's a 46% reduction in the number of source-map locations!

Q2: Why expect AST-Node-granularity source maps to be smaller?

A2:See A1 above

Q3: What format would you use for referencing AST Nodes?

A3: No format. See the sample code in Update 1 above. I am talking about adding source map locations for the starts of AST Nodes. The process is almost exactly the same as the process for adding source map locations for the starts of tokens, except you are adding fewer locations.

Q4: How can you assert that all tools dealing with the source map use the same AST representation?

A4: Assume we control the entire pipeline and are using the same parser everywhere.

Fanniefannin answered 28/8, 2019 at 10:17 Comment(3)

How would you expect a sourcemap to look different if it were node-based instead of token-based? What format would you use for referencing AST nodes? And how do you assert that all tools dealing with the source map use the same AST representation? – Illyes 24/2, 2020 at 17:31

"source maps could be much smaller" - what makes you assume that there are less ast nodes than token? – Illyes 24/2, 2020 at 17:34

@bergi I added Update 2 to respond to your questions, since there was some overlap with other questions that are being asked. – Fanniefannin 24/2, 2020 at 17:51

The TypeScript compiler actually only emits sourcemap locations on AST node bounds, with some exceptions to improve compatibility with certain tools that expect mappings for certain positions, so token-based maps actually aren't quite universal. In the example you give, TS's sourcemaps are for positions like so:

const a = function *() { return a + ++ b } /*
^     ^^  ^              ^      ^^  ^  ^^^
*/

Which are generally both the start and end of each Identifier AST node (plus starts otherwise).

The rationale for mapping both start and end positions for an Identifier AST node is pretty simple - when you rename an Identifier, you want a selection range on that renamed identifier to be able to map back to the original identifier, without necessarily relying on heuristics.

Coughlin answered 29/2, 2020 at 0:38 Comment(0)

It is possible to use AST granularity, but usually to build an AST, you need before to tokenize the code anyway. For debugging purpose AST is an unneccessary step as the syntax analyzer must be fed with tokenized data, in order to work.

An interesting resource on topic

I suggest also to explore acornJS sourcecode and take a look how it produces AST

Nobile answered 24/2, 2020 at 14:36 Comment(4)

Thanks for your post, but I don't think it answers the question. While I entirely agree that tokenization comes before parsing, here's an analogy that explains why this doesn't answer the question: the character stream is needed even before tokenization, but we don't usually generate source maps at character granularity. To rephrase the question: Every AST-aware source-map-generating tool that I know of adds source map locationst at token boundaries, even though the AST information is available. Why? – Fanniefannin 24/2, 2020 at 15:42

You're right, but to locate errors it is enough token granularity. AST is more verbose vs base64-encoded VLQ tokens mapping. What you feel to gain with AST vs Tokens, when speaking of sourceMaps and debugging ? Even where AST is available, you have to get it from a network. – Outflank 24/2, 2020 at 16:3

Again, character granularity works as well, but it's rare to use character granularity because it bloats the source maps. Here are the levels of granularity: character < token < AST Node. I want the coarsest granularity that will still work well in a debugger. Note that startsOf(ASTNodes) is a subset of startsOf(tokens). So the base64-encoded source map locations will be smaller when using AST node locations (that's the point). There is no extra network request, I'm talking about generating source maps at build time. – Fanniefannin 24/2, 2020 at 17:12

I updated the question with code snippets to make clearer what I'm asking. Thanks for your time in looking into this. – Fanniefannin 24/2, 2020 at 17:18

Update 1

Update 2

Recommended topics

Hot tags