How to exclude headers from AST in clang?
Asked Answered
H

5

24

I'm generating AST using clang. I've got following file (lambda.cpp) to parse:

#include <iostream>

void my_lambda()
{
    auto lambda = [](auto x, auto y) {return x + y;};
    std::cout << "fabricati diem"; 
}

I'm parsing this using following command:

clang -Xclang -ast-dump -fsyntax-only lambda.cpp

The problem is that clang parses also headers content. As a result, I've got quite big (~3000 lines) file with useless (for me) content.

How to exclude headers when generating AST?

Hydroponics answered 20/1, 2015 at 14:58 Comment(4)
What would you want clang to do when it needs a name/definition/etc from the header in order to generate the AST for the source file?Citify
@MarkB I must have expressed myself badly. I want clang to use headers during parsing, but to show only AST with my file - without AST from headers.Hydroponics
Wouldn't do Header_gurads do that job?Mealymouthed
What do you really want to do? I've used the clang python backend a bit and I know the pains you are are going through by looking at the ast. there are a couple of options that hack together what you want. iirc you can get the line number information from clang and ignore everything before that.Eustacia
S
19

clang-check might be useful on the matter, clang-check has option -ast-dump-filter=<string> documented as follow

-ast-dump-filter=<string> - Use with -ast-dump or -ast-print to dump/print only AST declaration nodes having a certain substring in a qualified name. Use -ast-list to list all filterable declaration node names.

when clang-check run with -ast-dump-filter=my_lambda on the sample code (lambda.cpp)

#include <iostream>

void my_lambda()
{
    auto lambda = [](auto x, auto y) {return x + y;};
    std::cout << "fabricati diem"; 
}

It dumps only matched declaration node FunctionDecl my_lambda 'void (void)'

Here is the command line arguments and few lines from output.

$ clang-check -extra-arg=-std=c++1y -ast-dump -ast-dump-filter=my_lambda lambda.cpp --

FunctionDecl 0x2ddf630 <lambda.cpp:3:1, line:7:1> line:3:6 my_lambda 'void (void)'
`-CompoundStmt 0x2de1558 <line:4:1, line:7:1>
  |-DeclStmt 0x2de0960 <line:5:9, col:57>
Siphonophore answered 29/1, 2015 at 14:50 Comment(3)
clang-check seems neat!Ence
@alper do you know if there is a way to get all nodes without having to import all dependencies of the class you wanna get the AST from? I would like to be able to get the syntax nodes for a class without having to resolve dependencies (pass framework paths) so that I can rewrite the same class in a specific way, am I on the right path looking at ASTs to achieve this? any comment is appreciated!!Cully
but clang-check can't output as json...Borosilicate
P
4

Filtering on a specific identifier is fine, using -ast-dump-filter. But what if you want ast from all identifiers in one file?

I came up with the following solution:

Add one recognizable line after the includes:

#include <iostream>
int XX_MARKER_XX = 123234; // marker line for ast-dump
void my_lambda()
...

Then dump the ast with

clang-check -extra-arg=-std=c++1y -ast-dump lambda.cpp > ast.txt

You can easily cut all stuff before XX_MARKER_XX away with sed:

cat ast.txt | sed -n '/XX_MARKER_XX/,$p'  | less

Still a lot, but much more useful with bigger files.

Possessory answered 1/11, 2016 at 20:51 Comment(2)
For my needs I do not initiliaze XX_MARKER_XX and shorten the command line to clang++ -Xclang -ast-dump minimal.cpp | sed -n '/XX_MARKER_XX/,$p'.Ondrea
Yeah, but not if you're trying to do that on someone else's sources :qScruff
M
3

I'm facing the same problem. My context is that I need to parse the AST in JSON format, and I'd like to get rid of all the headers and unnecessary files. I tried to replicate @textshell answer (https://mcmap.net/q/551382/-how-to-exclude-headers-from-ast-in-clang) but I noticed CLANG behaves differently in my case. The CLANG version I'm using is:

$ clang --version                                             
Debian clang version 13.0.1-+rc1-1~exp4
Target: x86_64-pc-linux-gnu
Thread model: posix

To explain my case, let's consider the following example:

enter image description here

Both my_function and main are functions from the same source file (function_definition_invocation.c). However, it is only specified in the FunctionDecl node of my_function. I presume this behavior is due to the fact that both functions belong to the same file, and CLANG prints the file location only in the node belonging to it.

Once the first occurrence of the main file is found, every consecutive node should be added to the resulting, filtered JSON file. The code I'm using is:

def filter_ast_only_source_file(source_file, json_ast):
    
    new_inner = []
    first_occurrence_of_main_file = False
    for entry in json_ast['inner']:
        if not first_occurrence_of_main_file:
            if entry.get('isImplicit', False):
                continue

            file_name = None
            loc = entry.get('loc', {})
            if 'file' in loc:
                file_name = loc['file']

            if 'expansionLoc' in loc:
                if 'file' in loc['expansionLoc']:
                    file_name = loc['expansionLoc']['file']

            if file_name != source_file:
                continue

            new_inner.append(entry)
            first_occurrence_of_main_file = True
        else:
            new_inner.append(entry)

    json_ast['inner'] = new_inner

And I call it like this:

generated_ast = subprocess.run(["clang", "-Xclang", "-ast-dump=json", source_file], capture_output=True) # Output is in bytes. In case it's needed, decode it to get string
# Parse the output into a JSON object
json_ast = json.loads(generated_ast.stdout)
filter_ast_only_source_file(source_file, json_ast)

So far it seems to be working.

Mistook answered 15/2, 2022 at 14:59 Comment(0)
E
1

This is a problem with C++ not with clang: there are no files in C++, there's just the compilation unit. When you #include a file you include all definitions in said file (recursively) into your compilation unit and there's no way to differentiate them (it's what the standard expects your compiler to do).

Imagine a different scenario:

/////////////////////////////
// headertmp.h
#if defined(A)
    struct Foo {
        int bar;
    };
#elif defined(B)
    struct Foo {
        short bar;
    };
#endif

/////////////////////////////
// foobar.cpp
#ifndef A
# define B
#endif

#include "headertmp.h"

void foobar(Foo foo) {
    // do stuff to foo.bar
}

Your foobar.cpp declares a struct called Foo and a function called foobar but headertmp.h itself doesn't define any Foo unless A or B are defined. Only in the compilation unit of foobar where the two come together can you make sense of headertmp.h.

If you are interested in a subset of the declarations inside a compilation unit, you will have to extract the necessary information from the generated AST directly (similar to what a linker has to do when linking together different compilation units). Of course you can then filter the AST of this compilation unit on any metadata your parser extracts.

Ence answered 29/1, 2015 at 13:41 Comment(6)
While it is true that during full compilation of a translation unit there should be no perceived difference between code in a header file and code in a non-header file, this question is about limiting AST extraction, and there is no reason why this wouldn't be possible (as seen from the other answer).Privity
@KyleStrand please read my last paragraph which expresses exactly the point of your comment: you may limit the extraction / dump to part of the compilation unit, as opposed to excluding specific header files (as there is not such concept in the AST)Ence
Your answer doesn't just say that header files can't be skipped, though; it says that definitions therein can't be distinguished from definitions directly in the file. This is true for compilation but not necessarily for raw AST extraction.Privity
@KyleStrand I disagree or don't understand your argument: can you show me in my example, how you would extract an AST from the header only? I.e. when would you do preprocessing?Ence
Do you mean extract it from non-header only? Simple: the preprocessor would need to keep a record of which lines came from which files (which is something modern preprocessors usually do), and the AST-dump tool would only dump definitions from non-header files.Privity
@KyleStrand I understand now. Thanks for clearing this up for meEnce
O
0

The dumped AST has some indication of source file for every node. So the dumped AST can be filtered based on the loc data of the second level AST nodes.

You need to match file in loc and file in expansionLoc in loc against the name of the top level file. This seems to work for me decently. Some of the nodes don't contain these elements for some reason. Nodes with isImplicit should be safe to skip but i'm not sure what is going on with other nodes without file name information.

The following python script filters 'astdump.json' to 'astdump.filtered.json' using these rules (doing the conversion in a streaming manner is left as a exercise for the reader):

#! /usr/bin/python3

import json
import sys

if len(sys.argv) != 2:
    print('Usage: ' + sys.argv[0] + ' filename')
    sys.exit(1)

filename = sys.argv[1]

with open('astdump.json', 'rb') as input, open('astdump.filtered.json', 'w') as output:
    toplevel = json.load(input)
    new_inner = []
    for o in toplevel['inner']:
        if o.get('isImplicit', False):
            continue

        file_name = None
        loc = o.get('loc', {})
        if 'file' in loc:
            file_name = loc['file']

        if 'expansionLoc' in loc:
            if 'file' in loc['expansionLoc']:
                file_name = loc['expansionLoc']['file']

        if file_name != filename:
            continue

        new_inner.append(o)

    toplevel['inner'] = new_inner
    json.dump(toplevel, output, indent=4)
Order answered 12/9, 2021 at 10:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.