PART 1 of 2 (see part 2 below)
I decided to make a special tool that uses CLang's AST tree.
As you're working on Windows, I wrote next instructions for Windows.
CLang library (SDK) as I found is very Linux oriented, it is difficult to use it straight away from sources on Windows. That's why I decided to use binary distribution of CLang to solve your task.
LLVM for Windows can be downloaded from github releases page, particularly current release is 11.0.1. To use it on windows you have to download LLVM-11.0.1-win64.exe. Install it to some folder, in my example I installed it into C:/bin/llvm/
.
Also Visual Studio has its own CLang packaged inside, it also can be used, but it is a bit outdated, so maybe very new C++20 features are not supported.
Find clang++.exe
in your LLVM installation, for my case it is C:/bin/llvm/bin/clang++.exe
, this path is used in my script as c_clang
variable in the beginning of script.
I used Python to write parsing tool, as this is well known and popular scripting language. I used my script to parse console output of CLang AST dump. You can install Python by download it from here.
Also AST tree can be parsed and processed at C++ level using CLang's SDK, example of AST Visitor implementation is located here, but this SDK can be probably used well only on Windows. That's why I chosen to use binary Windows distribution and parsing of console output. Binary distribution under Linux can also be used with my script.
You may try my script online on Linux server by clicking Try it online!
link below.
Script can be run using python script.py prog.cpp
, this will produce output prog.cpp.json
with parsed tree of namespaces and classes.
As a base script uses command clang++ -cc1 -ast-dump prog.cpp
to parse .cpp file into AST. You may try running command manually to see what it outputs, for example part of example output looks like this:
..................
|-CXXRecordDecl 0x25293912570 <line:10:13, line:13:13> line:10:19 class P definition
| |-DefinitionData pass_in_registers standard_layout trivially_copyable trivial literal
| | |-DefaultConstructor exists trivial needs_implicit
| | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
| | |-MoveConstructor exists simple trivial needs_implicit
| | |-CopyAssignment simple trivial has_const_param needs_implicit implicit_has_const_param
| | |-MoveAssignment exists simple trivial needs_implicit
| | `-Destructor simple irrelevant trivial needs_implicit
| |-CXXRecordDecl 0x25293912690 <col:13, col:19> col:19 implicit class P
| |-FieldDecl 0x25293912738 <line:11:17, col:30> col:30 x 'const char *'
| `-FieldDecl 0x252939127a0 <line:12:17, col:22> col:22 y 'bool'
..............
I parse this output to produce JSON output file. JSON file will look like this (part of file):
.............
{
"node": "NamespaceDecl",
"name": "ns2",
"loc": "line:3:5, line:18:5",
"tree": [
{
"node": "CXXRecordDecl",
"type": "struct",
"name": "R",
"loc": "line:4:9, line:6:9",
"tree": [
{
"node": "FieldDecl",
"type": "bool *",
"name": "pb",
"loc": "line:5:13, col:20"
}
]
},
.............
As you can see JSON file has next fields: node
tells CLang's name of node, it can be NamespaceDecl
for namespace, CXXRecordDecl
for struct/class/union, FieldDecl
for fields of struct (members). I hope you can easily find opensource JSON C++ parsers if you need, because JSON is the most simple format for storing structured data.
Also in JSON there are field name
with name of namespace/class/field, type
with type of class or field, loc
that says location inside file of namespace/class/field definition, tree
having a list of child nodes (for namespace node children are other namespaces or classes, for class node children are fields or other inlined classes).
Also my program prints to console simplified form, just list of classes (with full qualified name including namespaces) plus list of fields. For my example input .cpp it prints:
ns1::ns2::R - pb
ns1::ns2::S::P - x y
ns1::ns2::S::Q - r
ns1::ns2::S - i j b
Example input .cpp used:
// Start
namespace ns1 {
namespace ns2 {
struct R {
bool * pb;
};
struct S {
int i, j;
bool b;
class P {
char const * x;
bool y;
};
class Q {
R r;
};
};
}
}
int main() {
}
I also tested my script on quite complex .cpp having thousands of lines and dozens of classes.
You can use my script next way - after your C++ project is ready you run my script on your .cpp files. Using script output you can figure out what classes you have and what fields each class has. Then you can check somehow if this list of fields is same as your serialization code has, you can write simple macros for doing auto-checking. I think getting list of fields is the main feature that is needed for you. Running my script can be some preprocessing stage before compilation.
If you don't know Python and want to suggest me any improvements to my code, tell me, I'll update my code!
Try it online!
import subprocess, re, os, sys, json, copy, tempfile, secrets
c_file = ''
c_clang = 'C:/bin/llvm/bin/clang++.exe'
def get_ast(fname, *, enc = 'utf-8', opts = [], preprocessed = False, ignore_clang_errors = True):
try:
if not preprocessed:
fnameo = fname
r = subprocess.run([c_clang, '-cc1', '-ast-dump'] + opts + [fnameo], capture_output = True)
assert r.returncode == 0
else:
with tempfile.TemporaryDirectory() as td:
tds = str(td)
fnameo = tds + '/' + secrets.token_hex(8).upper()
r = subprocess.run([c_clang, '-E'] + opts + [f'-o', fnameo, fname], capture_output = True)
assert r.returncode == 0
r = subprocess.run([c_clang, '-cc1', '-ast-dump', fnameo], capture_output = True)
assert r.returncode == 0
except:
if not ignore_clang_errors:
#sys.stdout.write(r.stdout.decode(enc)); sys.stdout.flush()
sys.stderr.write(r.stderr.decode(enc)); sys.stderr.flush()
raise
pass
return r.stdout.decode(enc), fnameo
def proc_file(fpath, fout = None, *, clang_opts = [], preprocessed = False, ignore_clang_errors = True):
def set_tree(tree, path, **value):
assert len(path) > 0
if len(tree) <= path[0][0]:
tree.extend([{} for i in range(path[0][0] - len(tree) + 1)])
if 'node' not in tree[path[0][0]]:
tree[path[0][0]]['node'] = path[0][1]
if 'tree' not in tree[path[0][0]] and len(path) > 1:
tree[path[0][0]]['tree'] = []
if len(path) > 1:
set_tree(tree[path[0][0]]['tree'], path[1:], **value)
elif len(path) == 1:
tree[path[0][0]].update(value)
def clean_tree(tree):
if type(tree) is list:
for i in range(len(tree) - 1, -1, -1):
if tree[i] == {}:
tree[:] = tree[:i] + tree[i+1:]
for e in tree:
clean_tree(e)
elif 'tree' in tree:
clean_tree(tree['tree'])
def flat_tree(tree, name = (), fields = ()):
for e in tree:
if e['node'] == 'NamespaceDecl':
if 'tree' in e:
flat_tree(e['tree'], name + (e['name'],), ())
elif e['node'] == 'CXXRecordDecl':
if 'tree' in e:
flat_tree(e['tree'], name + (e['name'],), ())
elif e['node'] == 'FieldDecl':
fields = fields + (e['name'],)
assert 'tree' not in e['node']
elif 'tree' in e:
flat_tree(e['tree'], name, ())
if len(fields) > 0:
print('::'.join(name), ' - ', ' '.join(fields), sep = '')
ast, fpath = get_ast(fpath, opts = clang_opts, preprocessed = preprocessed, ignore_clang_errors = ignore_clang_errors)
fname = os.path.basename(fpath)
ipath, path, tree = [],(), []
st = lambda **value: set_tree(tree, path, **value)
inode, pindent = 0, None
for line in ast.splitlines():
debug = (path, line)
if not line.strip():
continue
m = re.fullmatch(r'^([|`\- ]*)(\S+)(?:\s+.*)?$', line)
assert m, debug
assert len(m.group(1)) % 2 == 0, debug
indent = len(m.group(1)) // 2
node = m.group(2)
debug = (node,) + debug
if indent >= len(path) - 1:
assert indent in [len(path), len(path) - 1], debug
while len(ipath) <= indent:
ipath += [-1]
ipath = ipath[:indent + 1]
ipath[indent] += 1
path = path[:indent] + ((ipath[indent], node),)
line_col, iline = None, None
m = re.fullmatch(r'^.*\<((?:(?:' + re.escape(fpath) + r'|line|col)\:\d+(?:\:\d+)?(?:\, )?){1,2})\>.*$', line)
if m: #re.fullmatch(r'^.*\<.*?\>.*$', line) and not 'invalid sloc' in line and '<<<' not in line:
assert m, debug
line_col = m.group(1).replace(fpath, 'line')
if False:
for e in line_col.split(', '):
if 'line' in e:
iline = int(e.split(':')[1])
if 'line' not in line_col:
assert iline is not None, debug
line_col = f'line:{iline}, ' + line_col
changed = False
if node == 'NamespaceDecl':
m = re.fullmatch(r'^.+?\s+?(\S+)\s*$', line)
assert m, debug
st(name = m.group(1))
changed = True
elif node == 'CXXRecordDecl' and line.rstrip().endswith(' definition') and ' implicit ' not in line:
m = re.fullmatch(r'^.+?\s+(union|struct|class)\s+(?:(\S+)\s+)?definition\s*$', line)
assert m, debug
st(type = m.group(1), name = m.group(2))
changed = True
elif node == 'FieldDecl':
m = re.fullmatch(r'^.+?\s+(\S+?)\s+\'(.+?)\'\s*$', line)
assert m, debug
st(type = m.group(2), name = m.group(1))
changed = True
if changed and line_col is not None:
st(loc = line_col)
clean_tree(tree)
if fout is None:
fout = fpath + '.json'
assert fout.endswith('.json'), fout
with open(fout, 'wb') as f:
f.write(json.dumps(tree, indent = 4).encode('utf-8'))
flat_tree(tree)
if __name__ == '__main__':
if c_file:
proc_file(c_file)
else:
assert len(sys.argv) > 1
proc_file(sys.argv[1])
Input:
// Start
namespace ns1 {
namespace ns2 {
struct R {
bool * pb;
};
struct S {
int i, j;
bool b;
class P {
char const * x;
bool y;
};
class Q {
R r;
};
};
}
}
int main() {
}
Output:
ns1::ns2::R - pb
ns1::ns2::S::P - x y
ns1::ns2::S::Q - r
ns1::ns2::S - i j b
JSON output:
[
{
"node": "TranslationUnitDecl",
"tree": [
{
"node": "NamespaceDecl",
"name": "ns1",
"loc": "line:2:1, line:19:1",
"tree": [
{
"node": "NamespaceDecl",
"name": "ns2",
"loc": "line:3:5, line:18:5",
"tree": [
{
"node": "CXXRecordDecl",
"type": "struct",
"name": "R",
"loc": "line:4:9, line:6:9",
"tree": [
{
"node": "FieldDecl",
"type": "bool *",
"name": "pb",
"loc": "line:5:13, col:20"
}
]
},
{
"node": "CXXRecordDecl",
"type": "struct",
"name": "S",
"loc": "line:7:9, line:17:9",
"tree": [
{
"node": "FieldDecl",
"type": "int",
"name": "i",
"loc": "line:8:13, col:17"
},
{
"node": "FieldDecl",
"type": "int",
"name": "j",
"loc": "col:13, col:20"
},
{
"node": "FieldDecl",
"type": "bool",
"name": "b",
"loc": "line:9:13, col:18"
},
{
"node": "CXXRecordDecl",
"type": "class",
"name": "P",
"loc": "line:10:13, line:13:13",
"tree": [
{
"node": "FieldDecl",
"type": "const char *",
"name": "x",
"loc": "line:11:17, col:30"
},
{
"node": "FieldDecl",
"type": "bool",
"name": "y",
"loc": "line:12:17, col:22"
}
]
},
{
"node": "CXXRecordDecl",
"type": "class",
"name": "Q",
"loc": "line:14:13, line:16:13",
"tree": [
{
"node": "FieldDecl",
"type": "ns1::ns2::R",
"name": "r",
"loc": "line:15:17, col:19"
}
]
}
]
}
]
}
]
}
]
}
]
PART 2 of 2
Digging inside sources of CLang I just found out that there is a way to dump into JSON directly from CLang, by specifying -ast-dump=json
(read PART 1 above for clarification), so PART1 code is not very useful, PART2 code is a better solution. Full AST dumping command would be clang++ -cc1 -ast-dump=json prog.cpp
.
I just wrote simple Python script to extract simple information from JSON dump, almost same like in PART1. On each line it prints full qualified struct/class/union name (including namespaces), then space, then separated by |
list of fields, each field is field type then ;
then field name. First lines of script should be modified to correct path to clang++.exe
location (read PART1).
Code below that collects fields names and types for all classes can be easily implemented also in C++ if desired. And even used at runtime to provide different useful meta-information, for your case checking if all fields where serialized and in correct order. This code uses just JSON format parser which is available everywhere for all programming languages.
Next script can be run same like first one by python script.py prog.cpp
.
import subprocess, json, sys
c_file = ''
c_clang = 'C:/bin/llvm/bin/clang++.exe'
r = subprocess.run([c_clang, '-cc1', '-ast-dump=json', c_file or sys.argv[1]], check = False, capture_output = True)
text = r.stdout.decode('utf-8')
data = json.loads(text)
def flat_tree(tree, path = (), fields = ()):
is_rec = False
if 'kind' in tree:
if tree['kind'] == 'NamespaceDecl':
path = path + (tree['name'],)
elif tree['kind'] == 'CXXRecordDecl' and 'name' in tree:
path = path + (tree['name'],)
is_rec = True
if 'inner' in tree:
for e in tree['inner']:
if e.get('kind', None) == 'FieldDecl':
assert is_rec
fields = fields + ((e['name'], e.get('type', {}).get('qualType', '')),)
else:
flat_tree(e, path, ())
if len(fields) > 0:
print('::'.join(path), '|'.join([f'{e[1]};{e[0]}' for e in fields]))
flat_tree(data)
Output:
ns1::ns2::R bool *;pb
ns1::ns2::S::P const char *;x|bool;y
ns1::ns2::S::Q ns1::ns2::R;r
ns1::ns2::S int;i|int;j|bool;b
For input:
// Start
namespace ns1 {
namespace ns2 {
struct R {
bool * pb;
};
struct S {
int i, j;
bool b;
class P {
char const * x;
bool y;
};
class Q {
R r;
};
};
}
}
int main() {
}
CLang's AST JSON partial example output:
...............
{
"id":"0x1600853a388",
"kind":"CXXRecordDecl",
"loc":{
"offset":189,
"line":10,
"col":19,
"tokLen":1
},
"range":{
"begin":{
"offset":183,
"col":13,
"tokLen":5
},
"end":{
"offset":264,
"line":13,
"col":13,
"tokLen":1
}
},
"name":"P",
"tagUsed":"class",
"completeDefinition":true,
"definitionData":{
"canPassInRegisters":true,
"copyAssign":{
"hasConstParam":true,
"implicitHasConstParam":true,
"needsImplicit":true,
"trivial":true
},
"copyCtor":{
"hasConstParam":true,
"implicitHasConstParam":true,
"needsImplicit":true,
"simple":true,
"trivial":true
},
"defaultCtor":{
"exists":true,
"needsImplicit":true,
"trivial":true
},
"dtor":{
"irrelevant":true,
"needsImplicit":true,
"simple":true,
"trivial":true
},
"isLiteral":true,
"isStandardLayout":true,
"isTrivial":true,
"isTriviallyCopyable":true,
"moveAssign":{
"exists":true,
"needsImplicit":true,
"simple":true,
"trivial":true
},
"moveCtor":{
"exists":true,
"needsImplicit":true,
"simple":true,
"trivial":true
}
},
"inner":[
{
"id":"0x1600853a4a8",
"kind":"CXXRecordDecl",
"loc":{
"offset":189,
"line":10,
"col":19,
"tokLen":1
},
"range":{
"begin":{
"offset":183,
"col":13,
"tokLen":5
},
"end":{
"offset":189,
"col":19,
"tokLen":1
}
},
"isImplicit":true,
"name":"P",
"tagUsed":"class"
},
{
"id":"0x1600853a550",
"kind":"FieldDecl",
"loc":{
"offset":223,
"line":11,
"col":30,
"tokLen":1
},
"range":{
"begin":{
"offset":210,
"col":17,
"tokLen":4
},
"end":{
"offset":223,
"col":30,
"tokLen":1
}
},
"name":"x",
"type":{
"qualType":"const char *"
}
},
{
"id":"0x1600853a5b8",
"kind":"FieldDecl",
"loc":{
"offset":248,
"line":12,
"col":22,
"tokLen":1
},
"range":{
"begin":{
"offset":243,
"col":17,
"tokLen":4
},
"end":{
"offset":248,
"col":22,
"tokLen":1
}
},
"name":"y",
"type":{
"qualType":"bool"
}
}
]
},
...............
__attribute__((packed))
in gcc/clang) then you may check thatsizeof
of all members that you serialize should be equal in sum tosizeof
of whole class. If class is virtual then extra 8 bytes have to be added (for 64-bit). This is like a fast error check to find missing members in serialization list. – Firedampstruct ALL_ATTRS S { ... };
, then you do two phases, regular compile and testing comile. When you do regular compile you just set#define ALL_ATTRS
i.e. you just make them empty, when you do testing compile to check all members you do#define ALL_ATTRS __attribute__((packed))
. Regarding AST, right now I'm writing code for doing AST work, I'll post an answer when I'm ready! – Firedamp