Is there command-line tool to extract typedef, structure, enumeration, variable, function from a C or C++ file?

Asked 27/6, 2012 at 2:20 Answered 27/6, 2012 at 4:26

Solved c refactoring automation code-generation program-transformation

I am desiring a command-line tool to extract a definition or declaration (typedef, structure, enumeration, variable, or function) from a C or C++ source file. Also a way to replace an existing definition/declaration would be handy (after transforming the extracted definition by a user-submitted script). Is there such generic tool available, or is some resonably close approximation of such a tool?

Scriptability and ability to hook-up with user created scripts or programs is of importance here, although I am academically curious of GUI programs too. Open source solutions for Unix/Linux camp are preferred (although I am curious of Windows and OS X tools too). Primary language interests are C and C++ but more generic solution would be even better (I think we do not need super accurate parsing capabilities for finding, extracting and replacing a definition in a program source file).

Sample Use Cases (extra - for the curious mind):

Given deeply nested structs and variable (array) initializations of these types, suppose there is a need to change a struct definition by adding or reordering fields or rewriting the variable/array definitions in more readable format without introducing errors resulting from manual labor. This would work by extracting the old initializations, then using a script/program to write the new initializations to replace the old ones.
For implementing a code browsing tool - extract a definition.
Decorative code generation (e.g. logging function entries / returns).
Scripted code structuring (e.g. extract this and that thing and put in different place without change - version control commit comment could document the command to perform this operation to make it evident and verifiable that nothing changed).

Alternative problem: If there is a tool to tell the location of the definition (beginning and end line would suffice - we could even assume all the definitions/declarations we are interested in are in their own line), then it would a simply exercise of finger dexterity to to write a program to

extract definitions,
replace definitions, or even
extract a definition, run a program specified by command line options (or an editor) to
- receive the desired extracted definitions from stdin (or from a temporary file),
- perform the transformation (editing), and
- output the new definitions to stdout (or save them to the given temporary file)
to be replaced by the executing program.

So the major, more challenging problem would be finding the begin and end line of the definition.

Note about tags: More accurate tag than code-generation would be code-transformation but it does not exist.

Froma answered 27/6, 2012 at 2:20 Comment(17)

You seem to mean C. There is no language called C/C++. – Elul 27/6, 2012 at 2:22

By C/C++ I mean C or C++. I am primarily interested in C (I am actively using and dealing with legacy code), but C++ is of interest too (I would prefer using for clean table projects if given a choice). For the described tasks you do not need very fine understanding of the language syntax, I think, so the differences between C and C++ does not seem so important. – Froma 27/6, 2012 at 2:25

I would use also tag c++ but maximum number of tags is 5. There are no tags code-extraction or code-transformation... and any way the maximum number of tags was already reached. :-) – Froma 27/6, 2012 at 2:27

Probably automation should be removed -- I'd expect it to assembly line systems, though it looks nearly useless as currently tagged. That'd give you one more.. and tools can probably be replaced for something more specific too. Feel free to create a new tag if none seem appropriate... :) – Magaretmagas 27/6, 2012 at 2:31

Hopefully useful: gcc-melt.org and sparse.wiki.kernel.org/index.php/Main_Page. I don't think either system comes close to letting you re-write source but they may be useful for their understanding / knowledge abilities. – Magaretmagas 27/6, 2012 at 2:35

@sarnold[1/2]: I do not have enough reputation (1500?) to create new tags. I also wanted to have scripting originally but had to remove due to limit. – Froma 27/6, 2012 at 2:38

@sarnold[2/2]: The tag automation is described as follows: Automation is the process of having a computer do a repetitive task or a task that requires great precision or multiple steps, without requiring human intervention. This seems what I am craving for. I do not trust myself as much as I trust a program to do such transformation (in my case I wanted to transform an array of complex data type of about 100 entries - I did it already programmatically but am dreaming of a more generic solution). – Froma 27/6, 2012 at 2:45

@sarnold: Thanks for the gcc-melt and sparse links. Seem very interesting although by quick reading not useful for my use case now. I will take time to study in more detail. – Froma 27/6, 2012 at 3:17

If you need both C and C++, ask two questions. The necessary tooling for C++ is non-trivial to come by. – Passant 27/6, 2012 at 4:25

You might also check out some of the GCC intermediary tree debug flags such as -fdump-tree-*. I haven't looked into these much, but they always hinted to me at a nice, although hack-y, alternative to modifying GCC source, perhaps suitable for scripting experiments. – Germanophobe 27/6, 2012 at 4:26

@DeadMG - actually I see three questions: C, C++, and generic case (using maybe some kind of grammar definition). – Froma 27/6, 2012 at 4:39

@FooF: DMS (my answer) addresses all 3 cases, basically by operating as the generic case. – Grajeda 28/6, 2012 at 4:0

@Ira Baxter: I was leaning towards choosing your answer because of its comprehensiveness and zeal. But after learning more about CLANG I am split between your and DeadMG's answer. – Froma 29/6, 2012 at 5:41

@Stephen Niedzielski: The benefit of your approach is that it is tightly coupled with the actual compiler used (which could be cross-compiler having partially different compilation flow (#ifdef ... #elif ... #else ... #endif) at least due to architecture differences - data sizes - and predefined macros. However, by very quick experiment I could not make use of the output files with output files resulting from use of -fdump-tree-* options. In my recall, the output seemed to have more information about the code than data structures which I was for my current use case interested in. – Froma 29/6, 2012 at 5:49

@FooF, I confess it's been a long while since I've looked into this. As a shot in the dark, you might check out this SO thread which looks promising. – Germanophobe 29/6, 2012 at 7:14

The tag you wanted wasn't "code-tranformation" but "program-transformation". See also en.wikipedia.org/wiki/Program_transformation. – Grajeda 29/6, 2012 at 9:45

I see DMS different from Clang on 4 fronts: a) Proven ability to process mixed arbitrary languages [you may not care today with your C/C++ focus, but you'll care tomorrow], b) Source-to-source transformations [Clang transformations have to be written procedurally by crawling the ASTs], c) composability of transformations [after you do one transform, you can do another; Clang's emit source-text-patch-as-xform disables that], d) parallel computing foundations [this matters for processing big programs]. If you can do what you want today with Clang, then its a perfectly fine tool for today. – Grajeda 29/6, 2012 at 9:48

Our DMS Software Reengineering Toolkit is trying to be the tool you are wishing for. But it is pushing the state of the art and isn't a nirvana style tool. It is good enough to do real, interesting work.

DMS provides general facilities for parsing, analyzing and transforming source code.

It uses explicit grammars to define languages (such as C and C++); the grammars drive parsers that build abstract syntax trees (ASTs). A variety of analysis primitives provide a) facilities ["attribute grammars" ATGs] for collecting information along tree-like information flow paths which match the shape of ASTs nicely, b) construction of symbol use to symbol definition maps ["symbol tables"], c) control and data flow analysis using facts extracted by ATGs, d) range analysis, e) points-to analysis both local and global. These primitive analyzers can be used to compose facts from the AST to draw conclusions about the code represented by the ASTs (e.g., "this statement modifies these variables"). A langauge front end packages the grammar and the language-specific analyzers together in a reusable bundle. DMS has such language front ends of varying levels of depth and maturity for a wide variety of languages.

[EDIT 6/27: The C and C++ front ends have support for specific dialects of C and C++: ANSIC, C99, GCC3/4 C, MS Visual C, ANSI C++98, ANSI C++11, GCC3/4 C++, MS Visual C++ 2005/2008/2010. If you want accurate analysis of code, you should use the "right" dialect to process your code.]

But "analysis" isn't the point. The purpose of analysis is to drive change. DMS provides additional support to procedurally modify the ASTs, to modify the ASTs by source-to-source rewrite rules written in the surface syntax of the language (both conditioned by some chosen analysis result), or to group sets of procedural and source-to-source rewrites together to make compound, complex rewrites that can carry off massive code changes such are re-architecting, etc. After the ASTs are transformed, they can be used to regenerate ("prettyprint") syntactically correct code in the corresponding front-end language/dialect. [By modifying an AST for one language piecewise until you have an AST for another, you can build translators, but this isn't as easy as this sentence implies].

This all works to considerable degree, yet is still somewhat stymied by certain language complications. For C and C++, a famous complication is the preprocessor; by editing the program text arbitrarily, preprocessor conditionals can render the source code unparseable by anything resembling standard parsing technology. DMS's C and C++ front ends ameliorate this somewhat and can parse code with well-structured preprocessor directives including some strange cases that most people would not call structured but that commonly occur:

   #IF  cond
        if (abc)  {
   #ELSE
        if (def)  {
   #ENDIF

We are making interesting progress on parsing code with arbitrary placement of preprocessor conditionals. But once you do that, now all of your analyzers suddenly have to take the preprocessor conditionals into account and we're all suddenly on turf the compiler people have not really visited.

DMS has been used to make major architectural shifts in large C++ programs, converting from non-CORBA style to CORBA style with an immense amount of code shuffling, to extract code along arbitrary control flow paths to generate SOW-style APIs for existing C code, to insert instrumentation in large C programs to detect pointer errors, etc. [It has been applied to other tasks in many of those other languages].

In our own experience, it is still pretty hard to use. In our opinion, this is in the same sense that democracy is the worst of all systems of government except for all the rest; YMMV. The website has lots of DMS-derived tools and discussions.

It has in fact been used to extract functions (the SOW-exercise is much more general than that) and insert functions (this is a generalized case of instrumentation).

Tools like GCC-XML are shadows of DMS's capabilities. GCC-XML parses, builds symbol tables, and dumps data declarations (not code), but it can't make any code changes. Clang is better; it parses C and C++ to ASTs, can do analyses on the LLVM intermediate representation, and has some kind of mechanism for spitting out to-be-applied-later patches to source text inspired by a desired tree change. I don't know if Clang can carry out massive code transformations, especially those where one transformation's result is transformed again (how do you modify the tree for a delayed text patch?). DMS can do this all day long, and can do it for many languages other than C and C++, and can do it for an arbitrary mixture of the langauges it knows.

Until the preprocessor problem with conditionals gets solved, analyzing/transforming C and C++ code will not be easy. We succeed in these tasks on these languages only by sheer willpower and using the the strongest tools we can build. (Java doesn't have these problems, and DMS is correspondingly better at analyzing/transforming it).

At severe risk of hubris, I believe DMS to be the best of tools out there for general purpose analysis and transformation. As its architect, I view it as my long term job to make it ever stronger for this task.

Grajeda answered 27/6, 2012 at 3:42 Comment(1)

Sounds very ambitious, powerful, interesting and useful. Even if I am primarily interested in open source solutions, and do not have money to spend at the moment, I certainly intend to learn more about your products (if not anything else then for academic curiosity). Thanks for commenting also about the difficulties and not making everything sound too rosy. I have also experienced how the C preprocessor layer makes some tools working on C code produce bad result (GNU indent comes quickly to my mind). – Froma 27/6, 2012 at 4:25

You might consider GCC-XML as a basis for developing tools like what you're talking about. I've used it in combination with pygccxml to do some automated extraction of deeply-nested struct members. It won't make your job a snap, but you'd certainly be better off than you would be otherwise.

I've also heard others mention clang as a basis for writing such tools, but haven't had a chance to look much into it myself.

Challenge answered 27/6, 2012 at 2:49 Comment(2)

I also tried using gccxml as an obvious choice but failed to quickly have any result other than errors in my test. (I have bad legacy code in my hand, the code is actually targeted to different architecture (ARM and MIPS),a and most of all gccxml regards the code as C++ (gccxml.org/HTML/FAQ.html question 1). – Froma 27/6, 2012 at 2:58

@FooF: That bit about "regards the code as C++" is an indication that you do in fact need language-precise parsing, for the specific dialect of the language you are processing. Small differences in interpretation of the meaning of a bit of syntax can lead to major differences in extraction/transformation later in the process. (This problem gets nasty when one encounters preprocessor conditionals whose arms contain code for different compilers.) – Grajeda 29/6, 2012 at 9:40

You could check out Clang. They have non-trivial source code processing libraries.

Passant answered 27/6, 2012 at 4:26 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags