Any decent PHP parser written in PHP? [closed]
Asked Answered
S

5

83

I do lots of work manipulating and analyzing PHP code. Normally I just use the Tokenizer to do this. For most applications this is sufficient. But sometimes parsing using a lexer just isn't reliable enough (obviously).

Thus I am looking for some PHP parser written in PHP. I found hnw/PhpParser and kumatch/stagehand-php-parser. Both are created by an automated conversion of zend_language_parser.y to a .y file with PHP instead of C (and then compiled to a LALR(1) parser). But this automated conversion just can't be worked with.

So, is there any decent PHP parser written in PHP? (I need one for PHP 5.2 and one for 5.3. But just one of them would be a good starting point, too.)

Seljuk answered 7/4, 2011 at 19:16 Comment(9)
What's your goal? What are you trying to accomplish here?Quorum
No idea about decentness, but there would also be PEARs PHP_Parser (which wasn't in your list), though it sounds also autogenerated.Keys
@Charles: There are many things I would use this for. Just anything that needs a PHP source code in an AST representation ;)Seljuk
@mario: That one drops lots of info. It really is designed only for the task of extracting some info about the file. So it only keeps things like class statements, method statements or return statements, but ignores everything I'm actually most interested in: The code.Seljuk
I don't think you'll find any large-scale, robust language parsers coded in PHP. There's just no call for it.Obstructionist
You should just code one up over the weekend.Leverage
Over the last week I have written an initial version of a parser myself: github.com/nikic/PHP-Parser I tested it against my codebase and it worked well. I will work on improving the interfaces, so that it's actually usable.Seljuk
Have you worked with PHP CodeSniffer at all? It basically punts on full PHP language parsing, but it has a pretty good tokenizer and lets you define "callbacks" from the token stream - which is enough to build checks for many anti-patterns (aka "smells" in CodeSnifferland).Wayside
I almost recommended your own project to you.Anglicist
S
135

After no complete and stable parser was found here I decided to write one myself. Here is the result:

PHP-Parser: A PHP parser written in PHP

The project supports parsing code written for any PHP version between PHP 5.2 and PHP 8.1.

Apart from the parser itself the library provides some related components:

  • Compilation of the AST back to PHP ("pretty printing")
  • Infrastructure for traversing and changing the AST
  • Serialization to and from XML (as well as dumping in a human readable form)
  • Converting an AST into JSON and back.
  • Resolution of namespaced names (aliases etc.)

For an usage overview see the "Usage of basic components" section of the documentation.

Seljuk answered 4/12, 2011 at 10:33 Comment(2)
This is awesome! Do you have a plan to maintain it?Kersten
Wow, PHP 7.1 support in early Dec '11!Gelatinate
Q
11

This isn't going to be a great option for you, as it violates the pure-PHP constraint, but:

A while ago, the php-internals folks decided that they would switch to Lemon as their parsing technology. There's a branch in the PHP svn repo that contains the required changes.

They decided not to continue with this, as they found that their Lemon solution is about 10-15% slower. But, the branch is still there.

There's an older Lemon parser written as a PHP extension. You might be able to work with it. There's also this PEAR package. There's also this other lemon package (via this blog post about PGN).

Of course, even if you get it working, I'm not sure what you'd do with the data, or what the data even looks like.

Another wacky option would be peeking at Quercus, a PHP implementation in Java. They'd have to have written a parser, maybe it might be worth investigating.

Quorum answered 10/4, 2011 at 19:43 Comment(3)
First of all: +1 for extensive research. The main problem isn't that there is no way to build a parser in PHP. You already mentioned using the Lemon PHP grammar and compiling it. Even easier would probably be to use the "real" yacc/bison grammar (there are compilers for that, too). The problem is more, that it's really, really much work to transform the yacc C code for generating opcodes into yacc PHP code for generating an AST. So I was looking whether somebody had already done that work.Seljuk
@nikic One of the reasons, IMO, that nobody's done this yet is that there is no specification for what PHP really is, and how to parse it. php-internals has previously outright rejected the entire concept. As a result, outside of the PHP source code itself, there is no authoritative source for how to actually get the parsing done. Without that authoritative source to reference, building a correct parser is going to be a real adventure. This unfortunately means that starting with the yacc or Lemon data may be the best option.Quorum
@nikic, Charles: It was a real adventure for our PHP parser. Approach: propose lexer/grammar, try on thousands of files, get wrong, adjust, try again. It takes about a year of such pounding to get a robust parser for a poorly documented language. At least it did for us. YMMV, but likely not by much.Obstructionist
S
7

The metrics tool PHP Depend contains code to generate an AST from PHP source written entirely in PHP. It does make use of PHP's own token_get_all for the tokenization however.

The source code is available on github: https://github.com/manuelpichler/pdepend/tree/master/src/main/php/PHP/Depend

The implementation of the AST for some parts like mathematical expressions was not yet complete last I checked, but according to its author that is the goal.

Sayles answered 14/4, 2011 at 15:36 Comment(3)
Has an AST but not for the "mathematical operations" (I presume you mean "expressions"? That's a key part of the langauge, esp. when you consider that double-quoted "string literals" (with embedded expressions) are really just complex string expressions.Obstructionist
You got the bounty, because this is the closest answer to the question. But obviously it isn't really usable, because it lacks, well, half of the PHP grammar...Seljuk
The content of this post is outdated. There have been active development since then, though I don't know how well it supports PHP grammar.Playmate
O
3

Well, this isn't in PHP, sorry, but building this kind of machinery is hard, and PHP isn't particularly suited for the task of language processing.

Our PHP Front End it provides full PHP 4.x and 5.x (EDIT 9/2016: now handles PHP 7) parsing, automatically builds ASTs with all the details of a full PHP grammar, can generate compilable source text from the ASTs. This is harder than it might sound when you consider all the screwy details including weird string literals, captured comments, numbers-with-radix, etc.

But ASTs are hardly enough (you've already observed that tokens aren't even barely enough).

The foundation on which it is built, the DMS Software Reengineering Toolkit provides support for analysis and arbitary transformations of the ASTs. It will also read large sets of files at once, enabling analysis and transformations across PHP files.

Obstructionist answered 7/4, 2011 at 23:9 Comment(6)
Just as a response to the first sentence: There are already parser generators, which can generate a parser from a yacc grammer (e.g. kmyacc). I.e. that there is no big difference between building it in PHP and building it in any other language. All you have to do, is "just" (irony) replace the C code in the zend_language_parser.y with some PHP code which build up a node tree.Seljuk
And concerning the rest: I would really like to have a PHP solution. But if (and this seems very probably) there is nothing like that, I will probably use something else. I have already heard about DMS several times here on SO, I'll have a look into it.Seljuk
@ninkic: All Turing machines (including PHP) can simulate all other Turing machines, yes of course it is possible to build it in PHP. But a) there's building just the parser; I think the PHP parser isn't designed to build a tree but rather to feed the PHP p-code generator, and I think you'll find the needs are different, and b) people repeatedly make the mistake of assuming that if they have the AST, everything else is easy; they make this mistake largely because they have no experience with doing complex things with ASTs. I built DMS because this assumption is false.Obstructionist
a) Yes, the PHP parser isn't designed to build a parse tree, it is designed to build an opcode stream. That's why it's hardly possible to convert the zend language parser to PHP automatedly. b) I'm probably one of those making this mistake ;) From the fact that loads of complex manipulations can already be done with the pure token stream, I concluded (in your eyes mistakenly?) that with an AST this would be easier and more stable.Seljuk
@nikic: The lessons from the 50 years of compiler technology is that each program representation makes certain things easy. You can do some program maninpulation on just the text. You can do more on tokens. You can do yet more on ASTs. You can do really interesting stuff if you have symbol tables, control and data flow information (graphs), variable aliasing data (points-to analysis). What you find as you try to do sophisticated code generation that this is all really, really useful stuff.Obstructionist
@nikic: if its very urgent and you are out of scope you can give a try to irc.freenode.com and ask #php or #zftalk or #cakephp or .. maybe you have some feedback atleast. Here its only 120 times being viewed. good luck..Divisionism
U
0

There is a port of ANTLR to PHP: http://code.google.com/p/antlrphpruntime/w/list

It's abandoned, but I think it should still work.

Uredo answered 14/4, 2011 at 14:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.