Extensive documentation on how to write a lexer for Pygments? [closed]

Asked 7/2, 2013 at 16:13 Answered 26/3, 2013 at 3:38

I have a dictionary of Stata keywords and reasonable knowledge of Stata syntax. I would like to devote a few hours to turn it into a Stata lexer for Pygments.

However, I cannot find enough documentation about the syntax of lexers and find myself unable to start coding the lexer. Could someone point out a good tutorial for writing new lexers for Pygments?

I know about the Pygments API and the lexer development page, but honestly, these are not enough for someone like me with very limited knowledge of Python.

My strategy so far has been to look for examples. I have found quite a few, e.g. Puppet, Sass, Scala, Ada. They helped only that much. Any help with how to get started from my Stata keywords would be welcome.

Fowle answered 7/2, 2013 at 16:13 Comment(5)

Not the answer you seek, but I'm always surprised at the emphasis on keywords in Stata syntax highlighting. Highlighting's main benefit I've found to be error flagging, but without a absolutely comprehensive word list, and allowance for command abbreviations, a pain in this context, there may be too many misclassifications. – Bangup 19/3, 2013 at 13:32

I agree that the emphasis on keywords is crucial here. There are two Stata syntax bundles for the TextMate editor on Mac OS X, and they have different keyword lists. Despite the limitations, I think something decent could be implemented into Pygments, but I lack the proper knowledge of lexers to start writing one. – Fowle 20/3, 2013 at 3:12

Fr.: I guess I was unclear. I think a list of keywords is -- for Stata -- the least needed detail for syntax highlighting. To put the point another way, it was pleasant to find some years ago that merely pretending that Stata code is C code got helpful syntax highlighting in various text editors. No list of keywords was needed and keywords often don't help, e.g. when a legal command name is in fact used as a variable name. – Bangup 20/3, 2013 at 13:48

Did you ever finish this lexer? I'd be interested. – Verrucose 29/1, 2014 at 20:0

Sorry, I did not (and switched almost exclusively to R in 2013). – Fowle 30/1, 2014 at 18:3

If you just wanted to highlight the keywords, you'd start with this (replacing the keywords with your own list of Stata keywords):

class StataLexer(RegexLexer):

    name = 'Stata'
    aliases = ['stata']
    filenames = '*.stata'
    flags = re.MULTILINE | re.DOTALL

    tokens = {
       'root': [
           (r'(abstract|case|catch|class|do|else|extends|false|final|'
            r'finally|for|forSome|if|implicit|import|lazy|match|new|null|'
            r'object|override|package|private|protected|requires|return|'
            r'sealed|super|this|throw|trait|try|true|type|while|with|'
            r'yield)\b', Keyword),
       ],
   }

I think your problem is not that you don't know any Python, but that you don't have much experience with writing a lexer or understanding how a lexer works? Because this implementation is fairly straightforward.

Then, if you want to add more stuff, add an extra element to the root list, a two-element tuple, where the first element is a regular expression and the second element designates a syntactic class.

Butte answered 19/3, 2013 at 13:2 Comment(3)

You are right on all counts: I have reasonable knowledge of regular expressions but limited knowledge of Python and no knowledge of lexers (e.g. what is a tuple and how does a syntactic class work). I have tried reading through a few other lexers to understand what a Stata one might look like, but that did not work out well. I am still looking for a reasonably well documented tutorial. – Fowle 20/3, 2013 at 3:17

Did I not give you a reasonable starting point? You could easily look up the meaning of a tuple in a Python tutorial. A syntactic class is the meaning of some piece of code in source code. I.e. "keyword" is a syntactic class, "operator" might be another one, so is "expression". This corresponds to the Keyword class referenced in my bit of source code. I think your desire for a pygments-lexer-writing tutorial without wanting to learn a little bit of how Python or lexers work is a bit unrealistic. – Butte 20/3, 2013 at 10:10

You are correct (again), it is a somewhat over-ambitious attempt. And you do give a few clues here. The only part where you are wrong is where you assume that I don't want to learn: I am willing to, but I need better documentation than the one I have found so far. Please allow me to report back if I manage to do anything from your starting point, and thank you for your help. – Fowle 20/3, 2013 at 19:56

I attempted to write a pygments lexer (for BibTeX, which has a simple syntax) recently and agree with your assessment that the resources out there aren't very helpful for people unfamiliar with Python or general code parsing concepts.

What I found to be most helpful was the collection of lexers included with Pygments.

There is a file _mapping.py that lists all of the recognized language formats and links to the lexer object for each one. To construct my lexer, I tried to think of languages that had similar constructs to the ones I was handling and checked if I could tease out something useful. Some of the built-in lexers are more complex than I wanted, but others were helpful.

Gaekwar answered 26/3, 2013 at 3:38 Comment(2)

Thanks. I have been giving a try ti that approach this week, and wrote a few things out of it. I'm also looking deeper into a Stata syntax parser written for the TextMate editor for Mac OS X, which is helping too. – Fowle 26/3, 2013 at 14:58

Thanks! this explanation of where the lexing redirect per language is phenomenally helpful, a several hours of reverse engineering time saver. – Gaynell 28/5, 2013 at 20:55

Recommended topics

Hot tags