How does one implement wildcards, character classes, negated character classes, etc. in a model for a regular grammar?
Asked Answered
N

1

8

TL;DR:

How does one computationally model a grammar's productions such that an indefinite number of products exist for the same left hand side?


I'm working on a project regarding formal language theory and am trying to write a class for building regular grammar objects which can be passed to a finite state machine. My naïve attempt was to create an API for adding a production for each permitted input. A stripped down version of my attempt is as follows (based on the formal definition of a formal grammar G = (N, Σ, P, S)):

class ContextFreeGrammar:
    def __init__(self, variables, alphabet, production_rules, start_variable):
        self.variables = variables
        self.alphabet = alphabet
        self.production_rules = production_rules
        self.start_variable = start_variable

    def __repr__(self):
        return '{}({}, {}, {}, {})'.format(
            self.__class__.__name__,
            self.variables,
            self.alphabet,
            self.production_rules,
            self.start_variable
        )


class RegularGrammar(ContextFreeGrammar):
    _regular_expression_grammar = None # TODO

    @classmethod
    def from_regular_expression(cls, regular_expression):
        raise NotImplementedError()

I haven't gotten to the point of actually writing the finite state automaton or the pushdown automaton yet.

The grammar for a regular expression is context-free, so I have included my definition in WSN below:

syntax = expression .
expression = term "|" expression .
expression = term .
term = factor repetition term .
term = factor term .
term = .
repetition = "*" .
repetition = "+" .
repetition = "?" .
repetition = "{" nonnegative_integer "," nonnegative_integer "}" .
repetition = "{" nonnegative_integer ",}" .
repetition = "{," nonnegative_integer "}" .
nonnegative_integer = nonzero_arabic_numeral arabic_numerals .
nonnegative_integer = arabic_numeral .
nonzero_arabic_numeral = "1" .
nonzero_arabic_numeral = "2" .
nonzero_arabic_numeral = "3" .
nonzero_arabic_numeral = "4" .
nonzero_arabic_numeral = "5" .
nonzero_arabic_numeral = "6" .
nonzero_arabic_numeral = "7" .
nonzero_arabic_numeral = "8" .
nonzero_arabic_numeral = "9" .
arabic_numeral = nonzero_arabic_numeral .
arabic_numeral = "0" .
arabic_numerals = arabic_numeral .
arabic_numerals = arabic_numeral arabic_numerals .
factor = "(" expression ")" .
factor = character_class .
factor = character .
escaped_character = "\\." .
escaped_character = "\\(" .
escaped_character = "\\)" .
escaped_character = "\\+" .
escaped_character = "\\*" .
escaped_character = "\\?" .
escaped_character = "\\[" .
escaped_character = "\\]" .
escaped_character = "\\\\" .
escaped_character = "\\{" .
escaped_character = "\\}" .
escaped_character = "\\|" .
character -> TODO ;
character_class = TODO .

One can easily see that I am explicitly splitting alternations into separate productions. I am doing this for ease of implementation. But I am stuck on how I should go about doing character classes and such. I was wanting production_rules to be a map from the each left hand side to a set of each of its corresponding right hand sides. But that looks infeasible now.

Namhoi answered 1/12, 2015 at 22:27 Comment(6)
Any particular reason you need character classes to be nonterminals? Trying to turn a character class into a CFG production isn't very practical.Arthromere
If you're referring to the WSN that I provided. I just wanted it to be a variable just to make the WSN easier to read.Namhoi
I think you have your precedence wrong, or at least you are using an uncommon convention. Normally, ab* means "an a followed by any number of bs", not "any number of abs.Falls
Anyway, I fail to see the problem. You know what the alphabet is, so you can enumerate all possible character productions; there will be one production for every character in the alphabet other than the ones you need to escape.Falls
If a . wildcard is used, I know it could be any possible character. But if I assume that I'm working with Unicode, that's a lot of possible characters. Unicode 7.0 contains 112,956 characters. I think for the sake of characters that require multiple code points, I'm going to scrap ranges in character classes. That makes this slightly easier. I think I might subclass set or something to that effect once for normal character classes and once for negated character classes and cast a period to an empty negated character class.Namhoi
@TylerCrompton: If you are looking for practical rather than theoretical solutions, I'd be happy to answer, but that seems like a different question. The main insight for a practical solution is that very few grammars actually distinguish 112,956 different characters; you can usually reduce a grammar's alphabet to a few dozen equivalence classes. (An example of an EC: in your grammar, [1-9].)Falls
S
0

I don't entirely understand your question, but from the comments it seems you are trying to work within a predefined character set which excludes miscellaneous Unicode & ASCII characters.

Here is a method I recently implemented for working with similar constraints:

[RegEx] Character Groups

Here is an example which implements the above definitions:

global rx_Trim_FromAlphaNumeric
rx_Trim_FromAlphaNumeric =                          \
    "[" + rx_AlphaNumeric                  + "]+" + \
    "[" + rx_ValidCharacters_WithLineSpace + "]*"

global rx_StartsWithSymbol
rx_StartsWithSymbol =                                \
    "[^" + rx_AlphaNumeric                  + "]"  + \
    "["  + rx_Symbols                       + "]+" + \
    "["  + rx_LineSpace + rx_Symbols        + "]*" + \
    "["  + rx_AlphaNumeric                  + "]+" + \
    "["  + rx_ValidCharacters_WithLineSpace + "]*"

global rx_StartsWithLetter
rx_StartsWithLetter =                                \
    "^[" + rx_Alphabetic                    + "]+" + \
    "["  + rx_ValidCharacters_WithLineSpace + "]+"

global rx_StartsWithNumber
rx_StartsWithNumber =                                \
    "^[" + rx_Numeric                       + "]+" + \
    "["  + rx_ValidCharacters_WithLineSpace + "]+"

global rx_WordSegments
rx_WordSegments =                  \
    "([" + rx_Symbols    + "]+|" + \
    "["  + rx_Numeric    + "]+|" + \
    "["  + rx_Alphabetic + "]+|" + \
    "["  + rx_LineSpace  + "]+)"

Note: I prefer to escape all symbols since certain characters, such as ^, have contextual escaping requirements. If they are always escaped, it is less likely that an issue will be encountered.

Sour answered 14/3, 2016 at 13:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.