Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Asked 24/7, 2015 at 13:43 Answered 28/7, 2015 at 3:6

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:

>> find/any "here is a string" "s?r"
== "string"

I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.

What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)

Spies answered 24/7, 2015 at 13:43 Comment(3)

Related to efficiency and PARSE, "What's the fastest/most-efficient way to count lines in Rebol might provide some general insights. – Thermotensile 24/7, 2015 at 14:7

I guess you have already a solution. You can answer your own questions and other users can vote for it too. – Psychro 24/7, 2015 at 14:20

For reference: wildcards are not included in the dictionary definition of FIND for Rebol 3 or Rebol 2, rather in the Core Users Manual. – Pescara 25/7, 2015 at 16:14

Here's a stab at handling the "*" case:

like: funct [
    series [series!]
    search [series!]
][
    rule: copy []
    remove-each s b: parse/all search "*" [empty? s]
    foreach s b [
        append rule reduce ['to s]
    ]
    append rule [to end]
    all [
        parse series rule
        find series first b
    ]
]

used as follows:

>> like "abcde" "b*d"
== "bcde"

Spies answered 25/7, 2015 at 1:23 Comment(0)

I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.

BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:

You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/

So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.

And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].

I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.

To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:

>> parse "here is a string" [
       some [                ; match rule repeatedly
           to "s"            ; advance to *before* "s"
           pos:              ; save position as potential match
           skip              ; now skip the "s"
           [                 ;     [sub-rule]
               skip          ; ignore any single character (the "?")
               "r"           ; match the "r", and if we do...
               return pos    ; return the position we saved
           |                 ;     | (otherwise)
               none          ; no-op, keep trying to match
           ]
       ]
       fail                  ; have PARSE return NONE
   ]
== "string"

If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.

On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.

The rule is certainly longer than "s?r". But it's not that long when comments are taken out:

[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]

(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)

Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)

But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:

 >> parse "here is a string" glob "s?r"
 == "string"

Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:

 s?r-rule: glob ["s" one "r"]

 pos-1: parse "here is a string" s?r-rule
 pos-2: parse "reuse compiled RegEx string" s?r-rule

It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."

Fairly interesting things would be possible. Given that in RegEx:

(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc

Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:

regex [("abc" | "def") "=" (1)]

Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)

Thermotensile answered 25/7, 2015 at 15:10 Comment(0)

I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:

expand-wildcards: use [literal][
    literal: complement charset "*?"

    func [
        {Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
        value [any-string!] "Value to expand"
        /local part
    ][
        collect [
            parse value [
                ; empty search string FAIL
                end (keep [return (none)])
                |

                ; only wildcard return HEAD
                some #"*" end (keep [to end])
                |

                ; everything else...
                some [
                    ; single char matches
                    #"?" (keep 'skip)
                    |

                    ; textual match
                    copy part some literal (keep part)
                    |

                    ; indicates the use of THRU for the next string
                    some #"*"

                    ; but first we're going to match single chars
                    any [#"?" (keep 'skip)]

                    ; it's optional in case there's a "*?*" sequence
                    ; in which case, we're going to ignore the first "*"
                    opt [
                        copy part some literal (
                            keep 'thru keep part
                        )
                    ]
                ]
            ]
        ]
    ]
]

like: func [
    {Finds a value in a series and returns the series at the start of it.}
    series [any-string!] "Series to search"
    value [any-string! block!] "Value to find"
    /local skips result
][
    ; shortens the search a little where the search starts with a regular char
    skips: switch/default first value [
        #[none] #"*" #"?" ['skip]
    ][
        reduce ['skip 'to first value]
    ]

    any [
        block? value
        value: expand-wildcards value
    ]

    parse series [
        some [
            ; we have our match
            result: value

            ; and return it
            return (result)
            |

            ; step through the string until we get a match
            skips
        ]

        ; at the end of the string, no matches
        fail
    ]
]

Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.

I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.

It might yet as per @HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

Pescara answered 28/7, 2015 at 3:6 Comment(0)

Recommended topics

Hot tags