Equivalent pattern to "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" in Lua 5.1
Asked Answered
A

2

3

When answering this question, I wrote this code to iterate over the UTF-8 byte sequence in a string:

local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do 
    print(c) 
end

It works in Lua 5.2, but in Lua 5.1, it reports an error:

malformed pattern (missing ']')

I recall in Lua 5.1, the string literal \xhh is not supported, so I modified it to:

local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\127\194-\244][\128-\191]*") do 
    print(c) 
end

But the error stays the same, how to fix it?

Amaral answered 9/4, 2014 at 7:52 Comment(0)
S
3

See the Lua 5.1 manual on patterns.

A pattern cannot contain embedded zeros. Use %z instead.

In Lua 5.2, this was changed so that you could use \0 instead, but not so for 5.1. Simply add %z to the first set and change the first range to \1-\127.

Supercilious answered 9/4, 2014 at 12:22 Comment(1)
Simply replacing is not enough, "[%z\1-\127\194-\244][\128-\191]*" works, using "%z" as an individual character in the set.Amaral
O
3

I highly suspect, this happens because of \0 in the pattern. Basically, string that holds your pattern null-terminates before it should and, in fact, what lua regex engine is parsing is: [\0. That's clearly wrong pattern and should trigger the error you're currently getting.

To prove this concept I made little change to pattern:

local str = "KORYTNAČKA"
for c in str:gmatch("[\x0-\x7F\xC2-\xF4][\x80-\xBF]*") do 
    print(c) 
end

That compiled and ran as expected on lua 5.1.4. Demonstration

Note: I have not actually looked what pattern was doing. Just removed \0 by adding x. So output of modified code might not be what you expect.

Edit: As a workaround you might consider replacing \0 with \\0 (to escape null-termination) in your second code example:

local str = "KORYTNAČKA"
for c in str:gmatch("[\\0-\127\194-\244][\128-\191]*") do 
    print(c) 
end

Demo

Overheat answered 9/4, 2014 at 8:10 Comment(4)
Yeah, I was going to say that it compiles in Lua 5.1, but the result is different. And I didn't find the support of \xhh in Lua 5.1 reference manual.Amaral
@YuHao What's the output in 5.2? I don't have lua installed.Overheat
The pattern should match a UTF-8 character, you can test it on Lua demoAmaral
@YuHao Updated answer. I don't really like it but this is the only workaround I can see here.Overheat
S
3

See the Lua 5.1 manual on patterns.

A pattern cannot contain embedded zeros. Use %z instead.

In Lua 5.2, this was changed so that you could use \0 instead, but not so for 5.1. Simply add %z to the first set and change the first range to \1-\127.

Supercilious answered 9/4, 2014 at 12:22 Comment(1)
Simply replacing is not enough, "[%z\1-\127\194-\244][\128-\191]*" works, using "%z" as an individual character in the set.Amaral

© 2022 - 2024 — McMap. All rights reserved.