Trying NOT to match a Japanese word using RegEx negative lookbehind

Asked 15/1, 2019 at 7:16 Answered 7/2, 2019 at 3:0

The target structure looks like the following:

検索結果：１００，０００件

If I use the following regex pattern:

((?<!検索結果：)(?<!次の)(((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京+|[0-9０-９]))(,|，|、)?).+((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京|[0-9０-９]).+)件)(?!表示)

As you can see, I want to unmatch everything preceded by "検索結果：" & "次の" using this pattern followed by either Arabic numerals or Japanese kanji (Chinese character) numbers. However, the pattern somehow matches up to 4 digits but not 6 digits.

In other words,

次の１０００件

works (meaning it doesn't match anything), but

次の５，００００件

gives a partial match ("００００件")

I want to know why up to 4 digits. And ultimately want to find a way to NOT match anything using this regex. I know this regex is a bit messy. Thanks in advance for your feedback!

Petronia answered 15/1, 2019 at 7:16 Comment(11)

Are you looking for \p{N}+ ? Or the opposite, \P{N}+ ? – Yclept 15/1, 2019 at 7:25

Hi Jan - could you explain more? – Petronia 15/1, 2019 at 7:28

i see this related to Jan's response: #14891629 – Petronia 15/1, 2019 at 7:31

When you talk about regex, you always must state which language/regex engine you are using. – Beaverette 15/1, 2019 at 7:41

See regex101.com/r/mDWcBh/1 – Irritated 15/1, 2019 at 7:59

Sorry - it's in a python script - Wiktor, I think your work does the job! I'll test some more and report back. Thanks in advance! – Petronia 15/1, 2019 at 9:28

"0000" is preceded by "5,", so it's a match. – Fabrianne 15/1, 2019 at 23:43

Are you sure you want the .+ terms? Which mean "match 1 or more of anything"? – Fabrianne 15/1, 2019 at 23:51

@WiktorStribiżew, I checked the regex but it didn't do well with other patterns. Here's the complete list of words that should and should not match. regex101.com/r/f1SybY/2 – Petronia 16/1, 2019 at 2:54

I see, [０-９] is not forming a word char. Use regex101.com/r/f1SybY/4. Or a bit shorter. Or, for PCRE, even shorter. – Irritated 16/1, 2019 at 8:13

Bravo, @WiktorStribiżew! Thank you so much! – Petronia 16/1, 2019 at 9:59

You need to avoid matching the numbers after a digit or digit + the separator, so you need to add (?<![０-９0-9])(?<![０-９0-9][，,、]) right after (?<!次の):

(?<!検索結果：)(?<!次の)(?<![０-９0-9])(?<![０-９0-9][，,、])(?:[〇一二三四五六七八九十百千万億兆0-9０-９]|京+)[,，、]?.+[〇一二三四五六七八九十百千万億兆京0-9０-９].+件
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo.

Neotropical answered 16/1, 2019 at 10:2 Comment(0)

Here's one problem that I see so far:

販売実績100万件販売実績１００万件販売実績1,000件販売実績１，０００件販売実績1,000,000件です１００，０００件５０００件

These are all matching but it captures irrelevant part in between the two matching patterns. For instance,

販売実績100万件販売実績１００万件

as one string will match the part that's not supposed to match.

https://regex101.com/r/LfDPHE/1

Petronia answered 7/2, 2019 at 3:0 Comment(0)

Recommended topics

Hot tags