Trying NOT to match a Japanese word using RegEx negative lookbehind
Asked Answered
P

2

6

The target structure looks like the following:

検索結果:100,000件

If I use the following regex pattern:

((?<!検索結果:)(?<!次の)(((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京+|[0-90-9]))(,|,|、)?).+((〇|一|二|三|四|五|六|七|八|九|十|百|千|万|億|兆|京|[0-90-9]).+)件)(?!表示)

As you can see, I want to unmatch everything preceded by "検索結果:" & "次の" using this pattern followed by either Arabic numerals or Japanese kanji (Chinese character) numbers. However, the pattern somehow matches up to 4 digits but not 6 digits.

In other words,

次の1000件

works (meaning it doesn't match anything), but

次の5,0000件

gives a partial match ("0000件")

I want to know why up to 4 digits. And ultimately want to find a way to NOT match anything using this regex. I know this regex is a bit messy. Thanks in advance for your feedback!

Petronia answered 15/1, 2019 at 7:16 Comment(11)
Are you looking for \p{N}+ ? Or the opposite, \P{N}+ ?Yclept
Hi Jan - could you explain more?Petronia
i see this related to Jan's response: #14891629Petronia
When you talk about regex, you always must state which language/regex engine you are using.Beaverette
See regex101.com/r/mDWcBh/1Irritated
Sorry - it's in a python script - Wiktor, I think your work does the job! I'll test some more and report back. Thanks in advance!Petronia
"0000" is preceded by "5,", so it's a match.Fabrianne
Are you sure you want the .+ terms? Which mean "match 1 or more of anything"?Fabrianne
@WiktorStribiżew, I checked the regex but it didn't do well with other patterns. Here's the complete list of words that should and should not match. regex101.com/r/f1SybY/2Petronia
I see, [0-9] is not forming a word char. Use regex101.com/r/f1SybY/4. Or a bit shorter. Or, for PCRE, even shorter.Irritated
Bravo, @WiktorStribiżew! Thank you so much!Petronia
N
2

You need to avoid matching the numbers after a digit or digit + the separator, so you need to add (?<![0-90-9])(?<![0-90-9][,,、]) right after (?<!次の):

(?<!検索結果:)(?<!次の)(?<![0-90-9])(?<![0-90-9][,,、])(?:[〇一二三四五六七八九十百千万億兆0-90-9]|京+)[,,、]?.+[〇一二三四五六七八九十百千万億兆京0-90-9].+件
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo.

Neotropical answered 16/1, 2019 at 10:2 Comment(0)
P
0

Here's one problem that I see so far:

販売実績100万件 販売実績100万件 販売実績1,000件 販売実績1,000件 販売実績1,000,000件です 100,000件 5000件

These are all matching but it captures irrelevant part in between the two matching patterns. For instance,

販売実績100万件販売実績100万件

as one string will match the part that's not supposed to match.

https://regex101.com/r/LfDPHE/1

Petronia answered 7/2, 2019 at 3:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.