If you do use -P
, make sure to use Git 2.40 (Q1 2023): "grep -P
" learned to use Unicode Character Property to grok character classes when processing \b
and \w
etc.
See commit acabd20 (08 Jan 2023) by Carlo Marcelo Arenas Belón (carenas
).
(Merged by Junio C Hamano -- gitster
-- in commit 557d93a, 27 Jan 2023)
grep
: correctly identify utf-8 characters with \{b,w}
in -P
Signed-off-by: Carlo Marcelo Arenas Belón
Acked-by: Ævar Arnfjörð Bjarmason
When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile()
call, but PCRE2_UCP
wasn't included.
This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex:
$ git grep -P '\bÆvar'
Add PCRE2_UCP
so that \w
will include Æ
and therefore \b
could correctly match the beginning of that word.
This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test.
That means those patterns will work, with any character:
'\bhow'
'\bÆvar'
'\d+ \bÆvar'
'\bBelón\b'
'\w{12}\b'
With Git 2.41 (Q2 2023), a recent-ish change to allow Unicode character classes to be used with "grep -P
" triggered a JIT bug in older pcre2
libraries.
The problematic change in Git built with these older libraries has been disabled to work around the bug.
See commit 14b9a04 (23 Mar 2023) by Mathias Krause (mathiaskrause
).
(Merged by Junio C Hamano -- gitster
-- in commit d35cd54, 30 Mar 2023)
grep
: work around UTF-8 related JIT bug in PCRE2 <= 10.34
Reported-by: Stephane Odul
Signed-off-by: Mathias Krause
Stephane is reporting a regression introduced in Git v2.40.0 that leads to 'git grep
'(man) segfaulting in his CI pipeline.
It turns out, he's using an older version of libpcre2
that triggers a wild pointer dereference in the generated JIT code that was fixed in PCRE2 10.35.
Instead of completely disabling the JIT compiler for the buggy version, just mask out the Unicode property handling as we used to do prior to commit acabd20 (grep
: correctly identify utf-8 characters with {b, 2023-01-08, Git v2.40.0-rc0 -- merge listed in batch #11) ("grep
: correctly identify utf-8 characters with \{b,w}
in -P
").
Mac OS
requires you escape your backslashes (e.g. double them). – Assurgentgit grep ">"
under Lion and get lots of matches. Perhaps there is something wrong with your set-up... – Edmundedmunda\>
just searches for closed angle bracket, instead of end of word boundary. I will try apple.stackexchange.com. Thanks for the link. – Lawful