What are `git diff --word-diff' default regexps?
Asked Answered
S

3

7

git diff has option --word-diff-regex=<...> that matches words. There are special default values for some languages (as said in man 5 gitattributes). But what are these? No description in docs, I looked up sources of git, haven't found them too.

Any ideas?

EDIT: I'm on git 1.9.1, but I'll accept answers for any version.

Stereochrome answered 24/5, 2015 at 21:12 Comment(0)
O
9

The sources contain the default word regexes in the userdiff.c file. The PATTERNS and IPATTERN macros take the base word regex as their third parameter, and add "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" to make sure all non-whitespace characters that aren't part of a larger word are treated as a word by themselves, and assuming UTF-8, without splitting up multi-byte characters. For example, in:

PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
         "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"),

the word regex is "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+".

In this case, the |[\xc0-\xff][\x80-\xbf]+ happens not to have any benefit, as everything covered by [\xc0-\xff][\x80-\xbf]+ is already covered by [a-zA-Z0-9\x80-\xff]+, but it doesn't cause any harm either.

Ozell answered 24/5, 2015 at 21:48 Comment(1)
on git version 2.8.3, $ git diff --word-diff-regex="\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" and git diff --word-diff produce similar, but different resultsSasaki
F
2

A list of predefined diff drivers (they all have predefined word diff regexes) is given in the docs for .gitattributes. It is further stated that

you still need to enable this with the attribute mechanism, via .gitattributes

So to activate the tex pattern shown in hvd's answer for all *.tex files, you could issue the following command in your project root (omit the quotes under Windows):

echo '*.tex diff=tex' >> .gitattributes
Florencia answered 6/9, 2015 at 19:20 Comment(0)
R
1

Note: Regarding those patterns, Git 2.34 (Q4 2021), is clearer and reminds developers that the userdiff patterns should be kept simple and permissive, assuming that the contents they apply are always syntactically correct.

See commit b6029b3 (10 Aug 2021) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit e1eb133, 30 Aug 2021)

userdiff: comment on the builtin patterns

Remind developers that they do not need to go overboard to implement patterns to prepare for invalid constructs.
They only have to be sufficiently permissive, assuming that the payload is syntactically correct, and that may allow them to be simpler.

Text stolen mostly from, and further improved by, Johannes Sixt.

So those built-in patterns now have as a comment:

/*
 * Built-in drivers for various languages, sorted by their names
 * (except that the "default" is left at the end).
 *
 * When writing or updating patterns, assume that the contents these
 * patterns are applied to are syntactically correct.  The patterns
 * can be simple without implementing all syntactical corner cases, as
 * long as they are sufficiently permissive.
 */
static struct userdiff_driver builtin_drivers[] = {
Rennold answered 4/9, 2021 at 22:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.