Where does the excerpt in the git diff hunk header come from?
Asked Answered
G

1

36

When I use git diff on a C# file, I see something like this:

diff --git a/foo.cs b/foo.cs
index ff61664..dd8a3e3 100644
--- a/foo.cs
+++ b/foo.cs
@@ -15,6 +15,7 @@ static void Main(string[] args)
                    string name = Console.ReadLine();
             }
             Console.WriteLine("Hello {0}!", name);
+            Console.WriteLine("Goodbye");
         }
     }
 }

The hunk header line contains the first line of the current method (static void Main(string[] args)), which is great. However it doesn't seem to be very reliable... I see many cases where it doesn't work.

So I was wondering, where does this excerpt come from? Does git diff somehow recognize the language syntax? Is there a way to customize it?

Ge answered 23/1, 2015 at 13:30 Comment(0)
G
39

Is there a way to customize it?

The configuration is defined in .gitattributes, section "Defining a custom hunk-header":

First, in .gitattributes, you would assign the diff attribute for paths.

*.tex diff=tex

Then, you would define a "diff.tex.xfuncname" configuration to specify a regular expression that matches a line that you would want to appear as the hunk header "TEXT". Add a section to your $GIT_DIR/config file (or $HOME/.gitconfig file) like this:

[diff "tex"]
   xfuncname = "^(\\\\(sub)*section\\{.*)$"

Note. A single level of backslashes are eaten by the configuration file parser, so you would need to double the backslashes; the pattern above picks a line that begins with a backslash, and zero or more occurrences of sub followed by section followed by open brace, to the end of line.

There are a few built-in patterns to make this easier, and tex is one of them, so you do not have to write the above in your configuration file (you still need to enable this with the attribute mechanism, via .gitattributes).

('csharp' is part of the current built-in patterns)




Where does this excerpt come from?
Does git diff somehow recognize the language syntax?

Initially, the algorithm was quite crude for function name detection :
See commit acb7257 (Git 1.3.0, April 2006, authored by Mark Wooding)

xdiff: Show function names in hunk headers.

The speed of the built-in diff generator is nice; but the function names shown by diff -p are really nice. And I hate having to choose.
So, we hack xdiff to find the function names and print them.

The function names are parsed by a particularly stupid algorithm at the moment: it just tries to find a line in the 'old' file, from before the start of the hunk, whose first character looks plausible. Still, it's most definitely a start.


It was refined with get_func_line(), itself coming from commit f258475 (Git 1.5.3, Sept 2007, authored by Junio C Hamano (gitster))

You can see in that commit the test t/t4018-diff-funcname.sh, to Test custom diff function name patterns.

Per-path attribute based hunk header selection.

This makes "diff -p" hunk headers customizable via gitattributes mechanism.
It is based on Johannes's earlier patch that allowed to define a single regexp to be used for everything.

The mechanism to arrive at the regexp that is used to define hunk header is the same as other use of gitattributes.
You assign an attribute, funcname (because "diff -p" typically uses the name of the function the patch is about as the hunk header), a simple string value.
This can be one of the names of built-in pattern (currently, java" is defined) or a custom pattern name, to be looked up from the configuration file.

 (in .gitattributes)
 *.java   funcname=java
 *.perl   funcname=perl

 (in .git/config)
 [funcname]
   java = ... # ugly and complicated regexp to override the built-in one.
   perl = ... # another ugly and complicated regexp to define a new one.

The current xfuncname syntax is introduced in commit 45d9414, Git 1.6.0.3, Oct. 2008, authored by Brandon Casey

diff.*.xfuncname which uses "extended" regex's for hunk header selection

Currently, the hunk headers produced by 'diff -p' are customizable by setting the diff.*.funcname option in the config file. The 'funcname' option takes a basic regular expression. This functionality was designed using the GNU regex library which, by default, allows using backslashed versions of some extended regular expression operators, even in Basic Regular Expression mode. For example, the following characters, when backslashed, are interpreted according to the extended regular expression rules: ?, +, and |.
As such, the builtin funcname patterns were created using some extended regular expression operators.

Other platforms which adhere more strictly to the POSIX spec do not interpret the backslashed extended RE operators in Basic Regular Expression mode. This causes the pattern matching for the builtin funcname patterns to fail on those platforms.

Introduce a new option 'xfuncname' which uses extended regular expressions, and advertise it instead of funcname.
Since most users are on GNU platforms, the majority of funcname patterns are created and tested there.
Advertising only xfuncname should help to avoid the creation of non-portable patterns which work with GNU regex but not elsewhere.

Additionally, the extended regular expressions may be less ugly and complicated compared to the basic RE since many common special operators do not need to be backslashed.

For example, the GNU Basic RE:

^[    ]*\\(\\(public\\|static\\).*\\)$

becomes the following Extended RE:

^[    ]*((public|static).*)$

Finally, It has been expanded with commit 14937c2, for git 1.7.8 (December 2011), authored by René Scharfe.

diff: add option to show whole functions as context

Add the option -W/--function-context to git diff.
It is similar to the same option of git grep and expands the context of change hunks so that the whole surrounding function is shown.
This "natural" context can allow changes to be understood better.


It is still being tweaked in Git 2.15 (Q4 2017)

The built-in pattern to detect the "function header" for HTML did not match <H1>..<H6> elements without any attributes, which has been fixed.

Before 2.15, it was failing to match <h1>...</h1>, while <h1 class="smth">...</h1> matches.

See commit 9c03cac (23 Sep 2017) by Ilya Kantor (iliakan).
(Merged by Junio C Hamano -- gitster -- in commit 376a1da, 28 Sep 2017)


A pattern to detect function boundary is called a xfuncref.

See commit a807200 (08 Nov 2019) by Łukasz Niemier (hauleth).
(Merged by Junio C Hamano -- gitster -- in commit 376e730, 01 Dec 2019), for Git 2.25 (Q1 2020)

userdiff: add Elixir to supported userdiff languages

Signed-off-by: Łukasz Niemier
Acked-by: Johannes Sixt

Adds support for xfuncref in Elixir language which is Ruby-like language that runs on Erlang Virtual Machine (BEAM).

And:

See commit d1b1384 (13 Dec 2019) by Ed Maste (emaste).
(Merged by Junio C Hamano -- gitster -- in commit ba6b662, 25 Dec 2019)

userdiff: remove empty subexpression from elixir regex

Signed-off-by: Ed Maste
Reviewed-by: Jeff King
Helped-by: Johannes Sixt

The regex failed to compile on FreeBSD.

Also add /* -- */ mark to separate the two regex entries given to the PATTERNS() macro, to make it consistent with patterns for other content types.


The userdiff patterns for Markdown documents have been added with Git 2.27 (Q2 2020).

See commit 09dad92 (02 May 2020) by Ash Holland (sersorrel).
(Merged by Junio C Hamano -- gitster -- in commit dc4c393, 08 May 2020)

userdiff: support Markdown

Signed-off-by: Ash Holland
Acked-by: Johannes Sixt

It's typical to find Markdown documentation alongside source code, and having better context for documentation changes is useful; see also commit 69f9c87d4 ("userdiff: add support for Fountain documents", 2015-07-21, Git v2.6.0-rc0 -- merge listed in batch #1).

The pattern is based on the CommonMark specification 0.29, section 4.2 https://spec.commonmark.org/ but doesn't match empty headings, as seeing them in a hunk header is unlikely to be useful.

Only ATX headings are supported, as detecting setext headings would require printing the line before a pattern matches, or matching a multiline pattern. The word-diff pattern is the same as the pattern for HTML, because many Markdown parsers accept inline HTML.


With Git 2.30 (Q1 2021), the userdiff pattern learned to identify the function definition in POSIX shells and bash.

See commit 2ff6c34 (22 Oct 2020) by Victor Engmark (l0b0).
(Merged by Junio C Hamano -- gitster -- in commit 292e53f, 02 Nov 2020)

userdiff: support Bash

Signed-off-by: Victor Engmark
Acked-by: Johannes Sixt

Support POSIX, bashism and mixed function declarations, all four compound command types, trailing comments and mixed whitespace.

Even though Bash allows locale-dependent characters in function names, only detect function names with characters allowed by POSIX.1-2017 for simplicity.
This should cover the vast majority of use cases, and produces system-agnostic results.

Since a word pattern has to be specified, but there is no easy way to know the default word pattern, use the default IFS characters for a starter. A later patch can improve this.

gitattributes now includes in its man page:

  • bash suitable for source code in the Bourne-Again SHell language.
    Covers a superset of POSIX shell function definitions.

With Git 2.32 (Q2 2021), userdiff patterns for "Scheme" has been added.

See commit a437390 (08 Apr 2021) by Atharva Raykar (tfidfwastaken).
(Merged by Junio C Hamano -- gitster -- in commit 6d7a62d, 20 Apr 2021)

userdiff: add support for Scheme

Signed-off-by: Atharva Raykar

Add a diff driver for Scheme-like languages which recognizes top level and local define forms, whether it is a function definition, binding, syntax definition or a user-defined define-xyzzy form.

Also supports R6RS library forms, module forms along with class and struct declarations used in Racket (PLT Scheme).

Alternate "def" syntax such as those in Gerbil Scheme are also supported, like defstruct, defsyntax and so on.

The rationale for picking define forms for the hunk headers is because it is usually the only significant form for defining the structure of the program, and it is a common pattern for schemers to have local function definitions to hide their visibility, so it is not only the top level define's that are of interest.
Schemers also extend the language with macros to provide their own define forms (for example, something like a define-test-suite) which is also captured in the hunk header.

Since it is common practice to extend syntax with variants of a form like module+, class* etc, those have been supported as well.

The word regex is a best-effort attempt to conform to R7RS (section 2.1) valid identifiers, symbols and numbers.

gitattributes now includes in its man page:

  • scheme suitable for source code in the Scheme language.

With Git 2.33 (Q3 2021), the userdiff pattern for C# learned the token "record".

See commit c4e3178 (02 Mar 2021) by Julian Verdurmen (304NotModified).
(Merged by Junio C Hamano -- gitster -- in commit f741069, 08 Jul 2021)

userdiff: add support for C# record types

Signed-off-by: Julian Verdurmen
Reviewed-by: Johannes Schindelin

Records are added in C# 9

Code example :

public record Person(string FirstName, string LastName);

For more information, see https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-9


With Git 2.34 (Q4 2021), the userdiff pattern for "java" language has been updated.

See commit a8cbc89 (11 Aug 2021) by Tassilo Horn (tsdh).
(Merged by Junio C Hamano -- gitster -- in commit a896086, 30 Aug 2021)

userdiff: improve java hunk header regex

Signed-off-by: Tassilo Horn

Currently, the git diff(man) hunk headers show the wrong method signature if the method has a qualified return type, an array return type, or a generic return type because the regex doesn't allow dots (.), [], or < and > in the return type.
Also, type parameter declarations couldn't be matched.

Add several t4018 tests asserting the right hunk headers for different cases:

  • enum constant change
  • change in generic method with bounded type parameters
  • change in generic method with wildcard
  • field change in a nested class

And, still with Git 2.34 (Q4 2021), userdiff patterns for the C++ language has been updated.

See commit 386076e (24 Oct 2021), commit c4fdba3, commit 637b80c, commit bfaaf19 (10 Oct 2021), and commit 350b87c, commit 3e063de, commit 1cf9384 (08 Oct 2021) by Johannes Sixt (j6t).
(Merged by Junio C Hamano -- gitster -- in commit f3f157f, 25 Oct 2021)

For instance:

userdiff-cpp: permit the digit-separating single-quote in numbers

Signed-off-by: Johannes Sixt

Since C++17, the single-quote can be used as digit separator:

3.141'592'654
1'000'000
0xdead'beaf

Make it known to the word regex of the cpp driver, so that numbers are not split into separate tokens at the single-quotes.


With Git 2.40 (Q1 2023), userdiff includes a regexp update for Java language.

See commit 93d52ed, commit 575e6fc, commit 39226a8 (08 Feb 2023) by Andrei Rybak (rybak).
(Merged by Junio C Hamano -- gitster -- in commit 4a6e6b0, 15 Feb 2023)

userdiff: support Java sealed classes

Signed-off-by: Andrei Rybak
Reviewed-by: Johannes Sixt

A new kind of class was added in Java 17 -- sealed classes (see "JEP 409: Sealed Classes"").1
This feature includes several new keywords that may appear in a declaration of a class.
New modifiers before name of the class: "sealed" and "non-sealed", and a clause after name of the class marked by keyword "permits".

The current set of regular expressions in userdiff.c already allows the modifier "sealed" and the "permits" clause, but not the modifier "non-sealed", which is the first hyphenated keyword in Java (see "JEP draft: Keyword Management for the Java Language").
Allow hyphen in the words that precede the name of type to match the "non-sealed" modifier.

Goshen answered 23/1, 2015 at 13:58 Comment(9)
The -W option is definitely something I'll useGe
Built in regexes can be found here github.com/git/git/blob/master/userdiff.cCroatia
@Croatia great, thanks for the link. Does it say somewhere what it uses by default (without a .gitattributes file)?Karp
@Goshen How the heck do you keep track of these sorts of long-standing answers so you can update them when things change? Do you have an amazing database? An amazing memory?Somerset
@Somerset Just checking the logs (github.com/git/git/commits/master) from time to time. And then searching for similar topics, which leads me more often than not to my own past answers.Goshen
Claude asks: Does it say somewhere what it uses by default (without a .gitattributes file)? The answer is that it does not assume anything, it uses pretty much the same logic as GNU diff in that case. At some point, having Git activate automatically the right setting based on the file extension was discussed on the mailing list, and I think a patch was sent, but it did not get further than that.Stiver
@Stiver Interesting. Do you have a link illustrating that? I tried public-inbox.org/git/?q=detect+hunk+language+file, but did not find anything.Goshen
Here it is: public-inbox.org/git/[email protected]/t/#uStiver
@Stiver Good catch, thank you for the link.Goshen

© 2022 - 2024 — McMap. All rights reserved.