Where's the 3-way Git merge driver for .PO (gettext) files?
Asked Answered
F

6

20

I already have following

[attr]POFILE merge=merge-po-files

locale/*.po POFILE

in the .gitattributes and I'd like to get merging of branches to work correctly when the same localization file (e.g. locale/en.po) has been modified in paraller branches. I'm currently using following merge driver:

#!/bin/bash
# git merge driver for .PO files (gettext localizations)
# Install:
# git config merge.merge-po-files.driver "./bin/merge-po-files %A %O %B"

LOCAL="${1}._LOCAL_"
BASE="${2}._BASE_"
REMOTE="${3}._REMOTE_"

# rename to bit more meaningful filenames to get better conflict results
cp "${1}" "$LOCAL"
cp "${2}" "$BASE"
cp "${3}" "$REMOTE"

# merge files and overwrite local file with the result
msgcat "$LOCAL" "$BASE" "$REMOTE" -o "${1}" || exit 1

# cleanup
rm -f "$LOCAL" "$BASE" "$REMOTE"

# check if merge has conflicts
fgrep -q '#-#-#-#-#' "${1}" && exit 1

# if we get here, merge is successful
exit 0

However, the msgcat is too dumb and this is not a true three way merge. For example, if I have

  1. BASE version

    msgid "foo"
    msgstr "foo"
    
  2. LOCAL version

    msgid "foo"
    msgstr "bar"
    
  3. REMOTE version

    msgid "foo"
    msgstr "foo"
    

I'll end up with a conflict. However, a true three way merge driver would output correct merge:

msgid "foo"
msgstr "bar"

Note that I cannot simply add --use-first to msgcat because the REMOTE could contain the updated translation. In addition, if BASE, LOCAL and REMOTE are all unique, I still want a conflict, because that would really be a conflict.

What do I need to change to make this work? Bonus points for less insane conflict marker than '#-#-#-#-#', if possible.

Formosa answered 25/4, 2013 at 11:55 Comment(2)
Any chance you could use another merge tool, like kdiff3 (which is 3-way)?Brosine
Have you tried to fix conflicting .PO file merge with kdiff3? I have and it's not pretty. The problem with .PO files is that in reality those are binary database files that just happen to look like text files. Any tool that's designed to merge text files is going to fail.Formosa
F
4

Here's yet another answer from year 2021. I'm nowadays using following merge driver and this seem to work correctly for all cases I've tested. I have this stored as ./bin/merge-po-files in our repository.

#!/bin/bash
#
# Three-way merge driver for PO files, runs on multiple CPUs where possible
#
# Copyright 2015-2016 Marco Ciampa
# Copyright 2021 Mikko Rantalainen <[email protected]>
# License: MIT (https://opensource.org/licenses/MIT)
#
# Original source:
# https://mcmap.net/q/427760/-where-39-s-the-3-way-git-merge-driver-for-po-gettext-files
# https://github.com/mezis/git-whistles/blob/master/libexec/git-merge-po.sh
#
# Install with
# git config merge.merge-po-files.driver "./bin/merge-po-files %A %O %B %P"
#
# Note that you also need file `.gitattributes` with following lines:
#
# [attr]POFILE merge=merge-po-files
# locale/*.po POFILE
#
##########################################################################
# CONFIG:

# Formatting flags to be be used to produce merged .po files
# This can be set to match project needs for the .po files.
# NOTE: $MSGCAT_FINAL_FLAGS will be passed to msgcat without quotation
MSGCAT_FINAL_FLAGS="--no-wrap --sort-output"

# Verbosity level:
# 0: Silent except for real errors
# 1: Show simple header for each file processed
# 2: Also show all conflicts in merge result (both new and existing)
# 3: Also show all status messages with timestamps
VERBOSITY="${VERBOSITY:=2}"

##########################################################################
# Implementation:

# Use logical names for arguments:
LOCAL="$1"
BASE="$2"
OTHER="$3"
FILENAME="$4"
OUTPUT="$LOCAL"

# The temporary directory for all files we need - note that most files are
# created without extensions to emit nicer conflict messages where gettext
# likes to embed the basename of the file in the conflict message so we
# use names like "local" and "other" instead of e.g. "local.G2wZ.po".
TEMP="$(mktemp -d /tmp/merge-po.XXXXXX)"


# abort on any error and report the details if possible
set -E
set -e
on_error()
{
    local parent_lineno="$1"
    local message="$3"
    local code="$2"
    if [[ -n "$message" ]] ; then
        printf "### $0: error near line %d: status %d: %s\n" "${parent_lineno}" "${code}" "${message}" 1>&2
    else
        printf "### $0: error near line %d: status %d\n" "${parent_lineno}" "${code}" 1>&2
    fi
    exit 255
}
trap 'on_error ${LINENO} $?' ERR


# Maybe print message(s) to stdout with timestamps
function status()
{
    if test "$VERBOSITY" -ge 3
    then
        printf "%s %s\n" "$(date '+%Y-%m-%d %H:%M:%S.%3N')" "$@"
    fi
}

# Quietly take translations from $1 and apply those according to template $2
# (and do not use fuzzy-matching, always generate output)
# also supports all flags to msgmerge
function apply_po_template()
{
    msgmerge --force-po --quiet --no-fuzzy-matching "$@"
}

# Take stdin, remove the "graveyard strings" and emit the result to stdout
function strip_graveyard()
{
    msgattrib --no-obsolete
}

# Take stdin, keep only confict lines and emit the result to stdout
function only_conflicts()
{
    msggrep --msgstr -F -e '#-#-#-#-#' -
    # alternative slightly worse implementation: msgattrib --only-fuzzy
}

# Take stdin, discard confict lines and emit the result to stdout
function without_conflicts()
{
    msggrep -v --msgstr -F -e '#-#-#-#-#' -
    # alternative slightly worse implementation: msgattrib --no-fuzzy
}

# Select messages from $1 that are also in $2 but whose contents have changed
# and emit results to stdout
function extract_changes()
{
    # Extract conflicting changes and discard any changes to graveyard area only
    msgcat -o - "$1" "$2" \
    | only_conflicts \
    | apply_po_template -o - "$1" - \
    | strip_graveyard
}

# Emit only the header of $1, supports flags of msggrep
function extract_header()
{
    # Unfortunately gettext really doesn't support extracting just header
    # so we have to get creative: extract only strings that originate
    # from file called "//" which should result to header only
     msggrep --force-po -N // "$@"

    # Logically msggrep --force-po -v -K -E -e '.' should return the header
    # only but msggrep seems be buggy with msgids with line feeds and output
    # those, too
}

# Take file in $1 and show conflicts with colors in the file to stdout
function show_conflicts()
{
    OUTPUT="$1"
    shift
    # Count number of lines to remove from the output and output conflict lines without the header
    CONFLICT_HEADER_LINES=$(cat "$OUTPUT" | msggrep --force-po --color=never --msgstr -F -e '#-#-#-#-#' - | extract_header - | wc -l)
    # tail wants line number of the first displayed line so we want +1 here:
    CONFLICTS=$(cat "$OUTPUT" | msggrep --force-po --color --msgstr -F -e '#-#-#-#-#' - | tail -n "+$((CONFLICT_HEADER_LINES+1))")
    if test -n "$CONFLICTS"
    then
        #echo "----------------------------"
        #echo "Conflicts after merge:"
        echo "----------------------------"
        printf "%s\n" "$CONFLICTS"
        echo "----------------------------"
    fi
}

# Sanity check that we have a sensible temporary directory
test -n "$TEMP" || exit 125
test -d "$TEMP" || exit 126
test -w "$TEMP" || exit 127

if test "$VERBOSITY" -ge 1
then
    printf "Using gettext .PO merge driver: %s ...\n" "$FILENAME"
fi

# Extract the PO header from the current branch (top of file until first empty line)
extract_header -o "${TEMP}/header" "$LOCAL"

##########################################################################
# Following parts can be run partially parallel and "wait" is used to syncronize processing


# Clean input files and use logical filenames for possible conflict markers:
status "Canonicalizing input files ..."
msguniq --force-po -o "${TEMP}/base" --unique "${BASE}" &
msguniq --force-po -o "${TEMP}/local" --unique "${LOCAL}" &
msguniq --force-po -o "${TEMP}/other" --unique "${OTHER}" &
wait

status "Computing local-changes, other-changes and unchanged ..."
msgcat --force-po -o - "${TEMP}/base" "${TEMP}/local" "${TEMP}/other" | without_conflicts > "${TEMP}/unchanged" &
extract_changes "${TEMP}/local" "${TEMP}/base" > "${TEMP}/local-changes" &
extract_changes "${TEMP}/other" "${TEMP}/base" > "${TEMP}/other-changes" &
wait

# Messages changed on both local and other (conflicts):
status "Computing conflicts ..."
msgcat --force-po -o - "${TEMP}/other-changes" "${TEMP}/local-changes" | only_conflicts > "${TEMP}/conflicts"

# Messages changed on local, not on other; and vice-versa:
status "Computing local-only and other-only changes ..."
msgcat --force-po -o "${TEMP}/local-only"  --unique "${TEMP}/local-changes"  "${TEMP}/conflicts" &
msgcat --force-po -o "${TEMP}/other-only" --unique "${TEMP}/other-changes" "${TEMP}/conflicts" &
wait

# Note: following steps require sequential processing and cannot be run in parallel

status "Computing initial merge without template ..."
# Note that we may end up with some extra so we have to apply template later
msgcat --force-po -o "${TEMP}/merge1" "${TEMP}/unchanged" "${TEMP}/conflicts" "${TEMP}/local-only" "${TEMP}/other-only"

# Create a template to only output messages that are actually needed (union of messages on local and other create the template!)
status "Computing template and applying it to merge result ..."
msgcat --force-po -o - "${TEMP}/local" "${TEMP}/other" | apply_po_template -o "${TEMP}/merge2" "${TEMP}/merge1" -

# Final merge result is merge2 with original header
status "Fixing the header after merge ..."
msgcat --force-po $MSGCAT_FINAL_FLAGS -o "${TEMP}/merge3" --use-first "${TEMP}/header" "${TEMP}/merge2"

# Produce output file (overwrites input LOCAL file because git expects that for the results)
status "Saving output ..."
mv "${TEMP}/merge3" "$OUTPUT"

status "Cleaning up ..."

rm "${TEMP}"/*
rmdir "${TEMP}"

status "Checking for conflicts in the result ..."

# Check for conflicts in the final merge
if grep -q '#-#-#-#-#' "$OUTPUT"
then
    if test "$VERBOSITY" -ge 1
    then
        printf "### Conflict(s) detected ###\n"
    fi

    if test "$VERBOSITY" -ge 2
    then
        # Verbose diagnostics
        show_conflicts "$OUTPUT"
    fi

    status "Automatic merge failed, exiting with status 1."
    exit 1
fi

status "Automatic merge completed successfully, exiting with status 0."
exit 0

This variant is based on version in the answer by @mezis in this same question but it has following improvements:

  • Run on multiple CPUs in parallel where possible (distributing to multiple CPUs is done by running multiple pipelines in the background with & and then syncronizing all parallel pipelines later with wait. The final merge requires sequential code so that's running on one CPU core only. The merge speed seems to be around 1 MB/s of .PO input given.
  • Add lots of documentation.
  • Add configurable variable at the start to define the final gettext file format. In the example above, the default config is --no-wrap --sort-output.
  • Use logical names without file extensions for all the temporary files so that gettext merge conflicts are easier to understand.
  • Use new git option %P in the merge driver to pass the correct filename as a parameter. This is required for the case where the merged file contents match another file in the project - the older code that matched on file contents SHA-1 could print wrong filename in such cases. Note that the %P must be used in the git config (see documentation at the start of the file).
  • Avoids using perl, awk or sed for modifying or even reading gettext files - just gettext tools. The optional part uses grep, tail and wc to show verbose conflicts to stdout only but that doesn't handle the real data in the output files.
  • Correctly merges cases where different plural forms have changes (the merge will result in conflict in that translation but nothing should be lost).
  • Note that if you have merge conflicts in the graveyard (the lines starting with #~ those conflicts will be silently dropped instead of trying to merge such cases). Non-conflicting graveyard data will be preserved.
  • Note that this doesn't try to do any fuzzy matching before or after the merge. Sometimes this could improve the results but it depends on heuristics and this merge driver tries to be deterministic.
Formosa answered 16/8, 2021 at 8:2 Comment(1)
Note that this doesn't try to merge fuzzy msgids so if you make minor changes to msgid values (e.g. to fix typos in the source code) you end up preserving both the old and new msgid data after running this merge. You can then use regular gettext tools to combine new template with all the preserved data as usual.Formosa
F
6

[This is a historical version, see my another more recent answer for year 2021 version of the merge driver.]

Here's a bit complex example driver that seems to output correct merge which may contain some translations that should have been deleted by local or remote version.
Nothing should be missing so this driver just adds some extra clutter in some cases.

This version uses gettext native conflict marker that looks like #-#-#-#-# combined with fuzzy flag instead of normal git conflict markers.
The driver is a bit ugly to workaround bugs (or features) in msgcat and msguniq:

#!/bin/bash
# git merge driver for .PO files
# Copyright (c) Mikko Rantalainen <[email protected]>, 2013
# License: MIT

ORIG_HASH=$(git hash-object "${1}")
WORKFILE=$(git ls-tree -r HEAD | fgrep "$ORIG_HASH" | cut -b54-)
echo "Using custom merge driver for $WORKFILE..."

LOCAL="${1}._LOCAL_"
BASE="${2}._BASE_"
REMOTE="${3}._REMOTE_"

LOCAL_ONELINE="$LOCAL""ONELINE_"
BASE_ONELINE="$BASE""ONELINE_"
REMOTE_ONELINE="$REMOTE""ONELINE_"

OUTPUT="$LOCAL""OUTPUT_"
MERGED="$LOCAL""MERGED_"
MERGED2="$LOCAL""MERGED2_"

TEMPLATE1="$LOCAL""TEMPLATE1_"
TEMPLATE2="$LOCAL""TEMPLATE2_"
FALLBACK_OBSOLETE="$LOCAL""FALLBACK_OBSOLETE_"

# standardize the input files for regexping
# default to UTF-8 in case charset is still the placeholder "CHARSET"
cat "${1}" | perl -npe 's!(^"Content-Type: text/plain; charset=)(CHARSET)(\\n"$)!$1UTF-8$3!' | msgcat --no-wrap --sort-output - > "$LOCAL"
cat "${2}" | perl -npe 's!(^"Content-Type: text/plain; charset=)(CHARSET)(\\n"$)!$1UTF-8$3!' | msgcat --no-wrap --sort-output - > "$BASE"
cat "${3}" | perl -npe 's!(^"Content-Type: text/plain; charset=)(CHARSET)(\\n"$)!$1UTF-8$3!' | msgcat --no-wrap --sort-output - > "$REMOTE"

# convert each definition to single line presentation
# extra fill is required to make sure that git separates each conflict 
perl -npe 'BEGIN {$/ = "\n\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$LOCAL" > "$LOCAL_ONELINE"
perl -npe 'BEGIN {$/ = "\n\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$BASE"  > "$BASE_ONELINE"
perl -npe 'BEGIN {$/ = "\n\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$REMOTE"  > "$REMOTE_ONELINE"

# merge files using normal git merge machinery
git merge-file -p --union -L "Current (working directory)" -L "Base (common ancestor)" -L "Incoming (applied changeset)" "$LOCAL_ONELINE" "$BASE_ONELINE" "$REMOTE_ONELINE" > "$MERGED"
MERGESTATUS=$?

# remove possibly duplicated headers (workaround msguniq bug http://comments.gmane.org/gmane.comp.gnu.gettext.bugs/96)
cat "$MERGED" | perl -npe 'BEGIN {$/ = "\n\n"}; s/^([^\n]+#nmsgid ""#nmsgstr ""#n.*?\n)([^\n]+#nmsgid ""#nmsgstr ""#n.*?\n)+/$1/gs' > "$MERGED2"

# remove lines that have totally empty msgstr
# and convert back to normal PO file representation
cat "$MERGED2" | grep -v '#nmsgstr ""$' | grep -v '^#fill#$' | perl -npe 's/#n/\n/g; s/##/#/g' > "$MERGED"

# run the output through msguniq to merge conflicts gettext style
# msguniq seems to have a bug that causes empty output if zero msgids
# are found after the header. Expected output would be the header...
# Workaround the bug by adding an empty obsolete fallback msgid
# that will be automatically removed by msguniq

cat > "$FALLBACK_OBSOLETE" << 'EOF'

#~ msgid "obsolete fallback"
#~ msgstr ""

EOF
cat "$MERGED" "$FALLBACK_OBSOLETE" | msguniq --no-wrap --sort-output > "$MERGED2"


# create a hacked template from default merge between 3 versions
# we do this to try to preserve original file ordering
msgcat --use-first "$LOCAL" "$REMOTE" "$BASE" > "$TEMPLATE1"
msghack --empty "$TEMPLATE1" > "$TEMPLATE2"
msgmerge --silent --no-wrap --no-fuzzy-matching "$MERGED2" "$TEMPLATE2" > "$OUTPUT"

# show some results to stdout
if grep -q '#-#-#-#-#' "$OUTPUT"
then
    FUZZY=$(cat "$OUTPUT" | msgattrib --only-fuzzy --no-obsolete --color | perl -npe 'BEGIN{ undef $/; }; s/^.*?msgid "".*?\n\n//s')
    if test -n "$FUZZY"
    then
        echo "-------------------------------"
        echo "Fuzzy translations after merge:"
        echo "-------------------------------"
        echo "$FUZZY"
        echo "-------------------------------"
    fi
fi

# git merge driver must overwrite the first parameter with output
mv "$OUTPUT" "${1}"

# cleanup
rm -f "$LOCAL" "$BASE" "$REMOTE" "$LOCAL_ONELINE" "$BASE_ONELINE" "$REMOTE_ONELINE" "$MERGED" "$MERGED2" "$TEMPLATE1" "$TEMPLATE2" "$FALLBACK_OBSOLETE"

# return conflict if merge has conflicts according to msgcat/msguniq
grep -q '#-#-#-#-#' "${1}" && exit 1

# otherwise, return git merge status
exit $MERGESTATUS

# Steps to install this driver:
# (1) Edit ".git/config" in your repository directory
# (2) Add following section:
#
# [merge "merge-po-files"]
#   name = merge po-files driver
#   driver = ./bin/merge-po-files %A %O %B
#   recursive = binary
#
# or
#
# git config merge.merge-po-files.driver "./bin/merge-po-files %A %O %B"
#
# The file ".gitattributes" will point git to use this merge driver.

Short explanation about this driver:

  • It converts regular PO file format to single line format where each line is a translation entry.
  • Then it uses regular git merge-file --union to do the merge and after the merge the resulting single line format is converted back to regular PO file format.
    The actual conflict resolution is done after this using msguniq,
  • and then it finally merges the resulting file with template generated by regular msgcat combining original input files to restore possibly lost metadata.

Warning: this driver will use msgcat --no-wrap on the .PO file and will force UTF-8 encoding if actual encoding is not specified.
If you want to use this merge driver but inspect the results always, change the final exit $MERGESTATUS to look like exit 1.

After getting merge conflict from this driver, the best method for fixing the conflict is to open the conflicting file with virtaal and select Navigation: Incomplete.
I find this UI a pretty nice tool for fixing the conflict.

Formosa answered 30/4, 2013 at 10:36 Comment(1)
Good work on that driver. The only flaw I've found so far is that it doesn't handle pluralization correctly - they end up flattened.Rounders
F
4

Here's yet another answer from year 2021. I'm nowadays using following merge driver and this seem to work correctly for all cases I've tested. I have this stored as ./bin/merge-po-files in our repository.

#!/bin/bash
#
# Three-way merge driver for PO files, runs on multiple CPUs where possible
#
# Copyright 2015-2016 Marco Ciampa
# Copyright 2021 Mikko Rantalainen <[email protected]>
# License: MIT (https://opensource.org/licenses/MIT)
#
# Original source:
# https://mcmap.net/q/427760/-where-39-s-the-3-way-git-merge-driver-for-po-gettext-files
# https://github.com/mezis/git-whistles/blob/master/libexec/git-merge-po.sh
#
# Install with
# git config merge.merge-po-files.driver "./bin/merge-po-files %A %O %B %P"
#
# Note that you also need file `.gitattributes` with following lines:
#
# [attr]POFILE merge=merge-po-files
# locale/*.po POFILE
#
##########################################################################
# CONFIG:

# Formatting flags to be be used to produce merged .po files
# This can be set to match project needs for the .po files.
# NOTE: $MSGCAT_FINAL_FLAGS will be passed to msgcat without quotation
MSGCAT_FINAL_FLAGS="--no-wrap --sort-output"

# Verbosity level:
# 0: Silent except for real errors
# 1: Show simple header for each file processed
# 2: Also show all conflicts in merge result (both new and existing)
# 3: Also show all status messages with timestamps
VERBOSITY="${VERBOSITY:=2}"

##########################################################################
# Implementation:

# Use logical names for arguments:
LOCAL="$1"
BASE="$2"
OTHER="$3"
FILENAME="$4"
OUTPUT="$LOCAL"

# The temporary directory for all files we need - note that most files are
# created without extensions to emit nicer conflict messages where gettext
# likes to embed the basename of the file in the conflict message so we
# use names like "local" and "other" instead of e.g. "local.G2wZ.po".
TEMP="$(mktemp -d /tmp/merge-po.XXXXXX)"


# abort on any error and report the details if possible
set -E
set -e
on_error()
{
    local parent_lineno="$1"
    local message="$3"
    local code="$2"
    if [[ -n "$message" ]] ; then
        printf "### $0: error near line %d: status %d: %s\n" "${parent_lineno}" "${code}" "${message}" 1>&2
    else
        printf "### $0: error near line %d: status %d\n" "${parent_lineno}" "${code}" 1>&2
    fi
    exit 255
}
trap 'on_error ${LINENO} $?' ERR


# Maybe print message(s) to stdout with timestamps
function status()
{
    if test "$VERBOSITY" -ge 3
    then
        printf "%s %s\n" "$(date '+%Y-%m-%d %H:%M:%S.%3N')" "$@"
    fi
}

# Quietly take translations from $1 and apply those according to template $2
# (and do not use fuzzy-matching, always generate output)
# also supports all flags to msgmerge
function apply_po_template()
{
    msgmerge --force-po --quiet --no-fuzzy-matching "$@"
}

# Take stdin, remove the "graveyard strings" and emit the result to stdout
function strip_graveyard()
{
    msgattrib --no-obsolete
}

# Take stdin, keep only confict lines and emit the result to stdout
function only_conflicts()
{
    msggrep --msgstr -F -e '#-#-#-#-#' -
    # alternative slightly worse implementation: msgattrib --only-fuzzy
}

# Take stdin, discard confict lines and emit the result to stdout
function without_conflicts()
{
    msggrep -v --msgstr -F -e '#-#-#-#-#' -
    # alternative slightly worse implementation: msgattrib --no-fuzzy
}

# Select messages from $1 that are also in $2 but whose contents have changed
# and emit results to stdout
function extract_changes()
{
    # Extract conflicting changes and discard any changes to graveyard area only
    msgcat -o - "$1" "$2" \
    | only_conflicts \
    | apply_po_template -o - "$1" - \
    | strip_graveyard
}

# Emit only the header of $1, supports flags of msggrep
function extract_header()
{
    # Unfortunately gettext really doesn't support extracting just header
    # so we have to get creative: extract only strings that originate
    # from file called "//" which should result to header only
     msggrep --force-po -N // "$@"

    # Logically msggrep --force-po -v -K -E -e '.' should return the header
    # only but msggrep seems be buggy with msgids with line feeds and output
    # those, too
}

# Take file in $1 and show conflicts with colors in the file to stdout
function show_conflicts()
{
    OUTPUT="$1"
    shift
    # Count number of lines to remove from the output and output conflict lines without the header
    CONFLICT_HEADER_LINES=$(cat "$OUTPUT" | msggrep --force-po --color=never --msgstr -F -e '#-#-#-#-#' - | extract_header - | wc -l)
    # tail wants line number of the first displayed line so we want +1 here:
    CONFLICTS=$(cat "$OUTPUT" | msggrep --force-po --color --msgstr -F -e '#-#-#-#-#' - | tail -n "+$((CONFLICT_HEADER_LINES+1))")
    if test -n "$CONFLICTS"
    then
        #echo "----------------------------"
        #echo "Conflicts after merge:"
        echo "----------------------------"
        printf "%s\n" "$CONFLICTS"
        echo "----------------------------"
    fi
}

# Sanity check that we have a sensible temporary directory
test -n "$TEMP" || exit 125
test -d "$TEMP" || exit 126
test -w "$TEMP" || exit 127

if test "$VERBOSITY" -ge 1
then
    printf "Using gettext .PO merge driver: %s ...\n" "$FILENAME"
fi

# Extract the PO header from the current branch (top of file until first empty line)
extract_header -o "${TEMP}/header" "$LOCAL"

##########################################################################
# Following parts can be run partially parallel and "wait" is used to syncronize processing


# Clean input files and use logical filenames for possible conflict markers:
status "Canonicalizing input files ..."
msguniq --force-po -o "${TEMP}/base" --unique "${BASE}" &
msguniq --force-po -o "${TEMP}/local" --unique "${LOCAL}" &
msguniq --force-po -o "${TEMP}/other" --unique "${OTHER}" &
wait

status "Computing local-changes, other-changes and unchanged ..."
msgcat --force-po -o - "${TEMP}/base" "${TEMP}/local" "${TEMP}/other" | without_conflicts > "${TEMP}/unchanged" &
extract_changes "${TEMP}/local" "${TEMP}/base" > "${TEMP}/local-changes" &
extract_changes "${TEMP}/other" "${TEMP}/base" > "${TEMP}/other-changes" &
wait

# Messages changed on both local and other (conflicts):
status "Computing conflicts ..."
msgcat --force-po -o - "${TEMP}/other-changes" "${TEMP}/local-changes" | only_conflicts > "${TEMP}/conflicts"

# Messages changed on local, not on other; and vice-versa:
status "Computing local-only and other-only changes ..."
msgcat --force-po -o "${TEMP}/local-only"  --unique "${TEMP}/local-changes"  "${TEMP}/conflicts" &
msgcat --force-po -o "${TEMP}/other-only" --unique "${TEMP}/other-changes" "${TEMP}/conflicts" &
wait

# Note: following steps require sequential processing and cannot be run in parallel

status "Computing initial merge without template ..."
# Note that we may end up with some extra so we have to apply template later
msgcat --force-po -o "${TEMP}/merge1" "${TEMP}/unchanged" "${TEMP}/conflicts" "${TEMP}/local-only" "${TEMP}/other-only"

# Create a template to only output messages that are actually needed (union of messages on local and other create the template!)
status "Computing template and applying it to merge result ..."
msgcat --force-po -o - "${TEMP}/local" "${TEMP}/other" | apply_po_template -o "${TEMP}/merge2" "${TEMP}/merge1" -

# Final merge result is merge2 with original header
status "Fixing the header after merge ..."
msgcat --force-po $MSGCAT_FINAL_FLAGS -o "${TEMP}/merge3" --use-first "${TEMP}/header" "${TEMP}/merge2"

# Produce output file (overwrites input LOCAL file because git expects that for the results)
status "Saving output ..."
mv "${TEMP}/merge3" "$OUTPUT"

status "Cleaning up ..."

rm "${TEMP}"/*
rmdir "${TEMP}"

status "Checking for conflicts in the result ..."

# Check for conflicts in the final merge
if grep -q '#-#-#-#-#' "$OUTPUT"
then
    if test "$VERBOSITY" -ge 1
    then
        printf "### Conflict(s) detected ###\n"
    fi

    if test "$VERBOSITY" -ge 2
    then
        # Verbose diagnostics
        show_conflicts "$OUTPUT"
    fi

    status "Automatic merge failed, exiting with status 1."
    exit 1
fi

status "Automatic merge completed successfully, exiting with status 0."
exit 0

This variant is based on version in the answer by @mezis in this same question but it has following improvements:

  • Run on multiple CPUs in parallel where possible (distributing to multiple CPUs is done by running multiple pipelines in the background with & and then syncronizing all parallel pipelines later with wait. The final merge requires sequential code so that's running on one CPU core only. The merge speed seems to be around 1 MB/s of .PO input given.
  • Add lots of documentation.
  • Add configurable variable at the start to define the final gettext file format. In the example above, the default config is --no-wrap --sort-output.
  • Use logical names without file extensions for all the temporary files so that gettext merge conflicts are easier to understand.
  • Use new git option %P in the merge driver to pass the correct filename as a parameter. This is required for the case where the merged file contents match another file in the project - the older code that matched on file contents SHA-1 could print wrong filename in such cases. Note that the %P must be used in the git config (see documentation at the start of the file).
  • Avoids using perl, awk or sed for modifying or even reading gettext files - just gettext tools. The optional part uses grep, tail and wc to show verbose conflicts to stdout only but that doesn't handle the real data in the output files.
  • Correctly merges cases where different plural forms have changes (the merge will result in conflict in that translation but nothing should be lost).
  • Note that if you have merge conflicts in the graveyard (the lines starting with #~ those conflicts will be silently dropped instead of trying to merge such cases). Non-conflicting graveyard data will be preserved.
  • Note that this doesn't try to do any fuzzy matching before or after the merge. Sometimes this could improve the results but it depends on heuristics and this merge driver tries to be deterministic.
Formosa answered 16/8, 2021 at 8:2 Comment(1)
Note that this doesn't try to merge fuzzy msgids so if you make minor changes to msgid values (e.g. to fix typos in the source code) you end up preserving both the old and new msgid data after running this merge. You can then use regular gettext tools to combine new template with all the preserved data as usual.Formosa
F
3

[This is a historical version, see my another more recent answer for year 2021 version of the merge driver.]

Here's an example driver that does correct text based diff with conflict markers in correct places. However, in case of conflict, git mergetool is sure to mess the results so this is not really good. If you want to fix conflicting merges using just a text editor, then this should be fine:

#!/bin/bash
# git merge driver for .PO files
# Copyright (c) Mikko Rantalainen <[email protected]>, 2013
# License: MIT

LOCAL="${1}._LOCAL_"
BASE="${2}._BASE_"
REMOTE="${3}._REMOTE_"
MERGED="${1}._MERGED_"
OUTPUT="$LOCAL""OUTPUT_"

LOCAL_ONELINE="$LOCAL""ONELINE_"
BASE_ONELINE="$BASE""ONELINE_"
REMOTE_ONELINE="$REMOTE""ONELINE_"

# standardize the input files for regexping
msgcat --no-wrap --strict --sort-output "${1}" > "$LOCAL"
msgcat --no-wrap --strict --sort-output "${2}" > "$BASE"
msgcat --no-wrap --strict --sort-output "${3}" > "$REMOTE"

# convert each definition to single line presentation
# extra fill is required to make sure that git separates each conflict 
perl -npe 'BEGIN {$/ = "#\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$LOCAL" > "$LOCAL_ONELINE"
perl -npe 'BEGIN {$/ = "#\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$BASE"  > "$BASE_ONELINE"
perl -npe 'BEGIN {$/ = "#\n"}; s/#\n$/\n/s; s/#/##/sg; s/\n/#n/sg; s/#n$/\n/sg; s/#n$/\n/sg; $_.="#fill#\n" x 4' "$REMOTE"  > "$REMOTE_ONELINE"

# merge files using normal git merge machinery
git merge-file -p -L "Current (working directory)" -L "Base (common ancestor)" -L "Incoming (another change)" "$LOCAL_ONELINE" "$BASE_ONELINE" "$REMOTE_ONELINE" > "$MERGED"
MERGESTATUS=$?

# convert back to normal PO file representation
cat "$MERGED" | grep -v '^#fill#$' | perl -npe 's/#n/\n/g; s/##/#/g' > "$OUTPUT"

# git merge driver must overwrite the first parameter with output
mv "$OUTPUT" "${1}"

# cleanup
rm -f "$LOCAL" "$BASE" "$REMOTE" "$LOCAL_ONELINE" "$BASE_ONELINE" "$REMOTE_ONELINE" "$MERGED"

exit $MERGESTATUS

# Steps to install this driver:
# (1) Edit ".git/config" in your repository directory
# (2) Add following section:
#
# [merge "merge-po-files"]
#   name = merge po-files driver
#   driver = ./bin/merge-po-files %A %O %B
#   recursive = binary
#
# or
#
# git config merge.merge-po-files.driver "./bin/merge-po-files %A %O %B"
#
# The file ".gitattributes" will point git to use this merge driver.

Short explanation about this driver: it converts regular PO file format to single line format where each line is a translation entry. Then it uses regular git merge-file to do the merge and after the merge the resulting single line format is converted back to regular PO file format. Warning: this driver will use msgcat --sort-output on the .PO file so if you want your PO files in some specific order, this may not be the tool for you.

Formosa answered 29/4, 2013 at 5:26 Comment(0)
J
2

After trying many scripts not adressing my issues I wrote po3way.

It works by just rewrapping the file in a way that can be deterministically undone, and just use git merge-file to do the job, it's simple, but it works.

I just used it to forward port ~60 commits (around 5k msgids) from the french Python documentation translation: it works.

Jenifferjenilee answered 14/9, 2022 at 20:10 Comment(11)
Have you tried the script from https://mcmap.net/q/427760/-where-39-s-the-3-way-git-merge-driver-for-po-gettext-files ? That doesn't require any merging of strings with regexp magic based on .po-file binary syntax and it can still handle all merges I've tried and the resulting file has always been a valid .po file. Conflicts are marked with fuzzy flag and the conflict in encoded in msgstr exactly identical to standard msgcat syntax. It can also use multiple CPU cores and can merge files at rate around 1 MB/s.Formosa
An example of a really hard merge case is a translation with plural cases where the shared base version is modified by local branch for the n=1 case and the remote branch changed the n=2 case. Merge for that should be fully automatic because there's no conflict for either n=1 or n=2 case.Formosa
@MikkoRantalainen Yes, I tried it, thanks for sharing, I probably took some inspiration from it. Here's how my test goes: wyz.fr/2T-QX.Jenifferjenilee
What do you mean with "msgid got translated"? The translation should go to msgstr, right? msgid is identifier so if you add different (that is new) identifier, both should be kept.Formosa
Yes sorry when I wrote "msgid got translated" I meant "the msgstr of the msgid I'm speaking got translated", I should have written "the entry got translated". In my example there's no new entry, but an updated entry.Jenifferjenilee
See the configuration setting MSGCAT_FINAL_FLAGS at the start of the script. I prefer the flags used there but you can obviously select the flags you like. The config is hardcoded because my intent is that the script is included in the repository so that the config is part of the script for that repository. If you have an example where actual translation (the msgstr string) is incorrectly merged, I'd be interested too example files.Formosa
See wyz.fr/2T-QX, the 5 first lines give you the files needed to reproduce the issue.Jenifferjenilee
I fail to see any problem with that merge. If I understand correctly, you think that the merge should somehow use heuristics to assume that msgid should be replaced by another value instead of keeping both identifiers and their matching translations. By definition, you cannot know during the merge which identifier "should" be considered to be alternative to each other. For sure, assuming that if two msgstr values are same, it's not safe to assume that you can merge different msgid values. Could you create a new question on this site about the problem you have with that merge?Formosa
Yes, I need that "magic", with a 3-way-merge it just works as expected (test po3way!). I fully agree that it cannot work if, for example, entries are shuffled in the two diverging branches: magic does not exists. (I also think StackOverflow comments is not a good issue tracker :().Jenifferjenilee
If I understood po3way correctly, it assumes that the order of files is the same and different msgid value replaces previous one in the same location. However, if one localization string is removed and another one is added at that same location, po3way will merge the localizations even though it's not the correct action to take. If you can guarantee that above situation cannot happen for your project, sure, you can use that kind of logic. For generic case, msgid must be unique identifier and the way po3way behaves is not safe.Formosa
po3way won't replace it, it'll emit a conflict for later human resolution, same if one is removed and another added, see github.com/JulienPalard/po3way README examples.Jenifferjenilee
M
1

Taking some inspiration from Mikko's answer, we've added a full-fledged 3-way merger to the git-whistles Ruby gem.

It doesn't rely of git-merge or rewriting string with Perl, and only manipulates PO files with Gettext tools.

Here's the code (MIT licensed):

#!/bin/sh
#
# Three-way merge driver for PO files
#
set -e

# failure handler
on_error() {
  local parent_lineno="$1"
  local message="$2"
  local code="${3:-1}"
  if [[ -n "$message" ]] ; then
    echo "Error on or near line ${parent_lineno}: ${message}; exiting with status ${code}"
  else
    echo "Error on or near line ${parent_lineno}; exiting with status ${code}"
  fi
  exit 255
}
trap 'on_error ${LINENO}' ERR

# given a file, find the path that matches its contents
show_file() {
  hash=`git hash-object "${1}"`
  git ls-tree -r HEAD | fgrep "$hash" | cut -b54-
}

# wraps msgmerge with default options
function m_msgmerge() {
  msgmerge --force-po --quiet --no-fuzzy-matching $@
}

# wraps msgcat with default options
function m_msgcat() {
  msgcat --force-po $@
}


# removes the "graveyard strings" from the input
function strip_graveyard() {
  sed -e '/^#~/d'
}

# select messages with a conflict marker
# pass -v to inverse selection
function grep_conflicts() {
  msggrep $@ --msgstr -F -e '#-#-#' -
}

# select messages from $1 that are also in $2 but whose contents have changed
function extract_changes() {
  msgcat -o - $1 $2 \
    | grep_conflicts \
    | m_msgmerge -o - $1 - \
    | strip_graveyard
}


BASE=$1
LOCAL=$2
REMOTE=$3
OUTPUT=$LOCAL
TEMP=`mktemp /tmp/merge-po.XXXX`

echo "Using custom PO merge driver (`show_file ${LOCAL}`; $TEMP)"

# Extract the PO header from the current branch (top of file until first empty line)
sed -e '/^$/q' < $LOCAL > ${TEMP}.header

# clean input files
msguniq --force-po -o ${TEMP}.base   --unique ${BASE}
msguniq --force-po -o ${TEMP}.local  --unique ${LOCAL}
msguniq --force-po -o ${TEMP}.remote --unique ${REMOTE}

# messages changed on local
extract_changes ${TEMP}.local ${TEMP}.base > ${TEMP}.local-changes

# messages changed on remote
extract_changes ${TEMP}.remote ${TEMP}.base > ${TEMP}.remote-changes

# unchanged messages
m_msgcat -o - ${TEMP}.base ${TEMP}.local ${TEMP}.remote \
  | grep_conflicts -v \
  > ${TEMP}.unchanged

# messages changed on both local and remote (conflicts)
m_msgcat -o - ${TEMP}.remote-changes ${TEMP}.local-changes \
  | grep_conflicts \
  > ${TEMP}.conflicts

# messages changed on local, not on remote; and vice-versa
m_msgcat -o ${TEMP}.local-only  --unique ${TEMP}.local-changes  ${TEMP}.conflicts
m_msgcat -o ${TEMP}.remote-only --unique ${TEMP}.remote-changes ${TEMP}.conflicts

# the big merge
m_msgcat -o ${TEMP}.merge1 ${TEMP}.unchanged ${TEMP}.conflicts ${TEMP}.local-only ${TEMP}.remote-only

# create a template to filter messages actually needed (those on local and remote)
m_msgcat -o - ${TEMP}.local ${TEMP}.remote \
  | m_msgmerge -o ${TEMP}.merge2 ${TEMP}.merge1 -

# final merge, adds saved header
m_msgcat -o ${TEMP}.merge3 --use-first ${TEMP}.header ${TEMP}.merge2

# produce output file (overwrites input LOCAL file)
cat ${TEMP}.merge3 > $OUTPUT

# check for conflicts
if grep '#-#' $OUTPUT > /dev/null ; then
  echo "Conflict(s) detected"
  echo "   between ${TEMP}.local and ${TEMP}.remote"
  exit 1
fi
rm -f ${TEMP}*
exit 0
Maze answered 9/4, 2015 at 10:13 Comment(1)
This was not stable enough for my use. I agree that this is the correct direction to go but in some cases the merge fails. I can't share the example case and I currently have no time to create minimal test case. I'll try to debug the issue when I have enough time. My complex driver below is able to successfully merge but that driver is an ugly hack.Formosa
S
0

I made a python driver that nicely handles key removed or introduced by either branch.

Here is its source :

#!/usr/bin/env python3
import importlib
import subprocess
import sys

def default_merge_and_exit():
    print(f"running default git merge", file=sys.stderr)
    subprocess.run(['git', 'merge-file', '-L', 'ours', '-L', 'base', '-L', 'theirs', sys.argv[1], sys.argv[2], sys.argv[3]])
    exit(1)

# check if polib is available
try:
    import polib
except ModuleNotFoundError as err:
    print('polib is not installed', file=sys.stderr)
    default_merge_and_exit()

try:
    # create 3 dictionnaries
    ours={}
    for e in polib.pofile(sys.argv[1]):
        ours[e.msgid]=e.msgstr
    base={}
    for e in polib.pofile(sys.argv[2]):
        base[e.msgid]=e.msgstr
    theirs={}
    for e in polib.pofile(sys.argv[3]):
        theirs[e.msgid]=e.msgstr

    all_keys=set(ours.keys())
    all_keys.update(base.keys())
    all_keys.update(theirs.keys())

    # check for conflicts
    conflicts=[]
    for key in sorted(all_keys):
        presence = (key in ours, key in base, key in theirs)
        if presence == (False, True, True) and base[key] != theirs[key]:
            conflicts.append(f"key removed by us and modified by them : {key}")
        if presence == (True, True, False) and base[key] != ours[key]:
            conflicts.append(f"key removed by them and modified by us : {key}")
        if presence == (True, False, True) and ours[key] != theirs[key]:
            conflicts.append(f"key added by them and us in a different way : {key}")
        if presence == (True, True, True) and base[key] != ours[key] and base[key] != theirs[key] and ours[key] != theirs[key]:
            conflicts.append(f"key modified by them and us in a different way : {key}")
    if conflicts:
        print(f"\nERROR : automerge for {sys.argv[1]} will conflict :", file=sys.stderr)
        for c in conflicts:
            print(c, file=sys.stderr)
        print("\n", file=sys.stderr)
        default_merge_and_exit()

    # update ours_po, knowing that there are no conflicts
    ours_po=polib.pofile(sys.argv[1])

    # mutate all entries with their modifications
    for e in ours_po:
        key=e.msgid
        if key in theirs and key in base and theirs[key] != base[key]:
            e.msgstr = theirs[key]

    # remove all entries removed by them
    # mutate the object without creating a new one https://mcmap.net/q/53385/-how-to-remove-items-from-a-list-while-iterating
    ours_po[:] = [e for e in ours_po if e.msgid in theirs]

    # add all entries introduced by them
    theirs_po=polib.pofile(sys.argv[3])
    for e in theirs_po:
        key=e.msgid
        if key not in ours:
            ours_po.append(e)

    # save result
    ours_po.save(sys.argv[1])

    # format result
    formatted = subprocess.check_output(['msgcat', '--sort-output',sys.argv[1]], text=True)
    open(sys.argv[1], 'w').write(formatted)
except BaseException:
    default_merge_and_exit()
Slurry answered 8/11, 2021 at 9:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.