How to search & replace arbitrary literal strings in sed and awk (and perl)
Asked Answered
D

6

9

Say we have some arbitrary literals in a file that we need to replace with some other literal.

Normally, we'd just reach for sed(1) or awk(1) and code something like:

sed "s/$target/$replacement/g" file.txt

But what if the $target and/or $replacement could contain characters that are sensitive to sed(1) such as regular expressions. You could escape them but suppose you don't know what they are - they are arbitrary, ok? You'd need to code up something to escape all possible sensitive characters - including the '/' separator. eg

t=$( echo "$target" | sed 's/\./\\./g; s/\*/\\*/g; s/\[/\\[/g; ...' ) # arghhh!

That's pretty awkward for such a simple problem.

perl(1) has \Q ... \E quotes but even that can't cope with the '/' separator in $target.

perl -pe "s/\Q$target\E/$replacement/g" file.txt

I just posted an answer!! So my real question is, "is there a better way to do literal replacements in sed/awk/perl?"

If not, I'll leave this here in case it comes in useful.

Diaeresis answered 6/1, 2019 at 7:55 Comment(2)
for sed, see this previous discussion: Is it possible to escape regex metacharacters reliably with sedOisin
See also Why is the escape of quotes lost in this regex substitution?Estabrook
D
3

Me again!

Here's a simpler way using xxd(1):

t=$( echo -n "$target" | xxd -p | tr -d '\n')
r=$( echo -n "$replacement" | xxd -p | tr -d '\n')
xxd -p file.txt | sed "s/$t/$r/g" | xxd -p -r

... so we're hex-encoding the original text with xxd(1) and doing search-replacement using hex-encoded search strings. Finally we hex-decode the result.

EDIT: I forgot to remove \n from the xxd output (| tr -d '\n') so that patterns can span the 60-column output of xxd. Of course, this relies on GNU sed's ability to operate on very long lines (limited only by memory).

EDIT: this also works on multi-line targets eg

target=$'foo\nbar' replacement=$'bar\nfoo'

Diaeresis answered 6/1, 2019 at 7:55 Comment(1)
When I first saw this answer, I thought it was brilliant.  A few minutes later, I realized that, like a diamond, it was flawed.  For example, if you try to change E to g in a file that contains $Q, it will change to &q.  This is because E is 45, g is 67, and $Q is 2451, so, when you do s/45/67/, you change 2451 to 2671, which is &q (26 + 71).  … I have posted an answer that addresses this issue.Kaspar
B
8

The quotemeta, which implements \Q, absolutely does what you ask for

all ASCII characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash

Since this is presumably in a shell script, the problem is really of how and when shell variables get interpolated and so what the Perl program ends up seeing.

The best way is to avoid working out that interpolation mess and instead properly pass those shell variables to the Perl one-liner. This can be done in several ways; see this post for details.

Either pass the shell variables simply as arguments

#!/bin/bash

# define $target

perl -pe"BEGIN { $patt = shift }; s{\Q$patt}{$replacement}g" "$target" file.txt

where the needed arguments are removed from @ARGV and utilized in a BEGIN block, so before the runtime; then file.txt gets processed. There is no need for \E in the regex here.

Or, use the -s switch, which enables command-line switches for the program

# define $target, etc

perl -s -pe"s{\Q$patt}{$replacement}g" -- -patt="$target" file.txt

The -- is needed to mark the start of arguments, and switches must come before filenames.

Finally, you can also export the shell variables, which can then be used in the Perl script via %ENV; but in general I'd rather recommend either of the above two approaches.


A full example

#!/bin/bash
# Last modified: 2019 Jan 06 (22:15)

target="/{"
replacement="&"

echo "Replace $target with $replacement"

perl -wE'
    BEGIN { $p = shift; $r = shift }; 
    $_=q(ah/{yes); s/\Q$p/$r/; say
' "$target" "$replacement"

This prints

Replace /{ with &
ah&yes

where I've used characters mentioned in a comment.

The other way

#!/bin/bash
# Last modified: 2019 Jan 06 (22:05)

target="/{"
replacement="&"

echo "Replace $target with $replacement"

perl -s -wE'$_ = q(ah/{yes); s/\Q$patt/$repl/; say' \
    -- -patt="$target" -repl="$replacement"

where code is broken over lines for readability here (and thus needs the \). Same printout.

Basinet answered 6/1, 2019 at 8:11 Comment(16)
If $target is a shell variable which contains the separator character, Perl only sees the script after variables have been interpolated by the shell, and ends up with a syntax error at best, and a security problem at worst. You have to tell Perl which part is a variable, perhaps by passing it as a command-line argument.Herefordshire
@Herefordshire Indeed, completely missed their implied context. EditedBasinet
@melpomene Thank you for the edit, it is absolutely better to quote it in the shell. (But it's always worked unquoted in my tests in bash; I wonder whether I've been lucking out with specific software versions?)Basinet
@Basinet Did your $target test strings contain interesting characters like *, ? or spaces?Nowicki
Found a related link: mywiki.wooledge.org/BashPitfalls#cp_.24file_.24targetNowicki
@Nowicki OK, thank you, good to have a reference. (I don't do this often and just forgot that with spaces it's of course parsed into words if unprotected, and the other "Interesting" chars may as well mess it up). Thank you, now I need to go to some previous posts and correct it at a few more places I think (recalled one so far).Basinet
Yes, thanks - { ... } gets over the problem in perl with s/.../.../ - although I had to add an extra backslash to fix the syntax: target='{'; replacement='&'; echo '{' | perl -s -pe"s{\Q\$patt}{$replacement}g" -- -patt="$target"Diaeresis
@Diaeresis Thank you for the well-meant edit suggestion -- but I can't do that: what I posted works in my tests, as it should. I don't see why you'd need to escape that, and how it would work then? That $patt is a variable in the perl script, set by courtesy of -s to whatever $target was evaluated to in the shell. In my tests I've set it in a bash script to hey* wa?\no, for example, and it gets into the Perl script as it should. I've also tried it in a regex as a short string having / and it works. Doesn't need the {}{} btw, it works with s/// as well.Basinet
@Diaeresis Also: pass the $replacement as well. If you use the form with -s then just add another switch, say -repl="$replacement" and in the regex you can use the variable $repl. See the linked post for a little more explantion.Basinet
@Diaeresis Added a full example, so that we are on the same page. This is on CentOS7, Perl v5.16, GNU bash 4.2.46(1)Basinet
Hmmm interesting. I wonder if it's the different environment - I'm on fedora-29 with perl-interpreter-5.28.1-427.fc29.x86_64 but I've also tried them out on fedora-24 and perl-5.22.4-372.fc24.x86_64 and a debian system with perl-5.24 with complete congruence. The command I gave above worked as expected with \$patt - it's only soft-quoted so without the backslash the shell expands $patt instead of passing it for perl to expand. Also, piping into perl -p works fine for me - maybe a change in perl since 5.16? You're quite right that s/.../.../ works but, again, only if I use \$patt.Diaeresis
I think your new examples look better as they have $patt hard-quoted - I'll have a play with them.Diaeresis
@Diaeresis From what I see you do have double-quotes, as you must; so "$target" is evaluated by the shell into a string that is intended, and then that's what Perl receives (in either way). (I missed the piping in your comment, sorry -- -p sure works with it. Removed that comment.)Basinet
@Diaeresis (1) Tried again, it works for me either way (with data piped, or in file, or in script) (2) Passing arguments directly via @ARGV (and extracting them in the BEGIN block) works for me and should be most reliableBasinet
This is the correct answer, but it didn't explain why perl doesn't interpret the string from a shell script correctly so I wrote my own answer explaining that part: https://mcmap.net/q/1157523/-how-to-search-amp-replace-arbitrary-literal-strings-in-sed-and-awk-and-perlIrrational
@QuinnComendant Yeah, that's what tripleee's (first) comment in this thread asserted and what the second sentence in the answer refers to. Good to have it spelled out in your answer :)Basinet
D
3

Me again!

Here's a simpler way using xxd(1):

t=$( echo -n "$target" | xxd -p | tr -d '\n')
r=$( echo -n "$replacement" | xxd -p | tr -d '\n')
xxd -p file.txt | sed "s/$t/$r/g" | xxd -p -r

... so we're hex-encoding the original text with xxd(1) and doing search-replacement using hex-encoded search strings. Finally we hex-decode the result.

EDIT: I forgot to remove \n from the xxd output (| tr -d '\n') so that patterns can span the 60-column output of xxd. Of course, this relies on GNU sed's ability to operate on very long lines (limited only by memory).

EDIT: this also works on multi-line targets eg

target=$'foo\nbar' replacement=$'bar\nfoo'

Diaeresis answered 6/1, 2019 at 7:55 Comment(1)
When I first saw this answer, I thought it was brilliant.  A few minutes later, I realized that, like a diamond, it was flawed.  For example, if you try to change E to g in a file that contains $Q, it will change to &q.  This is because E is 45, g is 67, and $Q is 2451, so, when you do s/45/67/, you change 2451 to 2671, which is &q (26 + 71).  … I have posted an answer that addresses this issue.Kaspar
S
1

With awk you could do it like this:

awk -v t="$target" -v r="$replacement" '{gsub(t,r)}' file

The above expects t to be a regular expression, to use it a string you can use

awk -v t="$target" -v r="$replacement" '{while(i=index($0,t)){$0 = substr($0,1,i-1) r substr($0,i+length(t))} print}' file

Inspired from this post

Note that this won't work properly if the replacement string contains the target. The above link has solutions for that too.

Straightedge answered 6/1, 2019 at 8:3 Comment(2)
@Sundeep: You are right, thankfully glenn jackman found a solution for that caseStraightedge
Thanks for the link - that solution is pretty complex but since it's all in awk, it's probably faster than the xxd solution, if performance ever becomes an issue. I do think the xxd solution is simpler to read and understand.Diaeresis
P
1

This is an enhancement of wef’s answer.

We can remove the issue of the special meaning of various special characters and strings (^, ., [, *, $, \(, \), \{, \}, \+, \?, &, \1, …, whatever, and the / delimiter) by removing the special characters.  Specifically, we can convert everything to hex; then we have only 0-9 and a-f to deal with.  This example demonstrates the principle:

$ echo -n '3.14' | xxd
0000000: 332e 3134                                3.14

$ echo -n 'pi'   | xxd
0000000: 7069                                     pi

$ echo '3.14 is a transcendental number.  3614 is an integer.' | xxd
0000000: 332e 3134 2069 7320 6120 7472 616e 7363  3.14 is a transc
0000010: 656e 6465 6e74 616c 206e 756d 6265 722e  endental number.
0000020: 2020 3336 3134 2069 7320 616e 2069 6e74    3614 is an int
0000030: 6567 6572 2e0a                           eger..

$ echo "3.14 is a transcendental number.  3614 is an integer." | xxd -p \
                                                       | sed 's/332e3134/7069/g' | xxd -p -r
pi is a transcendental number.  3614 is an integer.

whereas, of course, sed 's/3.14/pi/g' would also change 3614.

The above is a slight oversimplification; it doesn’t account for boundaries.  Consider this (somewhat contrived) example:

$ echo -n 'E' | xxd
0000000: 45                                       E

$ echo -n 'g' | xxd
0000000: 67                                       g

$ echo '$Q Eak!' | xxd
0000000: 2451 2045 616b 210a                      $Q Eak!.

$ echo '$Q Eak!' | xxd -p | sed 's/45/67/g' | xxd -p -r
&q gak!

Because $ (24) and Q (51) combine to form 2451, the s/45/67/g command rips it apart from the inside.  It changes 2451 to 2671, which is &q (26 + 71).  We can prevent that by separating the bytes of data in the search text, the replacement text and the file with spaces.  Here’s a stylized solution:

encode() {
        xxd -p    -- "$@" | sed 's/../& /g' | tr -d '\n'
}
decode() {
        xxd -p -r -- "$@"
}
left=$( printf '%s' "$search"      | encode)
right=$(printf '%s' "$replacement" | encode)
encode file.txt | sed "s/$left/$right/g" | decode

I defined an encode function because I used that functionality three times, and then I defined decode for symmetry.  If you don’t want to define a decode function, just change the last line to

encode file.txt | sed "s/$left/$right/g" | xxd -p –r

Note that the encode function triples the size of the data (text) in the file, and then sends it through sed as a single line — without even having a newline at the end.  GNU sed seems to be able to handle this; other versions might not be able to.

As an added bonus, this solution handles multi-line search and replace (in other words, search and replacement strings that contain newline(s)).

Pamphylia answered 1/4, 2020 at 4:31 Comment(4)
Quite correct AFAICS even though it's an extreme edge cases - the right thing to do for the completely general case. My original use-case was moderately long targets unlikely to occur. Please accept an upvote and thanks for the thoughtful answer.Diaeresis
Hex-encoding is a brilliant idea, but encoding and decoding a large input file just to substitute literals is overkill when hex-encoding the regex alone will work just fine.Timaru
Turns out hex-encoding the regex doesn't even work after some stress testing. sigh However, it's worth looking at answers on the earlier sed-specific question, since this approach still might be prohibitive for large inputs.Timaru
@djpohly: Can you / will you explain what went wrong?   Unless you got heavily downvoted, it might be more useful to undelete your answer and include a “disclosures” or “limitations” paragraph describing when it does and doesn’t work.Kaspar
I
1

I can explain why this doesn't work:

perl(1) has \Q ... \E quotes but even that can't cope with the '/' separator in $target.

The reason is because the \Q and \E (quotemeta) escapes are processed after the regex is parsed, and a regex is not parsed unless there are valid pattern delimiters defining a regex.

As an example, here's an attempt to replace the string /etc/ in /etc/hosts by using a variable in a string passed to perl:

$target="/etc/";
perl -pe "s/\Q$target\E/XXX/" <<<"/etc/hosts";

After the shell expands the variable in the string, perl receives the command s/\Q/etc/\E/XXX/ which is not a valid regex because it doesn't contain three pattern delimiters (perl sees five delimiters, i.e., s/…/…/…/…/). Therefore, the \Q and \E are never even executed.

The solution, as @zdim suggested, is to pass the variables to perl in a way that they are included in the regex after the regex is parsed, such as like this:

perl -s -pe 's/\Q$target\E/XXX/ig' -- -target="/etc/" <<<"/etc/123"
Irrational answered 5/10, 2021 at 21:41 Comment(0)
V
0

awk escaping is not all that complex either :

on the searching regex, just these 2 suffices to escape any and all awk variants - simply "cage" all of them, with additional escaping performed for just the circumflex/caret, and backslash itself :

-- technically you don't need to escape space at all - sometimes i like using it for marking an unambiguous anchoring point for the character instead of letting awk be too agile about how it handles spaces and tabs. swap the space for "!" inside the regex if u like

  jot -s '' -c  - 32 126 | 

  mawk 'gsub("[[-\440{-~:-@ -/]", "[&]") \       

                  gsub(/\\|\^/, "\\\\&")^_' FS='^$' RS='^$'  
  • \440 is (`) - i'm just not a fan of having those exposed in my code
    

|

  [ ][!]["][#][$][%][&]['][(][)][*][+] [,] [-][.] [/]   # re-aligned for 
  0123456789                    [:][;] [<] [=][>] [?]   # readability
  [@]ABCDEFGHIJKLMNOPQRSTUVWXYZ    [[][\\] []][\^][_]
  [`]abcdefghijklmnopqrstuvwxyz    [{] [|] [}][~]

as for replacement, only literal "&" needs to be escaped via

gsub(target_regex, "&")         # nothing escaped
      matched text
gsub(target_regex, "\\&")       # 2 backslashes
      literal "&"
gsub("[[:punct:]]", "\\\\&")    # 4 backslashes 
  \!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~

—- (personally prefer using square-brackets i.e. char classes as an escaping mechanism than having backslash galore)

gsub("[[:punct:]]", "\\\\\\&")   # 6 backslashes
  \&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&\&

Use 6-backslashes only if you're planning to feed this output further down to another gsub()/match() function call

Vinni answered 1/8, 2022 at 8:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.