Will sed (and others) corrupt non-ASCII files?
Asked Answered
D

1

3

If I write some scripts that manipulate files like doing some search/replace with sed, and the files can be in various charsets, can the files be corrupted?

The text I wish to replace is ASCII and also only occurs on lines in the files that contain only ASCII but the rest of the lines contain characters in other charsets.

Dacey answered 12/3, 2012 at 16:26 Comment(6)
My answer was to your only question. However, it appears that other charsets MIGHT work. As you've recv'd no answers here so far, it seems your best bet would be to search elsewhere.Geomorphology
you can easily test this by copying some of your files to a temporary directory, modifying them with sed, and then see if the files blow-up the programs that use them. Good luck.Cathay
Well, the files are too many and too big to test thoroughly. Was hoping for an expert opinion. :)Dacey
Would "diff" be able to tell me if any non-ASCII content in the files got changed (on a line-by-line basis)? Does the -a switch handle non-ASCII charsets?Dacey
These things are not properly standardized, so the general answer is that this will depend on your platform. In practical terms, Linux tends to be more robust with 8-bit data than, say, BSD.Haymaker
Also, Perl is probably more robust than sed. There is a script s2p in the Perl distribution to translate sed to Perl, but simple search and replace scripts are basically identical.Haymaker
D
5

If your charsets are single-byte encodings (like the ISO-8859-n family) or UTF-8, where the newline character is the same as in ASCII, and the NUL character (\0) doesn't occur, your operation is likely to work. If the files use UTF-16, it will not (because of NULs). Why it should work for simple search and replacement of ASCII strings is: we assumed, your encoding is a superset of ASCII and for a simple match like this, sed will mostly work on the byte level and just replace one byte sequence with another.

But: with more complex operations, like when your replaced or replacement strings contain special characters, your results may vary. For example, the accented characters you enter on your command line might not fit the encoding in your file if console encoding/locale is different from file encoding. One can go around this, but it requires care.

Some operations in sed depend on your locale, for example which characters are considered alphanumeric. Compare for example the following replacement performed in Polish UTF-8 locale and in C locale which uses ASCII:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/[[:alnum:]]/X/g'
XXX XXXXXX
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/[[:alnum:]]/X/g'
Xęś XęXXłX

But if you only want to replace literal strings, it works as expected:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/g/G/g'
Gęś GęGała
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/g/G/g'
Gęś GęGała

As you see, the results differ because accented characters are treated differently depending on locale. In short: replacements of literal ASCII strings will most probably work OK, more complex operations need looking into and may either work or not.

Debarath answered 13/3, 2012 at 17:34 Comment(2)
+100 for the helpful explanation and effort you have taken to help a stranger. So looks like changing the environment language/charset to match each target file will help. ### Additionally my search/replace is only on ASCII characters in lines that have only ASCII characters - am I right to understand lines with other characters should only pose a potential problem if the charset is such that newlines may be confused?Dacey
@Dacey Yes. If you only replace literal strings with other literal strings, it's enough for newline to be the same as in ASCII and NUL to not appear for the replacements to work. Characters outside of ASCII should be OK. Of course, even if the replacement is correct byte-wise, you need to make sure it means what you want it to mean in target's encoding. If you correctly replace ę encoded as UTF-8 in a file which uses Latin-2, the replacement will correctly insert the 2 bytes which represent ę in UTF-8, but these bytes will show as junk when displayed in Latin-2 with the rest of file.Thegn

© 2022 - 2024 — McMap. All rights reserved.