sed: matching unicode blocks with
Asked Answered
P

1

0

I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:

\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

I tried (in a sed config file loaded via the -f switch):

s/\p{InHigh_Surrogates}/###/  --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'

Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.

Thanks, Thomas

Pine answered 17/3, 2014 at 9:21 Comment(1)
The reason might be that surrogates are invalid in UTF-8.Vanderpool
U
2

Try using the -r flag for sed:

$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

From man sed:

-r, --regexp-extended

use extended regular expressions in the script.

Unglue answered 17/3, 2014 at 9:27 Comment(4)
Thanks! Itried that, required changing some other lines as well - but still InHigh_Surrogates seems to be the problem...Pine
But is it working to you or not? If not, please update your question with the exact problem you are facing. If it does, note you can mark the answer as accepted.Unglue
Sorry for being imprecise - no, it did not work using -r either. Seems to me like SED does not know about unicode blocks - or I am too dumb to make it work ;) I cannot give any clearer explanation than the one provided. In both ways, I get the same error message described in my initial posting.Pine
I am sorry to say I don't know what else could it be. You can try checking in this site possible options. For example, Remove unicode characters from textfiles - sed , other bash/shell methodsUnglue

© 2022 - 2024 — McMap. All rights reserved.