sed: matching unicode blocks with

About

Asked 17/3, 2014 at 9:21 Answered 17/3, 2014 at 9:27

I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:

\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

I tried (in a sed config file loaded via the -f switch):

s/\p{InHigh_Surrogates}/###/  --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'

Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.

Thanks, Thomas

Pine answered 17/3, 2014 at 9:21 Comment(1)

The reason might be that surrogates are invalid in UTF-8. – Vanderpool 17/3, 2014 at 15:8

Try using the -r flag for sed:

$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

From man sed:

-r, --regexp-extended

use extended regular expressions in the script.

Unglue answered 17/3, 2014 at 9:27 Comment(4)

Thanks! Itried that, required changing some other lines as well - but still InHigh_Surrogates seems to be the problem... – Pine 17/3, 2014 at 12:34

But is it working to you or not? If not, please update your question with the exact problem you are facing. If it does, note you can mark the answer as accepted. – Unglue 17/3, 2014 at 12:51

Sorry for being imprecise - no, it did not work using -r either. Seems to me like SED does not know about unicode blocks - or I am too dumb to make it work ;) I cannot give any clearer explanation than the one provided. In both ways, I get the same error message described in my initial posting. – Pine 18/3, 2014 at 16:18

I am sorry to say I don't know what else could it be. You can try checking in this site possible options. For example, Remove unicode characters from textfiles - sed , other bash/shell methods – Unglue 18/3, 2014 at 16:20

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags