tr [:upper:] [:lower:] with Cyrillic text

Asked 14/11, 2012 at 15:25 Answered 19/5, 2021 at 10:29

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

Espionage answered 14/11, 2012 at 15:25 Comment(7)

Conversion with sed works for me: echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/' – Bovine 14/11, 2012 at 15:37

echo "Г" | tr [:upper:] [:lower:] outputs "г" properly on a Mac OS X 10.8 system. – Report 14/11, 2012 at 16:8

Thanks @LevLevitsky . That's a suitable fix for me (feel free to promote it into an answer). I wonder why tr doesn't work. – Espionage 14/11, 2012 at 16:8

@Report Interesting, what version of tr is it? – Bovine 14/11, 2012 at 16:57

OSX tr is BSD tr. The manpage says that historically LC_ALL was ignored, and now it is not. I guess that implies unicode is supported. developer.apple.com/library/mac/#documentation/Darwin/Reference/… – Espionage 14/11, 2012 at 17:3

uname | tr "[:upper:]" "[:lower:]" output Linlx on openwrt. tr is busybox 1.34.1 – Charlottecharlottenburg 27/11, 2021 at 21:38

per macos mapage The LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of tr as described in environ(7). – Ovid 26/7, 2023 at 7:11

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк

Bovine answered 14/11, 2012 at 16:40 Comment(4)

Yes, the single-byte issue is true. I once reported this as a bug to GNU and they explained this is so by design (i.e. they would have to break compatibility with old software in order to fix it). I also discussed it on a mailing list and was similarly told it was supposed to be that way. – Menfolk 14/11, 2012 at 18:28

Remember to add g flag to the regular expression, if you want to replace all occurrences. – Novellanovello 9/1, 2015 at 15:22

If you add a space to the beginning, it will not work: echo ' СТЭК' | sed 's/[[:upper:]]*/\L&/' => ' СТЭК'. It seems this one works better: echo ' СТЭК' | sed 's/.*/\L&/' => ' стэк'. Tested on GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu) – Appendicle 7/7, 2018 at 23:36

Just need to add g: 's/[[:upper:]]*/\L&/g' – Mabellemable 28/4, 2020 at 3:28

This work for me:

echo ЫЕРУНКЫКТ | sed -e 's/\(.*\)/\L\1/'

Loveland answered 19/5, 2021 at 10:29 Comment(0)

Recommended topics

Hot tags