tr [:upper:] [:lower:] with Cyrillic text
Asked Answered
E

2

17

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

Espionage answered 14/11, 2012 at 15:25 Comment(7)
Conversion with sed works for me: echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'Bovine
echo "Г" | tr [:upper:] [:lower:] outputs "г" properly on a Mac OS X 10.8 system.Report
Thanks @LevLevitsky . That's a suitable fix for me (feel free to promote it into an answer). I wonder why tr doesn't work.Espionage
@Report Interesting, what version of tr is it?Bovine
OSX tr is BSD tr. The manpage says that historically LC_ALL was ignored, and now it is not. I guess that implies unicode is supported. developer.apple.com/library/mac/#documentation/Darwin/Reference/…Espionage
uname | tr "[:upper:]" "[:lower:]" output Linlx on openwrt. tr is busybox 1.34.1Charlottecharlottenburg
per macos mapage The LANG, LC_ALL, LC_CTYPE and LC_COLLATE environment variables affect the execution of tr as described in environ(7).Ovid
B
13

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк
Bovine answered 14/11, 2012 at 16:40 Comment(4)
Yes, the single-byte issue is true. I once reported this as a bug to GNU and they explained this is so by design (i.e. they would have to break compatibility with old software in order to fix it). I also discussed it on a mailing list and was similarly told it was supposed to be that way.Menfolk
Remember to add g flag to the regular expression, if you want to replace all occurrences.Novellanovello
If you add a space to the beginning, it will not work: echo ' СТЭК' | sed 's/[[:upper:]]*/\L&/' => ' СТЭК'. It seems this one works better: echo ' СТЭК' | sed 's/.*/\L&/' => ' стэк'. Tested on GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)Appendicle
Just need to add g: 's/[[:upper:]]*/\L&/g'Mabellemable
L
0

This work for me:

echo ЫЕРУНКЫКТ | sed -e 's/\(.*\)/\L\1/'
Loveland answered 19/5, 2021 at 10:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.