I'm trying to extract a word list from a Russian short story.
#!/bin/sh
export LC_ALL=ru_RU.utf8
sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq
However the tr
step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!
$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г
In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).
What is a portable, reliable way to lowercase unicode text in a pipe?
sed
works for me:echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
– Bovineecho "Г" | tr [:upper:] [:lower:]
outputs "г" properly on a Mac OS X 10.8 system. – Reporttr
is it? – Bovineuname | tr "[:upper:]" "[:lower:]"
outputLinlx
on openwrt. tr is busybox 1.34.1 – Charlottecharlottenburg