Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not '
-
#
$
%
). Now, "." is a sentence ending character but you said that $27.00
should be considered a "word" so .
needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.
So you need a solution that will convert this:
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".
into this:
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
[email protected]
Is that correct?
Try this using GNU awk so we can set RS to more than one character:
$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".
$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
[email protected]
Try to come up with some other test cases to see if this always does what you want.
O'Hara
,X-ray
,over-priced
,dog's
,27
,$27
,$27.00
,27lbs
? – Beldamcat file | sed "s/ /\n/g"
– Cabasset