How split a file in words in unix command line?
Asked Answered
M

11

29

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:

Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.

The output file should contain:

Hola
mundo
hablo
español
...

Thank!

Microfiche answered 19/3, 2013 at 14:3 Comment(4)
Are these one word or 2: O'Hara, X-ray, over-priced, dog's, 27, $27, $27.00, 27lbs?Beldam
Then what distinguishes "word"s from "word-separators"?Beldam
I posted an "answer" to show what I think you need and which none of the posted solutions will give you. Think about it and let us know...Beldam
cat file | sed "s/ /\n/g"Cabasset
E
58

Using tr:

tr -s '[[:punct:][:space:]]' '\n' < file
Educate answered 19/3, 2013 at 14:13 Comment(4)
Simple and clean. Nice solution.Hornsby
+1 as I think this is probably closest to what the poster wants but he did say that O'Hara and X-ray and some other combinations that include [:punct:] characters should be considered as one word which this solution would not do. He'd probably also want the output piped to "sort" so he just gets each word once in the output but now I'm guessing.Beldam
Perhaps expand [:punct:] and remove - and ', making: tr -s '[*!"#\$%&\(\)\+,\\\.\/:;<=>\?@\[\\\\]^_`\{|\}~][:space:]]' '\n' < file; optionally as Ed Morton also suggests sort and maybe add frequency: tr -s '[*!"#\$%&\(\)\+,\\\.\/:;<=>\?@\[\\\\]^_`\{|\}~][:space:]]' '\n' < file | sort | uniq -c | sort -nr. A bit tangled but perhaps good. Also think about character case. Proper tokenizing can be tricky :)Hornsby
You can save the result to a file using: tr -s '[[:punct:][:space:]]' '\n' < file > temp && mv temp file , supposing that filename is fileSororate
D
14

The simplest tool is fmt:

fmt -1 <your-file

fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html

Diez answered 1/2, 2017 at 11:44 Comment(1)
don't use this if you have extra spaces.Proconsul
P
4

Using sed:

$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile

basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).

Pantelegraph answered 19/3, 2013 at 14:6 Comment(0)
C
4

grep -o prints only the parts of matching line that matches pattern

grep -o '[[:alpha:]]*' file
Circumlunar answered 19/3, 2013 at 14:19 Comment(5)
Can you explain me more please? I'dont understand the pattern, thank you.Microfiche
Its a standard named classes for symbols that grep can use. This one, [:alpha:], for example, means "all alphabet characters". Just like [A-Za-z] except it is aware of the current locale. Also, it is [:alpha:], not :alpha: - brackets are a part of named class.Circumlunar
* means zero or more repetitions. Probably don't want to include words with zero characters :-). A BRE for 1-or-more would be [[:alpha:]][[:alpha:]]* while an ERE would be [[:alpha:]]+Beldam
This only matches the first word per line in the input file. Not a solution. Also, while 'word' is not defined, perhaps it would be a good thing to assume that a word can contain other characters than those in the alphabet, such as digits, apostrophes...?Hornsby
grep with -o option will just omit empty matches so it's completely legal. Still, in other utilities/languages it could be significant, thanks for correction.Circumlunar
H
1

Using perl:

perl -ne 'print join("\n", split)' < file

Hornsby answered 19/3, 2013 at 14:7 Comment(2)
No ponctuation handling :/Fretwell
Nothing about special treatment of punctuation was requested. One definition of 'word' is anything separated by a space character. Different languages have different punctuation. Sometimes punctuation is important information to retain when tokenizing. Hence, simple implementation which is easy to extend, if needed.Hornsby
E
1
cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v

tr -d ",." deletes , and .

tr " \t" "\n" changes spaces and tabs to newlines

grep -e "^$" -v deletes empty lines (in case of two or more spaces)

Ephemerality answered 19/3, 2013 at 14:12 Comment(3)
I'm using ubuntu, is there tr in ubuntu?, what package should I install?Microfiche
I'm using debian stable and cat, tr and grep are there by default, it is the same with ubuntu imho. tr is part of "coreutils" package in both debian and ubuntu.Ephemerality
@Microfiche You picked a solution which will consider "stop!" and "stop?" as 2 different "words". I doubt if that is what you would want and there are MANY other issues with this solution. If you can just tell us in words what distinguishes "word"s from "word-separators" in your mind then we can give probably you a solution.Beldam
T
1

this awk line may work too?

awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1'  inputfile
Tetchy answered 19/3, 2013 at 14:16 Comment(1)
It force awk internal to use OFS variable without using the commas to separate fields to displayFretwell
B
1

Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.

So you need a solution that will convert this:

I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".

into this:

I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at 
[email protected]

Is that correct?

Try this using GNU awk so we can set RS to more than one character:

$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".

$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
[email protected]

Try to come up with some other test cases to see if this always does what you want.

Beldam answered 19/3, 2013 at 16:56 Comment(3)
Yes Ed Morton, I had not thought about this cases, is important for me solve this problem now and I have not ideas of rules that could work.Microfiche
Heh. Covered a lot of cases there. But there are probably a zillion more... not to mention differences between languages. But a good solution demands a good understanding of the requirements. Question needs to be more detailed for someone to give a good solution. At this stage I'd recommend having a look at what libraries are available for natural language parsing. Perhaps there is a good tokenizer out there that already covers many of the common pitfalls. Have a look at Ruby, Python, Perl maybe.Hornsby
agreed. you can't do this job robustly with a quick script as so much in natural language depends on context so best the OP can hope for is a solution that's "good enough" for their needs.Beldam
E
0

A very simple option would first be,

sed 's,\(\w*\),\1\n,g' file

beware it doens't handle neither apostrophes nor punctuation

Evelinevelina answered 19/3, 2013 at 14:7 Comment(0)
G
0

Using :

perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file

Output

Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós
Grudge answered 19/3, 2013 at 14:13 Comment(0)
M
0

perl -ne 'print join("\n", split)'

Sorry @jsageryd

That one liner does not give correct answer as it joins last word on line with first word on next.

This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that

perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

Mccarver answered 17/10, 2014 at 12:5 Comment(1)
perl -nle 'print if $_=join($\,split)'Grindery

© 2022 - 2024 — McMap. All rights reserved.