How split a file in words in unix command line?

Asked 19/3, 2013 at 14:3 Answered 1/2, 2017 at 11:44

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into other file with a word per line. Example my file cotains:

Hola mundo, hablo español y no sé si escribí bien la
pregunta, ojalá me puedan entender y ayudar
Adiós.

The output file should contain:

Hola
mundo
hablo
español
...

Thank!

Microfiche answered 19/3, 2013 at 14:3 Comment(4)

Are these one word or 2: O'Hara, X-ray, over-priced, dog's, 27, $27, $27.00, 27lbs? – Beldam 19/3, 2013 at 14:20

Then what distinguishes "word"s from "word-separators"? – Beldam 19/3, 2013 at 16:34

I posted an "answer" to show what I think you need and which none of the posted solutions will give you. Think about it and let us know... – Beldam 19/3, 2013 at 16:57

cat file | sed "s/ /\n/g" – Cabasset 9/3, 2015 at 10:7

Using tr:

tr -s '[[:punct:][:space:]]' '\n' < file

Educate answered 19/3, 2013 at 14:13 Comment(4)

Simple and clean. Nice solution. – Hornsby 19/3, 2013 at 14:43

+1 as I think this is probably closest to what the poster wants but he did say that O'Hara and X-ray and some other combinations that include [:punct:] characters should be considered as one word which this solution would not do. He'd probably also want the output piped to "sort" so he just gets each word once in the output but now I'm guessing. – Beldam 19/3, 2013 at 16:33

Perhaps expand [:punct:] and remove - and ', making: tr -s '[*!"#\$%&\+,\\\.\/:;<=>\?@\[\\\\]^_`\{|\}~][:space:]]' '\n' < file; optionally as Ed Morton also suggests sort and maybe add frequency: tr -s '[*!"#\$%&\+,\\\.\/:;<=>\?@\[\\\\]^_`\{|\}~][:space:]]' '\n' < file | sort | uniq -c | sort -nr. A bit tangled but perhaps good. Also think about character case. Proper tokenizing can be tricky :) – Hornsby 20/3, 2013 at 10:8

You can save the result to a file using: tr -s '[[:punct:][:space:]]' '\n' < file > temp && mv temp file , supposing that filename is file – Sororate 1/7, 2020 at 11:0

The simplest tool is fmt:

fmt -1 <your-file

fmt designed to break lines to fit the specified width and if you provide -1 it breaks immediately after the word. See man fmt for documentation. Inspired by http://everythingsysadmin.com/2012/09/unorthodoxunix.html

Diez answered 1/2, 2017 at 11:44 Comment(1)

don't use this if you have extra spaces. – Proconsul 30/9, 2023 at 18:15

Using sed:

$ sed -e 's/[[:punct:]]*//g;s/[[:space:]]\+/\n/g' < inputfile

basically this deletes all punctuation and replaces any spaces with newlines. This also assumes your flavor of sed understands \n. Some do not -- in which case you can just use a literal newline instead (i.e. by embedding it inside your quotes).

Pantelegraph answered 19/3, 2013 at 14:6 Comment(0)

grep -o prints only the parts of matching line that matches pattern

grep -o '[[:alpha:]]*' file

Circumlunar answered 19/3, 2013 at 14:19 Comment(5)

Can you explain me more please? I'dont understand the pattern, thank you. – Microfiche 19/3, 2013 at 14:23

Its a standard named classes for symbols that grep can use. This one, [:alpha:], for example, means "all alphabet characters". Just like [A-Za-z] except it is aware of the current locale. Also, it is [:alpha:], not :alpha: - brackets are a part of named class. – Circumlunar 19/3, 2013 at 14:30

* means zero or more repetitions. Probably don't want to include words with zero characters :-). A BRE for 1-or-more would be [[:alpha:]][[:alpha:]]* while an ERE would be [[:alpha:]]+ – Beldam 19/3, 2013 at 14:42

This only matches the first word per line in the input file. Not a solution. Also, while 'word' is not defined, perhaps it would be a good thing to assume that a word can contain other characters than those in the alphabet, such as digits, apostrophes...? – Hornsby 19/3, 2013 at 14:48

grep with -o option will just omit empty matches so it's completely legal. Still, in other utilities/languages it could be significant, thanks for correction. – Circumlunar 19/3, 2013 at 14:50

Using perl:

perl -ne 'print join("\n", split)' < file

Hornsby answered 19/3, 2013 at 14:7 Comment(2)

No ponctuation handling :/ – Fretwell 19/3, 2013 at 14:21

Nothing about special treatment of punctuation was requested. One definition of 'word' is anything separated by a space character. Different languages have different punctuation. Sometimes punctuation is important information to retain when tokenizing. Hence, simple implementation which is easy to extend, if needed. – Hornsby 19/3, 2013 at 14:33

cat input.txt | tr -d ",." | tr " \t" "\n" | grep -e "^$" -v

tr -d ",." deletes , and .

tr " \t" "\n" changes spaces and tabs to newlines

grep -e "^$" -v deletes empty lines (in case of two or more spaces)

Ephemerality answered 19/3, 2013 at 14:12 Comment(3)

I'm using ubuntu, is there tr in ubuntu?, what package should I install? – Microfiche 19/3, 2013 at 14:35

I'm using debian stable and cat, tr and grep are there by default, it is the same with ubuntu imho. tr is part of "coreutils" package in both debian and ubuntu. – Ephemerality 19/3, 2013 at 14:40

@Microfiche You picked a solution which will consider "stop!" and "stop?" as 2 different "words". I doubt if that is what you would want and there are MANY other issues with this solution. If you can just tell us in words what distinguishes "word"s from "word-separators" in your mind then we can give probably you a solution. – Beldam 19/3, 2013 at 16:27

this awk line may work too?

awk 'BEGIN{FS="[[:punct:] ]*";OFS="\n"}{$1=$1}1'  inputfile

Tetchy answered 19/3, 2013 at 14:16 Comment(1)

It force awk internal to use OFS variable without using the commas to separate fields to display – Fretwell 19/3, 2013 at 14:36

Based on your responses so far, I THINK what you probably are looking for is to treat words as sequences of characters separated by spaces, commas, sentence-ending characters (i.e. "." "!" or "?" in English) and other characters that you would NOT normally find in combination with alpha-numeric characters (e.g. "<" and ";" but not ' - # $ %). Now, "." is a sentence ending character but you said that $27.00 should be considered a "word" so . needs to be treated differently depending on context. I think the same is probably true for "-" and maybe some other characters.

So you need a solution that will convert this:

I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".

into this:

I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at 
[email protected]

Is that correct?

Try this using GNU awk so we can set RS to more than one character:

$ cat file
I have $27.00. We're 20% under-budget, right? This is #2 - mail me at "[email protected]".

$ gawk -v RS="[[:space:]?!]+" '{gsub(/^[^[:alnum:]$#]+|[^[:alnum:]%]+$/,"")} $0!=""' file
I
have
$27.00
We're
20%
under-budget
right
This
is
#2
mail
me
at
[email protected]

Try to come up with some other test cases to see if this always does what you want.

Beldam answered 19/3, 2013 at 16:56 Comment(3)

Yes Ed Morton, I had not thought about this cases, is important for me solve this problem now and I have not ideas of rules that could work. – Microfiche 19/3, 2013 at 20:26

Heh. Covered a lot of cases there. But there are probably a zillion more... not to mention differences between languages. But a good solution demands a good understanding of the requirements. Question needs to be more detailed for someone to give a good solution. At this stage I'd recommend having a look at what libraries are available for natural language parsing. Perhaps there is a good tokenizer out there that already covers many of the common pitfalls. Have a look at Ruby, Python, Perl maybe. – Hornsby 20/3, 2013 at 10:22

agreed. you can't do this job robustly with a quick script as so much in natural language depends on context so best the OP can hope for is a solution that's "good enough" for their needs. – Beldam 20/3, 2013 at 11:35

A very simple option would first be,

sed 's,\(\w*\),\1\n,g' file

beware it doens't handle neither apostrophes nor punctuation

Evelinevelina answered 19/3, 2013 at 14:7 Comment(0)

Using perl :

perl -pe 's/(?:\p{Punct}|\s+)+/\n/g' file

Output

Hola
mundo
hablo
español
y
no
sé
si
escribí
bien
la
pregunta
ojal�
me
puedan
entender
y
ayudar
Adiós

Grudge answered 19/3, 2013 at 14:13 Comment(0)

perl -ne 'print join("\n", split)'

Sorry @jsageryd

That one liner does not give correct answer as it joins last word on line with first word on next.

This is better but generates a blank line for each blank line in src. Pipe via | sed '/^$/d' to fix that

perl -ne '{ print join("\n",split(/[[:^word:]]+/)),"\n"; }'

Mccarver answered 17/10, 2014 at 12:5 Comment(1)

perl -nle 'print if $_=join($\,split)' – Grindery 22/7, 2022 at 13:36

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Output

Recommended topics

Hot tags