Split sentences into separate lines
Asked Answered
K

8

13

I'm trying to split sentences in a file into separate lines using a shell script.

Now I would like to split the strings by !, ? or . . The output should be like this :

The file that I want to read from my_text.txt and contains

you want to learn shell script? First, you want to learn Linux command! then. you can learn shell script.

Now I would like to split the strings by " ! " or "? " or "." The output should be like this :

you want to learn shell script                 
First, you want to learn Linux command             
then           
you can learn shell script

I used this script :

while read p
do
   echo $p | tr "? ! ." "\n " 
done < my_text.txt

But the output is:

you want to learn shell script

First, you want to learn Linux command then you can learn shell script

Can somebody help?

Kempis answered 16/12, 2020 at 9:3 Comment(1)
wrt I used this script : - you should read why-is-using-a-shell-loop-to-process-text-considered-bad-practice.Pentagrid
G
2

You can call 3 tr commands to split for ? ! and .

cat test_string.txt | tr "!" "\n" | tr "?" "\n" | tr "." "\n"
Glassine answered 16/12, 2020 at 9:12 Comment(1)
You don't need a UUOC, 3 pipes, and 3 calls to tr to do this task and using the wrong quotes like that is asking the shell to interpret the text between the quotes and so some shells with some settings will interpret ! as a history request.Pentagrid
T
5

This could be done in a single awk using its global substitution option as follows, written and tested with shown samples only in GNU awk. Simply globally substituting ?,!,. with new line(by default ORS(output record separator) value as new line).

awk '{gsub(/\?|!|\./,ORS)} 1' Input_file
Tolan answered 16/12, 2020 at 9:22 Comment(0)
P
4
$ sed 's/[!?.]/\n/g' file
you want to learn shell script
 First, you want to learn Linux command
 then
 you can learn shell script
Pentagrid answered 16/12, 2020 at 20:22 Comment(0)
E
2

Awk is ideal for this:

awk -F '[?.!]' '{ for (i=1;i<=NF;i++) { print $i } }' file

Set the field delimiters to ? or . or ! and then loop through each field and print the entry.

Egg answered 16/12, 2020 at 9:12 Comment(0)
G
2

You can call 3 tr commands to split for ? ! and .

cat test_string.txt | tr "!" "\n" | tr "?" "\n" | tr "." "\n"
Glassine answered 16/12, 2020 at 9:12 Comment(1)
You don't need a UUOC, 3 pipes, and 3 calls to tr to do this task and using the wrong quotes like that is asking the shell to interpret the text between the quotes and so some shells with some settings will interpret ! as a history request.Pentagrid
T
1

That's not how you use tr. Both arguments to it should be of the same length, otherwise the second one is extended to length of the first by repeating its last character*—that is, in this case, a space—to make one-by-one transliteration possible. In other words, given ? ! . and \n  as arguments, tr will replace ? with a line feed, and !, , and . with a space. What you're looking for is I guess:

$ tr '?!.' '\n' <file
you want to learn shell script
 First, you want to learn Linux command
 then
 you can learn shell script

Or, more portably:

tr '?!.' '[\n*]' <file

*This is what GNU tr does, POSIX leaves the behavior unspecified when arguments aren't of the same length.

Technicolor answered 16/12, 2020 at 9:36 Comment(0)
S
0

In gnu-awk we can get it with gensub() function:

awk '{print gensub(/([.?!]\s*)/, "\n", "g", $0)}' file
you want to learn shell script
First, you want to learn Linux command
then
you can learn shell script

Sapotaceous answered 16/12, 2020 at 15:0 Comment(0)
P
0

why limit yourself to new line \n being the RS ? Maybe something like this :

  • \056 is the period. \040 is space. i'll add the + in case there have been legacy practices of typing 2 spaces after each sentence and u wanna standardize it.
  • I presume question mark \044 is more frequent than exclamation \041. Only reason why i'm using all octal is that all those are ones that can wreck havor on a terminal when just a slight chance of didn't quoting and escaping properly.
  • Unlike FS or RS, OFS/ORS are constant strings (are they?), so typing in the characters will be safe.
  • the periods are taken care of by RS. No need special processing. So if the row contains neither ? nor ! , just print it as is, and move on (it'll handle the ". \n" )

.

mawk 'BEGIN { RS = "[\056][\040]+" ; ORS = ". \n"; 
              FS = "[\044][\040]+";  OFS = "? \n"; }
      ($0 !~ /[\041\044]/) { 
                              print; next; } 
             /[\041]/      { 
                              gsub("[\041][\040]+", "\041 \n"); }  
      ( NF==1 ) || ( $1=$1 )'

As fast as mawk is, a gsub ( ) or $1=$1 still costs money, so skip the costly parts unless it actually has a ? or ! mark.

Last line is the fun trick, done *outside the brace brackets. You've already done the ! the line before, so if no ? found (aka NF is 1), then that one evaluates true, which awk will short circuit and not execute part 2 , simply print.

But if you've found any ? marks, the assignment of $1=$1 will re-arrange them in new order, and because it's an assignment operation not equality-compare, it always come back successful if the assignment itself didn't fail, which will also serve as it self's always-true flag to print towards the end.

Phonemic answered 19/12, 2020 at 1:34 Comment(0)
M
0

Awk's record separator variable RS should do the trick.

echo 'you want to learn shell script? First, you want to learn Linux command! then. you can learn shell script.' |
awk 'BEGIN{RS="[?.!] "}1'
Mandler answered 19/12, 2020 at 12:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.