How to Remove duplication of words from both sentences using shell script?

Asked 18/12, 2020 at 11:31 Answered 27/1, 2021 at 9:32

I have a two sentences containing duplicate words, for example, the input data in file my_text.txt:

The Unix and Linux operating system.
The Unix and Linux system was to create an environment that promoted efficient program.

I used this script:

while read p
do
echo "$p"|sort -u | uniq
done < my_text.txt

But the output is the same content of the input file:

The Unix and Linux operating system. The Unix and Linux system was to create an environment that promoted efficient program

How can I remove the duplicate words from both sentences?

Microcrystalline answered 18/12, 2020 at 11:31 Comment(2)

Could you please post more clear samples of input and expected output in your question for better understanding of question. – Scummy 18/12, 2020 at 11:34

I want to remove any duplicate words in both sentences, in my example, There are 5 repeated words in both sentences ( The, Unix ,and ,Linux, system ), but I need a script more general for both sentences that contain duplicate words. – Microcrystalline 18/12, 2020 at 11:40

Your code would remove repeated lines; both sort and uniq operate on lines, not words. (And even then, the loop is superfluous; if you wanted to do that, your code should be simplified to just sort -u my_text.txt.)

The usual fix is to split the input to one word per line; there are some complications with real-world text, but the first basic Unix 101 implementation looks like

tr ' ' '\n' <my_text.txt | sort -u

Of course, this gives you the words in a different order than in the original, and saves the first occurrence of every word. If you wanted to discard any words which occur more than once, maybe try

tr ' ' '\n' <my_text.txt | sort | uniq -c | awk '$1 == 1 { print $2 }'

(If your tr doesn't recognize \n as newline, maybe try '\012'.)

Here is a dead simple two-pass Awk script which hopefully is a little bit more useful. It collects all the words into memory during the first pass over the file, then on the second, removes any words which occurred more than once.

awk 'NR==FNR { for (i=1; i<=NF; ++i) ++a[$i]; next }
{ for (i=1; i<=NF; ++i) if (a[$i] > 1) $i="" } 1' my_test.txt my_test.txt

This leaves whitespace where words were removed; fixing that should be easy enough with a final sub().

A somewhat more useful program would split off any punctuation, and reduce words to lowercase (so that Word, word, Word!, and word? don't count as separate).

Bauer answered 18/12, 2020 at 12:6 Comment(0)

Can use this command to remove duplication of words from both sentences :

tr ' ' '\n' <my_text.txt | sort | uniq | xargs

Freese answered 27/1, 2021 at 9:32 Comment(0)

To output processed lines with preserved order of word occurrences, you can use awk to parse and remove the duplicates. This script supports multiple sentences, taking into account words followed by common punctuation marks (.,;):

File remove_duplicates.awk:

#!/usr/bin/awk -f

{
    # Store occurences of each word in current line, keyed by the word itself
    for (i=1; i<=NF; i++) {
        sub(/[.,;]/, "", $i)
        seen_words[$i]++
    }
    # Store line, keyed by line number
    lines[$NR]=$0
}
END {
    # Process stored lines
    for (i=1; i<=NR; i++) {
        split(lines[$i], word, " ")
        output_line=""
        for (j=1; j<=length(word); j++){
            sub(/[.,;]/, "", word[j])
            if (seen_words[word[j]] <= 1) {
                output_line = output_line " " word[j]
            }
        }
        print output_line
    }
}

Usage:

./remove_duplicates.awk < input_text

Output:

operating
was to create an environment that promoted efficient program

Caddish answered 18/12, 2020 at 12:5 Comment(1)

Solutions to homework should probably try to explain what's wrong and only hint at a solution, rather than dump a complex program with little explanation. See also meta.#335322 – Bauer 18/12, 2020 at 12:20

Using awk (GNU awk):

 awk '{ 
        for (i=1;i<=NF;i++) { # Loop on each word on each line
          gsub(/[[:punct:]]/,"",$i); # Srip out any punctuation
          cnt++; Set a word count variable
          if (!map[$i]) { If there is not an entry for the word in an array, set it with the word as the index and the cnt variable as the value
            map[$i]=cnt 
          } 
         } 
      } 
  END { 
        PROCINFO["sorted_in"]="@val_num_asc"; # Set the order of the array to value number ascending
        for (i in map) { 
           printf "%s ",i # Print each word with a space
        } 
       }' filename

One liner:

 awk '{ for (i=1;i<=NF;i++) { gsub(/[[:punct:]]/,"",$i);cnt++;if (!map[$i]) { map[$i]=cnt } } } END { PROCINFO["sorted_in"]="@val_num_asc";for (i in map) { printf "%s ",i } }' filename

NOTE - This will strip out any punctuation (full stops after words)

Frail answered 18/12, 2020 at 12:25 Comment(0)

Recommended topics

Hot tags