sed: removing alphanumeric words from a file

Asked 13/12, 2010 at 20:30 Answered 14/12, 2010 at 1:36

I have file with a lot of text, what I want to do is to remove all alphanumeric words.

Example of words to be removed:

gr8  
2006  
sdlfj435ljsa  
232asa  
asld213  
ladj2343asda
asd!32

what is the best way I can do this?

Mcnew answered 13/12, 2010 at 20:30 Comment(0)

If you want to remove all words that consist of letters and digits, leaving only words that consist of all digits or all letters:

sed 's/\([[:alpha:]]\+[[:digit:]]\+[[:alnum:]]*\|[[:digit:]]\+[[:alpha:]]\+[[:alnum:]]*\) \?//g' inputfile

Example:

$ echo 'abc def ghi 111 222 ab3 a34 43a a34a 4ab3' | sed 's/\<\([[:alpha:]]\+[[:digit:]]\+[[:alnum:]]*\|[[:digit:]]\+[[:alpha:]]\+[[:alnum:]]*\) \?//g'
abc def ghi 111 222

Incorporate answered 13/12, 2010 at 23:15 Comment(0)

Assuming the only output you wanted from your sample text is 2006 and you have one word per line:

 sed '/[[:alpha:]]\+/{/[[:digit:]]\+/d}' /path/to/alnum/file

Input

$ cat alnum
gr8
2006
sdlFj435ljsa
232asa
asld213
ladj2343asda
asd!32
alpha

Output

$ sed '/[[:alpha:]]\+/{/[[:digit:]]\+/d}' ./alnum
2006
alpha

Banyan answered 14/12, 2010 at 1:36 Comment(3)

using the ;/^$/d' command would clean up the output. For example sed '/[[:alpha:]]\+/{/[[:digit:]]\+/s/.*//g}' alnum would return 2006 and alpha on single lines – Linkboy 1/10, 2015 at 15:16

Appreciate the comment. Haven't looked at this answer for almost 5 years but now that I've looked at it in light of your comment I delete the line in lieu of substituting it with an empty line. – Banyan 2/10, 2015 at 7:32

nice work, that one even removed the command chaining. I'm impressed and learnt something new at the same time. +1 – Linkboy 2/10, 2015 at 7:56

If the goal is actually to remove all alphanumeric words (strings consisting entirely of letters and digits) then this sed command will work. It replaces all alphanumeric strings with nothing.

sed 's/[[:alnum:]]*//g' < inputfile

Note that other character classes besides alnum are also available (see man 7 regex).

For your given example data, this leaves only 6 blank lines and a single ! (since that is the only non-alphanumeric character in the example data). Is this actually what you're trying to do?

Sumach answered 13/12, 2010 at 21:5 Comment(0)

AWK solution:

BEGIN { # Statement that will be executed once at the beginning.
    FS="[ \t]" # Set space and tab characters to be treated as word separator.
}
# Code below will execute for each line in file.
{
    x=1  # Set initial word index to 1 (0 is the original string in array)
    fw=1 # Indicate that future matched word is a first word. This is needed to put newline and spaces correctly.
    while ( x<=NF )
    {
        gsub(/[ \t]*/,"",$x) # Strip word. Remove any leading and trailing white-spaces.
        if (!match($x,"^[A-Za-z0-9]*$")) # Print word only if it does not match pure alphanumeric set of characters.
        {
            if (fw == 0)
            {
                printf (" %s", $x) # Print the word offsetting it with space in case if this is not a first match.
            }
            else
            {
                printf ("%s", $x) # Print word as is...
                fw=0 # ...and indicate that future matches are not first occurrences
            }
        }
        x++ # Increase word index number.
    }
    if (fw == 0) # Print newline only if we had matched some words and printed something.
    {
        printf ("\n")
    }
}

Assuming you have this script in script.awk' and data indata.txt, you have to invokeawk` like this:

awk -f ./test.awk ./data.txt

For your file it will produce:

asd!32

For more complex cases like this:

gr8
2006
sdlfj435ljsa
232asa  he!he lol
asld213  f
ladj2343asda
asd!32  ab acd!s

... it will produce this:

he!he
asd!32 acd!s

Hope it helps. Good luck!

Awakening answered 13/12, 2010 at 22:2 Comment(0)

Input

Output

Recommended topics

Hot tags