Removing all special characters from a string in Bash
Asked Answered
S

3

17

I have a lot of text in lowercase, only problem is, that there is a lot of special characters, which I want to remove it all with numbers too.

Next command it's not strong enough:

tr -cd '[alpha]\n '

In case of éćščž and some others it returns "?" But I want to remove all of them. Is there any stronger command?

I use linux mint 4.3.8(1)-release

Stylographic answered 28/4, 2016 at 23:12 Comment(2)
Every character is special in its own way.Physiognomy
Your question is not very clear. Giving a bit more context would maybe draw more helpful responses.Accordant
R
34

You can use tr to print only the printable characters from a string like below. Just use the below command on your input file.

tr -cd "[:print:]\n" < file1   

The flag -d is meant to the delete the character sets defined in the arguments on the input stream, and -c is for complementing those (invert what's provided). So without -c the command would delete all printable characters from the input stream and using it complements it by removing the non-printable characters. We also keep the newline character \n to preserve the line endings in the input file. Removing it would just produce the final output in one big line.

The [:print:] is just a POSIX bracket expression which is a combination of expressions [:alnum:], [:punct:] and space. The [:alnum:] is same as [0-9A-Za-z] and [:punct:] includes characters ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Rectocele answered 29/4, 2016 at 5:17 Comment(2)
Unfortunately tr's character class [:print:] does not include all letters in each locale, e.g. ä in Finnish.Rhone
Not true about print being the same as alnum punc and space. I had a problem with characters messing up my terminal tr -dc "[:alnum:][:punct:] \n" fixed it but tr -dc "[:print:]\n" doesn't. So allnum, punc, space must be a subset of print.Cushing
A
4

I am not exactly certain where the text is coming from in your question but lets just say that the "lot of text in lowercase" is in the file called special.txt you could do something like the following but focused more on the characters you want to keep:

cat special.txt | sed 's/[^a-z  A-Z]//g'

It is a bit like doing surgery with an axe though.

Another possible solution in the post Remove non-ascii characters from ...

If the above don't solve your question, please try to provide a bit more details and I might be able to provide a more actionable answer.

Accordant answered 29/4, 2016 at 1:17 Comment(0)
M
1

Just wanted to add my bit to it. The code below will do a better job of getting rid of all characters as explained above and will replace them with space and preserve your newline character at the same time

    tr -s "[:punct:]" " "

From Manual Entry -s

Squeeze multiple occurrences of the characters listed in the last operand (either string1 or string2) in the input into a single instance of the character. This occurs after all deletion and translation is completed.

Mentally answered 15/8, 2018 at 15:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.