How to remove strange characters using gsub in R? [duplicate]

Asked 8/8, 2016 at 11:57 Answered 17/5, 2018 at 18:15

I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8').

If I don't specify the encoding, I see all kinds of strange characters like:

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> ðŸ˜œðŸ˜â˜º'"

This is what it looks like after readLines(..., encoding='UTF-8'):

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

You can see the unicode literals at the end: \u009f, \u0098, etc.

I can't find the right command and regular expression to get rid of these. I've tried:

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

I also tried specifying the unicode characters, but I believe they're getting interpreted as text:

gsub('\u009', '', text) # Unchanged

Hockey answered 8/8, 2016 at 11:57 Comment(0)

If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that ðŸ˜œðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

Below is a table of ASCII codes taken from asciitable.com:

You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

Lanellelanette answered 17/5, 2018 at 18:15 Comment(1)

This is much better than iconv! – Vogue 5/10, 2021 at 1:43

The easiest way to get rid of these characters is to convert from utf-8 to ascii:

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

Hockey answered 8/8, 2016 at 11:57 Comment(0)

Recommended topics

Hot tags