Delete non-UTF characters from a string in Ruby?

T

8

41

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.

This:

text = x = "foo\xC2bar"
text.gsub!(/\xC2/, '')

returns an error:

incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.

Twoway answered 27/8, 2012 at 18:27 Comment(1)

You might find #11375842 useful – Continuant 27/8, 2012 at 22:14

B

121

You can use encode for that. text.encode('UTF-8', :invalid => :replace, :undef => :replace)

Or text.scrub

For more info look into Ruby-Docs, replaces it by default with a question mark box.

Brigittebriley answered 27/8, 2012 at 20:48 Comment(2)

Every time you see that you got 10 points from this answer you must know how much head banging against a desk you've just saved someone. – Drown 12/3, 2015 at 18:12

Yup. Here, 10 more points for you. – Unsociable 17/9, 2021 at 16:6

R

11

You could do it like this

# encoding: utf-8

class String
  def validate_encoding
    chars.select(&:valid_encoding?).join 
  end
end

puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string

Repression answered 27/8, 2012 at 19:32 Comment(3)

.select(&:valid_encoding?) instead of .collect{} is a lot shorter. – Levania 27/8, 2012 at 20:48

you'r right ephemient and it stays comprehensible, thanks, i adapt my answer – Repression 27/8, 2012 at 21:6

This actually works, unlike the most voted answer. – Staley 15/11, 2021 at 19:37

P

7

You text have ASCII-8BIT encoding, instead you should use this:

String.delete!("^\u{0000}-\u{007F}");

It will serve the same purpose.

Pierette answered 23/3, 2017 at 14:24 Comment(2)

I ended up using mystring.delete() since delete! returns nil if the string was not modified, see apidock.com/ruby/v2_5_5/String/delete%21 – Exogenous 7/1, 2023 at 8:39

Removes characters with diacriticals and stuff like that, but works :) – Historied 28/2, 2023 at 15:25

L

5

You can use /n, as in

text.gsub!(/\xC2/n, '')

to force the Regexp to operate on bytes.

Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.

Levania answered 27/8, 2012 at 19:24 Comment(1)

This gives me incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) – Twoway 28/8, 2012 at 20:15

V

4

Try Iconv

1.9.3p194 :001 > require 'iconv'
# => true 
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string" 
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290> 
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"

Viveca answered 27/8, 2012 at 20:25 Comment(2)

one note: Iconv is (will be) deprecated from Rails 3.2 in favour of String#encode – Toothpaste 21/2, 2013 at 14:42

Appears Iconv was deprecated in ruby 1.9 FWIW...guess it's just a gem now for 2.x – Historied 28/2, 2023 at 16:2

J

3

The best solution to this problem that I've found is this answer to the same question: https://mcmap.net/q/392251/-is-there-a-way-in-ruby-1-9-to-remove-invalid-byte-sequences-from-strings.

In short: "€foo\xA0".chars.select(&:valid_encoding?).join

Jawbone answered 17/12, 2015 at 14:37 Comment(0)

C

0

Use String encode method with param 'replace' to return a string without invalid chars

'MyString'.encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '')

or using bang to change the string

'MyString'.encode!('UTF-8', :invalid => :replace, :undef => :replace, :replace => '')

Ruby Doc

Cabaret answered 11/6 at 17:35 Comment(0)

T

-2

data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

Tug answered 11/10, 2014 at 7:41 Comment(1)

This doesn't actually repair a string? – Historied 28/2, 2023 at 15:27

Recommended topics

Hot tags