Invalid multibyte string in read.csv
Asked Answered
D

12

74

I am trying to import a csv that is in Japanese. This code:

url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)

returns the following error:

Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) : 
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>񓙂̏󋵁@(<8f>T<8e><9f><81>E<8e>w<92><e8><95>񍐋@<8a>փx<81>[<83>X<81>j'

I tried changing the encoding (Encoding(url) <- 'UTF-8' and also to latin1) and tried removing the read.csv parameters, but received the same "invalid multibyte string" message in each case. Is there a different encoding that should be used, or is there some other problem?

Dovev answered 16/1, 2013 at 16:29 Comment(2)
Have you tried to set the argument encoding="UTF-8" to read.csv()?Wag
Yes, with the same result.Dovev
E
115

Encoding sets the encoding of a character string. It doesn't set the encoding of the file represented by the character string, which is what you want.

This worked for me, after trying "UTF-8":

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")

And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.

x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
  fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1])    # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1],  # convert to numbers
  function(d) type.convert(gsub(d, pattern=",", replace=""))))
Ephesian answered 16/1, 2013 at 16:39 Comment(3)
Thanks. From this question I tried setting the local to japanese with Sys.setlocale but that didn't work either ("OS reports request to set locale to "japanese" cannot be honored").Dovev
Yes, read.csv("foobar.csv", fileEncoding = "latin1") worked for me. I had an Excel file and saved as CSV, then had to set the fileEncoding to "latin1" to read that CSV in R.Flournoy
@Joshua Ulrich, what if my code looks like this? file.list <- list.files(pattern = '*.txt') file.list <- file.list[order(nchar(file.list), file.list)] df.list <- lapply(file.list, read_file) df_virgi <- do.call(rbind.data.frame, df.list) where shall I place **fileEncoding = "latin1"? Thanks a lot!Barbabas
S
18

You may have encountered this issue because of the incompatibility of system locale try setting the system locale with this code Sys.setlocale("LC_ALL", "C")

Schoolroom answered 12/4, 2015 at 5:27 Comment(0)
J
12

The readr package from the tidyverse universe might help.

You can set the encoding via the local argument of the read_csv() function by using the local() function and its encoding argument:

read_csv(file = "http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv",
         skip = 14,
         local = locale(encoding = "latin1"))
Judsen answered 30/5, 2017 at 11:31 Comment(0)
I
3

The simplest solution I found for this issue without losing any data/special character (for example when using fileEncoding="latin1" characters like the Euro sign € will be lost) is to open the file first in a text editor like Sublime Text, and to "Save with encoding - UTF-8".

Then R can import the file with no issue and no character loss.

Involution answered 18/12, 2018 at 14:20 Comment(0)
L
0

For those using Rattle with this issue Here is how I solved it:

  1. First make sure to quit rattle so your at the R command prompt
  2. > library (rattle) (if not done so already)
  3. > crv$csv.encoding="latin1"
  4. > rattle()
  5. You should now be able to carry on. ie, import your csv > Execute > Model > Execute etc.

That worked for me, hopefully that helps a weary traveller

Larghetto answered 17/4, 2015 at 0:20 Comment(0)
I
0

I had a similar problem with scientific articles and found a good solution here: http://tm.r-forge.r-project.org/faq.html

By using the following line of code:

tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))

you convert the multibyte strings into hex code. I hope this helps.

Ideally answered 20/9, 2015 at 7:19 Comment(0)
A
0

If the file you are trying to import into R that was originally an Excel file. Make sure you open the original file and Save as a csv and that fixed this error for me when importing into R.

Adermin answered 21/2, 2017 at 20:2 Comment(0)
V
0

I had the same error and tried all the above to no avail. The issue vanished when I upgraded from R 3.4.0 to 3.4.3, so if your R version is not up to date, update it!

Vermiculite answered 9/3, 2018 at 11:11 Comment(0)
L
0

I came across this error (invalid multibyte string 1) recently, but my problem was a bit different:

We had forgotten to save a csv.gz file with an extension, and tried to use read_csv() to read it. Adding the extension solved the problem.

Localize answered 1/1, 2020 at 12:49 Comment(0)
N
0

Reproduce the read.csv error on multi-byte char repeatedly:

R's read.csv() will puke on all multi-byte characters if it is expecting a number.

I'm using Version: R version 4.2.1 (2022-06-23)

Put this data in file named: /tmp/foo.csv

#year,someval 
2022,0.1389 
2021,0.0000°
2020,0.2857

If you look close you can see the 0.0000 value on line 2 has a 'degree' symbol on it.

Load it this way using read.csv:

> read.csv('/tmp/foo.csv')

Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec,  : 
  invalid multibyte string at '<b0>0'
Calls: read.csv -> read.table -> type.convert -> type.convert.default
Execution halted

What does cat have to say about that guff:

$ cat /tmp/foo.csv 
#year,someval
2022,0.1389
2021,0.0000�
2020,0.2857

We do not tolerate that "Degrees" symbol. Changing the encoding does nothing to help. You could try telling read.csv to interpret everything as a string, but now you've got string to number conversion issues downstream.

What does read.csv2 have to say?:

> read.csv2('/tmp/foo.csv')
  X.year.someval
1 2022,0.1389
2 2021,0.000\xb0
3 2020,0.2857

https://www.codetable.net/hex/b0

Nominative answered 4/3, 2023 at 21:4 Comment(0)
H
0

Did you use copy-paste to create CSV-file? I had the same error and successfully tried the most popular solution from this thread (fileEncoding="latin1"). After I re-saved the data frame into a CSV-file, I found that some cells had extra space after the cell value (encoded as A-tilde). I removed these spaces in the original file and was able to read it without fileEncoding="latin1" and without any error.

Hydrothorax answered 22/3, 2023 at 5:7 Comment(1)
This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From ReviewRebatement
A
0

I have this problem with DBI connection reading a sql file with read_lines; but seems the file has nothing to do with. Refreshing my sql connection (re-connect) solves the issue.

I have not idea that strange behavior.

Sys.info()
       sysname        release        version             machine 
     "Windows"       "10 x64"  "build 19044"             "x86-64" 
Ammonic answered 8/5, 2023 at 18:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.