How to source() .R file saved using UTF-8 encoding?
Asked Answered
C

7

60

The following, when copied and pasted directly into R works fine:

> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."

However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:

> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") : 
  C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2: 
  ^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
  invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'

Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.

> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_2.12.1

and

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252
Calore answered 17/2, 2011 at 16:23 Comment(8)
Well, it seems to work well here. I run Linux with an UTF-8 locale. Maybe the problem comes from the locale on your system. Did you try to change it to an UTF-8 one ?Actinochemistry
Works on MacOS 10.6.6 as well.Calve
@Actinochemistry How would I go about changing R on windows to a UTF-8 local?Calore
Well, my knowledge of Windows is quite limited, but maybe you can take a look at the Sys.setlocale R function, and find some informations in the R installation and administration guide : cran.r-project.org/doc/manuals/R-admin.html#LocalesActinochemistry
@Actinochemistry - many thanks, but even after looking at that otherwise rather useful document, I can't see how to set it to a utf-8 local.Calore
How did you create the file, and how do you know it's really in UTF-8 format? Do you know the characters in that file are correctly encoded?Morganstein
@Morganstein file was created in notepad and saved by changing the encoding from ANSI to UTF-8.Calore
@Morganstein I'm sure this is an R on Windows thing, it will work fine on Linux I'm sure. The file I've been working with (you can see it in my answer) just came from copying some sample Unicode text from some website offer such a thing. These text editors (Notepad, Notepad2, Notepad++), they can all encode UTF-8 easily enough. All this talk of locales seems bizarre to me (I'm just a Windows developer). On Windows you no longer worry about locales because we've stopped using the old ANSI API calls. Text is UTF-16LE and it all just works. I can't understand why there is a problem!Seisin
B
32

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

Bartholemy answered 21/2, 2011 at 13:53 Comment(9)
Many thanks, this worked! I used Sys.setlocale("LC_CTYPE","chinese")Calore
Anytime sir. ("chinese" not "Chinese", interesting how inconsistent they are good you found out)Bartholemy
how do you load a file that contains multiple languages? Something is wrong in R!Seisin
You just switch the locale multiple times inside that file. I'm not sure the problem is with R, some commenters said that it's fine in Linux (without locale switching). It may-be R but it may be the Windows-API (widechar instead of utf-8) or a combination thereof.Bartholemy
@David @eznme Just saw this on the official R-help list, in which Prof Ripley says something about utf-8 locals on Windows: goo.gl/cUZCmCalore
@Tony Prof. Ripley is talking out of his hat! Windows supports UTF-8 just fine. Windows has supported Unicode since 1991 and the reason it uses UTF-16 rather than UTF-8 as on Linux is that it supported Unicode before UTF-8 was even invented! My Windows app eats all these characters for breakfast. Locales should be irrelevant when you specify an encoding. I'm fingering iconv as the culprit here, but I'm afraid that if Prof. Ripley is taking that attitude then R on Windows has little hope of ever supporting Unicode properly.Seisin
@eznme There just should be no need for locales. That might be how its done on Linux but it makes no sense in Windows. You just use the WideChar versions of all the API functions, hold the text as LPWSTR, and convert to different encodings at the boundaries (file import/export). It's not that difficult, but I understand that it becomes more difficult if you want to support Linux and Windows from a single codebase!Seisin
@eznme Of course I can't get this locale thing to go because I can't select the ru locale on my machine. What a mess!Seisin
The solution doesn't work for me. If I have this in my R source: boxplot(weight~Diet,data=ChickWeight,subset = Time ==21,col = "yellow", main="Gewicht van kuikens in gram op dag 21 bij verschillende diëten", xlab="dieet", ylab="gewicht in gram", sub="bron:package datasets in R") I still get INCOMPLETE_STRING. Also, is there a way to make r-studio source in utf-8 by default?Chinchilla
B
40

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.

This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:

eval(parse(filename, encoding="UTF-8"))

This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

Belloir answered 7/4, 2011 at 22:40 Comment(3)
I confirm that this works. source() requires setting Sys.setlocale() all along the file. eval does the job without this requirement.Sext
source forwards the encoding argument to file, which, in turn, converts the textual input in memory to whatever locale encoding is specified (and fails) – this seems to be the culprit. parse by contrast doesn’t do this, it reads the file as-is and just marks the bytes in memory with the correct encoding. – I’m not entirely sure what this tells us, except that R’s internal handling of encodings is messy (we already knew that), and should be fixed, backwards compatibility be damned.Visitant
Is this still true in the latest R releases where UCRT is used to deal with the encoding in windows?Homeland
B
32

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

Bartholemy answered 21/2, 2011 at 13:53 Comment(9)
Many thanks, this worked! I used Sys.setlocale("LC_CTYPE","chinese")Calore
Anytime sir. ("chinese" not "Chinese", interesting how inconsistent they are good you found out)Bartholemy
how do you load a file that contains multiple languages? Something is wrong in R!Seisin
You just switch the locale multiple times inside that file. I'm not sure the problem is with R, some commenters said that it's fine in Linux (without locale switching). It may-be R but it may be the Windows-API (widechar instead of utf-8) or a combination thereof.Bartholemy
@David @eznme Just saw this on the official R-help list, in which Prof Ripley says something about utf-8 locals on Windows: goo.gl/cUZCmCalore
@Tony Prof. Ripley is talking out of his hat! Windows supports UTF-8 just fine. Windows has supported Unicode since 1991 and the reason it uses UTF-16 rather than UTF-8 as on Linux is that it supported Unicode before UTF-8 was even invented! My Windows app eats all these characters for breakfast. Locales should be irrelevant when you specify an encoding. I'm fingering iconv as the culprit here, but I'm afraid that if Prof. Ripley is taking that attitude then R on Windows has little hope of ever supporting Unicode properly.Seisin
@eznme There just should be no need for locales. That might be how its done on Linux but it makes no sense in Windows. You just use the WideChar versions of all the API functions, hold the text as LPWSTR, and convert to different encodings at the boundaries (file import/export). It's not that difficult, but I understand that it becomes more difficult if you want to support Linux and Windows from a single codebase!Seisin
@eznme Of course I can't get this locale thing to go because I can't select the ru locale on my machine. What a mess!Seisin
The solution doesn't work for me. If I have this in my R source: boxplot(weight~Diet,data=ChickWeight,subset = Time ==21,col = "yellow", main="Gewicht van kuikens in gram op dag 21 bij verschillende diëten", xlab="dieet", ylab="gewicht in gram", sub="bron:package datasets in R") I still get INCOMPLETE_STRING. Also, is there a way to make r-studio source in utf-8 by default?Chinchilla
S
6

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following

danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")

is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.

Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

Seisin answered 20/2, 2011 at 22:13 Comment(1)
I'm going to post my question to the official R-help list, just in case it really is an error of R on Windows.Calore
C
6

For me (on windows) I do:

source.utf8 <- function(f) {
    l <- readLines(f, encoding="UTF-8")
    eval(parse(text=l),envir=.GlobalEnv)
}

It works fine.

Colquitt answered 27/9, 2014 at 13:47 Comment(0)
F
2

I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.

Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.

PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.

Feldt answered 21/5, 2015 at 4:20 Comment(0)
S
2

Building on crow's answer, this solution makes RStudio's Source button work.

When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:

source <- function(f, encoding = 'UTF-8') {
    l <- readLines(f, encoding=encoding)
    eval(parse(text=l),envir=.GlobalEnv)
}

You can then add that script to an .Rprofile file, so it will execute on startup.

Simonson answered 6/3, 2018 at 9:11 Comment(1)
The readLines call is redundant. See Joe Cheng’s answer. Furthermore, when replacing the source function it’s a good idea to handle the remaining arguments, e.g. local, correctly.Visitant
B
1

On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:

52 3F 3F 3F 3F

what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:

52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB

This will then be recognized as valid utf-8 by [R].

I used "Notepad2" for trying this, but i am sure there are many more.

Bartholemy answered 20/2, 2011 at 22:8 Comment(15)
I just tried WinEdt (for which there is an often used R-Plugin RWinEdt) and it does not work (Version 5.5). So, you might want to try it with "Notepad2" first. You can also write the utf-8 text-file yourself using [R] writeChar(), i think it uses the encoding you set in Sys.setlocale().Bartholemy
It doesn't matter which text editor writes the file, they can all write the file correctly, R on Windows just fails to read it.Seisin
@David Heffernan The problem the original poster is having is different from your's. Yes, R can read UTF-8 files but the way his editor is set-up doesn't even create an UTF-8 file. He uses an editor that is not set to Utf-8-Mode and thus if he copies "R同时也" into it, the file becomes the bytes [52 3F 3F 3F] "R???".Bartholemy
@eznme I don't think so. OP states that the file is saved with UTF-8 encoding. I save the same file with UTF-8 encoding (or indeed UTF-16) and get the same error. The problem is with R.Seisin
@eznme just take a look at my answer and try to get R to source the file with the Russian in!Seisin
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.") russian() [1] "Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями."Bartholemy
To do that use: Sys.setlocale("LC_CTYPE","ru")Bartholemy
@eznme Cheers, but as @David says, my file was originally saved in notepade, set to utf-8 format mode. I installed notepad2 to try it out (quite nice, thanks for mentioning it, didn't know about it before), changed it to utf-8 and still have the same issue.Calore
@Tony Notepad2 is nice, Notepad++ is even nicer!Seisin
@eznme @Tony What does locale have to do with anything? It's just a file read. Anyway, my machine says "OS reports request to set locale to "ru" cannot be honored". How did you get it to work?Seisin
@David I actually agree with you in that the locale shouldn't matter because I'm specifically telling R to read in the file as utf-8 encoding, but I'm not an R expert and so am very willing to try different things out if they work. I get the same "cannot be honored" message as you. Also, just downloaded Notepad++ and very nice it is too!Calore
@Tony Really, how can this be anything other than a bug in R, as I suggest in my answer?Seisin
In my screenshot you can see that when i set the locale to "ru" the russian text displays correctly, when i set it to "German" it does not.Bartholemy
@eznme I don't see you calling source on a UTF-8 file with that text in in that screenshot. That's what doesn't work. The use of locales your are illustrating is for dealing with 8 bit character sets. A modern Unicode program uses Unicode text and so locales are only used for things like date/time/number formatting preferences.Seisin
Yes. R 3.1.1 also can't do source(file, encoding="UTF-8") for Russian.Colquitt

© 2022 - 2024 — McMap. All rights reserved.