Reading Rdata file with different encoding
Asked Answered
P

5

12

I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows. Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.

To read an Rdata file, the normal procedure is to run load("file.Rdata"). Functions such as read.csv have an encoding argument that you can use to solve those kind of issues, but load has no such thing. If I try load("file.Rdata", encoding = latin1), I just get this (expected) error:

Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")

What else can I do? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.

Phototelegraphy answered 1/12, 2015 at 16:2 Comment(3)
RData files do not have encodings. You need to load the serialized Rdata and then re-encode the values once they are inside the R workspace. If this remains unclear after reading ?Encoding, then do the load and post the output of dput(head(object)).Espalier
@42, this seems to solve the problem, too bad apparently I need to apply Encoding(x) to each vector in my dataframe. I'll take a better look at it and will get back to you.Phototelegraphy
You can record the names in the workspace before and after the load and then work on the difference for the items that have character values.Espalier
P
8

Thanks to 42's comment, I've managed to write a function to recode the file:

fix.encoding <- function(df, originalEncoding = "latin1") {
  numCols <- ncol(df)
  for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
  return(df)
}

The meat here is the command Encoding(df[, col]) <- "latin1", which takes column col of dataframe df and converts it to latin1 format. Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.

Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, i.e. reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8".

Phototelegraphy answered 2/12, 2015 at 15:27 Comment(0)
N
3

following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble. Thanks for inspiration.

fix.encoding <- function(df, originalEncoding = "UTF-8") {
numCols <- ncol(df)
df <- data.frame(df)
for (col in 1:numCols)
{
        if(class(df[, col]) == "character"){
                Encoding(df[, col]) <- originalEncoding
        }

        if(class(df[, col]) == "factor"){
                        Encoding(levels(df[, col])) <- originalEncoding
}
}
return(as_data_frame(df))
}
Nariko answered 8/11, 2016 at 2:28 Comment(0)
M
1

Thank you for posting this. I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character. Otherwise, an error occurs:

> fix.encoding(adress)
Error in `Encoding<-`(`*tmp*`, value = "latin1") :
 a character vector argument expected

So here is the modified function:

fix.encoding <- function(df, originalEncoding = "latin1") {
    numCols <- ncol(df)
    for (col in 1:numCols)
            if(class(df[, col]) == "character"){
                    Encoding(df[, col]) <- originalEncoding
            }
    return(df)
}

However, this will not change the encoding of level's names in a "factor" column. Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
Mattheus answered 21/10, 2016 at 7:38 Comment(0)
I
0

Another option using dplyr's mutate_if:

fix_encoding <- function(x) {
  Encoding(x) <- "latin1"
  return(x)
}
data <- data %>% 
  mutate_if(is.character,fix_encoding) 

And for factor variables that have to be recoded:

fix_encoding_factor <- function(x) {
  x <- as.character(x)
  Encoding(x) <- "latin1"
  x <- as.factor(x)
  return(x)
}
data <- data %>% 
  mutate_if(is.factor,fix_encoding_factor) 
Instant answered 27/8, 2020 at 12:52 Comment(0)
L
0

This can also be a problem on Windows with files created in older versions of R (<4.2). To avoid this, I use the following code to specify the encoding and save them again (no reprocessing would be needed):

file <- "file.RData"
df.encoding <- "latin1"

# Load data.frame
df.name <- load(file) 
df <- get(df.name[1])

# Names
Encoding(names(df)) <- df.encoding

# Variable labels (if present)
if (!is.null(vlabels <- attr(df, "variable.labels"))) {
  Encoding(vlabels) <- df.encoding  
  Encoding(names(vlabels)) <- df.encoding
  attr(df, "variable.labels") <- vlabels  
}

# Character variables
vchar <- sapply(df, is.character)
df[vchar] <- lapply(df[vchar],  function(x) {
  Encoding(x) <- df.encoding
  x
})

# Factors
vcat <- sapply(df, is.factor)
df[vcat] <- lapply(df[vcat],  function(x) {
  Encoding(levels(x)) <- df.encoding
  x
})

# Save
assign(df.name[1], df)
save(list = df.name[1], file = file)
Longlived answered 9/10, 2023 at 17:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.