Importing data with special characters in R
Asked Answered
K

2

9

The following pic shows how the data is before i import it(notepad) in R and after importing.

enter image description here

I use the following command to import it in R:

Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")

It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

Kozhikode answered 13/11, 2015 at 14:35 Comment(6)
Have you tried install.packages("data.table");library(data.table);fread()?Brandnew
If you know the encoding type you can set that in the argument in readLines.Scyphus
@Scyphus the data are from web scraping, so I guess they dont have a standard format. Right? Or could be?Kozhikode
@MpizosDimitris correct - often you can check the encoding type (depending what browser you are using). Since its not in English, you'd have to look up what encoding is their most common... if you can't figure this out, there is always the option of finding the patterns and just gsubbing. - maybe this helps: htmlpurifier.org/docs/enduser-utf8.html#findcharsetScyphus
I had an issue not dissimilar to this a while back. Some of the suggestions I received may be of help in terms of narrowing down the source of the error: https://mcmap.net/q/1316750/-reading-foreign-charactersKarie
Thanks for the answers. Following ur suggestions for the encoding type I managed the solution by doing the following:Encoding(Data$Column) <- "UTF-8"Kozhikode
B
9

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.

So taking each issue in turn:

  1. Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

    To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

  2. Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

    > Sys.getlocale()
    [1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
    

    You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

Beauty answered 13/11, 2015 at 16:32 Comment(0)
A
0

In my case, I use only the parameter [encoding = "Latin-1"] and it worked. Thanks.

read.csv(paste(src,sprintf("%s.csv",x), sep = "/"), header = TRUE,
                         stringsAsFactors = FALSE, encoding = "Latin-1")
Anglesite answered 21/2, 2022 at 3:38 Comment(1)
This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From ReviewPaterfamilias

© 2022 - 2024 — McMap. All rights reserved.