Getting an SPSS data file into R
Asked Answered
S

5

5

At my company, we are thinking of gradually phasing out SPSS in choice of R. During the transition though we'll still be having the data coming in SPSS data file format (.sav).

I'm having issues importing this SPSS datafile into R. When I import an SPSS file into R, I want to retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value labels of a variable but not both.

AFAIK, R does allow factor variables to have values (levels) and value labels (level labels). I was just wondering if it's possible to somehow modify the read.spss() function to incorporate this.

Alternatively, I came across spss.system.file() function from memisc package which supposedly allows this to happen, but it asks for a separate syntax file (codes.file), which is not necessarily available to me always.

Here's a sample data file.

I'd appreciate any help resolving this issue.

Thanks.

Soemba answered 30/1, 2013 at 14:51 Comment(10)
Please provide an example data file.Exocentric
As requested, an example data file would be useful. According to read.spss, "There may be attributes "label.table" and "variable.labels". Attribute "label.table" is a named list of value labels with one element per variable, either NULL or a names character vector. Attribute "variable.labels" is a named character vector with names the short variable names and elements the long names."Necolenecro
Not sure if it would serve your purpose, but I usually just use as.numeric(varX) if I need to use the labeled variable as number.Schulte
you are looking for the r function spss.get() in the Hmisc packageInterinsurance
Have added a link to sample data file. spss.get() from Hmisc package internally uses read.spss(), so not useful.Soemba
I just found out that foreign::read.spss("C:/temp/data_expt.sav") converts the variables with value labels in SPSS data to R factors. I did not know that.Bulla
@Bulla Yes, I had already mentioned that in the question.. :)Soemba
@Soemba Than what the question is about? The factors store both - values and labels (try as.numeric() and as.character() on a factor). And this becomes a wrong statement "gives me option to retain either values OR value labels".Bulla
@steadyfish, I understood where the problem is. The original values are lost when a variable is converted to a factor, for example as.numeric(factor(4:6)) == 1:3.Bulla
And it makes R factors quite useless for storing information. Would like to have SPSS value labels feature available in R.Bulla
B
2

There is a solution to read SPSS data file in R by ODBC driver.

1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.

2) Create a DSN for the SPSS data driver.

3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.

Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.

require(RODBC)

# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")

# List of tables
Tables <- sqlTables(con)
Tables

# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]

# Function to query a table by name
sqlQuery.tab.name <- function(table) {
  sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}

# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)

# See the data
lapply(Data, head)

# Close connection
close(con)

For example we can that value labels are defined for two variables:

[[5]]
  VAR00002 VAR00002_label
1        1           Male
2        2         Female

[[6]]
  VAR00003 VAR00003_label
1        2        Student
2        3       Employed
3        4     Unemployed

Additional information

Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.

I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).

get.spss <- function(channel, variables = NULL, value.labels = F) {

  VarNames <- sqlQuery(channel = channel,
    query = "SELECT VarName FROM [Variables]", as.is = T)$VarName

  if (is.null(variables)) variables <- VarNames else {
    if (any(!variables %in% VarNames)) stop("Wrong variable names")
  }

  if (value.labels) {
    ValueLabelTableName <- sqlQuery(channel = channel,
      query = "SELECT VarName FROM [Variables]
                                    WHERE ValueLabelTableName is not null",
      as.is = T)$VarName
    ValueLabelTableName <- intersect(variables, ValueLabelTableName)
  }

  variables <- paste(variables, collapse = ", ")

  data <- sqlQuery(channel = channel,
    query = paste("SELECT", variables, "FROM [Cases]"),
    as.is = T)

  if (value.labels) {
    for (var in ValueLabelTableName) {
      VL <- sqlQuery(channel = channel,
        query = paste0("SELECT * FROM [VLVAR", var,"]"),
        as.is = T)
      data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
    }
  }

  return(data)
}
Bulla answered 26/2, 2013 at 15:19 Comment(0)
C
4

I do not know how to read in SPSS metadata; I usually read .csv files and add metadata back, or write a small one-off PERL script to do the job. What I wanted to mention is that a recently published R package, Rz, may assist you with bringing SPSS data into R. I have had a quick look at it and seems useful.

Connivent answered 30/1, 2013 at 17:15 Comment(0)
C
2

My work is going through the same transition.

read.spss() returns the variable labels as an attribute of the object you create with it. So in the example below I have a data frame called rvm which was created by read.spss() with to.data.frame=TRUE. It has 3,500 variables with short names a1, a2 etc but long labels for each variable in SPSS. I can access the variable labels by

cbind(attributes(rvm)$variable.labels)

which returns a list of all 3,500 variables full names up to

…
x23      "Other Expenditure Uncapped Daily Expenditure In Region"          
x24      "Accommodation Expenditure In Region"                             
x25      "Food/Meals/Drink Expenditure In Region"                          
x26      "Local Transport Expenditure In Region"                           
x27      "Sightseeing/Attractions Expenditure In Region"                   
x28      "Event/Conference Expenditure In Region"                          
x29      "Gambling/Casino Expenditure In Region"                           
x30      "Gifts/Souvenirs Expenditure In Region"                           
x31      "Other Shopping Expenditure In Region"                            
x0       "Accommodation Daily Expenditure In Region"  

What to do with these is another matter, but at least I have them, and if I want I can put them in some other object for safekeeping, searching with grep, etc.

Carrell answered 30/1, 2013 at 20:29 Comment(3)
I guess the question is about value labels, not variable labels.Bulla
Ah, yes. Should I delete my answer?Carrell
Do not delete it. It is useful to know.Bulla
B
2

There is a solution to read SPSS data file in R by ODBC driver.

1) There is a IBM SPSS Statistics Data File Driver. I could not find the download link. I got it from my SPSS provider. The Standalone Driver is all you need. You do not need SPSS to install or use the driver.

2) Create a DSN for the SPSS data driver.

3) Using RODBC package you can read in R any SPSS data file. It will be possible to get value labels for each variable as separate tables. Then it is possible to use the labels in R in any way as you wish.

Here is a working example on Windows (I do not have SPSS on my computer now) to read in R your example data file. I have not testted this on Linux. It probably works also on Linux, because there is a SPSS data driver also for Linux.

require(RODBC)

# Create connection
# Change the DSN name and CP_CONNECT_STRING according to your setting
con <- odbcDriverConnect("DSN=spss_ehsis;SDSN=SAVDB;HST=C:\\Program Files\\IBM\\SPSS\\StatisticsDataFileDriver\\20\\Standalone\\cfg\\oadm.ini;PRT=StatisticsSAVDriverStandalone;CP_CONNECT_STRING=C:\\temp\\data_expt.sav")

# List of tables
Tables <- sqlTables(con)
Tables

# List of table names to extract
table.names <- Tables$TABLE_NAME[Tables$TABLE_SCHEM != "SYSTEM"]

# Function to query a table by name
sqlQuery.tab.name <- function(table) {
  sqlQuery(con, paste0("SELECT * FROM [", table, "]"))
}

# Retrieve all tables
Data <- lapply(table.names, sqlQuery.tab.name)

# See the data
lapply(Data, head)

# Close connection
close(con)

For example we can that value labels are defined for two variables:

[[5]]
  VAR00002 VAR00002_label
1        1           Male
2        2         Female

[[6]]
  VAR00003 VAR00003_label
1        2        Student
2        3       Employed
3        4     Unemployed

Additional information

Here is a function that allows to read SPSS data after the connection has been made to the SPSS data file. The function allows to specify the list of variables to be selected. If value.labels=T the selected variables with value labels in SPSS data file are converted to the R factors with labels attached.

I have to say I am not satisfied with the performance of this solution. It work good for small data files. The RAM limit is reached quite often for large SPSS data files (even the subset of variables is selected).

get.spss <- function(channel, variables = NULL, value.labels = F) {

  VarNames <- sqlQuery(channel = channel,
    query = "SELECT VarName FROM [Variables]", as.is = T)$VarName

  if (is.null(variables)) variables <- VarNames else {
    if (any(!variables %in% VarNames)) stop("Wrong variable names")
  }

  if (value.labels) {
    ValueLabelTableName <- sqlQuery(channel = channel,
      query = "SELECT VarName FROM [Variables]
                                    WHERE ValueLabelTableName is not null",
      as.is = T)$VarName
    ValueLabelTableName <- intersect(variables, ValueLabelTableName)
  }

  variables <- paste(variables, collapse = ", ")

  data <- sqlQuery(channel = channel,
    query = paste("SELECT", variables, "FROM [Cases]"),
    as.is = T)

  if (value.labels) {
    for (var in ValueLabelTableName) {
      VL <- sqlQuery(channel = channel,
        query = paste0("SELECT * FROM [VLVAR", var,"]"),
        as.is = T)
      data[, var] <- factor(data[, var], levels = VL[, 1], labels = VL[, 2])
    }
  }

  return(data)
}
Bulla answered 26/2, 2013 at 15:19 Comment(0)
I
1

Since you have SPSS available, I recommend installing the "Essentials for R" plugin (free of charge, but you need to register, also see the installation instructions) which allows you to run R within SPSS. The plugin includes an R package with functions that transfer the active SPSS data frame to R (and back) - including labeled factor levels, dates, German umlauts - details that are otherwise notoriously difficult. In my experience, it is more reliable than R's own foreign package.

Once you have everything set up, open the data in SPSS, and run something like the following code in the syntax window:

begin program r.
myDf <- spssdata.GetDataFromSPSS(missingValueToNA=TRUE,
                                 factorMode="labels",
                                 rDate="POSIXct")
save(myDf, file="d:/path/to/your/myDf.Rdata")
end program.

Essentials for R plugin link (apparently breaks markdown link syntax):

https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/We70df3195ec8_4f95_9773_42e448fa9029/page/Downloads%20for%20IBM®%20SPSS®%20Statistics?lang=en
Interlay answered 30/1, 2013 at 19:41 Comment(1)
I believe this is not an option in a long run -- "we are thinking of gradually phasing out SPSS"Bulla
N
0

Nowadays, the package haven provides the functionality to achieve what you want (and much more).

The function read_sav() can import *.sav and *.zsav files and returns a tibble. The variable labels are automatically stored in the labels attribute of the corresponding variables within that tibble. The class labelled preserves the original semantics and allows us to associate arbitrary labels with numeric or character vectors. If needed, we can use the function as_factor() to coerce labeled objects, i.e. objects of the class labelled, and even all labeled vectors within data.frames or tibbles (at once) to factors.

Nickynico answered 22/1, 2020 at 22:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.