Asked 24/8, 2012 at 15:41 Answered 31/5, 2020 at 13:29

How do I read/write libsvm data into/from R?

The libsvm format is sparse data like

<class/target>[ <attribute number>:<attribute value>]*

(cf. Compressed Row Storage (CRS)) e.g.,

1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:03

I am sure I can whip some something myself, but I would much rather use something off the shelf. However, R library foreign does not seem to provide the necessary functionality.

Microscopic answered 24/8, 2012 at 15:41 Comment(2)

library(sos); findFn("libsvm") suggests e1071::write.svm, although I'm not sure if that does what you want? – Methaemoglobin 24/8, 2012 at 15:55

e1071::write.svm writes the svm model into 2 files – Microscopic 24/8, 2012 at 16:3

`e1071` is off the shelf:

install.packages("e1071")
library(e1071)
read.matrix.csr(...)
write.matrix.csr(...)

Note: it is implemented in R, not in C, so it is dog-slow.

It even have a special vignette Support Vector Machines—the Interface to libsvm in package e1071.

`r.vw` is bundled with `vowpal_wabbit`

Note: it is implemented in R, not in C, so it is dog-slow.

Microscopic answered 24/8, 2012 at 20:36 Comment(0)

I have been running a job using the zygmuntz solution on a dataset with 25k observations (rows) for almost 5 hrs now. It has done 3k-ish rows. It was taking so long that I coded this up in the meantime (based on zygmuntz's code):

require(Matrix)
read.libsvm = function( filename ) {
  content = readLines( filename )
  num_lines = length( content )
  tomakemat = cbind(1:num_lines, -1, substr(content,1,1))

  # loop over lines
  makemat = rbind(tomakemat,
  do.call(rbind, 
    lapply(1:num_lines, function(i){
       # split by spaces, remove lines
           line = as.vector( strsplit( content[i], ' ' )[[1]])
           cbind(i, t(simplify2array(strsplit(line[-1],
                          ':'))))   
})))
class(makemat) = "numeric"

#browser()
yx = sparseMatrix(i = makemat[,1], 
              j = makemat[,2]+2, 
          x = makemat[,3])
return( yx )
}

This ran in minutes on the same machine (there may have been memory issues with zygmuntz solution too, not sure). Hope this helps anyone with the same problem.

Remember, if you need to do big computations in R, VECTORIZE!

EDIT: fixed an indexing error I found this morning.

Gastrulation answered 25/2, 2014 at 8:9 Comment(4)

Output? Do you mean the y values? They are stored in the first column. – Gastrulation 26/2, 2014 at 1:51

I mean write.libsvm. – Microscopic 26/2, 2014 at 2:47

write.libsvm = function(data, filename= "out.dat", class = 1) { out = file(filename) writeLines(apply(data, 1, function(X)

paste(X[class],apply(cbind(which(X!=0)[-class], X[which(X!=0)[-class]]), 1, paste, collapse=":"), collapse=" ") }), out) close(out)

} Works with matrix and sparsematrix, assumes output vector is column in matrix, defaults to column 1. And to compare:

> system.time(write.libsvm(data, 'test5.dat'))    user  system elapsed     0.17    0.02    0.87  > system.time(write.matrix.csr(data, 'test6.dat'))    user  system elapsed     1.46   13.09  182.21

– Chare 17/6, 2014 at 9:55

I'm sorry for replying to such an old post but Nicholas McCarthy, your function does not seem to give the correct format, given: 'data2 <- Matrix(c(1,35,35,2,241,11,1,23,31), sparse = TRUE, nrow = 3)', it gives: '1 2:2 1 3:1' on the first row, but the second "1" should not be present right? – Flatwise 2/10, 2015 at 12:38

I came up with my own ad hoc solution leveraging some data.table utilities,

It ran in almost no time on the test data set I found (Boston Housing data).

Converting that to a data.table (orthogonal to solution, but adding here for easy reproducibility):

library(data.table)
x = fread("/media/data_drive/housing.data.fw",
          sep = "\n", header = FALSE)
#usually fixed-width conversion is harder, but everything here is numeric
columns =  c("CRIM", "ZN", "INDUS", "CHAS",
             "NOX", "RM", "AGE", "DIS", "RAD", 
             "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                   header = FALSE, sep = "\t",
                   col.names = columns))

Here it is:

DT[ , fwrite(as.data.table(paste0(
  MEDV, " | ", sapply(transpose(lapply(
    names(.SD), function(jj)
      paste0(jj, ":", get(jj)))),
    paste, collapse = " "))), 
  "/path/to/output", col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]
#what gets sent to as.data.table:
#[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
#  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
#[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
#  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
# ...

There may be a better way to get this understood by fwrite than as.data.table, but I can't think of one (until setDT works on vectors).

I replicated this to test its performance on a bigger data set (just blow up the current data set):

DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))

The operation was pretty fast compared to some of the times reported here (I haven't bothered comparing directly yet):

system.time(.)
#    user  system elapsed 
#   8.392   0.000   8.385

I also tested using writeLines instead of fwrite, but the latter was better.

I am looking again and seeing it might take a while to figure out what's going on. Maybe the magrittr-piped version will be easier to follow:

DT[ , 
    #1) prepend each column's values with the column name
    lapply(names(.SD), function(jj)
      paste0(jj, ":", get(jj))) %>%
      #2) transpose this list (using data.table's fast tool)
      #   (was column-wise, now row-wise)
      #3) concatenate columns, separated by " "
      transpose %>% sapply(paste, collapse = " ") %>%
      #4) prepend each row with the target value
      #   (with Vowpal Wabbit in mind, separate with a pipe)
      paste0(MEDV, " | ", .) %>%
      #5) convert this to a data.table to use fwrite
      as.data.table %>%
      #6) fwrite it; exclude nonsense column name,
      #   and force quotes off
      fwrite("/path/to/data", 
             col.names = FALSE, quote = FALSE),
  .SDcols = !"MEDV"]

reading in such files is much easier**

#quickly read data; don't split within lines
x = fread("/path/to/data", sep = "\n", header = FALSE)

#tstrsplit is transpose(strsplit(.))
dt1 = x[ , tstrsplit(V1, split = "[| :]+")]

#even columns have variable names
nms = c("target_name", 
        unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                   with = FALSE]))

#odd columns have values
DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
#add meaningful names
setnames(DT, nms)

**this will not work with "ragged"/sparse input data. I don't think there's a way to extend this to work in such cases.

Melisma answered 19/12, 2016 at 3:59 Comment(7)

@Microscopic added something. Neither of my approaches are sparse-data-friendly. The writer can probably be adjusted for this... – Melisma 19/12, 2016 at 4:31

the whole idea behind the format is to handle sparse data. however, the requirement is to interact with libsvm/liblinear, so, I guess, this is a valid solution. – Microscopic 19/12, 2016 at 4:35

Agreed it's especially sub-par for reading data. For writing, IIUC I think it will work correctly (?) if we simply replace 0 with NA. – Melisma 19/12, 2016 at 4:49

It's come to my attention there's a bug in data.table screwing this up from working perfectly. Until that's fixed, you'll have to assign DT$MEDV outside DT (say as mv <- DT$MEDV) and replace it in the [.data.table call – Melisma 19/12, 2016 at 4:54

fwiw i just discovered the r.vw package that's a part of the vowpal_wabbit official repository: github.com/JohnLangford/vowpal_wabbit/tree/master/R/r.vw – Melisma 5/12, 2017 at 6:28

@ MichaelChirico Hello, the link to the Boston dataset is no longer valid. I have loaded the dataset from library(mlbench) and command data(BostonHousing). If I replace your x with this dataset and run your code from DT = with(x, fread(paste(gsub etc, I receive: Error in is.factor(x) : object 'V1' not found. Do you know what can be the cause? Thank you! – Buckley 26/9, 2022 at 14:36

ps: i also have downloaded the file in csv from here : kaggle.com/datasets/vikrishnan/… and tried to adapt your code to csv. Although the column MEDV does not appear in DT in my case. (Sorry for chasing you with this old post!) – Buckley 27/9, 2022 at 10:40

Based on some comments. I add it as an aswer so it's easier for others to use. This is to write data in libsvm format.

Function to write a data.frame to svm light format. I've added a train={TRUE, FALSE} argument in case the data doesn't have labels. In this case, the class index is ignored.

write.libsvm = function(data, filename= "out.dat", class = 1, train=TRUE) {
  out = file(filename)
  if(train){
    writeLines(apply(data, 1, function(X) {
      paste(X[class], 
            apply(cbind(which(X!=0)[-class], 
                        X[which(X!=0)[-class]]), 
                  1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  } else {
    # leaves 1 as default for the new data without predictions. 
    writeLines(apply(data, 1, function(X) {
      paste('1',
            apply(cbind(which(X!=0), X[which(X!=0)]), 1, paste, collapse=":"), 
            collapse=" ") 
      }), out)
  }
  close(out) 
}

** EDIT **

Another option - In case you already have the data in a data.table object

libfm and SVMlight have the same format, so this function should work.

library(data.table)

data.table.fm <- function (data = X, fileName = "../out.fm", target = "y_train", 
    train = TRUE) {
    if (train) {
        if (is.logical(data[[target]]) | sum(levels(factor(data[[target]])) == 
            levels(factor(c(0, 1)))) == 2) {
            data[[target]][data[[target]] == TRUE] = 1
            data[[target]][data[[target]] == FALSE] = -1
        }
    }
    specChar = "\\(|\\)|\\||\\:"
    specCharSpace = "\\(|\\)|\\||\\:| "
    parsingNames <- function(x) {
        ret = c()
        for (el in x) ret = append(ret, gsub(specCharSpace, "_", 
            el))
        ret
    }
    parsingVar <- function(x, keepSpace, hard_parse) {
        if (!keepSpace) 
            spch = specCharSpace
        else spch = specChar
        if (hard_parse) 
            gsub("(^_( *|_*)+)|(^_$)|(( *|_*)+_$)|( +_+ +)", 
                " ", gsub(specChar, "_", gsub("(^ +)|( +$)", 
                  "", x)))
        else gsub(spch, "_", x)
    }
    setnames(data, names(data), parsingNames(names(data)))
    target = parsingNames(target)
    format_vw <- function(column, formater) {
        ifelse(as.logical(column), sprintf(formater, j, column), 
            "")
    }
    all_vars = names(data)[!names(data) %in% target]
    cat("Reordering data.table if class isn't first\n")
    target_inx = which(names(data) %in% target)
    rest_inx = which(!names(data) %in% target)
    cat("Adding Variable names to data.table\n")
    for (j in rest_inx) {
        column = data[[j]]
        formater = "%s:%f"
        set(data, i = NULL, j = j, value = format_vw(column, 
            formater))
        cat(sprintf("Fixing %s\n", j))
    }
    data = data[, c(target_inx, rest_inx), with = FALSE]
    drop_extra_space <- function(x) {
        gsub(" {1,}", " ", x)
    }
    cat("Pasting data - Removing extra spaces\n")
    data = apply(data, 1, function(x) drop_extra_space(paste(x, 
        collapse = " ")))
    cat("Writing to disk\n")
    write.table(data, file = fileName, sep = " ", row.names = FALSE, 
        col.names = FALSE, quote = FALSE)
}

Tollmann answered 25/8, 2015 at 1:8 Comment(4)

is this faster than e1071? – Microscopic 25/8, 2015 at 1:49

The first one is slow. The second one, was orders of magnitude faster. But you need ram. Anyway, it's easy to extend the function and make it work in chunks. I've used it for fm, so i'm not sure if the output format is totally correct. – Tollmann 26/8, 2015 at 0:0

As marbel says, the second one is really fast. I agree with that. I tested in on one of my files with 50k rows x 750 columns. The e1071 function (write.matrix.csr) takes 18seconds, the one from user user3350186 takes 11s and this one 6s. In my case that saves me hours, many thanks. – Aggress 15/3, 2016 at 13:55

well, it's should be faster to work with the entire vectors in memory rather than read it line by line. There is a cleaner version in the VW repo. You can also use python for this, it's much more convenient. – Tollmann 23/12, 2016 at 23:11

Try these functions and examples:

https://github.com/zygmuntz/r-libsvm-format-read-write

Rees answered 22/2, 2013 at 19:14 Comment(3)

this is just as slow as the e1071 solution. – Microscopic 22/2, 2013 at 19:51

an advantage is no memory overhead; it is saved line by line. – Erda 1/2, 2014 at 21:4

Write.matrix.csr first convert your data to a sparse matrix, and it could fail if not enough memory. – Erda 1/2, 2014 at 21:15

I went with a two-hop solution - convert R data to another format first, and then to LIBSVM:

Used R package foreign to convert (and write out) data frame to ARFF format (modified write.arff changing write.table to na="0.0" instead of na="?" otherwise step 2 fails)
Used https://github.com/dat/svm-tools/blob/master/arff2svm.py to convert ARFF format to LIBSVM

My data set is 200K x 500 and this only took 3-5 minutes.

Bimbo answered 11/1, 2016 at 19:58 Comment(0)

The question was asked a long time ago and has several answer. Most answers didn't work for me since my data comes in a long format, and I cant one-hot encode it in R. So here is my take. I wrote a function to one-hot encode the data, and save it without having to first transform the matrix into a sparse one.

RCPP code:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace Rcpp;

// Reading data frame from R and saving it as an libFM file

// [[Rcpp::export]] 
std::string createNumber(int x, double y) {
  std::string s1 = std::to_string(x); 
  std::string s2 = std::to_string(y); 
  std::string X_elem = s1 + ":" + s2; 
  return X_elem;
}

// [[Rcpp::export]]
std::string createRowLibFM(arma::rowvec row_to_fm, arma::vec factor_levels, arma::vec position) {
  int n = factor_levels.n_elem; 
  std::string total =  std::to_string(row_to_fm[0]); 
  for (int i = 1; i < n; i++) { 
    if (factor_levels[i] > 1) { 
      total = total + " " + createNumber(position[i - 1] + row_to_fm[i], 1);
    } 
    if (factor_levels[i] == 1) {
      total = total + " " + createNumber(position[i], row_to_fm[i]);
    }
  }
  return total; 
}

// [[Rcpp::export]]
void writeFile(std::string file, arma::mat all_data, arma::vec factor_levels) {
  int n = all_data.n_rows;
  arma::vec position = arma::cumsum(factor_levels);
  std::ofstream temp_file;
  temp_file.open (file.c_str());
  for (int i = 0; i < n; i++) {
    std::string temp_row = createRowLibFM(all_data.row(i), factor_levels, position);
    temp_file << temp_row + "\n";
  }
  temp_file.close();
}

R function acting as wrapper for it:

writeFileFM <- function(temp.data, path = 'test.txt') { 
  ### Dealing with y function 
  if (!(any(colnames(temp.data) %in% 'y'))) { 
    stop('No y column is given')  
  } else { 
    temp.data <- temp.data %>% select(y, everything()) ## y is required to be first column for writeFile 
  }
  ### Dealing with factors/strings 
  temp.classes <- sapply(temp.data, class) 
  class.num    <- rep(0, length(temp.classes))
  map.list     <- list()
  for (i in 2:length(temp.classes)) { ### since y is always the first column 
    if (any(temp.classes[i] %in% c('factor', 'character'))) {
      temp.col         <- as.factor(temp.data[ ,i]) ### incase it is character 
      temp.unique      <- levels(temp.col)
      factors.new      <- seq(0, length(temp.unique) - 1, 1)
      levels(temp.col) <- factors.new 
      temp.data[ ,i]   <- temp.col
      ### Saving changes 
      class.num[i]  <- length(temp.unique)
      map.list[[i - 1]] <- data.frame('original.value'  = temp.unique, 
                                      'transform.value' = factors.new)
    } else { 
      class.num[i]  <- 1  ### Numeric values require only 1 column 
    }
  }
  ### Writing file 
  print('Writing file to disc')
  writeFile(all_data = sapply(temp.data, as.numeric), file = path, factor_levels = class.num)
  return(map.list) 
}

Comparing it on fake data.

### Creating data to save 
set.seed(999)
n <- 10000 
factor.lvl1 <- 3
factor.lvl2 <- 2 
temp.data <- data.frame('x1' = sample(stri_rand_strings(factor.lvl1, 7), n, replace = TRUE),
                        'x2' = sample(stri_rand_strings(factor.lvl2, 4), n, replace = TRUE), 
                        'x3' = rnorm(n), 
                        'x4' = rnorm(n),
                        'y'  = rnorm(n))

### Comparing to other method 
library(data.table)
library(e1071)

microbenchmark::microbenchmark(
  temp.data.table <- model.matrix( ~ 0 + x1 + x2 + x3 + x4, data = temp.data,
                                   contrasts = list(x2 = contrasts(temp.data$x2, contrasts = FALSE))),
  write.matrix.csr(temp.data.table, 'out.txt'), 
  writeFileFM(temp.data))

Results.

  min       lq       mean    median        uq
   1.3061   1.6725   1.890942   1.92475   2.07725
 629.9863 653.4345 676.108548 672.52510 687.88330
 270.8217 275.1353 283.537898 281.42100 289.39160
      max neval cld
   3.2328   100 a  
 793.7040   100   c
 328.0863   100  b

It is faster than the e1071 option, and while that option fails when increasing the number of observations, the method suggested is still applicable.

Ichnology answered 31/5, 2020 at 13:29 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

e1071 is off the shelf:

r.vw is bundled with vowpal_wabbit

Another option - In case you already have the data in a data.table object

libfm and SVMlight have the same format, so this function should work.

Recommended topics

Hot tags

`e1071` is off the shelf:

`r.vw` is bundled with `vowpal_wabbit`