Converting CSV file to LIBSVM compatible data file using python
Asked Answered
E

2

6

I am doing a project using libsvm and I am preparing my data to use the lib. How can I convert CSV file to LIBSVM compatible data?

CSV File: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/data/iris.csv

In the frequencies questions:

How to convert other data formats to LIBSVM format?

It depends on your data format. A simple way is to use libsvmwrite in the libsvm matlab/octave interface. Take a CSV (comma-separated values) file in UCI machine learning repository as an example. We download SPECTF.train. Labels are in the first column. The following steps produce a file in the libsvm format.

matlab> SPECTF = csvread('SPECTF.train'); % read a csv file
matlab> labels = SPECTF(:, 1); % labels from the 1st column
matlab> features = SPECTF(:, 2:end); 
matlab> features_sparse = sparse(features); % features must be in a sparse matrix
matlab> libsvmwrite('SPECTFlibsvm.train', labels, features_sparse);
The tranformed data are stored in SPECTFlibsvm.train.
Alternatively, you can use convert.c to convert CSV format to libsvm format.

but I don't wanna use matlab, I use python.

I found this solution as well using JAVA

Can anyone recommend a way to tackle this problem ?

Euchologion answered 19/4, 2014 at 12:37 Comment(5)
Are you going to use libsvm executables? or Python binding?Sarracenia
If libsvm, you need to convert csv to libsvm data. If Python binding, you need to load csv to Python.Sarracenia
I am going to use libsvm executables. I found this one (github.com/seamusabshere/vector_embed), I am figuring out now if it's helpful. But I wanna split between predictors and target(which is one of columns). Does this affect ?Euchologion
It seems to treat the first column is target. You need to modify the code properly. It's ruby code. Did you need to Python version?Sarracenia
This is first interaction with libsvm, I just need to know how to separate predictors (many columns) and target (one specific column). I'd use this script (github.com/zygmuntz/phraug/blob/master/csv2libsvm.py) I would be pleased if you can explain more.Euchologion
S
7

You can use csv2libsvm.py to convert csv to libsvm data

python csv2libsvm.py iris.csv libsvm.data 4 True

where 4 means target index, and True means csv has a header.

Finally, you can get libsvm.data as

0 1:5.1 2:3.5 3:1.4 4:0.2
0 1:4.9 2:3.0 3:1.4 4:0.2
0 1:4.7 2:3.2 3:1.3 4:0.2
0 1:4.6 2:3.1 3:1.5 4:0.2
...

from iris.csv

150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
...
Sarracenia answered 19/4, 2014 at 13:53 Comment(2)
I got altogether 16 features and my 16th feature is the class attribute, I have no headers how can i convert csv2libsvm using the above fileSima
I tried with a 2 column csv file and it didn't work. I run python3 csv2libsvm.py P0.txt P0.data 2 True and I got Traceback (most recent call last): File "csv2libsvm.py", line 71, in <module> label = line.pop(label_index) IndexError: pop index out of range Aggarwal
V
5

csv2libsvm.py does not work with Python3, and also it does not support label targets (string targets), I have slightly modified it. Now It should work with Python3 as well as wıth the label targets. I am very new to Python, so my code may do not follow the best practices, but I hope it is good enough to help someone.

#!/usr/bin/env python

"""
Convert CSV file to libsvm format. Works only with numeric variables.
Put -1 as label index (argv[3]) if there are no labels in your file.
Expecting no headers. If present, headers can be skipped with argv[4] == 1.

"""

import sys
import csv
import operator
from collections import defaultdict

def construct_line(label, line, labels_dict):
    new_line = []
    if label.isnumeric():
        if float(label) == 0.0:
            label = "0"
    else:
        if label in labels_dict:
            new_line.append(labels_dict.get(label))
        else:
            label_id = str(len(labels_dict))
            labels_dict[label] = label_id
            new_line.append(label_id)

    for i, item in enumerate(line):
        if item == '' or float(item) == 0.0:
            continue
        elif item=='NaN':
            item="0.0"
        new_item = "%s:%s" % (i + 1, item)
        new_line.append(new_item)
    new_line = " ".join(new_line)
    new_line += "\n"
    return new_line

# ---

input_file = sys.argv[1]
try:
    output_file = sys.argv[2]
except IndexError:
    output_file = input_file+".out"


try:
    label_index = int( sys.argv[3] )
except IndexError:
    label_index = 0

try:
    skip_headers = sys.argv[4]
except IndexError:
    skip_headers = 0

i = open(input_file, 'rt')
o = open(output_file, 'wb')

reader = csv.reader(i)

if skip_headers:
    headers = reader.__next__()

labels_dict = {}
for line in reader:
    if label_index == -1:
        label = '1'
    else:
        label = line.pop(label_index)

    new_line = construct_line(label, line, labels_dict)
    o.write(new_line.encode('utf-8'))
Varien answered 29/6, 2016 at 9:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.