LIBSVM Data Preparation: Excel data to LIBSVM format
Asked Answered
J

1

9

I want to study how to perform LIBSVM for regression and I'm currently stuck in preparing my data. Currently I have this form of data in .csv and .xlsx format and I want to convert it into libsvm data format.

Current Data

So far, I understand that the data should be in this format so that it can be used in LIBSVM:

LIBSVM format

Based on what I read, for regression, "label" is the target value which can be any real number.

I am doing a electric load prediction study. Can anyone tell me what it is? And finally, how should I organized my columns and rows?

Jaredjarek answered 5/11, 2016 at 9:33 Comment(0)
J
18

The LIBSVM data format is given by:

<label> <index1>:<value1> <index2>:<value2> ...
...
...

As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>, only the indices 5 and 8 and of course label will have a custom value, all other values are set to 0. This is just for notational simplicity or to save space, since datasets can be huge.

For the meanig of the tags, I cite the ReadMe file:

<label> is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. is an integer starting from 1, <value> is a real number. The indices must be in an ascending order.

As you can see, the label is the data you want to predict. The index marks a feature of your data and its value. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.

Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature with index 1 as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0 to 6) also affects his load, so he selects it as a second feature with index 2. A matrix row for LIBSVM now has the following format:

<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>

Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):

0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...

Notice, that we could leave out 2:0, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23°C and today is Tuesday, which is 1, so tomorrow is 2. So, this is the line or vector to use with the model:

0 1:23 2:2

Here, you can set the <label> value arbitrarily. It will be overwritten with the predicted value. I hope this helps.

Jugendstil answered 7/11, 2016 at 14:26 Comment(3)
wow! thanks a lot for this very comprehensive explanation.. i was perfectly clueless about the data format for libsvm but this really helped me understand it.. thank you so much!Jaredjarek
One of the best explanations found on the web. What prominence does libSVM format hold when I try to build a model using SVM? Can't I just scale the data and run it through the algorithm to get a trained model?Meldon
@thatguy, I think I'm missing something... is the index-feature correspondence kept in a different file? It seems strange to me that only arbitrary indexes are used in the libsvm file...Revolutionist

© 2022 - 2024 — McMap. All rights reserved.