The LIBSVM data format is given by:
<label> <index1>:<value1> <index2>:<value2> ...
...
...
As you can see, this forms a matrix [(IndexCount + 1) columns, LineCount rows]. More precisely a sparse matrix. If you specify a value for each index
, you have a dense matrix, but if you only specify a few indices like <label> <5:value> <8:value>
, only the indices 5
and 8
and of course label
will have a custom value, all other values are set to 0
. This is just for notational simplicity or to save space, since datasets can be huge.
For the meanig of the tags, I cite the ReadMe file:
<label> is the target value of the training data. For classification,
it should be an integer which identifies a class (multi-class
classification is supported). For regression, it's any real
number. For one-class SVM, it's not used so can be any number.
is an integer starting from 1, <value> is a real number. The indices
must be in an ascending order.
As you can see, the label
is the data you want to predict. The index
marks a feature of your data and its value
. A feature is simply an indicator to associate or correlate your target value with, so a better prediction can be made.
Totally Fictional story time: Gabriel Luna (a totally fictional character) wants to predict his energy consumption for the next few days. He found out, that the outside temperature from the day before is a good indicator for that, so he selects Temperature
with index 1
as feature. Important: Indices always start at one, zero can sometimes cause strange LIBSVM behaviour. Then, he surprisingly notices, that the day of the week (Monday to Sunday or 0
to 6
) also affects his load, so he selects it as a second feature with index 2
. A matrix row for LIBSVM now has the following format:
<myLoad_Value> <1:outsideTemperatureFromYesterday_Value> <2:dayOfTheWeek_Value>
Gabriel Luna (he is Batman at night) now captures these data over a few weeks, which could look something like this (load in kWh, temperature in °C, day as mentioned above):
0.72 1:25 2:0
0.65 1:21 2:1
0.68 2:29 2:2
...
Notice, that we could leave out 2:0
, because of the sparse matrix format. This would be your training data to train a LIBSVM model. Then, we predict the load of tomorrow as follows. You know the temperature of today, let us say 23
°C and today is Tuesday, which is 1
, so tomorrow is 2
. So, this is the line or vector to use with the model:
0 1:23 2:2
Here, you can set the <label>
value arbitrarily. It will be overwritten with the predicted value. I hope this helps.