LibSVM Input format
Asked Answered
M

2

4

I want to represent a set of labelled instances (data) in a file to be fed in to LibSVM as training data. For the problem mentioned in this question. It will include,

  1. Login date
  2. Login time
  3. Location (country code?)
  4. Day of the week
  5. Authenticity (0 - Non Authentic, 1 - Authentic) - The Label

How can I format this data to be input to the SVM?

Municipality answered 13/3, 2011 at 18:51 Comment(2)
Location and IP address overlap, so you might want to pick only one of them (experiment to find out which is best).Keil
You don't need to remove it from the question :)Keil
P
4

Are you asking about the data format or how to convert the data? For the latter you're going to have to experiment to find the right way to do this. The general idea is to convert your data into a nominal or ordinal value attribute. Some of these are simple - #4, #6 - some of these are going to be tough - #1-#3.

For example, you could represent #1 as three attributes of day, month and year, or just one by converting it to a UNIX like timestamp.

The IP is even harder - there's no straightforward way to convert that into a meaningful ordinal value. Using every IP as a nominal attribute might not be useful depending on your problem.

Once you figure this out, convert your data, check the LibSVM docs. The general format is followed by : i.e., +1 1:0 2:0 .. etc

Pushy answered 14/3, 2011 at 0:6 Comment(8)
IP address equals previous (or most common) IP address for user might be a good feature, and is only binary.Keil
So.. Simply I should be able to use Date (dd/mm/yyyy), Time (hh:mm - 24h format), Location (Country Code - For the sake of simplicity), Day of the week (0-6), with authenticity (1 or 0) to achieve this.. Shouldn't I?Municipality
You won't be able to use any non-numeric formats like a date - the distinction here is that you have 3 dimensions (day/mo/year) vs one dimension (seconds since 1972). You'll have to do a conversion either way.Pushy
The problem here is, I need the SVM to identify the patterns (relationships) involving day of the week and time of the day (e.g. Friday 0830h OR Sunday 1845h), Will it be more flexible to use time since midnight rather than UNIX time stamp?Municipality
Yes. If you have lots of data, don't be afraid to include redundant dimensions. Getting SVMs to do what you want is more of an art than a science, you'll have to experiment to find out what works best.Pushy
@spinning_plate: What exactly are 'redundant dimensions' ?Municipality
A totally redundant dimension might be one that is perfectly correlated with another dimension. Such as having two variables, one for temperature in C and another in temp. in F.Pushy
@spinning_plate,@larsmans :Once I do the training and created a model, I need to compare it with a single instance (or one line of data). i.e. I need to predict the authenticity of a single visit by a user. Is this possible?Municipality
C
1

I believe there is an unstated assumption in the previous answers. The unstated assumption is that users of libSVM know that they should avoid putting categorical data into the classifier.

For example, libSVM will not know what to do with country codes. If you are trying to predict which visitors are most likely to buy something on your site then you could have problems if USA is between Chad and Niger in your country code list. The bulge from USA will likely skew predictions for the countries located near it.

To fix this I would create one category for each country under consideration (and perhaps an 'other' category). Then for each instance you want to classify, I would set all the country categories to zero except the one to which the instance belongs. (To do this with the libSVM sparse file format, this isn't really a big deal).

Clientele answered 16/4, 2011 at 22:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.