representing data for usage with libsvm (sparse or not)
Asked Answered
L

1

0

I'd like to do some data mining on functions stacktraces, for this I am using libsvm and representing the data in a sparse format for speed of processing, each stacktrace is an instance and the variables are functions, i.e:

class1 F1,F2,F1,F456,F3  
class2 F4,F4,F4,F56,F3000  
...

somewhere I have an ever growing registry of seen unique functions, this is where the funcions indexes come from. Ideally, I'd like to represent the aforementioned instances using a sparse format and partitioned in 5 variables like this:

1 1:1 2:1 1:1 456:1 3:1  
2 4:1 4:1 4:1 56:1 3000:1

this is not possible in libsvm's format, so I am adding the length of the total functions registry to each group to avoid index clashes, if we suppose there are 3000 functions in total:

1 1:1 3002:1 6001:1 9456:1 12003:1,  this is how the first instance looks now

this works if the amount of functions doesn't change, but that is not the case, as there are new functions added every time, would have to redo whole thing.

I am using a sparse format, but suggestions are welcome on other formats too, I am able to use the data with Weka in a dense format using the function names as variables and it works, just muuch slower than with libsvm

thanks!

Lukey answered 9/9, 2013 at 10:49 Comment(0)
D
0

You have few options:

a) redo the whole thing each time (I think generating inputs for libsvm is faster than libsvm itself :))

b) use even numbers for the first thing and odd numbers for the other thing. So your example would look like:

1 2:1 3:1 1:1 911:1 5:1

This avoids collisions and you don't have to redo the whole thing :)

Dichotomize answered 9/9, 2013 at 20:0 Comment(1)
thanks @usamec, a) is what I am doing presently, not sure I get b) though? I think a more general question is how do I represent in libsvm sparse format repeating features, i.e. A B A A C, for example. In the previous example I used 5 diff atributes, not sure how odd/even indexes would help? I'm thinking to otherwise resort to the normal dense format and make it 1:2 2:3 3:2 4:2 5:3000 ...Lukey

© 2022 - 2024 — McMap. All rights reserved.