representing data for usage with libsvm (sparse or not)

I'd like to do some data mining on functions stacktraces, for this I am using libsvm and representing the data in a sparse format for speed of processing, each stacktrace is an instance and the variables are functions, i.e:

class1 F1,F2,F1,F456,F3  
class2 F4,F4,F4,F56,F3000  
...

somewhere I have an ever growing registry of seen unique functions, this is where the funcions indexes come from. Ideally, I'd like to represent the aforementioned instances using a sparse format and partitioned in 5 variables like this:

1 1:1 2:1 1:1 456:1 3:1  
2 4:1 4:1 4:1 56:1 3000:1

this is not possible in libsvm's format, so I am adding the length of the total functions registry to each group to avoid index clashes, if we suppose there are 3000 functions in total:

1 1:1 3002:1 6001:1 9456:1 12003:1,  this is how the first instance looks now

this works if the amount of functions doesn't change, but that is not the case, as there are new functions added every time, would have to redo whole thing.

I am using a sparse format, but suggestions are welcome on other formats too, I am able to use the data with Weka in a dense format using the function names as variables and it works, just muuch slower than with libsvm

thanks!

Recommended topics

Hot tags