Support Vector Machine for Java?

Asked 25/3, 2013 at 21:43 Answered 5/3, 2015 at 10:54

I'd like to write a "smart monitor" in Java that sends out an alert any time it detects oncoming performance issues. My Java app is writing data in a structured format to a log file:

<datetime> | <java-method> | <seconds-to-execute>

So, for example, if I had a Widget#doSomething(String) method that took 812ms to execute, it would be logged as:

2013-03-24 11:39:21 | Widget#doSomething(String) | 812

As performance starts to degrade (such as during a major collection, during peak loads, or if the system is just slowing to a crawl), method execution timings start to slow down; so the right-most column starts to see huge numbers (sometime 20 - 40 seconds to execute a single method).

In college - for a machine learning exercise - I wrote what my professor called a linear dichotomizer that took simple test data (the height, weight and gender of a person) and "learned" how to categorize a person as male or female based on their height/weight. Then, once it had all its training data, we fed it new data to see how accurately it could determine gender.

I think the multivariate version of a linear dichotomizer is something called a support vector machine (SVM). If I'm wrong, then please clarify and I'll change the title of my question to something more appropriate. Regardless, I need this app to do the following things:

Run in a "test mode" where I feed it the structured log file from my main Java app (the one I wish to monitor) and it takes each log entry (as shown above) and uses it for test data
Only the java-method and seconds-to-execute columns are important as inputs/test data; I don't care about the datetime
Run in "monitor mode" where it is actively reading new log data from the log file, and using similar "machine learning" techniques to determine if a a performance degradation is looming

It's important to note that the seconds-to-execute column is not the only important factor here, as I've seen horrible timings for certain methods during periods of awesome performance, and really great timings for other methods at times when the server seemed like it was about to die and push daisies. So obviously certain methods are "weighted"/more important to performance than others.

My question

Googling for "linear dichotomizer" or "support vector machines" turns up some really scary, highly-academic, ultra-cerebral white papers that I just don't have the mental energy (nor time) to consume - unless they truly are my only options; so I ask is there a laymen's introduction to this stuff, or a great site/article/tutorial for building such a system in Java?
Are there any solid/stable open source Java libraries? I was only able to find jlibsvm and svmlearn but the former looks to be in a pure beta state and the latter seems to only support binary decisions (like my old linear dichotomizer). I know there's Mahout but that sits on top of Hadoop, and I don't think I have enough data to warrant the time and mental energy into setting up my own Hadoop cluster.

Bigener answered 25/3, 2013 at 21:43 Comment(4)

This is not directly related to your question, but... scary academic-papers are the norm for these kinds of domains, as is a decent amount of math. For example, I had been interested in Neural Nets for a long time. For the final project in one of my masters classes, I decided to implement one to do handwritten-digit recognition. It took me 3 solid days of skimming papers (reading core sections), reading code, and writing code before I got a decent handle on the subject. My advice is to pick the most introductory material and start from there. Basically, expect to have to read! Good luck! – Pus 25/3, 2013 at 23:20

Good point @VivinPaliath (+1) - I guess I'm not opposed to rolling up my sleeves, I just know that if there's anything out there open sourced, that it will have been written by smarter (and more informed) people than myself! – Bigener 25/3, 2013 at 23:25

I prefer Java for development, but when I face machine learning challenges, I turn to R. The packages are usually very mature and well documented. Here is a documentation with examples that decsribes SVM use in R. I use a bridge that is called JRI to embed R into my Java code. – Lucky 26/3, 2013 at 8:1

I'm not sure SVM is the best tool here, but I like how it is explained in this video by Andrew Ng : youtube.com/watch?v=dzuuNPmGAVU – Monolayer 26/3, 2013 at 14:42

A "smart monitor" you describe is exactly time-series classification.

There are many classification algorithms. They all basically take an matrix, where the rows are observations and the columns are "features" that somehow describe the observation, and a label vector of length rows that is valued either 0 or 1. In your problem an observation might be a minute sample, and your label vector will be valued 1 for the time periods that are experiencing performance issues and 0 otherwise.

Implicit in this definition is the need to resample your data(using the mode/median/mean if necessary) such that each observation is defined evenly, such as seconds or minutes or hours.

Generating features is the crucial part. I'd probably start with 2 features, the raw values and the (once) differenced values between observation x_i and x_i-1. We'll define these for a lag of 2. Technically making this 4 features. Each feature can't look into the future. Each feature must represent the same thing for each observation.

For example consider the time-series of length 10:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

If we want to produce a set of features using lag two intervals in the past then the first two element of the time-series are considered a burnt-in sample. We can't use the observations associated with them to train out algorithm.

The raw values, of 8 rows by 2 columns would be

[[ 1.,  0.]
 [ 2.,  1.],
 [ 3.,  2.],
 [ 4.,  3.],
 [ 5.,  4.],
 [ 6.,  5.],
 [ 7.,  6.],
 [ 8.,  7.]]

The differenced values

[[ 1.,  1.],
 [ 1.,  1.],
 [ 1.,  1.],
 [ 1.,  1.],
 [ 1.,  1.],
 [ 1.,  1.],
 [ 1.,  1.]])

These get column stacked. There are many additional features you could explore. Rolling mean would be my next pick.

If you want to predict further in the future then your training data should be lagging further from your label vector.

If performance isn't satisfactory then try adding more features by choosing a rolling mean over a bigger window, or add further back in the future. A clever trick to improve the performance of time-series algorithms is to include the value of the prediction for the previous time interval.

Fit your classifier on some early part of the data, then observe its accuracy over a later part of the data. There are many metrics for classifiers you can use. If you choose to use a classifier that outputs probabilities instead of hard 1/0, then your options even broaden. (As does the uses of your classifier.)

Precision and recall are intuitive performance metrics of classifiers.

Train on the first (early) half of your data and test on the second half (later).

As far as algorithms go, I'd look into logistic regression. I'd only look elsewhere if the performance isn't satisfactory and you've exhausted feature extraction options.

Mallet appears to be a good library for the task. See this bit of the docs.

I recently discovered JSAT, which looks promising.

There are more specific approaches to time-series classification that explicitly take into account the sequential nature of the observations and labels. This is a general purpose adaptation of classification to time-series.

Stenopetalous answered 25/3, 2013 at 23:33 Comment(2)

One up for mentioning logistic regression in this context. – Field 26/3, 2013 at 13:17

Yah. Linear models should be preferred over nonlinear ones unless there is evidence to suggest a performance increase. Logistic regression also has the added bonus of being perfectly interpretable. – Stenopetalous 26/3, 2013 at 15:13

If you are interested in using support vector machines, there's a guide that is very oriented for beginners and you might find useful (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)

That guide is from the same guys of libsvm which is a very mature library for support vector machines (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and they do have binding for Java (http://www.csie.ntu.edu.tw/~cjlin/libsvm/#java)

Poltroon answered 26/3, 2013 at 13:11 Comment(0)

Weka is a popular machine learning/data mining package in Java. This book http://guidetodatamining.com/ might be useful. It does not really address SVM's but it certainly has good classification algorithms, and it is certainly not at all esoteric.

Trepidation answered 26/3, 2013 at 7:47 Comment(0)

Perhaps Apache Spark MLlib will help you:

The linear SVM is a standard method for large-scale classification tasks. It is a linear method as described above in equation (1), with the loss function in the formulation given by the hinge loss:

L(w;x,y):=max{0,1−ywTx}.

By default, linear SVMs are trained with an L2 regularization. We also support alternative L1 regularization. In this case, the problem becomes a linear program.

The linear SVMs algorithm outputs an SVM model. Given a new data point, denoted by x, the model makes predictions based on the value of wTx. By the default, if wTx≥0 then the outcome is positive, and negative otherwise.

Hadrian answered 5/3, 2015 at 10:54 Comment(0)

My question

Recommended topics

Hot tags