Support vector machine in Python using libsvm example of features
Asked Answered
P

2

5

I have scraped a lot of ebay titles like this one:

Apple iPhone 5 White 16GB Dual-Core

and I have manually tagged all of them in this way

B M C S NA

where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)

Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.

I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.

I want to considering this features:

* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized 
....

I can't understand how can I give all this info to the library. The official doc lacks a lot of information

My class are Brand, Model, Size, Color, NA

what does the input file of the SVM algo must contain?

how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?

* UPDATE * I want to represent these features... How can I must do?

  1. Identity of the current word

I think that I can interpret it in this way

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).

  1. N-gram substring features of current word (N=4,5,6) No Idea, what does it means?

  2. Identity of 2 words before the current word. How can I model this feature?

Considering the legend that I create for the 1st feature I have 5^(5) combination)

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

How can I convert it to a format that the libsvm (or scikit-learn) can understand?

  1. Membership to the 4 dictionaries of attributes

Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.

  1. Exclusive membership to dictionary of brand names

I think that like in the 4. feature I must use a bool variable. Do you agree?

Projective answered 22/6, 2015 at 23:34 Comment(3)
I suggest you also take a look at sklearn. They SVM library is a little more comprehensive and the documentation is very good.Regression
yes, I know it... but I need an example similar to my problem :DProjective
I don't know if I missed something, but I would tackle this similar to a multi-label classification. Although, I would iterate over every word and ask my classifier what it thinks that word is - based on the entire phrase and placement within the phrase. You could also see if nltk gives you a better starting point, but unless you are building a new grammar it would entail the same word-by-word classifier that I described above. Hope this makes sense!Ventricose
S
11

Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)

Step 0: Install dependencies

You need to install the following libraries:

  • pandas
  • scikit-learn

From command line:

pip install pandas
pip install scikit-learn

Step 1: Load the data

We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.

We will train the SVM with train.csv and get test labels with test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

Step 2: Process the data

We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.

We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.

Then we will train a linear svm with the data

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

We see here that the length of train_labels (5) exactly matches how many rows we have in trainfeatures. Each item in train_labels corresponds to a row.

Step 3: Train the SVM

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

Step 4: Evaluate the SVM on some testing data

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

Links & Tips

You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.

Note that since you're evaluating using the data you trained on the accuracy will be unusually high.

Sesqui answered 27/6, 2015 at 22:0 Comment(11)
Thank you for your great answer and thanks also for the great list of links and tips, I very appreciate it. Scikit-learn has a good doc and your example seems very understandable. Thank you for your time. I have updated my question with some doubts about modelling some features... Could you help me please to represent them in a format that scikit can understand? ThanksProjective
I read the update to your question. Your original question asked how to train an SVM using a set of features, but your update is now asking a follow-up question: how to extract features from your raw data. I have some thoughts on things you could do, but it will be hard for me to work this into my existing answer. I'd recommend posting a separate question asking how to extract the features you want from your raw data. You're more likely to get a good answer if you keep your question specific. Feel free to post a link to the new question here and I can take a look if I have time.Sesqui
Sure, right. Thank you so much for your time. #31104606Projective
Ok, I took a quick look, will try to take a more detailed look later. You may get some comments asking you to be a bit more specific, which I think will also help you solve your problem. You might also want to remove the Update part on this question. But maybe @erik-e will provide some additional help so probably wait a day or so :-)Sesqui
Yes, sorry I added more info to the question linked!Projective
Good feedback on the separate question @juliasaurus, that is too much to address in the same answer. Usi, I recommend this solution over using LibSVM directly. LibSVM is great software but I wouldn't approach a solution using it first, I much prefer this approach with scikit-learn and pandas.Kania
@Kania Thank you very much for your answer . Both of you have been very kind. Erik your answer perfectly answers my question (LIBSVM) but juliasaurus answer is very complete and precise. That's why I choose her answer as best.. please erik understand me ... it was a very difficult choice! Anyway I still have some doubt in modelling some features... all the details are in my second question (#31104606) Someone could help me? That's very important for me! Thanks both :DProjective
No problem at all. I am glad you followed @juliasaurus's answer, it is a better way to approach this problem. The tools recommended are great starting points for these analysis whereas LibSVM is not as fully featured.Kania
Thanks @UsiUsi! I'll try to take a look at features tonight or this weekend. Sorry, work leaves me very tired in the evenings so I don't often do much SO on evenings.Sesqui
thanks :D! I also star a bounty for that question! As I said is really important for meProjective
@juliasaurus great answer! I just want to add a small comment. change recall = num_correct / len(test_labels) to recall = num_correct / float(len(test_labels)) , other wise the recal percentage will be 0 all the time.Octet
K
2

I echo the comment of @MarcoPashkov but will try to elaborate on the LibSVM file format. I find the documentation comprehensive yet hard to find, for the Python lib I recommend the README on GitHub.

An important piece to recognize is that there is a Sparse format where all features which are 0 get removed and a Dense format where features which are 0 are not removed. These two are equivalent examples of each taken from the README.

# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]

The y variable stores a list of all the categories for the data.

The x variable stores the feature vector.

assert len(y) == len(x), "Both lists should be the same length"

The format found in the Heart Scale Example is a Sparse format where the dictionary key is the feature index and the dictionary value is the feature value while the first value is the category.

The Sparse format is incredibly useful while using a Bag of Words Representation for your feature vector.

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

For an example using the feature vector you started with, I trained a basic LibSVM 3.20 model. This code isn't meant to be used but may help in showing how to create and test a model.

from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])

# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
    # LibSVM expects index to start at 1, not 0.
    categories[name] = Category(index + 1, name)
categories

Out[0]: {'B': Category(index=1, name='B'),
   'C': Category(index=3, name='C'),
   'M': Category(index=2, name='M'),
   'NA': Category(index=5, name='NA'),
   'S': Category(index=4, name='S')}

# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]

# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
    split_values = line.split(',')
    # Create a Feature with the values converted to integers.
    features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))

features

Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
 Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
 Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
 Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
 Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]

# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)

from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)

# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)

Out[3]: Accuracy = 100% (5/5) (classification)

I hope this example helps, it shouldn't be used for your training. It is meant as an example only because it is inefficient.

Kania answered 27/6, 2015 at 11:51 Comment(1)
Thank you for your great answer. It is what I looking for. Thank you for your time. I have updated my question with some doubts about modelling some features... Could you help me please to represent them in a format that the libsvm library can understand? ThanksProjective

© 2022 - 2024 — McMap. All rights reserved.