I have scraped a lot of ebay titles like this one:
Apple iPhone 5 White 16GB Dual-Core
and I have manually tagged all of them in this way
B M C S NA
where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)
Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.
I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.
I want to considering this features:
* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized
....
I can't understand how can I give all this info to the library. The official doc lacks a lot of information
My class are Brand, Model, Size, Color, NA
what does the input file of the SVM algo must contain?
how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?
* UPDATE * I want to represent these features... How can I must do?
- Identity of the current word
I think that I can interpret it in this way
0 --> Brand
1 --> Model
2 --> Color
3 --> Size
4 --> NA
If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).
N-gram substring features of current word (N=4,5,6) No Idea, what does it means?
Identity of 2 words before the current word. How can I model this feature?
Considering the legend that I create for the 1st feature I have 5^(5) combination)
00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44
How can I convert it to a format that the libsvm (or scikit-learn) can understand?
- Membership to the 4 dictionaries of attributes
Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.
- Exclusive membership to dictionary of brand names
I think that like in the 4. feature I must use a bool variable. Do you agree?