The reason that the solution proposed to you in the previous question had Insufficient results (I assume) - is that the feature were poor for this problem.
If I understand correctly, What you want is the following:
given the sentence -
Apple iPhone 5 White 16GB Dual-Core
You to get-
B M C S NA
The problem you are describing here is equivalent to part of speech tagging (POS) in Natural Language Processing.
Consider the following sentence in English:
We saw the yellow dog
The task of POS is giving the appropriate tag for each word. In this case:
We(PRP) saw(VBD) the(DT) yellow(JJ) dog(NN)
Don't invest time on understanding the tags in English here, since I give it here only to show you that your problem and POS are equal.
Before I explain how to solve it using SVM, I want to make you aware of other approaches: consider the sentence Apple iPhone 5 White 16GB Dual-Core
as test data. The tag you set to the word Apple
must be given as input to the tagger when you are tagging the word iPhone
. However, After you tag the word a word, you will not change it. Hence, models that are doing sequance tagging usually recievces better results. The easiest example is Hidden Markov Models (HMM). Here is a short intro to HMM in POS.
Now we model this problem as classification problem. Lets define what is a window -
`W-2,W-1,W0,W1,W2`
Here, we have a window of size 2. When classifying the word W0
, we will need the features of all the words in the window (concatenated). Please note that for the first word of the sentence we will use:
`START-2,START-1,W0,W1,W2`
In order to model the fact that this is the first word. for the second word we have:
`START-1,W-1,W0,W1,W2`
And similarly for the words at the end of the sentence. The tags START-2
,START-1
,STOP1
,STOP2
must be added to the model two.
Now, Lets describe what are the features used for tagging W0:
Features(W-2),Features(W-1),Features(W0),Features(W1),Features(W2)
The features of a token should be the word itself, and the tag (given to the previous word). We shall use binary features.
Example - how to build the feature representation:
Step 1 - building the word representation for each token:
Lets take a window size of 1. When classifying a token, we use W-1,W0,W1
. Say you build a dictionary, and gave every word in the corpus a number:
n['Apple'] = 0
n['iPhone 5'] = 1
n['White'] = 2
n['16GB'] = 3
n['Dual-Core'] = 4
n['START-1'] = 5
n['STOP1'] = 6
Step 2 - feature token for each tag:
we create features for the following tags:
n['B'] = 7
n['M'] = 8
n['C'] = 9
n['S'] = 10
n['NA'] = 11
n['START-1'] = 12
n['STOP1'] = 13
Lets build a feature vector for START-1,Apple,iPhone 5
: the first token is a word with known tag (START-1
will always have the tag START-1
). So the features for this token are:
(0,0,0,0,0,0,1,0,0,0,0,0,1,0)
(The features that are 1: having the word START-1
, and tag START-1
)
For the token Apple
:
(1,0,0,0,0,0,0)
Note that we use already-calculated-tags feature for every word before W0 (since we have already calculated it) . Similarly, the features of the token iPhone 5
:
(0,1,0,0,0,0,0)
Step 3 concatenate all the features:
Generally, the features for 1-window will be:
word(W-1),tag(W-1),word(W0),word(W1)
Regarding your question - I would use one more tag - number
- so that when you tag the word 5
(since you split the title by space), the feature W0
will have a 1 on some number feature, and 1 in W-1
's model
tag - in case the previous token was tagged correctly as model.
To sum up, what you should do:
- give a number to each word in the data
- build feature representation for the train data (using the tags you already calculated manually)
- train a model
- label the test data
Final Note - a Warm Tip For Existing Code:
You can find POS tagger implemented in python here. It includes explanation of the problem and code, and it also does this feature extraction I just described for you. Additionally, they used set
for representing the feature of each word, so the code is much simpler to read.
The data this tagger receives should look like this:
Apple_B iPhone_M 5_NUMBER White_C 16GB_S Dual-Core_NA
The feature extraction is doing in this manner (see more at the link above):
def get_features(i, word, context, prev):
'''Map tokens-in-contexts into a feature representation, implemented as a
set. If the features change, a new model must be trained.'''
def add(name, *args):
features.add('+'.join((name,) + tuple(args)))
features = set()
add('bias') # This acts sort of like a prior
add('i suffix', word[-3:])
add('i-1 tag', prev)
add('i word', context[i])
add('i-1 word', context[i-1])
add('i+1 word', context[i+1])
return features
For the example above:
context = ["Apple","iPhone","5","White","16GB","Dual-Core"]
prev = "B"
i = 1
word = "iPhone"
Generally, word
is the str of the current word, context
is a the title split into list, and prev
is the tag you received for the previous word.
I use this code in the past, it works fast with great results.
Hope its clear, Have fun tagging!