Are start and end states in HMM, necessary when implementing the Viterbi Algorithm for POS tagging?
Asked Answered
K

2

5

I do not fully understand how to use the start and end states in the Hidden Markov Model. Are these necessary in order to design and implement the transition and emission matrices?

Kokanee answered 15/2, 2014 at 16:29 Comment(0)
P
10

The start/end states are necessary for modeling whether a tag is likely to come at the beginning or end of a sentence.

For example, if you had a five-word sentence and you were considering two taggings

  1. Det Noun Verb Det Noun
  2. Det Noun Verb Det Adj

Both of these look pretty good in terms of transitions because Det->Noun and Det->Adj are both very likely. BUT, it is much less for a sentence to end in an Adj than a Noun, something that you would not get without an end tag. So what you really want to compare is

  1. START Det Noun Verb Det Noun END
  2. START Det Noun Verb Det Adj END

Then you will be computing P(END|Noun) and P(END|Adj).


If you're doing supervised training, then getting the probabilities with START/END is no different than the other tags, you just have to append the special tags to each sentence before counting. So if your training corpus has:

Det Noun Verb
Det Noun Verb Det Noun

Then you would modify it to be

START Det Noun Verb END
START Det Noun Verb Det Noun END

And compute, for example:

  • P(Det|START) = 2/2
  • P(END|Verb) = 1/2
  • P(END|Noun) = 1/3

Also, emissions are trivial: P(START|START)=1 and P(END|END)=1

Pressor answered 15/2, 2014 at 16:57 Comment(3)
Right, but I don't have that information, neither at the transition matrix, nor at the emission matrix. Should I also keep the counts of the POS appearing after a full stop?Kokanee
Sure! I am trying to implement a bigram tagger though, so do I have to insert these START and END states to every sentence in my corpus?Kokanee
You don't actually need to edit your corpus, you can just "add" them on the fly as you're doing your counting.Pressor
S
1

I think this question really depends on your corpus. If, say, the corpus you are using consist of full sentences (semantically speaking), then I suggest you add the start and end states, to improve the language model. But if the corpus are full of sentence fragments, then I don't think start/end states will help. They may even backfire.

Basically, in pos tagging, start states try to model what kind of tags are more likely to appear at the beginning of a sentence. The end states are likewise. So if the sentences in your corpus are really sentences, these start/end states will teach your language model how to begin or finish a sentence.

Sludgy answered 17/2, 2014 at 9:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.