Suppose we're training a neural network model to learn the mapping from the following input to output, where the output is Name Entity (NE).
Input: EU rejects German call to boycott British lamb .
Output: ORG O MISC O O O MISC O O
A sliding window is created to capture the context information and its outcomes are fed into the training model as model_input. The sliding window generates results as following:
[['<s>', '<s>', 'EU', 'rejects', 'German'],\
['<s>', 'EU', 'rejects', 'German', 'call'],\
['EU', 'rejects', 'German', 'call', 'to'],\
['rejects', 'German', 'call', 'to', 'boycott'],\
['German', 'call', 'to', 'boycott', 'British'],\
['call', 'to', 'boycott', 'British', 'lamb'],\
['to', 'boycott', 'British', 'lamb', '.'],\
['boycott', 'British', 'lamb', '.', '</s>'],\
['British', 'lamb', '.', '</s>', '</s>']]
<s>
represents start of sentence token and </s>
represents end of sentence token, and every sliding window corresponds to one NE in output.
To process these tokens, a pre-trained embedding model is used converting words to vectors (e.g., Glove), but those pre-trained models do not include tokens such as <s>
and </s>
. I think random initialization for <s>
and </s>
won't be a good idea here, because the scale of such random results might not be consistent with other Glove embeddings.
Question:
What suggestions of setting up embeddings for <s>
and </s>
and why?