I've been experimenting with stacking language models recently and noticed something interesting: the output embeddings of BERT and XLNet are not the same as the input embeddings. For example, this code snippet:
bert = transformers.BertForMaskedLM.from_pretrained("bert-base-cased")
tok = transformers.BertTokenizer.from_pretrained("bert-base-cased")
sent = torch.tensor(tok.encode("I went to the store the other day, it was very rewarding."))
enc = bert.get_input_embeddings()(sent)
dec = bert.get_output_embeddings()(enc)
print(tok.decode(dec.softmax(-1).argmax(-1)))
Outputs this for me:
,,,,,,,,,,,,,,,,,
I would have expected the (formatted) input sequence to be returned since I was under the impression that the input and output token embeddings were tied.
What's interesting is that most other models do not exhibit this behavior. For example, if you run the same code snippet on GPT2, Albert or Roberta, it outputs the input sequence.
Is this a bug? Or is it expected for BERT/XLNet?