These names were inherited from the original Google word2vec.c
implementation, upon which the gensim
Word2Vec
class was based. (I believe syn0
only exists in recent versions for backward-compatbility.)
The syn0
array essentially holds raw word-vectors. From the perspective of the neural-network used to train word-vectors, these vectors are a 'projection layer' that can convert a one-hot encoding of a word into a dense embedding-vector of the right dimensionality.
Similarity operations tend to be done on the unit-normalized versions of the word-vectors. That is, vectors that have all been scaled to have a magnitude of 1.0. (This makes the cosine-similarity calculation easier.) The syn0norm
array is filled with these unit-normalized vectors, the first time they're needed.
This syn0norm
will be empty until either you do an operation (like most_similar()
) that requires it, or you explicitly do an init_sims()
call. If you explicitly do an init_sims(replace=True)
call, you'll actually clobber the raw vectors, in-place, with the unit-normed vectors. This saves the memory that storing both vectors for every word would otherwise require. (However, some word-vector uses may still be interested in the original raw vectors of varying magnitudes, so only do this when you're sure most_similar()
cosine-similarity operations are all you'll need.)
The syn1
(or syn1neg
in the more common case of negative-sampling training) properties, when they exist on a full model (and not for a plain KeyedVectors
object of only word-vectors), are the model neural network's internal 'hidden' weights leading to the output nodes. They're needed during model training, but not a part of the typical word-vectors collected after training.
I believe the syn
prefix is just a convention from neural-network variable-naming, likely derived from 'synapse'.