I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website.
here is the code I wrote to load the pre-trained model:
from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")
I try to check if the following phrase exists in the vocal(which rare chance it would as these are pre-trained model).
print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)
False
True
So the phrase "internal executive" is not present in the vocabulary but we still have the word vector corresponding to that.
model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
-0.06304111, 0.07833624, -0.17026938, -0.21922196, 0.01146349,
-0.13639058, 0.17283678, -0.09251394, -0.17875175, 0.01339212,
-0.26683623, 0.05487974, -0.11843193, -0.01982722, 0.37037706,
-0.24370994, 0.14269598, -0.16363597, 0.00328478, -0.16560239,
-0.1450972 , -0.24787527, -0.01318423, 0.03277111, 0.16175713,
-0.19367714, 0.16955379, 0.1972683 , 0.09044111, 0.01731548,
-0.0034324 , -0.04834719, 0.14321515, 0.01422525, -0.08803893,
-0.29411593, -0.1033244 , 0.06278021, 0.16452256, 0.0650492 ,
0.1506474 , -0.14194389, 0.10778475, 0.16008648, -0.07853138,
0.2183501 , -0.25451994, -0.0345991 , -0.28843886, 0.19964759,
-0.10923116, 0.26665714, -0.02544454, 0.30637854, 0.04568949,
-0.04798719, -0.05769338, 0.25762403, -0.05158515, -0.04426906,
-0.19901046, 0.00894193, -0.17269588, -0.24747233, -0.19061406,
0.14322804, -0.10804397, 0.4002605 , 0.01409482, -0.04675362,
0.10039093, 0.07260711, -0.0938239 , -0.20434211, 0.05741301,
0.07592541, -0.02921724, 0.21137556, -0.23188967, -0.23164661,
-0.4569614 , 0.07434579, 0.10841205, -0.06514647, 0.01220404,
0.02679767, 0.11840229, 0.2247431 , -0.1946325 , -0.0990666 ,
-0.02524677, 0.0801085 , 0.02437297, 0.00674876, 0.02088535,
0.21464555, -0.16240154, 0.20670174, -0.21640894, 0.03900698,
0.21772243, 0.01954809, 0.04541844, 0.18990673, 0.11806394,
-0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
0.1967368 , 0.09804545, 0.1440366 , -0.08401451, -0.03715726,
0.27826542, -0.25195453, -0.16737154, 0.3561183 , -0.15756823,
0.06724873, -0.295487 , 0.28395334, -0.04908851, 0.09448399,
0.10877471, -0.05020981, -0.24595442, -0.02822314, 0.17862654,
0.06452435, -0.15105674, -0.31911567, 0.08166212, 0.2634299 ,
0.17043628, 0.10063848, 0.0687021 , -0.12210461, 0.10803893,
0.13644943, 0.10755012, -0.09816817, 0.11873955, -0.03881042,
0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
0.05387992, 0.0497043 , 0.06922272, -0.0089245 , 0.24790663,
0.27209425, -0.04925154, -0.08621719, 0.15918174, 0.25831223,
0.01654229, -0.03617229, -0.13490392, 0.08033483, 0.34922174,
-0.01744722, -0.16894792, -0.10506647, 0.21708378, -0.22582002,
0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
-0.06613475, -0.08779443, -0.10732629, 0.05967236, -0.02455976,
0.2229451 , -0.19476262, -0.2720119 , 0.03687386, -0.01220259,
0.07704347, -0.1674307 , 0.2400516 , 0.07338555, -0.2000631 ,
0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
0.41587186, 0.04643605, 0.03352945, -0.13700874, 0.16430037,
-0.13630766, -0.18546128, -0.04692861, 0.37308362, -0.30846512,
0.5535561 , -0.11573419, 0.2332801 , -0.07236694, -0.01018955,
0.05936847, 0.25877884, -0.2959846 , -0.13610311, 0.10905041,
-0.18220575, 0.06902339, -0.10624941, 0.33002165, -0.12087796,
0.06742091, 0.20762768, -0.34141317, 0.0884434 , 0.11247049,
0.14748637, 0.13261876, -0.07357208, -0.11968047, -0.22124515,
0.12290633, 0.16602683, 0.01055585, 0.04445777, -0.11142147,
0.00004863, 0.22543314, -0.14342701, -0.23209116, -0.00003538,
0.19272381, -0.13767233, 0.04850799, -0.281997 , 0.10343244,
0.16510887, 0.08671653, -0.24125539, 0.01201926, 0.0995285 ,
0.09807415, -0.06764816, -0.0206733 , 0.04697794, 0.02000999,
0.05817033, 0.10478792, 0.0974884 , -0.01756372, -0.2466861 ,
0.02877498, 0.02499748, -0.00370895, -0.04728201, 0.00107118,
-0.21848503, 0.2033032 , -0.00076264, 0.03828803, -0.2929495 ,
-0.18218371, 0.00628893, 0.20586628, 0.2410889 , 0.02364616,
-0.05220835, -0.07040054, -0.03744286, -0.06718048, 0.19264086,
-0.06490505, 0.27364203, 0.05527219, -0.27494466, 0.22256687,
0.10330909, -0.3076979 , 0.04852265, 0.07411488, 0.23980476,
0.1590279 , -0.26712465, 0.07580928, 0.05644221, -0.18824042],
Now my confusion is that Fastext creates vectors for character ngrams of a word too. So for a word "internal" it will create vectors for all its character ngrams including the full word and then the final word vector for the word is the sum of its character ngrams.
However, how it is still able to give me vector of a word or even the whole sentence? Isn't fastext vector is for a word and its ngram? So what are these vector I am seeing for the phrase when its clearly two words?