Relation between Word2Vec vector size and total number of words scanned?

Asked 4/10, 2017 at 8:58 Answered 4/10, 2017 at 20:20

machine-learning apache-spark-mllib word2vec

What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?

I am using Apache Spark Mllib 1.6.0 for word2vec.

Sample code :-

public class Main {       
      public static void main(String[] args) throws IOException {

        SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
        conf.setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(jsc);

        // $example on$
        // Input data: Each row is a bag of words from a sentence or document.
        JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
          RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
          RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
          RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
          RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
        ));
        StructType schema = new StructType(new StructField[]{
          new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);

        // Learn a mapping from words to Vectors.
        Word2Vec word2Vec = new Word2Vec()
          .setInputCol("text")
          .setOutputCol("result")
          .setVectorSize(3) // What is the optimum value to set here
          .setMinCount(0);
        Word2VecModel model = word2Vec.fit(documentDF);
        DataFrame result = model.transform(documentDF);
        result.show(false);
        for (Row r : result.select("result").take(3)) {
         System.out.println(r);
        }
        // $example off$
      }
}

Eyeopener answered 4/10, 2017 at 8:58 Comment(0)

There's no one answer: it will depend on your dataset and goals.

Common values for the dimensionality-size of word-vectors are 300-400, based on values preferred in some of the original papers.

But, the best approach is to create some sort of project-specific quantitative quality score – are the word-vectors performing well in your intended application? – and then optimize the size like any other meta-parameter.

Separately, if you truly have 1 billion unique word tokens – a 1 billion word vocabulary – it will be hard to train those vectors in typical system environments. (1 billion word-tokens is 333 times larger than Google's released 3-million-vectors dataset.)

1 billion 300-dimensional word-vectors would require (1 billion * 300 float dimensions * 4 bytes/float =) 1.2TB of addressable memory (essentially, RAM) just to store the raw vectors during training. (The neural network will need another 1.2TB for output-weights during training, plus other supporting structures.)

Relatedly, words with very few occurrences can't get quality word-vectors from those few contexts, but still tend to interfere with the training of nearby words – so a minimum-count of 0 is never a good idea, and throwing away more lower-frequency words tends to speed training, lower memory-requirements, and improve the quality of the remaining words.

Karate answered 4/10, 2017 at 20:20 Comment(0)

According to research, the quality for vector representations improves as you increase the vector size until you reach 300 dimensions. After 300 dimensions, the quality of vectors starts to decrease. You can find analysis of the different vector and vocabulary sizes here (See Table 2, where SG refers to the Skip Gram model that is the model behind Word2Vec).

Your choice for the vector size also depends on you computational power, even though 300 probably gives you the most reliable vectors, you may need to lower the size if your machine is too slow at computing the vectors.

Phyllida answered 4/10, 2017 at 15:19 Comment(1)

The optimal size can depend a lot on the training data & end application, so it shouldn't be considered generally true that "300" is the best size. (If someone was using the same algorithm and training-data, and a similar end-application, as those used in the 'GloVe' paper linked, then sure, 300 would be a likely good size. But then, they could just re-use that project's downloadable vectors.) – Karate 18/6, 2019 at 18:21

Recommended topics

Hot tags