What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?
I am using Apache Spark Mllib 1.6.0 for word2vec.
Sample code :-
public class Main {
public static void main(String[] args) throws IOException {
SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
conf.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
));
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3) // What is the optimum value to set here
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
DataFrame result = model.transform(documentDF);
result.show(false);
for (Row r : result.select("result").take(3)) {
System.out.println(r);
}
// $example off$
}
}
size
can depend a lot on the training data & end application, so it shouldn't be considered generally true that "300" is the best size. (If someone was using the same algorithm and training-data, and a similar end-application, as those used in the 'GloVe' paper linked, then sure, 300 would be a likely good size. But then, they could just re-use that project's downloadable vectors.) – Karate