Predicting next word with text2vec in R

Asked 21/4, 2016 at 21:6 Answered 11/8, 2017 at 23:5

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).

I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?

Walker answered 21/4, 2016 at 21:6 Comment(0)

You can try char-rnn or word-rnn (google a little bit). For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.

If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.

Separatrix answered 27/4, 2016 at 11:21 Comment(0)

There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.

In the paper the authors provide the instructions to run de code: 1. Download pre-trained GloVe vectors. 2. Obtain a text to train the model on. 3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main function. 4. Run the following methods: (a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name, tokenized_file_name) (b) run_experiment(tokenized_file_name)

The code in Python is here https://github.com/vicmak/ProofSeer.

I also found that @Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).

Hundredfold answered 11/8, 2017 at 23:5 Comment(0)

Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Kept answered 21/4, 2016 at 22:29 Comment(2)

Do you know if there is R implementation for RNNLM? – Walker 22/4, 2016 at 2:35

Probably not would be my guess. – Kept 22/4, 2016 at 23:50

Recommended topics

Hot tags