Tensorflow vocabularyprocessor
Asked Answered
F

2

17

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

Also how can i extract vocabulary from the vocab_processor

Fillin answered 17/11, 2016 at 17:45 Comment(1)
I am trying to follow the same tutorial but there are a few things which I don't understand. Maybe you can take a look at my question and help me out?Suffuse
F
29

I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.

import numpy as np
from tensorflow.contrib import learn

x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])

## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))    

## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping

## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])

## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])

print(vocabulary)
print(x)
Fillin answered 22/11, 2016 at 12:17 Comment(1)
If you see the vocab_dict, you can see that "This" is indexed as 1, "is" as 2 and so on. I would like to pass my own index. For example, frequency based. Do you know how to do this?Dasi
T
2

not able to understand the purpose of max_document_length

The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.

Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.

You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,

max_document_length: Maximum length of documents. if documents are longer, they will be trimmed, if shorter - padded.

Check out the source code.

  def transform(self, raw_documents):
    """Transform documents to word-id matrix.
    Convert words to ids with vocabulary fitted with fit or the one
    provided in the constructor.
    Args:
      raw_documents: An iterable which yield either str or unicode.
    Yields:
      x: iterable, [n_samples, max_document_length]. Word-id matrix.
    """
    for tokens in self._tokenizer(raw_documents):
      word_ids = np.zeros(self.max_document_length, np.int64)
      for idx, token in enumerate(tokens):
        if idx >= self.max_document_length:
          break
        word_ids[idx] = self.vocabulary_.get(token)
      yield word_ids

Note the line word_ids = np.zeros(self.max_document_length).

Each row in raw_documents variable will be mapped to a vector of length max_document_length.

Treillage answered 28/12, 2017 at 19:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.