not able to understand the purpose of max_document_length
The VocabularyProcessor
maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor
so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length)
.
Each row in raw_documents
variable will be mapped to a vector of length max_document_length
.