I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor
. In the document. They said it Maps documents to sequences of word ids. I am not quite sure how this function work. Does it creates a list of Ids then maps the Ids with Words or It has an dictionary of words and their Ids, when run function it only give the ids ?
Vocabulary Processor function
Asked Answered
Lets say that you have just two documents I like pizza
and I like Pasta
. Your whole vocabulary consists of these words (I, like, pizza, pasta)
For every word in the vocabulary, there is an index associated like so (1, 2, 3, 4). Now given a document like I like pasta
it can be converted into a vector [1, 2, 4]. This is what the learn.preprocessing.VocabularyProcessor
does. The parameter max_document_length
makes sure that all the documents are represented by a vector of length max_document_length
either by padding numbers if their length is shorter than max_document_length
and clipping them if their length is greater than max_document_length
Hope this helps you
Thanks Kashyap, so It only encoding the document into vector space. Does it have a name in Natural Language Processing ? –
Laodicean
@Laodicean As far as I know there is none... This is one of the pre-processing steps that is done in most of the natural language processing systems. –
Culm
max_document_length should be the number of distinct words –
Polak
© 2022 - 2024 — McMap. All rights reserved.