Vocabulary Processor function
Asked Answered
L

1

8

I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor. In the document. They said it Maps documents to sequences of word ids. I am not quite sure how this function work. Does it creates a list of Ids then maps the Ids with Words or It has an dictionary of words and their Ids, when run function it only give the ids ?

Laodicean answered 3/10, 2016 at 5:24 Comment(0)
C
20

Lets say that you have just two documents I like pizza and I like Pasta. Your whole vocabulary consists of these words (I, like, pizza, pasta) For every word in the vocabulary, there is an index associated like so (1, 2, 3, 4). Now given a document like I like pasta it can be converted into a vector [1, 2, 4]. This is what the learn.preprocessing.VocabularyProcessor does. The parameter max_document_length makes sure that all the documents are represented by a vector of length max_document_length either by padding numbers if their length is shorter than max_document_length and clipping them if their length is greater than max_document_length Hope this helps you

Culm answered 3/10, 2016 at 7:20 Comment(3)
Thanks Kashyap, so It only encoding the document into vector space. Does it have a name in Natural Language Processing ?Laodicean
@Laodicean As far as I know there is none... This is one of the pre-processing steps that is done in most of the natural language processing systems.Culm
max_document_length should be the number of distinct wordsPolak

© 2022 - 2024 — McMap. All rights reserved.