Vocabulary Processor function

About

Asked 3/10, 2016 at 5:24 Answered 3/10, 2016 at 7:20

Solved python tensorflow text-classification

I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor. In the document. They said it Maps documents to sequences of word ids. I am not quite sure how this function work. Does it creates a list of Ids then maps the Ids with Words or It has an dictionary of words and their Ids, when run function it only give the ids ?

Laodicean answered 3/10, 2016 at 5:24 Comment(0)

Lets say that you have just two documents I like pizza and I like Pasta. Your whole vocabulary consists of these words (I, like, pizza, pasta) For every word in the vocabulary, there is an index associated like so (1, 2, 3, 4). Now given a document like I like pasta it can be converted into a vector [1, 2, 4]. This is what the learn.preprocessing.VocabularyProcessor does. The parameter max_document_length makes sure that all the documents are represented by a vector of length max_document_length either by padding numbers if their length is shorter than max_document_length and clipping them if their length is greater than max_document_length Hope this helps you

Culm answered 3/10, 2016 at 7:20 Comment(3)

Thanks Kashyap, so It only encoding the document into vector space. Does it have a name in Natural Language Processing ? – Laodicean 3/10, 2016 at 8:41

@Laodicean As far as I know there is none... This is one of the pre-processing steps that is done in most of the natural language processing systems. – Culm 3/10, 2016 at 14:38

max_document_length should be the number of distinct words – Polak 19/5, 2017 at 17:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags