How to implement word embedding for persian language
Asked Answered
O

1

5

I have this code that works for English language but does not work for Persian language

from gensim.models import Word2Vec as wv
for sentence in sentences:
    tokens = sentence.strip().lower().split(" ")
    tokenized.append(tokens)
model = wv(tokenized
    ,size=5,
          min_count=1)
print('done2')
model.save('F:/text8/text8-phrases1')
print('done3')
print(model)
model = wv.load('F:/text8/text8-phrases1')

print(model.wv.vocab)

output

> 'بر': <gensim.models.keyedvectors.Vocab object at 0x0000027716EEB0B8>,
> 'اساس': <gensim.models.keyedvectors.Vocab object at
> 0x0000027716EEB160>, 'قوانين': <gensim.models.keyedvectors.Vocab
> object at 0x0000027716EEB198>, 'دانشگاه':
> <gensim.models.keyedvectors.Vocab object at 0x0000027716EEB1D0>,
> 'اصفهان،': <gensim.models.keyedvectors.Vocab object at
> 0x0000027716EEB208>, 'نويسنده': <gensim.models.keyedvectors.Vocab
> object at 0x0000027716EEB240>, 'مسؤول':
> <gensim.models.keyedvectors.Vocab object at 0x0000027716EEB278>,
> 'مقاله': <gensim.models.keyedvectors.Vocab object at
> 0x0000027716EEB2B0>, 'بايد'

plesae take example with code thanks

Ola answered 23/7, 2018 at 19:57 Comment(2)
Can you post sample text from the file text8-phrases1? It could be a problem if your input doesn't use spaces or if there's not enough text.Holley
As long as you have a clear separator between words such as a space, it should work just as well as English. You can also look through the nltk documentation, there is a part about phrase recognition that automatically collects standing elements (such as New York Times in a text). This could work here as well if you have two- or more word phrases that count as one word in Persian.Pruitt
S
6

@AminST, I know it's too late to answer your question, but there might be some people with the same problem. So I put some useful code here. I used the code below on digikala comments. I only assume that you had your preprocessing section (Removing stopwords, HTML, emojis and ...) and data is ready for vectorizing.

from hazm import word_tokenize
import pandas as pd

import gensim
from gensim.models.word2vec import Word2Vec


# reading dataset
df = pd.read_csv('data/cleaned/data.csv')
df.title = df.title.apply(str)
df.comment = df.comment.apply(str)

# Storing comments in list
comments = [comment for comment in df.comment]

# converting each sentence to list of words and inserting in sents
sents = [word_tokenize(comment) for comment in comments]

model = Word2Vec(sentences=sents, size=64, window=10, min_count=5, seed=42, workers=5)

model.save('digikala_words.w2v')

# Check for vector
model['دیجیکالا']

I really hope it could help you, my friend. And if you are still interested to see more detail please visit this link here: digikala comment verification

Sensation answered 29/11, 2019 at 9:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.