I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages:
- Page from gensim github issues section. It was a question by someone where the answer was to subclass WikiCorpus (answered by Piskvorky). Luckily, in the same page, there was a code representing the suggested 'subclass' solution. The code was provided by Rhazegh. (link)
- Page from stackoverflow with a title: "Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus". However, no clear answer was provided and was treated in the context of spaCy. (link)
I decided to use the code provided in page 1. My current code (mywikicorpus.py):
import sys
import os
sys.path.append('C:\\Users\\Ghaliamus\\Anaconda2\\envs\\wiki\\Lib\\site-packages\\gensim\\corpora\\')
from wikicorpus import *
def tokenize(content):
# override original method in wikicorpus.py
return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
if len(token) <= 15 and not token.startswith('_')]
def process_article(args):
# override original method in wikicorpus.py
text, lemmatize, title, pageid = args
text = filter_wiki(text)
if lemmatize:
result = utils.lemmatize(text)
else:
result = tokenize(text)
return result, title, pageid
class MyWikiCorpus(WikiCorpus):
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
WikiCorpus.__init__(self, fname, processes, lemmatize, dictionary, filter_namespaces)
def get_texts(self):
articles, articles_all = 0, 0
positions, positions_all = 0, 0
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
pool = multiprocessing.Pool(self.processes)
for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
for tokens, title, pageid in pool.imap(process_article, group): # chunksize=10):
articles_all += 1
positions_all += len(tokens)
if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
continue
articles += 1
positions += len(tokens)
if self.metadata:
yield (tokens, (pageid, title))
else:
yield tokens
pool.terminate()
logger.info(
"finished iterating over Wikipedia corpus of %i documents with %i positions"
" (total %i articles, %i positions before pruning articles shorter than %i words)",
articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
self.length = articles # cache corpus length
And then, I used another code by Pan Yang (link). This code initiates WikiCorpus object and retrieve the text. The only change in my current code is initiating MyWikiCorpus instead of WikiCorpus. The code (process_wiki.py):
from __future__ import print_function
import logging
import os.path
import six
import sys
import mywikicorpus as myModule
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) != 3:
print("Using: python process_wiki.py enwiki-20180601-pages- articles.xml.bz2 wiki.en.text")
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(outp, 'w')
wiki = myModule.MyWikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
else:
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
Through command line I ran the process_wiki.py code. I got text of the corpus with the last line in the command prompt:
(2018-06-05 09:18:16,480: INFO: Finished Saved 4526191 articles)
When I read the file in python, I checked the first article and it was without punctuation. Example:
(anarchism is a political philosophy that advocates self governed societies based on voluntary institutions these are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical or free associations anarchism holds the state to be undesirable unnecessary and harmful while opposition to the state is central anarchism specifically entails opposing authority or hierarchical)
My two relevant questions, and I wish you can help me with them, please:
- is there any thing wrong in my reported pipeline above?
- regardless such pipeline, if I opened the gensim wikicorpus python code (wikicorpus.py) and wanted to edit it, what is the line that I should add it or remove it or update it (with what if possible) to get the same results but with punctuation?
Many thanks for your time reading this long post.
Best wishes,
Ghaliamus