I want to scrape a Hindi(Indian Langage) pdf file with python
Asked Answered
P

1

6

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

and here is the screenshot of PDF. PDF SCREEN SHOT

Patsy answered 14/3, 2016 at 18:50 Comment(12)
If you copy paste from pdf viewer, you can properly see correct text?Fissirostral
No text is not correct after i copy paste itPatsy
sometimes text is not stored as is in some pdf files in some language I seen before, that mean you need to write custom decoder for that. without the knowledge of language, not much I can do here.Fissirostral
How to write a genral coustom decoder?? If you can help me with that may be I can figure out a way.Patsy
To write decoder, need to understand the language and grammar, which I dont speak. may be you can post sets of correct texts and incorrect texts, but there is alot of chance that I wont have a clue.Fissirostral
नाम gets changed into नपम, राम chnages to रपमPatsy
Its only one character changes - u"\u0928\u093e\u092e" to u"\u0928\u092a\u092e", which is \u093e changes to \u092a, so may be change \u092a to \u093a or \u092a\u092e to \u093e\u092e may be it could solve that particular case.Fissirostral
Can you please elaborate. I dint get you!! And what would be the loguc behind the decoderPatsy
Let us continue this discussion in chat.Patsy
decoding is converting one set of characters to another, in this case, it is just a replace function, eg text = text.replace(u"\u092a\u092e", u"\u093e\u092e")Fissirostral
@YOU: What if i have dynamic hindi (indian) words in pdf. Which is the best way to extract it? Even if we copy these text from pdf and paste in another document. It has problem. Can you please guide on this? ThanksCaballero
@AbhinavMishra do you have code to convert hindi text??Misspeak
P
5

Best way to solve the problem is use textract module from python and load hindi test data from its github repository and write the extracted text to a txt file. This solved my problem.

Patsy answered 21/3, 2016 at 20:9 Comment(1)
Can you please elaborate the solution with a simple example would help us? ThanksCaballero

© 2022 - 2024 — McMap. All rights reserved.