I want to scrape a Hindi(Indian Langage) pdf file with python - McMap

About

I want to scrape a Hindi(Indian Langage) pdf file with python

Asked 14/3, 2016 at 18:50 Answered 21/3, 2016 at 20:9

Solved python pdf ocr pdfminer pdf-scraping

P

1

6

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

and here is the screenshot of PDF.

Patsy answered 14/3, 2016 at 18:50 Comment(12)

If you copy paste from pdf viewer, you can properly see correct text? – Fissirostral 15/3, 2016 at 1:58

No text is not correct after i copy paste it – Patsy 15/3, 2016 at 1:59

sometimes text is not stored as is in some pdf files in some language I seen before, that mean you need to write custom decoder for that. without the knowledge of language, not much I can do here. – Fissirostral 15/3, 2016 at 2:9

How to write a genral coustom decoder?? If you can help me with that may be I can figure out a way. – Patsy 15/3, 2016 at 2:12

To write decoder, need to understand the language and grammar, which I dont speak. may be you can post sets of correct texts and incorrect texts, but there is alot of chance that I wont have a clue. – Fissirostral 15/3, 2016 at 2:19

नाम gets changed into नपम, राम chnages to रपम – Patsy 15/3, 2016 at 2:39

Its only one character changes - u"\u0928\u093e\u092e" to u"\u0928\u092a\u092e", which is \u093e changes to \u092a, so may be change \u092a to \u093a or \u092a\u092e to \u093e\u092e may be it could solve that particular case. – Fissirostral 15/3, 2016 at 5:33

Can you please elaborate. I dint get you!! And what would be the loguc behind the decoder – Patsy 15/3, 2016 at 7:12

Let us continue this discussion in chat. – Patsy 15/3, 2016 at 7:15

decoding is converting one set of characters to another, in this case, it is just a replace function, eg text = text.replace(u"\u092a\u092e", u"\u093e\u092e") – Fissirostral 15/3, 2016 at 9:46

@YOU: What if i have dynamic hindi (indian) words in pdf. Which is the best way to extract it? Even if we copy these text from pdf and paste in another document. It has problem. Can you please guide on this? Thanks – Caballero 22/11, 2017 at 17:10

@AbhinavMishra do you have code to convert hindi text?? – Misspeak 20/1, 2021 at 19:4

P

5

Best way to solve the problem is use textract module from python and load hindi test data from its github repository and write the extracted text to a txt file. This solved my problem.

Patsy answered 21/3, 2016 at 20:9 Comment(1)

Can you please elaborate the solution with a simple example would help us? Thanks – Caballero 22/11, 2017 at 15:4

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.