How to extract text from table in image?

Asked 17/12, 2019 at 8:55 Answered 24/6, 2020 at 15:56

python ocr tesseract text-extraction python-tesseract

I have data which in a structured table image. The data is like below:

I tried to extract the text from this image using this code:

import pytesseract
from PIL import Image

value=Image.open("data/pic_table3.png")
text = pytesseract.image_to_string(value, lang="eng")    
print(text)

and, here is the output:

EA Domains

Traditional role

Future role

Technology e Closed platforms ¢ Open platforms

e Physical e Virtualized Applicationsand |e Proprietary e Inter-organizational Integration e Siloed composite e P2P integrations applications

e EAI technology e Software asa Service

e Enterprise Systems e Service-Oriented

e Automating transactions Architecture

e “Informating”

interactions

However, the expected data output should be aligned according to the column and row. How can I do that?

Residence answered 17/12, 2019 at 8:55 Comment(1)

You can also try pp-strcuture if you need document layout analysis/mixed tabular and other stuff, as you indicated. github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/… – Mousy 17/6, 2023 at 12:22

You must preprocess the image to remove the table lines and dots before throwing it into OCR. Here's an approach using OpenCV.

Load image, grayscale, and Otsu's threshold
Remove horizontal lines
Remove vertical lines
Dilate to connect text and remove dots using contour area filtering
Bitwise-and to reconstruct image
OCR

Here's the processed image:

Result from Pytesseract

EA Domains Traditional role Future role
Technology Closed platforms Open platforms
Physical Virtualized
Applications and Proprietary Inter-organizational
Integration Siloed composite
P2P integrations applications
EAI technology Software as a Service
Enterprise Systems Service-Oriented
Automating transactions Architecture
“‘Informating”
interactions

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, and Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, (0,0,0), 2)

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,15))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, (0,0,0), 3)

# Dilate to connect text and remove dots
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (10,1))
dilate = cv2.dilate(thresh, kernel, iterations=2)
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area < 500:
        cv2.drawContours(dilate, [c], -1, (0,0,0), -1)

# Bitwise-and to reconstruct image
result = cv2.bitwise_and(image, image, mask=dilate)
result[dilate==0] = (255,255,255)

# OCR
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.imshow('dilate', dilate)
cv2.waitKey()

Lubberly answered 17/12, 2019 at 21:12 Comment(2)

Thank you very much for your answer @Lubberly !! however, your result is still doesnt meet my expectation. Is there any other way that I can align the text according to each column and row? – Residence 18/12, 2019 at 3:17

your code gives the entire text of the image, how can I get only the tables data? – Cida 28/8, 2022 at 7:18

You might want to detect the cells first, as shown in this image. You can do it using a hough line transform, a library provided by OpenCV. After that, you can use the detected lines to select the ROI and then extract the text for each cell.

For detailed explanation, kindly visit my blogpost

Rawlins answered 24/6, 2020 at 15:56 Comment(1)

Thanks a lot! I followed your guide and had some decent results but if the structure of the page is not completely tabular (e.g. goods manifest, bills etc) I have issues which cause the ROI to be out of bounds and such. any hint on how to address such semi-structured images? – Calutron 17/9, 2020 at 10:8

Recommended topics

Hot tags