How to extract text from table in image?

2024/10/4 17:22:36

I have data which in a structured table image. The data is like below:

enter image description here

I tried to extract the text from this image using this code:

import pytesseract
from PIL import Imagevalue=Image.open("data/pic_table3.png")
text = pytesseract.image_to_string(value, lang="eng")    
print(text)

and, here is the output:

EA Domains

Traditional role

Future role

Technology e Closed platforms ¢ Open platforms

e Physical e VirtualizedApplicationsand |e Proprietary e Inter-organizationalIntegration e Siloed compositee P2P integrations applications

e EAI technology e Software asa Service

e Enterprise Systems e Service-Oriented

e Automating transactions Architecture

e “Informating”

interactions

However, the expected data output should be aligned according to the column and row. How can I do that?

Answer

You must preprocess the image to remove the table lines and dots before throwing it into OCR. Here's an approach using OpenCV.

  1. Load image, grayscale, and Otsu's threshold
  2. Remove horizontal lines
  3. Remove vertical lines
  4. Dilate to connect text and remove dots using contour area filtering
  5. Bitwise-and to reconstruct image
  6. OCR

Here's the processed image:

enter image description here

Result from Pytesseract

EA Domains Traditional role Future role
Technology Closed platforms Open platforms
Physical Virtualized
Applications and Proprietary Inter-organizational
Integration Siloed composite
P2P integrations applications
EAI technology Software as a Service
Enterprise Systems Service-Oriented
Automating transactions Architecture
“‘Informating”
interactions

Code

import cv2
import pytesseractpytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"# Load image, grayscale, and Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:cv2.drawContours(thresh, [c], -1, (0,0,0), 2)# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,15))
detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:cv2.drawContours(thresh, [c], -1, (0,0,0), 3)# Dilate to connect text and remove dots
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (10,1))
dilate = cv2.dilate(thresh, kernel, iterations=2)
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:area = cv2.contourArea(c)if area < 500:cv2.drawContours(dilate, [c], -1, (0,0,0), -1)# Bitwise-and to reconstruct image
result = cv2.bitwise_and(image, image, mask=dilate)
result[dilate==0] = (255,255,255)# OCR
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.imshow('dilate', dilate)
cv2.waitKey()
https://en.xdnf.cn/q/70585.html

Related Q&A

numpy.savetxt tuple index out of range?

Im trying to write a few lines into a text file, and heres the code I used:import numpy as np# Generate some test data data = np.arange(0.0,1000.0,50.0)with file(test.txt, w) as outfile: outfile.w…

Retrieve list of USB items using Python

How can I retrieve the items plugged into the computer through USB using python? Ive searched around on here and found some old examples which dont appear to work anymore as they are over 5 years old.…

History across ipdb sessions

This question has been asked before, but I couldnt find a good answer. So, I am trying to ask again.I would like my ipdb to remember commands across sessions. Right now, it can pull up commands execute…

Python Distributed Computing (works)

Im using an old thread to post new code which attempts to solve the same problem. What constitutes a secure pickle? this?sock.pyfrom socket import socket from socket import AF_INET from socket import…

Django - Stream request from external site as received

How can Django be used to fetch data from an external API, triggered by a user request, and stream it directly back in the request cycle without (or with progressive/minimal) memory usage?BackgroundAs…

django rest framework - always INSERTs, never UPDATES

I want to be able to UPDATE a user record by POST. However, the id is always NULL. Even if I pass the id it seems to be ignoredView Code:JSON POSTED:{"id": 1, "name": "Craig Ch…

Copy fields from one instance to another in Django

I have the following code which takes an existing instance and copies, or archives it, in another model and then deletes it replacing it with the draft copy. Current Codedef archive_calc(self, rev_num,…

Python selenium get Developer Tools →Network→Media logs

I am trying to programmatically do something that necessarily involves getting the "developer tools"→network→media logs. I will spare you the details, long story short, I need to visit thou…

deep copy nested iterable (or improved itertools.tee for iterable of iterables)

PrefaceI have a test where Im working with nested iterables (by nested iterable I mean iterable with only iterables as elements). As a test cascade considerfrom itertools import tee from typing import …

ImportError: No module named cv2.cv

python 3.5 and windows 10I installed open cv using this command :pip install opencv_python-3.1.0-cp35-cp35m-win_amd64.whlThis command in python works fine :import cv2But when i want to import cv2.cv :i…