I have an image of a table (in my case .gif) and want to extract the table it was (ideally, .ods).
Is there any way to do so? (doing it manually is discarted, since the table has more than 1000 rows and 6 columns)
I have an image of a table (in my case .gif) and want to extract the table it was (ideally, .ods).
Is there any way to do so? (doing it manually is discarted, since the table has more than 1000 rows and 6 columns)
You will be able to get most of it through OCR, but you'll need to manually verify the data and fix some inaccuracies that will be there. It definitely won't be perfect.
First thing to do is to ensure you have a good quality image for the OCR software:
Here's what I did with your sample png (I'm using Windows):
Removed the orange/blue backgrounds:
a) Select -> By Color and clicked the blue background
b) I held down Shift and clicked the orange background (this will add it to the current selection)
c) Edit -> Fill With BG Color (this sets it to white)
d) Ctrl-Shift-A to cancel the selection
I removed the partially cut off '305' line:
a) used the Rectangular Select tool button from the palette, and filled the selection with BG Color, as above
Let's remove the table border:
a) Click the 'Fuzzy Select' tool button from the palette
b) Click somewhere on the table border (you should see the 'marching ants' instead of the border)
c) Edit -> Fill With BG Color
d) Ctrl-Shift-A to cancel the selection again
We need to increase the number of pixels that the numbers use so that the OCR can better detect their shapes
a) Image -> Scale Image. I chose to scale by 1000% with Linear Interpolation (the other interpolations won't work as well)
Download and install Tesseract from GitHub
a) At the command prompt type (include the double-quotes to cope with spaces within the path, & change your paths as necessary): "D:\Program Files (x86)\Tesseract-OCR\tesseract" "d:\temp\your_image.png" "d:\temp\your_txt_file_output"
The output with be a text file with an appended .txt
extension. It will still have a few artifacts but we can easily correct those in Notepad++ (or similar):
a) The commas were seen as full-stops, so I did a Find and Replace of "." with "," (I'm assuming you don't have any decimal points in the data!)
b) There were some spaces before a few commas, so I did Find and Replace " ," with "," (note I included a space before the comma in the Find)
c) There were still some spaces in the numbers, so I did a Find and Replace of " " with "" (a space with an empty replace)
This gave the following result:
298
299
300
301
302
303
304
910,820,000
920,820,000
930,820,000
941,820,000
952,820,000
983,820,000
9?4,820,000
210,000
220,000
220,000
220,000
220,000
220,000
220,000
2,500
2,500
3,000
3,000
3,000
3,000
3,000
19,000
19,000
20,000
20,000
20,000
20,000
20,000
Note the question mark in the place of 7 in the second block of text. Things like that still need to be tidied up.
Lastly, you'd copy and paste the rows of text into your spreadsheet etc.
I wanted to post another option I finally found online.
Even though I think K Scandrett answer deserves to be the correct one, since it doesn't rely on a URL, which might go down.
© 2022 - 2024 — McMap. All rights reserved.