How to recognize entities in text that is the output of optical character recognition (OCR)?
C

1

6

I am trying to do multi-class classification with textual data. Problem I am facing that I have unstructured textual data. I'll explain the problem with an example. consider this image for example:

example data

I want to extract and classify text information given in image. Problem is when I extract information OCR engine will give output something like this:

18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS

Now target classes here are:

18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name

Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".

So I don't know how I can split/merge lines before passing it to classification model.
Is there any way using NLP I can split paragraph based on target class. In other words given input paragraph split it based on target labels.

Condone answered 3/3, 2019 at 10:52 Comment(0)
A
5

If you only consider the text, this is a Named Entity Recognition (NER) task.

What you can do is train a Spacy model to NER for your particular problem.

Here is what you will need to do:

  1. First gather a list of training text data
  2. Label that data with corresponding entity types
  3. Split the data into training set and testing set
  4. Train a model with Spacy NER using training set
  5. Score the model using the testing set
  6. ...
  7. Profit!

See Spacy documentation on training specific NER models

Good luck!

Anteversion answered 5/3, 2019 at 13:21 Comment(1)
Any idea on the amount of training data required for a decent accuracy (>80%)?Shamanism

© 2022 - 2024 — McMap. All rights reserved.