Help: Extracting data tuples from text... Regex or Machine learning?
Asked Answered
G

6

5

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.

Problem: Extract a data tuple from the given text.

Here are some characteristics of the data.

  1. The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.

  2. Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).

  3. In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.

  4. Some words go together (bigrams). Like "Low miles".

  5. Historical data available = 10,000 records

  6. Incoming New Data volume = 1000-1500 records / week

The expected output should be in the form of (Year,Make,Model, feature). So the output should look like

1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....

Raw Headline Data:


1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic


Possible choices:

  1. A machine learning Text Classifier (Naive Bayes etc)
  2. Regex

What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?

If the choice is to go with a text classifier then what would you consider to be the easiest to implement.

Thanks in advance for your kind help.

Guesswarp answered 12/6, 2011 at 18:28 Comment(2)
Do you have labeled data for training/testing any algorithms? This may limit the type of approaches you are able to apply from a machine learning perspective (e.g. language modeling requires a good sized corpus).Venetic
yes. I do have a lot of data for training purposes...Guesswarp
S
4

This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.

Strangle answered 13/6, 2011 at 0:54 Comment(1)
Unfortunately I have to agree with this. If you have a lot of labelled training data, you have a chance, but it's certainly not trivial to build/configure and test such a system. You definitely need some sort of Named Entity Recognition (dictionary based or otherwise), and looking for common terms using word ngrams is a good idea. Because you have a limited domain, I think you have a decent shot at getting this to work- I think if you manually label 1000 examples and then have a decent feature set, and then tokenize each headline and run it with MALLET that'll work.Glooming
T
3

Check this IE library for writing extraction rule< I think it will work best for you problem. There also example how to create fast dictionary matching.

Torsibility answered 13/6, 2011 at 7:17 Comment(1)
+1 I'd also like to draw attention to the 'Alternatives' given in the first link (GATE, UIMA, NLTK, Lingpipe and MALLET), which is a very nice list of other possibilitiesPhotoreconnaissance
B
0

I think that the ARX or Phoebus systems may suit your needs if you already have annotated data and a list of words associated to each field. Their approach is a mix of information extraction and information integration.

Baldhead answered 14/6, 2011 at 8:27 Comment(0)
E
0

There are a few good entity recognition libraries. Have you taken a look at Apache opennlp?

Eloiseloisa answered 16/6, 2011 at 22:38 Comment(0)
D
0

As a user looking for a specific model of car the task is easier. I'm pretty sure I could classify, say, most Ford Rangers since I know what to look for with regexp.

I think your best bet is to write a function for each car model with type String -> Maybe Tuple. Then run all these on each input and throw away those inputs resulting in zero or too many tuples.

Drillstock answered 4/10, 2012 at 1:39 Comment(0)
R
0

You should use a tool like Amazon Mechanical Turk for this. Human microtasking. Another alternative is to use a data entry freelancer. upWork is a great place to look. You can get excellent quality results and the cost is very reasonable for each.

Ryon answered 25/8, 2015 at 12:9 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.