Best machine learning approach to automate text/fuzzy matching
Asked Answered
H

2

7

I'm reasonably new to machine learning, I've done a few projects in python. I'm looking for advice on how to approach the below problem which I believe could be automated.

A user in a data quality team in my organisation has a daily task of taking a list of company names (with addresses) that have been manually entered, he has to then search a database of companies to find the matching result, using his judgement - i.e. no hard and fast rule.

An example of the input would be:

Company Name, Address Line 1, Country

Of this, the user takes the company name and enters it into the search tool. Where he is presented with a list of results and he picks the best match but may choose not to pick any match. The search tool is built in house and talks to an external API, I have access to the source code so I can modify the search tool to capture the input, the list of results, and I could add a checkbox to see which result was used, and a check box to signify that none was chosen. Therefore this would become my labelled training data.

The columns used from the results to make the judgement are roughly the same:

Company Name, Address Line 1, Country

Given a company name like Stack Overflow, the results may return Stack Overflow Ltd., Stacking Overflowing Shelves Ltd. etc. The input data is reasonably good, so the results usually yield about 10 matches, and to a human, it's fairly obvious which one to pick.

My thought is that with enough training data I could call the API directly with the search term, and then choose the appropriate result from the list of results.

Is this something that could be achieved through ML? I'm struggling with the fact that the data will be different every time. Thoughts on the best way to achieve this are welcome, in particular how to structure the data for the model and what kind of classifier to use etc.

Haymo answered 16/2, 2017 at 16:40 Comment(5)
This sounds more like fuzzy matching than text classification. You will likely receive poor ML classification results due to a huge number of labels, which would be all possible company names you want to match.Zaneta
Thanks, I've updated the description. I wonder if a there's a way to give the results of a fuzzy match in combination with which one was chosen to enhance it. There is a bit of logic used to decide which result to take when there are similar results, or multiple for the same company. E.g. they take the head office over the branch of a company when available, which is signified in another field.Haymo
I guess this can also be perceived as a binary classification problem, where for each pair of descriptions of companies you must answer whether they correspond to the same company or not. Levenshtein distance, tfidf or ngrams matches can be used as features. Even if the solution will be as simple as choosing a threshold in Levenshtein distance, or applying some combination of stemming/stop words, it would still be nice to use ML approach to choose that threshold value and measure the quality of the classification.Cleveland
@NickP Can you share the data (company names)?Crop
Hi @Nickp, Were you able get solution for this using ML?Dupery
K
8

To frame it as a ML problem, you could learn a similarity function.

Instead of classifying "Acme Corp" as matching the target class "Acme" (classifier), you would instead learn a function that learns to tell that "Acme Corp" is similar to "Acme", but dissimilar to "ABC Corp".

This is usually called "Similarity Learning", in your case, maybe more specifically "Ranking similarity learning" since your goal is not to learn a function that will output a similarity value, but instead rank potential candidates.

But before using full ML algorithms, I would start first by using a string distance metric, for instance the Levenshtein distance metric (very common and easy to find). Transform your data in positive and negative examples (a positive example: Acme is a match to Acme Corp). The simplest learning function would be finding the Edit Distance threshold that maximizes your score. You can also add parameters like: "remove Corp.", "remove Ltd", etc. and find what combination works best.

Kith answered 16/2, 2017 at 17:11 Comment(6)
Hi Pascal, this sounds like the kind of thing I'm after, I don't suppose you have seen a decent example of this anywhere have you? Also, are you aware of any good libraries for similarity learning. Ta.Haymo
@NickP maybe you don't need "real" ML at all. I would start first by using a string distance metric, for instance the Levenshtein distance metric (very common and easy to find). Transform your data in positive and negative examples (a positive example: Acme is a match to Acme Corp). The simplest learning function would be finding the Edit Distance threshold that maximizes your score. You can also add parameters like: "remove Corp.", "remove Ltd", etc. and find what combination works best. You probably don't need full ML here.Kith
@NickP take a look at the overview of the dedupe library for a description of how to use ML for this problemBrieta
@PascalSoucy can you give any suggestion on how to get positive and negative examples? For most classification algorithms the percentage of the classes in the train data must be the same as in the actual data. If we treat the task as the binary classification problem and for each pair of companies we decide whether they are the same or not, that would mean that the number of negative examples should be close to the number of companies, as each company usually matches with one company and doesn't match with all the other companies. Am I correct in my assumptions?Cleveland
@PascalSoucy I'm just a bit worried that that would create too many examples and complicate the computations while training the model.Cleveland
Hi, there. I'm now a fair bit more experienced in machine learning, So if i were tackle this problem again, I'd probably use a character level recurrent neural network. Using ML at all is probably a bit overkill, but it's fun to do it this way nonetheless. I don't think class imbalance will be an issue with a Char RNN. Your input to this kind of model will be a one hot vector, your output will be true false, and you'll need some kind of marker to label company 1 then company 2 e.g. 1:Acme Corp2:AcmeHaymo
K
0

I am glad to see there are people who are working on a similar solution.

I am using fuzzywuzzy for this but this I would want to create a recommender system the suggests the companies based on the information available but since you have only 2 data point I would suggest the below:

Prepare 2 independent fuzzy lookup scripts. One for the company name and other for the address. Pick the closest results and try finding the distance of the their respective objects. Example - address1 match to address2 is 92% check what is the distance of the company name of address1 to the company name of address2. if the match is good enough you got your match.

The mistake I did while trying to implement this solution was preparing only 1 script heavily dependent on the company name and later on matched the address which reduced my chances of finding the match.

Thank you,

Koy answered 19/11, 2019 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.