Function that returns affinity between texts?

Asked 24/1, 2011 at 23:21 Answered 3/2, 2011 at 0:55

Solved algorithm text full-text-search relevance

consider I have a

string1 = "hello hi goodmorning evening [...]"

and I have some minor keywords

compare1 = "hello evening"
compare2 = "hello hi"

I need a function that returns the affinity between the text and keywords. Example:

function(string1,compare1);  // returns: 4
function(string1,compare2);  // returns: 5 (more relevant)

Please note 5 and 4 are just for example.

You could say - write a function that counts occurrences - but for this example this would not work because both got 2 occurrences, but compare1 is less relevant because "hello evening" isn't exactly found in string1 (the 2 words hello and evening are more distant than hello hi)

are there any known-algorithm to do this?

ADD1:

algos like Edit Distance in this case would NOT work. Because string1 is a complete text (like 300-400 words) and the comparing strings are max 4-5 word.

Hearttoheart answered 24/1, 2011 at 23:21 Comment(4)

Are you looking for simple string edit-distance comparison or full on semantic equivalence? e.g. is cat more similar to cart or feline? – Pentosan 24/1, 2011 at 23:26

none of both.. I would need something like count occurences+give weight based on the words distances (as i explained previos: string1 is an article with 300-400words and the comparing strings are just 3-4 words) – Hearttoheart 24/1, 2011 at 23:42

Do your keywords always come in pairs? What's more important, having more matching words or better proximity? – Skiba 27/1, 2011 at 10:11

1. not always in pairs, the comparing string can be up to 5-6 words 2. 50%-50% – Hearttoheart 31/1, 2011 at 23:29

A Dynamic Programing Algorithm

It seems what you are looking for is very similar to what the Smith–Waterman algorithm does.

From Wikipedia:

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

Let's see a practical example, so you can evaluate its usefulness.

Suppose we have a text:

text = "We the people of the United States, in order to form a more 
perfect union, establish justice, insure domestic tranquility, 
provide for the common defense, 

  promote the general welfare, 

  and secure the blessings of liberty to ourselves and our posterity, 
do ordain and establish this Constitution for the United States of 
America.";

I isolated the segment we are going to match, just for your easy of reading.

We will compare the affinity (or similarity) with a list of strings:

list = {
   "the general welfare",
   "my personal welfare",
   "general utopian welfare",
   "the general",
   "promote welfare",
   "stackoverflow rulez"
   };

I have the algorithm already implemented, so I'll calculate the similarity and normalize the results:

sw = SmithWatermanSimilarity[ text, #] & /@ list;
swN = (sw - Min[sw])/(Max[sw] - Min[sw])

Then we Plot the results:

enter image description here

I think it's very similar to your expected result.

HTH!

Some implementations (w/source code)

Mahalia answered 3/2, 2011 at 0:55 Comment(23)

This might be the solution! But why ur function is called "SmithWatermanSimilarity" and not just "SmithWaterman": I mean do you have some customization? Anyway do you have a pseudocode description for this? So I can translate in my language (php) – Hearttoheart 3/2, 2011 at 18:16

@yes123 At the end of my answer there are a few links (you may find a lot) including three implementations and two "educative" resources. I can't provide mine because it isn't open source. – Mahalia 3/2, 2011 at 18:22

Ah ok. Maybe I will try the cuda or java implementation to see if it fits my needs (like your solution). If I will translate to php I will post a link here, thanks – Hearttoheart 3/2, 2011 at 19:17

I am testing some java implementations. It seems those implementations consider chars and not the entry words. That's a big problem for me – Hearttoheart 3/2, 2011 at 19:37

@yes123 it works well with words too, try it. It is designed to match DNA sequences that are like words – Mahalia 3/2, 2011 at 20:13

@yes123 Once you do some testing, please return and leave some comments. I'm very interested on knowing how the algorithm represent your specific requirements! – Mahalia 4/2, 2011 at 15:58

the implementations i am using in php gives me the same score of "1" to "the general welfare" and "the general", if u want see the php code i will post the link.. – Hearttoheart 5/2, 2011 at 0:28

@yes123 So it is very badly broken, because that kind of alignment is the most basic thing the algorithm do right. You can check at baba.sourceforge.net what should be the intermediate results and how increasing the text match length increases the score. – Mahalia 5/2, 2011 at 6:49

here there is the code: google.com/codesearch/p?hl=en#I08I-vVKhsw/trunk/biostor/swa.php too bad it's not that great =/ – Hearttoheart 5/2, 2011 at 17:51

@yes123 The code is not working properly. It works well only for perfect alignment. :( – Mahalia 5/2, 2011 at 18:25

i don't get why such an important algos for search (score based on proximity) isn't that known, especially for php that works with website/searches =/ – Hearttoheart 5/2, 2011 at 18:37

@yes123 You may try to find the Needleman-Wunsch algorithm, but this one is much better for your purpose. I'm not savvy in PHP, so can't help much with that. Sorry. – Mahalia 5/2, 2011 at 18:49

nothing neither for needleman. The problem here is that those 2 algos are used in Bioinformatics... that's why in PHP are completly unknown – Hearttoheart 6/2, 2011 at 18:21

well i have write a function on my own to do this. Results are very close to your results. I would like to share and discuss my solution with you, can we talk privately (email,chat,msn,fb) ? this little comment space here hurts me rly. – Hearttoheart 6/2, 2011 at 19:35

@yes123 It's preferred that the communication happens on Stack Overflow so that the solution to your problem can be shared by the community. That's what we're here for, to make the programming world a better place by sharing our knowledge publicly. – Stride 7/2, 2011 at 13:30

@george i am perfectly agree. but this little space is absoluty awful to discuss such things. thus stackoverflow (by rules) is a Q&A website only, not a discussion board :) (and that why the comments here are so limited) – Hearttoheart 7/2, 2011 at 13:43

@belisarius :( are you there? :( – Hearttoheart 9/2, 2011 at 1:19

@yes123 yes, I'm here. Check if you can establish a private chat room in chat.stackoverflow.com and invite me – Mahalia 9/2, 2011 at 9:19

@belisarius can't do it with chat :( Anyway I have created a link for you. Go here you can test my algo. As you can see (at the bottom of the codepad page) "the general welfare" got the highest score of 8 points. If you have any suggestions tell me. Please note I have not used any of well known algo.. I just wrote it myself – Hearttoheart 10/2, 2011 at 0:29

@yes123 If your algorithm gives you consistent results go ahead! It's not of course the SW algo, and seems to favor too little for longer matches. – Mahalia 10/2, 2011 at 13:3

@belisarius yes I am currently using it with good results. Anyway thanks for your help. – Hearttoheart 10/2, 2011 at 14:28

@belisarius These implementations of the Smith–Waterman and Needleman-Wunsch algorithms are much cleaner starting point Python implementations than the one you link to for S-W in your post. – Sierra 30/5, 2012 at 22:0

@Sierra Thanks! Feel free to edit my answer if you think they are better than those I found – Mahalia 30/5, 2012 at 23:47

Take a look into creating N-grams out of your input data and then matching on the N-grams. I have a solution where I regard each n-gram as a dimension in a vector space (which becomes a space of 4000 dimensions in my case) and then affinity is the cosine of the angle between two vectors (the dot-product is involved here).

The hard part is to come up with a metric defining the affinity in a way you want.

An alternative is to look at a sliding window and score based on how many words in your compare_x data is in the window. The final score is the sum.

Appal answered 25/1, 2011 at 1:42 Comment(3)

Hm even if it's interesting i dont think u can use this sliding window because you cant choose a winodws size that alyawys works imo. I was thinking something more like this: first counts only occorunces then remove some points based on how much chars there are between found keys. But i cannot belive there isnt already a known algo for this job... Its so important in search-related stuff. Regsrding other solution using ngrams i dont have a clear idea on how to make it – Hearttoheart 25/1, 2011 at 2:14

There is no clear-cut solution because different metrics are wanted for different purposes. – Appal 25/1, 2011 at 12:39

why you say that? I am pretty sure someone else already wrote algos for this (i don't think it would be too hard) I found some mathematics solution to the problem here: domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/… At page 31 there is a solution "For every term t of a query q= {t1, . . . , tn}, we compute an accumulator acc that contains proximity score of t within the current element e:" – Hearttoheart 25/1, 2011 at 16:1

py-editdist will give you the Levenshtein edit distance between two strings, which is one metric that might be helpful.

See: http://www.mindrot.org/projects/py-editdist/

The code example from that page:

import editdist

# Calculate the edit distance between two strings
d = editdist.distance("abc", "bcdef")

Semiquaver answered 24/1, 2011 at 23:32 Comment(0)

I think there is a pretty good and complete answer to this question here http://answers.google.com/answers/threadview?id=337832

Sorry its on google answers!

Nodule answered 2/2, 2011 at 5:42 Comment(2)

Half of the link on that page don't work. And the other half don't speak at all about proximity – Hearttoheart 2/2, 2011 at 10:16

If you mean proximity, as in meaning, – Nodule 3/2, 2011 at 23:29

Here you can find a list of metrics to calculate distance between strings, and an opensource java library that just do that. http://en.wikipedia.org/wiki/String_metric In particular, take a look at the Smith–Waterman algorithm, keeping in mind that what they call "Alphabet" can be composed by what we call Strings : so, given the alphabet

{A = "hello", B = "hi",C = "goodmorning",D = "evening"}

and called d the distance, your function tries to calculate

d(ABCD,AB) vs d(ABCD,AC)

Tomchay answered 27/1, 2011 at 23:58 Comment(3)

HM, I read Smith–Waterman works great on characters.. I need this algo for text =/ – Hearttoheart 31/1, 2011 at 9:26

As pointed out before, think at each word as a consant char, and you should be done – Tomchay 31/1, 2011 at 12:6

Hm so I should build my Alphabet based on my text and my comparing string and give them to the algos? I am not sure it can work because algos supports stuff like "insertion"/"deletion" that work for chars – Hearttoheart 31/1, 2011 at 13:28

Well, you can count the occurrences of pieces of the comparing text, ie:

"a-b-c" -> "a" , "b" , "c" , "a-b" ," b-c" , "a-b-c" (possible "a-c", if you wanted that)

And then count occurrences of each of those, and sum them, possibly with a weight of (length of string) / (length of whole string).

Then you just need a way to produce those pieces, and run a check for all of them.

Armlet answered 31/1, 2011 at 1:18 Comment(0)

While the Levenshtein distance as it stands may not suit your purposes, a modification of it might: Try implementing it by storing the insertions, deletions, and substitutions separately.

The distance will then be the sum of the following:

All Substutions
The number of spaces minus one in each set of consecutive insertions/deletions:
- (In your case: " hi goodmorning " only counts as two edits, and ' [...] ' counts as one.)

You'd have to test this, of course, but if it doesn't work well try simply using the sum of consecutive insertions/deletions (so, " hi good morning " is only 1 edit).

EDIT

P.S.: this assumes a relatively major change to how Levenshtein works, you'd want to 'align' your data first (finding out where there's significant (more than two characters) overlap and inserting 'null' characters that would count as insertions).

Also, this is just an untested idea, so any ideas for improvements are welcome.

Charlsiecharlton answered 1/2, 2011 at 15:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A Dynamic Programing Algorithm

Recommended topics

Hot tags