Synonym finder algorithm

Asked 26/5, 2011 at 6:22 Answered 16/8, 2021 at 2:1

Solved php python algorithm synonym data-analysis

I think example will be much better than loooong description :)

Let's assume we have an array of arrays:

("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")

Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:

("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")

So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.

Thank you!

Anallise answered 26/5, 2011 at 6:22 Comment(5)

so you are saying Server1 and 192.168.0.5 are synonyms? – Hambrick 26/5, 2011 at 6:32

Yes. Because Server1 = 192.168.0.3 and 192.168.0.3 = 192.168.0.5 – Anallise 26/5, 2011 at 6:34

and you are also selecting unique ones, right?like distinct. – Hambrick 26/5, 2011 at 6:35

This actually doesn't matter. It's always possible to apply array_unique. And I don't need information about how often this or that word is mentioned as a synonym. Besides, some smart SQL query can be a good solution as well! Not only php/python… – Anallise 26/5, 2011 at 6:38

well my idea is creating a database table with a Value column and a Type column Value holds Server1 Type holds 1, Value holds 192.168.0.3 Type 1 and Value holds Server_2 Type holds 192.168.0.4.then with some linq query in C# I guess we could group them but I do not know if you are ok with using tables or how many strings you will have.you can also create a class with attributes Value and Type and use the same system, I mean at least I firstly would try this.However, there might be special algorithms for such kind of work, I just like to create my own algorithms. – Hambrick 26/5, 2011 at 6:46

This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.

An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.

To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.

Server1 is Node #0
Server_1 is Node #1
Server_2 is Node #2

Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.

Complexity Analysis

Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.

Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).

Fuchsin answered 26/5, 2011 at 6:40 Comment(6)

Yes, that seems to be an option. Thank you! – Anallise 26/5, 2011 at 6:58

If algorithm is O(n) and you have to visit every node once, then overall complexity is O(n^2). If you tries with graphs, then some closure calculation where more appropriate. – Confraternity 26/5, 2011 at 8:43

@p4553d: I don't follow why it is O(n^2). You visit each node once and you are done - as soon as you have visited a node you know which sub-graph it belongs to. The key is of course to skip over nodes that you have already visited, which the flood fill algorithm already does. – Fuchsin 26/5, 2011 at 11:18

@Himadri: But you have to find synonym group for every word. Flood fill algorithm will find synonyms for one word only. "Disconnected" regions of graph stay undiscovered. – Confraternity 26/5, 2011 at 11:22

@p4553d: Yes. That's why I mention that you should skip over the nodes that you have already visited. Say 1/2 the graph is visited by the first iteration of the flood fill, the outer for loop that finds all disconnected region then will just skip over that 1/2 of the graph in linear time. It's still O(n) – Fuchsin 26/5, 2011 at 11:25

@Himadri: Ok, I got it. I still believe that union-find approach is more practicable here, but your solution is fine too. Thanks for elaboration. – Confraternity 26/5, 2011 at 11:56

Introduce integer marking, which indicates synonym groups. On start one marks all words with different marks from 1 to N.

Then search trough your collection and if you find two words with indexes i and j are synonym, then remark all of words with marking i and j with lesser number of both. After N iteration you get all groups of synonyms.

It is some dirty and not throughly efficient solution, I believe one can get more performance with union-find structures.

Confraternity answered 26/5, 2011 at 6:34 Comment(1)

That's what I was thinking. Union-find. It's got amortized constant time (effectively). – Gent 26/5, 2011 at 7:13

Edit: This probably is NOT the most efficient way of solving your problem. If you are interested in max performance (e.g., if you have millions of values), you might be interested in writing more complex algorithm.

PHP, seems to be working (at least with data from given example):

$data = array(
    array("Server1", "Server_1", "Main Server", "192.168.0.3"),
    array("Server_1", "VIP Server", "Main Server"),
    array("Server_2", "192.168.0.4"),
    array("192.168.0.3", "192.168.0.5"),
    array("Server_2", "Backup"),
);

do {
    $foundSynonyms = false;
    foreach ( $data as $firstKey => $firstValue ) {
        foreach ( $data as $secondKey => $secondValue ) {
            if ( $firstKey === $secondKey ) {
                continue;
            }
            if ( array_intersect($firstValue, $secondValue) ) {
                $data[$firstKey] = array_unique(array_merge($firstValue, $secondValue));
                unset($data[$secondKey]);
                $foundSynonyms = true;
                break 2; // outer foreach
            }
        }
    }
} while ( $foundSynonyms );

print_r($data);

Output:

Array
(
    [0] => Array
        (
            [0] => Server1
            [1] => Server_1
            [2] => Main Server
            [3] => 192.168.0.3
            [4] => VIP Server
            [6] => 192.168.0.5
        )

    [2] => Array
        (
            [0] => Server_2
            [1] => 192.168.0.4
            [3] => Backup
        )

)

Helicograph answered 26/5, 2011 at 6:52 Comment(9)

Wow thanks a lot! I've got the idea. I'll test this on full data dump (2 millions of records, besides :)) a little bit later. – Anallise 26/5, 2011 at 6:56

This algorithm seems O(n^3) to me though which is a lot slower than O( n log n) algorithms that exist. – Fuchsin 26/5, 2011 at 6:58

2 million records will be problematic with this algorithm. The 2 for loops itself will cause the inner loop to run 4*10^12, already not practical unless you have a supercomputer handy. Then you have the array_intersect() function each pass through the inner loop. – Fuchsin 26/5, 2011 at 7:0

You might want to skip break 2; line, I'm not sure what is its impact on performance. Also, if performance is too weak, take a look at more complex solutions. This was just the very first thing that came into my mind without deep knowledge about data analysis (aargh, I sometimes hate that I don't have to solve complex problems at job). – Helicograph 26/5, 2011 at 7:2

Yes, you're right, but it needs testing. Dataset is supposed to change monthly and data collecting takes at least several hours so even if script will do the job in an hour it's definitely an acceptable result. – Anallise 26/5, 2011 at 7:3

@Himadri, this was just the very first thing that came into my mind and that takes like 5 minutes to write. It's practical only for small sets of data and/or for using it only once, i.e., there's no need to spend 2 hours for writing algorithm that would process 2 millions of records in 5 seconds, if you can spend 5 minutes to write an algorithm that will do the job in 20 minutes. – Helicograph 26/5, 2011 at 7:4

@DEgorov, modifying $data[$firstKey] might be done differently - instead of making new array with array_unique(array_merge()), you might calculate array_diff() and update $data[$firstKey] with "new" entries. – Helicograph 26/5, 2011 at 7:7

@binaryLV: Generally I agree with you, better to do things the simplest possible way ... but no simpler!. This solution will take a lot more than 5 seconds. An O(N^3) algorithm will probably take weeks or months on this problem for 2,000,000 points. – Fuchsin 26/5, 2011 at 11:13

@Himadri, 5 seconds was for much better algorithm :) – Helicograph 26/5, 2011 at 11:26

This would yield lower complexity then the PHP example (Python 3):

a = [set(("Server1", "Server_1", "Main Server", "192.168.0.3")),
    set(("Server_1", "VIP Server", "Main Server")),
    set(("Server_2", "192.168.0.4")),
    set(("192.168.0.3", "192.168.0.5")),
    set(("Server_2", "Backup"))]

b = {}
c = set()
for s in a:
    full_s = s.copy()
    for d in s:
        if b.get(d):
            full_s.update(b[d])
    for d in full_s:
        b[d] = full_s
    c.add(frozenset(full_s))

for k,v in b.items():
    fsv = frozenset(v)
    if fsv in c:
        print(list(fsv))
        c.remove(fsv)

Androclinium answered 26/5, 2011 at 8:45 Comment(0)

I was looking for a solution in python, so I came up with this solution. If you are willing to use python data structures like sets you can use this solution too. "It's so simple a cave man can use it."

Simply this is the logic behind it.

foreach set_of_values in value_collection:
    alreadyInSynonymSet = false
    foreach synonym_set in synonym_collection:
        if set_of_values in synonym_set:
            alreadyInSynonymSet = true
            synonym_set = synonym_set.union(set_of_values)
    if not alreadyInSynonymSet:
        synonym_collection.append(set(set_of_values))

vals = (
    ("Server1", "Server_1", "Main Server", "192.168.0.3"),
    ("Server_1", "VIP Server", "Main Server"),
    ("Server_2", "192.168.0.4"),
    ("192.168.0.3", "192.168.0.5"),
    ("Server_2", "Backup"),
)

value_sets = (set(value_tup) for value_tup in vals)

synonym_collection = []

for value_set in value_sets:
    isConnected = False # If connected to a term in the graph

    print(f'\nCurrent Value Set: {value_set}')
    
    for synonyms in synonym_collection:
        # IF two sets are disjoint, they don't have common elements
        if not set(synonyms).isdisjoint(value_set):
            isConnected = True
            synonyms |= value_set # Appending elements of new value_set to synonymous set
            break

    # If it's not related to any other term, create a new set
    if not isConnected:
        print ('Value set not in graph, adding to graph...')
        synonym_collection.append(value_set)

print('\nDone, Completed Graphing Synonyms')
print(synonym_collection)

This will have a result of

Current Value Set: {'Server1', 'Main Server', '192.168.0.3', 'Server_1'}
Value set not in graph, adding to graph...

Current Value Set: {'VIP Server', 'Main Server', 'Server_1'}

Current Value Set: {'192.168.0.4', 'Server_2'}
Value set not in graph, adding to graph...

Current Value Set: {'192.168.0.3', '192.168.0.5'}

Current Value Set: {'Server_2', 'Backup'}

Done, Completed Graphing Synonyms
[{'VIP Server', 'Main Server', '192.168.0.3', '192.168.0.5', 'Server1', 'Server_1'}, {'192.168.0.4', 'Server_2', 'Backup'}]

Unpracticed answered 16/8, 2021 at 2:1 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags