search keywords efficiently when keywords are multi words
Asked Answered
M

2

6

I needs to match a really large list of keywords (>1000000) in a string efficiently using python. I found some really good libraries which try to do this fast:

1) FlashText (https://github.com/vi3k6i5/flashtext)

2) Aho-Corasick Algorithm etc.

However I have a peculiar requirement: In my context a keyword say 'XXXX YYYY' should return a match if my string is ' XXXX is a very good indication of YYYY'. Note here that 'XXXX YYYY' is not occuring as a substring but XXXX and YYYY are present in the string and this is good enough for me for a match.

I know how to do it naively. What I am looking for is efficiency, any more good libraries for this?

Martinet answered 15/1, 2018 at 7:38 Comment(10)
It'd be nice to know your naive solutions, not to repeat them. One of the ideas could be a removal of everything not being in the list of your keywords from the string and then applying one of the fast libraries.Highhat
@Highhat By naive I mean convert multi-word key words into a list and match each element seperately with an and condition (this is without using the fats libraries). What you suggest assumes that YYYY occurs after XXXX which might not be true too.Martinet
OK, I see. It was not clear from your question.Highhat
Are you looking for patterns which are space-separated or something more general?Narine
@Narine space and punctuation seperated would be best for my caseMartinet
Split into tokens, look for the longest or most uncommon and then see if the match also contains the other(s)?Narine
how long your keywords?for "xx yy","xx is yy"is ok,but "yy is xx"?and " xxyy"?Douma
so say you have 2 keywords {cost benefit analysis} and {machine learning} - you want to find documents that contain those keywords or any permutation of those keywords, possibly with some non-keyword characters in between the constituents of the keyword. Does 'the cost benefit of different machine learning techniques requires more analysis' match only {machine learning} or both keywords?Fervidor
@MattiLyra BothMartinet
@Douma "xx is yy" and "yy is xx" is ok. Not the other oneMartinet
M
1

What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

This produces:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

You can also persist your index using FileStorage as described in How to index documents.

Mandibular answered 23/1, 2018 at 12:41 Comment(0)
G
1

This falls into the "naive" camp, but here's a method that uses sets as food for thought:

docs = [
    """ Here's a sentence with dog and apple in it """,
    """ Here's a sentence with dog and poodle in it """,
    """ Here's a sentence with poodle and apple in it """,
    """ Here's a dog with and apple and a poodle in it """,
    """ Here's an apple with a dog to show that order is irrelevant """
]

query = ['dog', 'apple']

def get_similar(query, docs):
    res = []
    query_set = set(query)
    for i in docs:
        # if all n elements of query are in i, return i
        if query_set & set(i.split(" ")) == query_set:
            res.append(i)
    return res

This returns:

[" Here's a sentence with dog and apple in it ", 
" Here's a dog with and apple and a poodle in it ", 
" Here's an apple with a dog to show that order is irrelevant "]

Of course, the time complexity is not that great but it's much, much quicker than using lists overall because of the speed of doing hash/set operations.


Part 2 is that Elasticsearch is a great candidate for this if you're willing to put in the effort and you're dealing with lots and lots of data.

Germanous answered 19/1, 2018 at 0:58 Comment(0)
M
1

What you ask sound like a full text search task. There's Python search package called whoosh. @derek's corpus can be indexed and searched in memory like the following.

from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser
from whoosh import fields


texts = [
    "Here's a sentence with dog and apple in it",
    "Here's a sentence with dog and poodle in it",
    "Here's a sentence with poodle and apple in it",
    "Here's a dog with and apple and a poodle in it",
    "Here's an apple with a dog to show that order is irrelevant"
]

schema = fields.Schema(text=fields.TEXT(stored=True))
storage = RamStorage()
index = storage.create_index(schema)
storage.open_index()

writer = index.writer()
for t in texts:
    writer.add_document(text = t)
writer.commit()

query = QueryParser('text', schema).parse('dog apple')
results = index.searcher().search(query)

for r in results:
    print(r)

This produces:

<Hit {'text': "Here's a sentence with dog and apple in it"}>
<Hit {'text': "Here's a dog with and apple and a poodle in it"}>
<Hit {'text': "Here's an apple with a dog to show that order is irrelevant"}>

You can also persist your index using FileStorage as described in How to index documents.

Mandibular answered 23/1, 2018 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.