Python Query Processing and Boolean Search
Asked Answered
S

3

7

I have an inverted index (as a dictionary) and I want to take a boolean search query as an input to process it and produce a result.

The inverted index is like this:

{
Test : { FileName1: [213, 1889, 27564], FileName2: [133, 9992866, 27272781, 78676818], FileName3: [9211] },
Try : { FileName4 ...
.....
}

Now, given a boolean search query, I have to return the result.

Examples:

Boolean Search Query: test AND try The result should be all documents that have the words test and try.

Boolean Search Query: test OR try The result should be all documents that have either test or try.

Boolean Search Query: test AND NOT try The result should be all documents that have test but not try.

How can I build this search engine to process the given boolean search query?

Thanks in advance!

Sabatier answered 27/10, 2017 at 14:54 Comment(7)
Can you be more specific in the search part? Where is test and try going to be searched?Archduchy
@TimGivois In the inverted index. As for the first example, the program has to locate the postings of "test", and the postings of "try", and then take the intersection of them.Sabatier
This really would come much more naturally when done in SQL. Nevertheless the key question here is how do you expect to construct queries? Where and how do the AND/OR combos come from?Gradin
@Gradin I just saw your comment. The boolean operators come as an input from the user. Let us assume that I am looking for the documents having the words : Project AND Gutenberg. So the query is project AND Gutenberg.Sabatier
I see now. However, since your words is a single dictionary, it looks as if you have no difference between AND and OR, and both should act as OR. Ie. if I say "test OR try" or "test AND try" you want me to get the document that have dict key "test" and document that has dict key "try"?Gradin
Scratch that, I think I got it now. You want to get the list of your Filenames, right?Gradin
Yes, I want to get the filenames of documents that match the entered query. So if the entered query is "Test AND NOT Gnudiff". The result should be filenames of documents that contain the word test but don't contain Gnudiff.Sabatier
G
3

EDIT: I am retaining the first part of my answer, because if this WASN'T a school assignment, this would be in my opinion still a better way to go about the task. I replace the second part of the answer with update matching OP's question.

What you appear to want to do is to create a query string parser, which would read the query string and translate it into a series of AND/OR/NOT combos to return the correct keys.

There are 2 approaches to this.

  1. According to what you wrote that you need, by far the simplest solution would be to load the data into any SQL database (SQLite, for example, which does not require a full-blown running SQL server), load dictionary keys as a separate field (the rest of your data may all be in a single another field, if you don't care about normal forms &c), and translate incoming queries to SQL, approximately like this:

SQL table has at least this:

CREATE TABLE my_data(
dictkey text,
data text);

python_query="foo OR bar AND NOT gazonk"
sql_keywords=["AND","NOT","OR"]
sql_query=[]
for word in python_query.split(" "):
    if word in sql_keywords:
        sql_query+=[ word ]
    else:
        sql_query+=["dictkey='%s'" % word]

real_sql_query=" ".join(sql_query)

This needs some escaping and control checking for SQL injections and special chars, but in general it would just translate your query into SQL, which, when run against the SQL datbase would return the keys (and possibly data) for further processing.

  1. Now for the pure Python version.

What you need to do is to analyze the string you get and apply the logic to your existing Python data.

Analyzing the string to reduce it to specific components (and their interactions) is parsing. If you actually wanted to build your own fully fledged parser, there would be Python modules for that, however, for a school assignment, I expect you are tasked to build your own.

From your description, the query can be expressed in quasi BNF form as:

(<[NOT] word> <AND|OR>)...

Since you say that priority of is not relevant all, you can do it the easy way and parse word by word.

Then you have to match the keywords to the filenames, which, as mentioned in another answer, is easiest to do with sets.

So, it could go approximately like this:

import re

query="foo OR bar AND NOT gazonk"

result_set=set()
operation=None

for word in re.split(" +(AND|OR) +",query):
    #word will be in ['foo', 'OR', 'bar', 'AND', 'NOT gazonk']

    inverted=False # for "NOT word" operations

    if word in ['AND','OR']:
        operation=word
        continue

    if word.find('NOT ') == 0:
        if operation is 'OR':
        # generally "OR NOT" operation does not make sense, but if it does in your case, you 
        # should update this if() accordingly
            continue

        inverted=True
        # the word is inverted!
        realword=word[4:]
    else:
        realword=word

    if operation is not None:
        # now we need to match the key and the filenames it contains:
        current_set=set(inverted_index[realword].keys())

        if operation is 'AND':
            if inverted is True:
                result_set -= current_set
            else:
                result_set &= current_set
        elif operation is 'OR':
            result_set |= current_set

    operation=None

print result_set

Note that this is not a complete solution (for example it does not include dealing with the first term of the query, and it requires the boolean operators to be in uppercase), and is not tested. However, it should serve the primary purpose of showing you how to go about it. Doing more would be writing your course work for you, which would be bad for you. Because you are expected to learn how to do it so you can understand it. Feel free to ask for clarifications.

Gradin answered 27/10, 2017 at 18:6 Comment(9)
It's in python not in sqlArchduchy
So what? Am I forbidden to suggest different approach? Creating a parser is not that trivial compared to moving data to SQL.Gradin
Hi Gnudiff. Thanks for passing by. I am actually required to solve this using Python. Regarding precedence rules, lets just ignore them. I firstly want to know how to read the string query and divide it into parts according to the number of operators.Sabatier
@Sabatier then you have it partially in my answer already. I will update it a bit to make it more clear.Gradin
Okay, I got what you want now, according to my comment on OP. That makes more sense, but means my answer needs to be rewritten. Too tired to do it now, if nobody answers by tomorrow I will try to update it.Gradin
Thanks man, I will be waiting for your answer. To make it more clear, I have about 12 text files of Gutenberg's plays. From those files, I have constructed an inverted index that contains words and their occurrences in different text files (each word is a dictionary key, each word could appear in different documents, and each document could have different appearances of the same word). Now, having the inverted index, the program should allow users to input a boolean search query to find filenames that satisfy the query by breaking the query into terms and processing them.Sabatier
Well, it would still make much more sense to put the data into a database file and let users run queries against it. I can only understand doing it in pure python if it is a school task or something similarGradin
@Gradin Yes, it is a course assignment and I am stuck in the middle.Sabatier
Have you measured how db performs better or worse than manual index file? I supposed using db in this case doesn't work well with large scale indexPaddle
L
3

Another approach could be an in-memory intersection of the posting lists (for your AND cases, you can enhance this for OR, NOT, etc).

Attached a simple merge algorithm to be performed on the posting lists, assuming that the lists are sorted (increasing doc_id order, this can be easily achieved if we index our terms correctly) - this will improve time complexity (O(n1+n2)) as we will perform linear-merge on sorted list and might stop earlier.

Now assume our positional inverted index looks like this: (similar to yours but with posting lists as lists and not dict's- this will be allow compression in future uses) positional index where it maps- String > Terms, while each term consists of (tf, posting list ([P1, P2,...])) and each Posting has (docid, df, position list). Now we can perform a simple AND to all of our postings lists iteratively:

def search(self, sq: BoolQuery) -> list:

    # Performs a search from a given query in boolean retrieval model,
    # Supports AND queries only and returns sorted document ID's as result:

    if sq.is_empty():
        return super().search(sq)

    terms = [self.index[term] for term in sq.get_terms() if term in self.index]
    if not terms:
        return []

    # Iterate over posting lists and intersect:
    result, terms = terms[0].pst_list, terms[1:]
    while terms and result:
        result = self.intersect(result, terms[0].pst_list)
        terms = terms[1:]
    return [p.id for p in result]

Now lets look at the intersection:

def intersect(p1: list, p2: list) -> list:

    # Performs linear merge of 2x sorted lists of postings,
    # Returns the intersection between them (== matched documents):

    res, i, j = list(), 0, 0
    while i < len(p1) and j < len(p2):
        if p1[i].id == p2[j].id:
            res.append(p1[i])
            i, j = i + 1, j + 1
        elif p1[i].id < p2[j].id:
            i += 1
        else:
            j += 1
    return res

This simple algorithm can be later expanded when performing phrase search (edit the intersection to calculate slop distance, e.g: |pos1-pos2| < slop)

Lamellirostral answered 18/6, 2019 at 11:23 Comment(0)
A
0

Taking into account you have that inverted index and that is a dict with test and try as keys you can define the following functions and play with them:

def intersection(list1, list2):
    return list(set(list1).intersection(list2))
def union(list1, list2):
    return list(set(list1).union(list2))
def notin(list1, list2)
    return [filter(lambda x: x not in list1, sublist) for sublist in list2]

intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys()))
Archduchy answered 27/10, 2017 at 15:40 Comment(6)
Hi Tim. Ok that makes sense. But I need the code to be flexible to any given words as an input. Plus, it could have more than two operators like: test AND try AND people ....Sabatier
yes, you can create functions and construct the query with it, the hardest part is to figure out how to actually do the intersections. I think that was the question.Archduchy
😂😂 yes this is it. If you can just help me figure it out I would be thankful.Sabatier
Yes, the answer has the functions with the intersection, union, etc. For three character and you can: intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys())), etcArchduchy
@Lamellirostral Im not sure, it seems that someone wanted his answe to have more votes :(Archduchy
better implementation can be a simple linear merge (as we have sorted posting lists), i'll post my try on it as well. about the NOT part- why not using set difference operation?Lamellirostral

© 2022 - 2024 — McMap. All rights reserved.