Constructing a tree using Python

Asked 8/9, 2012 at 3:31 Answered 24/10, 2012 at 20:25

Solved python algorithm search-engine information-retrieval booleanquery

I am trying to implement a unranked boolean retrieval. For this, I need to construct a tree and perform a DFS to retrieve documents. I have the leaf nodes but I am having difficulty to construct the tree.

Eg: query = OR ( AND (maria sharapova) tennis)

Result:

      OR
     |   |
     AND tennis
     | | 
  maria sharapova

I traverse the tree using DFS and calculate the boolean equivalent of certain document ids to identify the required document from the corpus. Can someone help me with the design of this using python? I have parsed the query and retrieved the leaf nodes for now.

EDIT: I am new here, so apologies for lacking clarity. I am basically trying to build a very naive search engine. So, the user enters any boolean query like: OR ( AND (maria sharapova) tennis). I have a corpus of wikipedia documents that gets displayed to the user depending on the query you type.

Till now, I have parsed the query to retrieve individual operators (like OR, AND, etc). And, the individual search terms(maria, tennis, etc). The code for parsing is just a function that would basically group all the operators and query terms as typed. i.e (maria sharapova), (tennis), OR, AND. I parsed this function this way so as to create a tree bottom-up. Now, using the inverted lists for the corresponding keywords like tennis, maria, sharapova, etc I perform the boolean operation with the inverted list to get a certain "documentid". This documentid is then passed to an API which would then retrieve the correct wikipedia page.

Just to explain the topic in more detail, please refer to this document for more information about my problem in hand: http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/boolean.pdf

Derbent answered 8/9, 2012 at 3:31 Comment(5)

Welcome to Stack Overflow! What have you tried? What worked? What didn't? Throw us a bone! – Disorderly 8/9, 2012 at 3:34

Perhaps you could share the code you used for parsing the query so far? – Concubinage 8/9, 2012 at 3:40

I would look at the AST module or dalkescientific.com/Python/python4ply.html and use a proper lexer/grammar that will parse this stuff for you... – Fremitus 8/9, 2012 at 3:41

What exactly do you need help with? Your task seems to have several parts, and it's not clear which ones you need assistance with and which you can do on your own. Do you need help parsing the query string? Do you need help designing the tree data structure that you build from the query? Do you need help running your depth first search on the query tree? "All of the above" is also OK, as long as you're clear about it. – Tamathatamaulipas 9/9, 2012 at 1:21

Thanks, All of the above. since I am unclear if my approach is correct in the first place. So helping me understand the design aspects would be useful. (I am unhappy with my current design and I want to redesign from scratch) – Derbent 9/9, 2012 at 2:23

First if you want a fancy syntax of your query language to support many operators, range queries or wildcard, you definitely should refer to lex/yacc solution as Joran pointed out.

Second, from the lecture slides you posted, I think you care more about how to implement the boolean query model than constructing a tree in python. Then you don't need to worry about the query itself. Suppose the query is well formatted as below:

"OR ( AND ( maria sharapova ) tennis )"

That is, you have space between operator (AND/OR) and keywords/parenthesis. Then you only need two stacks (without using DFS on tree-data-structure) to parse the query and get the combined search results from them.

The first stack holds the operators (AND/OR) and the operands (e.g., maria, tennis). You treat the parenthesis as open/close condition to process the current operands on top of the stack. You only process the search operation when you see a close parenthesis ).

The second stack holds the current search results.

Let's do a step-by-step demo using the above example. You scan the query from left to right.

Step 1. You push the "OR" operator into the stack.

+               +
+               +
+    OR         +
+ + + + + + + + +

Step 2. You see an open parenthesis (, just skip it.

Step 3. You push the "AND" operator into your stack. Now the stack looks like below:

+               +
+    AND        +
+    OR         +
+ + + + + + + + +

Step 4. You skip another (.

Step 5. You push "maria" to your stack.

Step 6. You push "sharapova" to your stack. Now the stack looks like below:

+   sharapova   +
+    maria      +
+    AND        +
+    OR         +
+ + + + + + + + +

Step 7. You see a close parenthesis ). Now it's time to do the first operation. You pop all items on top of the stack until you see an operator. Pop the operator as well to get the current operator. Now you process the search for "sharapova" and "maria" separately and combine the search results using the operator "AND". Assume for "maria", you get 3 doc ids: [1, 2, 3]. For "sharapova", you get another 5 doc ids: [2, 3, 8, 9, 10]. After you combine the results with "AND", you have [2,3] in the
second stack which holds the current search results. The current situation looks like below: on the right it is the result buffer.

+               +           +         +
+               +           +         +
+               +           +         +
+    OR         +           +  [2,3]  +
+ + + + + + + + +           + + + + + +

Step 8. You push tennis to the stack.

+               +           +         +
+               +           +         +
+    tennis     +           +         +
+    OR         +           +  [2,3]  +
+ + + + + + + + +           + + + + + +

Step 9. You see another close parenthesis ). Again, you pop all items on top of the stack until you see "OR". You start search using "tennis" and suppose you get resulting doc ids: [3, 5, 7]. At this time, you combine this result with the previous results in your buffer using operator "OR", so that finally get doc ids: [2,3,5,7].

My sample code is here. Note I simulate the searching and returning doc ids by randomly sample len(word) integers.

The printout from the code shows step-by-step, how the system looks like before processing the current query item (1st column), the status of the result buffer(2nd column), the items in stack (3rd column) and the immediate search result (4th column).

Ye answered 24/10, 2012 at 20:25 Comment(0)

A list of lists is a natural way to represent trees in Python (without creating classes):

>>> query = ['OR', ['AND', 'maria', 'sharapova'], 'tennis']

Astrosphere answered 8/9, 2012 at 6:32 Comment(1)

Thanks Raymond. But, how would you traverse the tree? These queries are different usually. So, once I construct this list of lists if i need to substitute the 'maria' and 'sharapova' with corresponding inverted lists and evaluate the boolean expression, is it possible? – Derbent 8/9, 2012 at 23:38

Recommended topics

Hot tags