boolean text search in python [closed]
Asked Answered
B

4

13

i'm looking for an existing module(s) which enabled me to write basic boolean queries for matching and searching texts, WITHOUT writing my own parser etc.

for example,

president AND (ronald OR (george NOT bush))

would match TRUE against "the president ronald ragen" "the president ronald ragen and bush" "max bush was not a president"

but False on "george bush was a president" "i don't know how to spell ronald ragen"

(So far i found Booleano, which seems a bit overkill, but could do the task. However their group is inactive, and i couldn't figure out from the documentation what to do.)

thanks

Edit: the exact style or grammer is not critical. my aim is to provide non-tech users with the ability to search certain texts a bit beyond keyword search.

Bobsleigh answered 11/2, 2010 at 17:53 Comment(5)
+1 for “ragan” … and for the question. ;)Cordwain
By the way, for the expression that you have, "max bush was not a president" should be false I would think.Jody
what's up with your NOT operator, bit sussToxin
@Matt Joiner: i didn't understand your question. Maybe i should be "(george AND NOT bush)" ? donno..Bobsleigh
yeah that's kind of what I was thinking, i donno eitherToxin
U
8

DISCLAIMER: I am the creator of the package presented below.

For the people who might come to this page: I built a package to do just that (still in beta).

pip install eldar

Your query would be translated in the following code:

from eldar import Query

eldar = Query('"president" AND ("ronald" OR ("george" AND NOT "bush"))')

print(eldar("President Bush"))
# >>> False
print(eldar("President George"))
# >>> True

You can use it on some pandas dataframe as well, check the git page for more info: https://github.com/kerighan/eldar

Urgent answered 18/10, 2019 at 12:42 Comment(8)
While this answer may look like spam to some casual observers/reviewers, I think it does actually meet the requirements for self-promotion (which is clearly stated). It also directly addresses the question asked.Gender
should I edit my answer and remove the github link ? I don't care about promotion, I just wanted to be transparent.Urgent
You don't need to remove or change anything! Your post came up in review as "a new user's answer to an old question" and I put that comment in to help other reviewers.Gender
i accepted this answer because it seems right. However - I haven't tested it.Bobsleigh
@BerryTsakala perhaps if you tested you would have noticed that literally copy paste of this answer OR the first "basic usage" example on github link don't work properly.Jabin
@Jabin thanks for the bug report. There was a mistake in the latest version. No need to be condescending, I'm only trying to help the community. It's fixed on the 0.0.6. Next time post a little something on github.Urgent
@Urgent Thanks for the effort and update but the github basic example is still not working properly - eldar(documents[2]) returns True.Jabin
@okko thanks for the comment. Ok so it works as expected as "movies" is not "movie" : by default match_word=True, so words should match exactly. It's a readme mistake, I'll update it. Thanks for pointing it out.Urgent
L
2

It would be pretty lucky to find a pre-existing library that happens to be ready to parse the example expression that you provided. I recommend making your expression format a bit more machine readable, while retaining all of its clarity. A Lisp S-expression (which uses prefix notation) is compact and clear:

(and "president" (or "ronald" "george" "sally"))

Writing a parser for this format is easier than for your format. Or you could just switch to Lisp and it will parse it natively. :)

Side note: I assume you didn't mean to make your "NOT" operator binary, right?

Let answered 11/2, 2010 at 18:5 Comment(0)
J
2

You might want to take a look at the simpleBool.py code on this page that uses the pyparsing module. Otherwise, here's some simple code I wrote.

This isn't a module, but it might get you in the right direction.

def found(s,searchstr):
    return s.find(searchstr)>-1

def booltest1(s):
    tmp = found(s,'george') and not found(s,'bush')
    return found(s,'president') and (found(s,'ronald') or tmp)

print booltest1('the president ronald reagan')
print booltest1('george bush was a president')

and you can test other ones. I used tmp because the line was getting so long

Jody answered 11/2, 2010 at 18:7 Comment(1)
thanks, but your example is not a general purpose routine. simpleBool, seems interesting, but requires lots of work to adapt to the text-domain.Bobsleigh
L
1

I use sphinx for full text search from python in my website. It has a simple syntax that supports boolean matchings, but with operators, not words. For example, your query would be president (regan|(bush -george)).

Lucene has the same feature.

Levana answered 11/2, 2010 at 18:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.