Cassandra Full-Text Search
Asked Answered
C

7

17

Full-Text search in Cassandra;

I am fairly new to Cassandra, and wish to understand it more properly. I am attempting to perform a Full-Text search in Cassandra, but after some research I have found that there may not be a "simple" approach for this.. and I say maybe because the first page of Google hasn't said much of anything.

So I am trying to understand now instead, what is the best approach here.. This sort of lead me to take make up my own assumptions based on what I've learned so far about Cassandra, that is based on these two principals; a) design your tables based on your queries, rather than the data, and b) more-data is a good thing, as long as it is being used properly.

With that being said, I've come up with a couple of solutions I'd like to share, and also ask that if anyone has a better idea, please fill me on it before I commit to anything unreasonable/naive.

First Solution: Create a Column Family(CF), with two primary keys and an Index like so:

CREATE TABLE "FullTextSearch" (
"PartialText" text,
"TargetIdentifier" uuid,
"CompleteText" text,
"Type" int,
PRIMARY KEY ("PartialText","TargetIdentifier")
);
CREATE INDEX IX_FullTextSearch_Type "keyspace"."FullTextSearch" ("Type");

With the above table, I would need to insert rows for the text "Hello World" as follows:

BATCH APPLY;
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("H",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("He",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hel",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello World",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("World",000000000-0000-0000-0000-000000000,"Hello World",1);
END BATCH;

Basically, the above will satisfy the following wildcards/partialtext "%o W%", "Hello%", "Worl%"; However it will not satisfy partial words such as "%ell%" for "Hello", which I can feel alright about for now..... (OCD sorta kicks in here)

This approach sort of sucks for me because I would now have to delete/re-insert any time a save/name change occurs on the "TargetIdentifier";

The Second Solution, would be very similar only this time making use of wide-columns; where the table might look like:

CREATE TABLE "FullTextSearch" (
"TargetIdentifier" uuid,
"Type" int,
"CompleteText" text,
PRIMARY KEY("TargetIdentifier")
);

and now during a search something like:

SELECT * FROM "FullTextSearch" WHERE "He" = 1;

so that if the column exists, the respective rows are returned;

Third Solution: similar to the one above, only this time instead of using wide-columns we use a set column such as map for the partial texts, and perform a query like:

SELECT * FROM "FullTextSearch" WHERE "PartialTexts"['He'] = 1;

Anyways, I am all out of ideas, it is late, and I can only hope for a great response! Please, let me know what I should be doing here... am I even on the right path?

Consumable answered 21/7, 2014 at 5:18 Comment(1)
See also #3153864Fled
X
9

AFAIK Datastax Enterprise Search is the (commercial) successor of Solandra.

Cassandra 2.0 supports so called "custom secondary indexes". Custom secondary indexes are Java code. Your own implementation has to implement the abstract class org.apache.cassandra.db.index.SecondaryIndex (See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_index_r.html)

I'm not sure whether implementations exist for Elasticsearch or Solr.

I would not recommend to code all the weird full text search logic like stemming, multiple/exotic language support or even geo spatial stuff.

But SecondaryIndexwould be a good point to start integrating your favorite search engine.

Xavierxaviera answered 23/7, 2014 at 20:57 Comment(0)
P
2

Use elassandra which comes elasticsearch as a plugin in cassandra. An example can be found from here

Pascia answered 19/7, 2019 at 22:28 Comment(0)
M
1

If your dataset is relative small you can simply use a inmemory instance of lucene, update the index at a set interval and you are ready to go.

Mcadoo answered 21/7, 2014 at 8:55 Comment(0)
P
0

Check out SOLANDRA (former Lucandra)

But I think Solandra is not being actively developed any more, the author moved to Datastax and continued his work there.

So You can also take a look at Datastax Enterprise Search

There are some limitation also, look at DistributedSearch

Plaintive answered 21/7, 2014 at 6:35 Comment(0)
T
0

The very basic thing about cassandra is if you want to use where clause for filtration of records that column is either primary key or you have to assign index to it, so what i can see is you have given primary key to "TargetIdentifier" field and index to "Type" and using "CompleteText" in where clause so this may not work..

Assign secondary index to "CompleteTex" and check whether you are getting desired output or not.

Trevino answered 21/7, 2014 at 6:42 Comment(0)
J
0

A couple other options you have: Stratio Lucene Plugin. This uses Lucene for implementing a native secondary index.

You also have SSTable Attached Secondary Index (SASI) available to use for free text searching.

Be forewarned that both of these strategies use locally distributed indexes such that queries will not be very performant since searches will end up being broadcast across the entire cluster. For SASI, you can avoid this if you can use a partition key as part of your query.

Julenejulep answered 7/10, 2017 at 0:21 Comment(0)
A
0

Use Solr for fullText search Cassandra is not good for fullText. 1 Db Cassandra for Archive 2 Solr for full text search

Aliber answered 29/4, 2022 at 8:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.