Cassandra CQL searching for element in list
Asked Answered
O

1

8

I have a table that has a column of list type (tags):

CREATE TABLE "Videos" (
    video_id UUID,
    title VARCHAR,
    tags LIST<VARCHAR>,
    PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);

I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].

I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?

Odine answered 5/11, 2014 at 13:59 Comment(0)
B
22

To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.

aploetz@cqlsh:stackoverflow> SELECT * FROM videos ;

 video_id                             | upload_timestamp                     | tags                                          | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
 2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d |             ['sci-fi', 'action', 'adventure'] |                 Star Wars
 ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d |                               ['documentary'] | The Witches of Whitewater
 15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] |              Pulp Fiction

(3 rows)

Next, I'll create a secondary index on the tags column:

aploetz@cqlsh:stackoverflow> CREATE INDEX ON videos (tags);

Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:

aploetz@cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';

 video_id                             | upload_timestamp                     | tags                                          | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
 2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d |             ['sci-fi', 'action', 'adventure'] |    Star Wars
 15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction

(2 rows)

With this all being said, I should pass along a couple of warnings:

  • Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
  • You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.
Ballenger answered 5/11, 2014 at 14:55 Comment(2)
It looks like a perfect solution for my problem. Unfortunately I cannot use the elegant solution with index on tag column because DataStax Enterprise still uses Cassandra 2.0.10.x. Looks like I will need to go with the query table solution. Anyway, thank you for a very detailed answer. It fixed my problem and I could learn something.Odine
Great! contains on a list helps me a lot.Hayseed

© 2022 - 2024 — McMap. All rights reserved.