Cassandra CQL range query rejected despite equality operator and secondary index
Asked Answered
G

1

4

From the table schema below, I am trying to select all pH readings that are below 5.

I have followed these three pieces of advice:

  1. Use ALLOW FILTERING
  2. Include an equality comparison
  3. Create a secondary index on the reading_value column.

Here is my query:

select * from todmorden_numeric where sensor_name = 'pHradio' and reading_value < 5  allow filtering;

Which is rejected with this message:

Bad Request: No indexed columns present in by-columns clause with Equal operator

I tried adding a secondary index to the sensor_name column and was told that it was already part of the key and therefore already indexed.

I created the index after the table had been in use for a while - could that be the problem? I ran "nodetool refresh" in the hope it would make the index available but this did not work. Here is the output of describe table todmorden_numeric :

CREATE TABLE todmorden_numeric (
  sensor_name text,
  reading_time timestamp,
  reading_value float,
  PRIMARY KEY ((sensor_name), reading_time)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='Data that suits being stored as floats' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX todmorden_numeric_reading_value_idx ON todmorden_numeric (reading_value);
Goddaughter answered 22/7, 2014 at 17:52 Comment(0)
A
9

Cassandra allows range search only on:

a) Partition Key only if ByteOrderPartitioner is used (default now is murmur3).

b) any single clustering key ONLY IF any clustering keys defined BEFORE the target column in the primary key definition are already specified by an = operator in the predicate.

They don't work on secondary indices.

Consider the following table definition:

CREATE TABLE tod1 (name text, time timestamp, 
    val float, PRIMARY KEY (name, time));

You CAN'T do a range on the val in this case.

Consider this one:

CREATE TABLE tod2 (name text, time timestamp, 
    val float, PRIMARY KEY (name, time, val));

Then the following is valid:

SELECT * FROM tod2 WHERE name='X' AND time='timehere' AND val < 5; 

Kinda pointless, but this is not valid:

SELECT * from tod2 WHERE name='X' AND val < 5; 

It's not valid as you haven't filtered by a previous clustering key in the primary key def (in this case, time).

For your query, you may want to do this:

CREATE TABLE tod3 (name text, time timestamp, 
    val float, PRIMARY KEY (name, val, time));

Note the order of columns in the primary key: val's before time.

This will allow you to do:

SELECT * from tod3 WHERE name='asd' AND val < 5;

On a different note, how long do you intend to hold data? How frequently do you get readings? This can cause your partition to grow quite large quite quickly. You may want to bucket it readings into multiple partitions (manual sharding). Perhaps one partition per day? Of course, such things would greatly depend on your access patterns.

Hope that helps.

Aron answered 23/7, 2014 at 17:18 Comment(3)
Also worth noting that you can do a range query based on the timestamp and load the values into memory, and aggregate client side. Another option would be to use something like Spark to do post aggregation.Aron
Thanks for the explanation. I had no idea Cassandra was so restrictive regarding what queries you can run. I'm from an SQL background (aren't we all?) and have a lot to learn! I picked Cassandra because it was recommended for storing time series data. I think it's time for an architectural review ... I'll look into your suggestions about sharding also.Goddaughter
Cassandra's very flexible in terms of use case, but you do need to do a bit of data modelling. When dealing with cassandra, you think of your queries and access patterns first, rather than having data and querying it arbitrarily (which even in SQL land, can be quite detrimental to performance). It's quite an interesting challenge, but yes, there's a bit to learn. Have fun :)Aron

© 2022 - 2024 — McMap. All rights reserved.