what does `create index` do in cassandra tables?
Asked Answered
P

3

8

consider this example:

create table bite (
      id varchar PRIMARY KEY,
      feedid varchar,
      score bigint,
      data varchar
  );

create index bite_feedid on bite (feedid);
create index bite_score on bite (score);

I am not sure what the last two lines create index.. do? why is it important? Does it create a new table? If so, how can I look up using that?

Thanks

Pe answered 25/7, 2014 at 22:27 Comment(0)
G
7

A secondary index creates a new table using the indexed column as primary key. Advantages of this approach is that your write/delete operations on a table will be automatically translated into multiples operations, you don't have to care about it. Now that Cassandra support logged batches it may not seem a big advantage but in Cassandra 0.7 ... 1.1 was a big stuff.

Secondary indexes should not be used when the query on the index will retrieve always one result (eg: putting secondary index on a uuid).

A good feature of s.i. is that you can both query a single column without knowing anything of the primary key and combine part of the primary key with a secondary index (using AND operator).

You can't perform WHERE clause with multiple secondary indexes combined in AND.

HTH, Carlo

Gardal answered 26/7, 2014 at 7:9 Comment(8)
can you help with this: #24965173Pe
can you give an example to illustrate this point: Secondary indexes should not be used when the query on the index will retrieve always one result (eg: putting secondary index on a uuid).Pe
Imagine you have such table: UserTweets(user_id uuid, tweet_id uuid, user_name text, tweet_text text, primary KEY (user_id, tweet_id)); -- where user_id and tweet_id are unique identifier. You might want to perform query only on tweet_id without knowing user_id, to do so you should add a secondary index on this. But performing a secondary index query on tweet_id will ALWAYS return a single row, due to nature of information. If for instance you put an index on user_name you possibly retrieve multiple rows.Gardal
what is wrong if I retrieve multiple rows on index? It is possible for a user to have many tweets..Pe
It is not wrong, it is correct. What should be avoided is a secondary index that always return a single value. So in my example before the index on the user_name may retrieve all tweets of a user (possibly many rows) -- and this is good. Contrariwise the index on tweet_id will ALWAYS return a single Row, and this is a deprecated behaviour.Gardal
can you help with this post: #25087148Pe
I'm sorry but could you help with this #29660064Rabi
@Carlo Bertuccini: What if you need to create a secondary index that returns a single row? Why should this be avoided? In you example above, what if I indeed want to query by tweed_id. How else could I achieve this?Limpet
M
6

create index creates a secondary index for the table. In cassandra, data is stored in partitions across nodes - one partition corresponds to one partition key - which is the first key of the primary key. Remaining keys in the primary key constitute the clustering keys. For example, if you had the following:

CREATE TABLE foo.people ( id int, name text, age int, job text, PRIMARY KEY (id, name, job) )

id would be the partition key, and name and job would be the clustering keys.

Data in a partition is stored in order of the clustering keys. When querying with filters, you specify a partition key, and then you can filter down based on clustering keys. For multiple clustering keys, you must specify previous clustering in order to use a particular one. For example, in the mentioned scenario, you can do

where id = 2 and name = 'john' and job = 'dev' or where id = 2 and name = 'john'

but not where id = 2 and job = 'dev' as name appears before job in the clustering key.

You can't do a filter on age as it's not part of a key. This is where the secondary index comes in. If you then do: create index blah on people(age)

you will be allowed to do this: select * from people where age = 45;

This can potentially be expensive as it will query across your cluster. The following though, can be efficient: select * from people where id=2 and age = 45;

This is useful for time series or other wide row formats.

Queries on secondary indices are restrictive - you can't do range queries for example - you're limited to = checks.

Secondary indices in cassandra can save you the hassle of maintaining index tables yourself, and are more efficient than if you'd done so manually. They are eventually consistent (your writes won't wait for indices to be updated to return success) and currently, index info for a node's data is stored locally.

Lastly, you can find the indexes currently in place from the "IndexInfo" table in the system keyspace.

Hope that helps.

Methodist answered 25/7, 2014 at 23:32 Comment(0)
B
2

Usually in traditional databases creating index will use a data structure for example say HashMap whose keys will be the indexed column and the value points to the actual row in the table . So that it allows the query to fetch results based on the index key in approximately O(1).

How is the index created? Each key in the indexed column is hashed using a hashing function wich will return a value and it is used as index.

In Cassandra database since the data (ie) a particular column itself is distributed it used special mechanism to achieve the above indexing.

Indexing means fast retrieval or fast read. But the caveat is too much of indexing also leads to its bad things like collisions in the indexed keys.

Benildas answered 25/7, 2014 at 22:35 Comment(2)
is the index another table? If it datastructure like hashMap it must be persisted somewhere to look up? how would one access it? can you give an example?Pe
Yes the index is another table. You can access all the indexes from the system keyspace. I do not have an example. But this would really help you #21093024Benildas

© 2022 - 2024 — McMap. All rights reserved.