Paging large resultsets in Cassandra with CQL3 with varchar keys
Asked Answered
R

1

0

I’m working on updating an old thrift-based code to CQL3.

One part of the code is walking through the entire dataset of a table consisting of 20M+ rows. This part was initially crashing the program due to memory usage, so I created a RowIterator class which iterated through the column family using TokenRanges (and Hector).

When trying to rewrite this using CQL3, I’m having trouble paging through the data. I found some info over at http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html, but when trying this code for the first "page"

resultSet = session.execute("select * from " + TABLE + " where token(key) <= token(" + offset + ")");

I get the error

com.datastax.driver.core.exceptions.InvalidTypeException: Invalid type for value 0 of CQL type varchar, expecting class java.lang.String but class java.lang.Integer provided

Granted, the example at the link uses numerical keys. Is there a way to do this with varchar (UTF8Type) keys?

It seems that there is now a built-in functionality for this (https://issues.apache.org/jira/browse/CASSANDRA-4415), but I can’t find examples that get me going. Besides, I have to solve it for Cassandra 1.2.9 for now.

Reactor answered 12/5, 2014 at 13:45 Comment(0)
G
1

So the easy answer is to upgrade to Cassandra 2.0.X and use the new built in paging functionality. But to get it done on Cassandra 1.2 you are on the right path. Your syntax should be working, if you run the query you are trying in cqlsh do you get the same error? When paging like this it is best to use ">" like in the example, that might be the issue. You want to start with select * from table limit 100 then go to select * from table where token(key)>token('last key') limit 100

Also I would try it with a prepared statement. The string manipulations may be doing something funny to the offset.

Glochidiate answered 13/5, 2014 at 4:7 Comment(3)
Thanks for answering. Yes, we are upgrading to 2.0 soon, I've been told. But I have to make this work before that. I am using prepared statements, but switched to string just when trying to get it working. The < was just for the first chunk; next would use >. But I wasn't aware that a simple limit 100 would sort the results the same way as when using tokens. That makes the initial query more simple.Reactor
And yes, I get the same error message in cqlsh. But when I now tried your suggestion (...>token(lastkey) limit 100, I got it working! Sort of. Seems I am missing out on rows. I should point out that the primary key is a composite of key and column1. Using CLI there is only one row per "key" (which is an IP address), with all data in columns, but using cqlsh there can be upwards of 10 "rows" for each IP. If the first chunk returns 2 rows from that IP and I use the IP for the next chunk, the remaining cql "rows" (the columns) are discarded...Reactor
That was probably not very clear. I'll accept this as answer to my question (since I now have the select working) and create a new question for the current problem. Thanks!Reactor

© 2022 - 2024 — McMap. All rights reserved.