Get distinct partition keys from C* table

Asked 24/12, 2015 at 15:10 Answered 27/12, 2015 at 15:16

Solved cassandra datastax-enterprise cqlsh

cqlsh doesn't allow nested queries so I cant export selected data to csv.. I'm trying to export the selected data (about 200,000 rows with a single column) from cassandra using:

echo "SELECT distinct imei FROM listener.snapshots;" > select.cql bin/cqlsh -f select.cql > output.txt

and it just stuck forever without any error, and the file isn't growing.

if I use strace on the last line I got many rows like:

select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)

and --debug give me only:

cqlsh --debug -f select.cql > output.txt

Using CQL driver: <module 'cassandra' from '/usr/share/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-2.5.1.zip/cassandra-driver-2.5.1/cassandra/__init__.py'>

what is wrong? Is there better way to get distinct partition keys from large C* table?

Hydrocellulose answered 24/12, 2015 at 15:10 Comment(4)

Try strace to see where it gets stuck. cqlsh is probably going to set a limit to your query by default so it might not be the best way to do this. – Nitrogen 24/12, 2015 at 15:17

@Nitrogen please see my edit :) – Hydrocellulose 24/12, 2015 at 15:35

Maybe try cqlsh --debug. I'm not sure why there aren't file descriptors in your strace select gist.github.com/tonyc/1384523 – Nitrogen 24/12, 2015 at 17:1

@Nitrogen please see my edit again. :/ – Hydrocellulose 24/12, 2015 at 17:9

I used capture:

cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;
---MORE---
---MORE---
---MORE---
---MORE---
.
.
.
cqlsh> 
cqlsh>

And press enter until it finished.

Even faster option is to use paging:

cqlsh> PAGING off
Disabled Query paging.
cqlsh> CAPTURE 'temp.csv'                                              
Now capturing query output to 'temp.csv'.
cqlsh> SELECT distinct imei FROM listener.snapshots;

It would immediately extract the data to the file (if you get a OperationTimedOut you should edit the timeout settings in cassandra.yaml).

I cant believe that it is the fasts way there is... I know I can export data using spark by using CassandraSQLContext but its not so fast when I need to create the rdd querying C* for distinct column out of very large table(2B rows~), and print them to file:

    val conf = new SparkConf().setAppName("ExtractDistinctImeis")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val connector = CassandraConnector(conf)
    val cc = new CassandraSQLContext(sc)

    val snapshots_imeis = cc.sql("select distinct imei from listener.snapshots").map(row => row(0).toString)

    val imeis = snapshots_imeis.collect 

    def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
        val p = new java.io.PrintWriter(f)
        try { op(p) } finally { p.close() }
    }

    printToFile(new File("/path/to/file.txt")) { p => imeis.foreach(p.println) }

It took 3.5 hours with spark! With capture I manage to get my file after 3 min/3 sec.

Hydrocellulose answered 27/12, 2015 at 15:16 Comment(0)

Usually you would use the cqlsh "COPY ... TO ..." command to export data from a table to a csv file.

See the documentation here.

I'm not sure why what you're trying is getting stuck. To debug I'd suggest not using DISTINCT and adding a LIMIT clause. It may be your table is large and so cqlsh is paging the results, but there is no one there to press return to get to the next page of results, so it waits forever.

Sack answered 26/12, 2015 at 21:46 Comment(7)

how can I use 'copy to' without nested queries? I want to get distinct partition keys of the table, not all of them, so- "COPY listener.snapshots (imei) TO 'temp.csv';" isn't enough. – Hydrocellulose 27/12, 2015 at 12:43

you were right about why it got stck- I tried with limit and it worked. Still does not know how to get distinct partition keys. – Hydrocellulose 27/12, 2015 at 13:30

I'd use "sort -u" to remove duplicates after you have created the csv file. – Sack 27/12, 2015 at 13:31

but what if the table is very large? exporting it all to csv and then sorting, is the only way to get the distinct partitions keys of it? there is no better and simple way to do it? – Hydrocellulose 27/12, 2015 at 13:40

It looks like cqlsh has an option to disable paging. Try adding "PAGING OFF" as the first line of your cqlsh command file. – Sack 27/12, 2015 at 13:44

Don't put parenthesis around OFF. It's just "PAGING OFF;" with no parenthesis. This should generate the message, "Disabled Query paging." – Sack 27/12, 2015 at 13:59

I edit my file to: "PAGING OFF; SELECT distinct imei FROM listener.snapshots;" and when I use it it shows "select.cql:3:OperationTimedOut: errors={}" so it didn't work. :/ any ideas? – Hydrocellulose 27/12, 2015 at 14:26

Recommended topics

Hot tags