select count(*) runs into timeout issues in Cassandra
Asked Answered
U

7

22

Maybe it is a stupid question, but I'm not able to determine the size of a table in Cassandra.

This is what I tried:

select count(*) from articles;

It works fine if the table is small but once it fills up, I always run into timeout issues:

cqlsh:

  • OperationTimedOut: errors={}, last_host=127.0.0.1

DBeaver:

  • Run 1: 225,000 (7477 ms)
  • Run 2: 233,637 (8265 ms)
  • Run 3: 216,595 (7269 ms)

I assume that it hits some timeout and just aborts. The actual number of entries in the table is probably much higher.

I'm testing against a local Cassandra instance which is completely idle. I would not mind if it has to do a full table scan and is unresponsive during that time.

Is there a way to reliably count the number of entries in a Cassandra table?

I'm using Cassandra 2.1.13.

Unhappy answered 20/4, 2016 at 12:44 Comment(0)
A
19

As far as I see you problem connected to timeout of cqlsh: OperationTimedOut: errors={}, last_host=127.0.0.1

you can simple increase it with options:

 --connect-timeout=CONNECT_TIMEOUT
                       Specify the connection timeout in seconds (default: 5
                       seconds).
 --request-timeout=REQUEST_TIMEOUT
                       Specify the default request timeout in seconds
                       (default: 10 seconds).
Astringent answered 21/4, 2016 at 11:4 Comment(3)
Thanks! Yes, with an increased timeout, SELECT returns the expected number of elements.Godparent
Increasing both timeouts didn't work for me :/ I set both to 100 seconds, but cqlsh failed after 16 seconds with ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses]. Using COPY TO per @PhilippClaßen's answer was the only way I could compute the row count.Postnatal
Increasing both timeouts did not work for me either. I also used COPY tablename TO '/dev/null'; to get the count. Maybe there is specific format for providing --connect-timeout and --request-timeout?Isosteric
U
23

Here is my current workaround:

COPY articles TO '/dev/null';
...
3568068 rows exported to 1 files in 2 minutes and 16.606 seconds.

Background: Cassandra supports to export a table to a text file, for instance:

COPY articles TO '/tmp/data.csv';
Output: 3568068 rows exported to 1 files in 2 minutes and 25.559 seconds

That also matches the number of lines in the generated file:

$ wc -l /tmp/data.csv
3568068
Unhappy answered 20/4, 2016 at 13:15 Comment(2)
The COPY command is essentially doing a full cluster scan like COUNT(*) would, and this solution will not scale either. Like others have mentioned you'll have to settle for an approx row count to maintain speed and stability in your cluster. Options are. Counter columns, hyperloglog, or reading your tables metadata will give you and estimate of how many partitions there are.Tirrell
I'm OK with sacrificing scalability, as my intention was only to understand what is going on in my local development environment. I agree that it is not a solution that can be used in production.Godparent
A
19

As far as I see you problem connected to timeout of cqlsh: OperationTimedOut: errors={}, last_host=127.0.0.1

you can simple increase it with options:

 --connect-timeout=CONNECT_TIMEOUT
                       Specify the connection timeout in seconds (default: 5
                       seconds).
 --request-timeout=REQUEST_TIMEOUT
                       Specify the default request timeout in seconds
                       (default: 10 seconds).
Astringent answered 21/4, 2016 at 11:4 Comment(3)
Thanks! Yes, with an increased timeout, SELECT returns the expected number of elements.Godparent
Increasing both timeouts didn't work for me :/ I set both to 100 seconds, but cqlsh failed after 16 seconds with ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses]. Using COPY TO per @PhilippClaßen's answer was the only way I could compute the row count.Postnatal
Increasing both timeouts did not work for me either. I also used COPY tablename TO '/dev/null'; to get the count. Maybe there is specific format for providing --connect-timeout and --request-timeout?Isosteric
L
13

Is there a way to reliably count the number of entries in a Cassandra table?

Plain answer is no. It is not a Cassandra limitation but a hard challenge for distributed systems to count unique items reliably.

That's the challenge that approximation algorithms like HyperLogLog address.

One possible solution is to use counter in Cassandra to count the number of distinct rows but even counters can miscount in some corner cases so you'll get a few % error.

Liven answered 20/4, 2016 at 13:26 Comment(0)
C
4

This is a good utility for counting rows that avoids the timeout issues that happen when running a large COUNT(*) in Cassandra:

https://github.com/brianmhess/cassandra-count

Concision answered 20/4, 2016 at 14:40 Comment(0)
I
1

The reason is simple:

When you're using:

SELECT count(*) FROM articles;

it has the same effect on the database as:

SELECT * FROM articles;

You have to query over all your nodes. Cassandra simply runs into a timeout.

You can change the timeout, but it isn't a good solution. (For one time it's fine but don't use it in your regular queries.)

There's a better solution: make your client count your rows. You can create a java app where you count your rows, when you inserting them, and insert the result using a counter column in a Cassandra table.

Interviewer answered 20/4, 2016 at 13:25 Comment(0)
R
0

You can use copy to avoid cassandra timeout usually happens on count(*)

use this bash

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Raila answered 23/8, 2018 at 11:56 Comment(0)
B
0

You can define timeout seconds as a command line option while using Datastax cqlsh. The default value is 10.

$ cqlsh --request-timeout=3600
Belenbelesprit answered 11/8, 2023 at 14:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.