How to obtain number of rows in Cassandra table
Asked Answered
A

11

66

This is a super basic question but it's actually been bugging me for days. Is there a good way to obtain the equivalent of a COUNT(*) of a given table in Cassandra?

I will be moving several hundreds of millions of rows into C* for some load testing and I'd like to at least get a row count on some sample ETL jobs before I move massive amounts of data over the network.

The best idea I have is to basically loop over each row with Python and auto increment a counter. Is there a better way to determine (or even estimate) the row size of a C* table? I've also poked around Datastax Ops Center to see if I can determine the row size there. If you can, I don't see how it's possible.

Anyone else needed to get a count(*) of a table in C*? If so, how'd you go about doing it?

Adaiha answered 28/10, 2014 at 23:13 Comment(1)
see #1952343Rollway
R
75

Yes, you can use COUNT(*). Here's the documentation.

A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users table:

SELECT COUNT(*) FROM users;
Rodroda answered 29/10, 2014 at 0:4 Comment(6)
I'm curiius to hear back how long your count takes. Maybe you can report back. :)Gladis
in terms of benchmarks, right now I'm running a 5 node cluster with SSDs and 32 GB of ram. For a million rows, it's taking about a minute and a half to return that count *. I got a lot of tweaking and fine tuning to do. But in terms of where it is right now, "straight out of the box" if you will, it's definitely a bit on the slow side.Adaiha
So that seems to point back to the postrelational mantra - design for your queries. Create a table that keeps a counter, and bump the count whenever you add a row. Then the query will come back quickly. I'd expect it to be on the order of a few milliseconds.Gladis
If I needed to execute count(*)s against my tables often at all, I'd certainly agree with you. This is just sanity checking and control totals. "I sent it a million records... did all of them arrive?" kind of thingAdaiha
Generally speaking, running aggregate queries in Cassandra is not good practice even though it is technically supported. Also, if you end up with millions of rows this query will likely timeout. Using some of the nodetool commands below is generally a better approach.Lanalanae
i don't know of one.Rodroda
G
25

You can use copy to avoid cassandra timeout usually happens on count(*)

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Goingson answered 23/8, 2018 at 11:53 Comment(2)
Nice! This is helpfulDevitalize
@shubham can you please explain the code. What this code is doing. Is there any side-effect?Pentastich
U
17

nodetool tablestats can be pretty handy for quickly getting row estimates (and other table stats).

nodetool tablestats <keyspace.table> for a specific table

Unsteady answered 13/7, 2017 at 14:6 Comment(3)
I don't see an estimated count using tablestats, COPY seems a good option but for larger table(with less free storage left on disk) it's a bit difficult to manage space. any other way??Carryingon
I believe originally I thought "Number of keys" was sufficient but I see now that it's referring to partition keys. You could possibly try using Memtable Cell Count if you know the column size. docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/…Unsteady
I don't see row estimates under nodetool tablestatsDiantha
N
14

You can use dsbulk count for retrieving the total count of the table. I struggled with all above mentioned command with read timeout, and finally able to obtain count using below command

e.g,

dsbulk count -k <keyspace_name> -t <table_name>

More information about dsbulk can be found here

Nuncio answered 26/6, 2020 at 18:44 Comment(2)
great tip! This was super helpful!Gat
For anyone looking to download: downloads.datastax.com/#bulk-loader or github.com/datastax/dsbulkCarver
B
10

You can also get some estimates from nodetool cfhistograms if you don't need an exact count (these values are estimates).

You can also use spark if you're running DSE.

Blasius answered 30/10, 2014 at 19:29 Comment(3)
How can you use Spark to make the count faster ? For example I have about billion records and I am using spark to count the number of rows. It took me 1hr 30 mins to complete it. Here are more details #40779436Sfax
There is no cfhistograms option now and the tablehistograms does not seem to offer much useful info in terms of number of rows.Radiotelegraphy
cfhistograms and tablehistograms are the same thing, like I said you get some general estimates, not a strict count.Blasius
R
5
$nodetool settimeout read 360000
cqlsh -e "SELECT COUNT(*) FROM table;" --request-timeout=3600
Richardo answered 26/10, 2018 at 9:32 Comment(1)
While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.Lx
R
4

I've been working with Elasticsearch and this can be an answer to this problem... Assuming you are willing to use Elassandra instead of Cassandra.

The search system maintains many statistics and within seconds of the last updates it should have a good idea of how many rows you have in a table.

Here is a Match All Query request that gives you the information:

curl -XGET \
     -H 'Content-Type: application/json' \
     "http://127.0.0.1:9200/<search-keyspace>/_search/?pretty=true" \
     -d '{ "size": 1, "query": { "match_all": {} } }'

Where the <search-keyspace> is a keyspace that Elassandra creates. It generally is named something like <keyspace>_<table>, so if you have a keyspace named foo and a table named bar in that keyspace, the URL will use .../foo_bar/.... If you want to get the total number of rows in all your tables, then just use /_search/.

The output is a JSON which looks like this:

{
  "took" : 124,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 519659,                <-- this is your number
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo_bar",
        "_type" : "content",
        "_id" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284",
        "_score" : 1.0,
        "_source" : {
          "date" : "2018-12-29T00:06:27.710Z",
          "key" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284"
        }
      }
    ]
  }
}

And in terms of speed, this takes milliseconds, whatever the number of rows. I have tables with many millions of rows and it works like a charm. No need to wait hours or anything like that.

As others have mentioned, Elassandra is still a system heavily used in parallel by many computers. The counters will change quickly if you have many updates all the time. So the numbers you get from Elasticsearch are correct only if you prevent further updates for long enough for the counters to settle. Otherwise it's always going to be an approximate result.

Radiotelegraphy answered 31/12, 2018 at 22:26 Comment(0)
F
3

For count(*) for big tables, you can use Presto on top of Cassandra. I have tested and it works good.

Please refer below URL for the same: Key Word search: Cassandra question v3.11.3 …

select count(*) from table1

URL: Cassandra question v3.11.3 ... select count(*) from table1

Funda answered 9/5, 2019 at 10:37 Comment(0)
L
1

For those using the C# Linq Component Adapter you can use:

var t = new Table<T>(session);
var count = t.Count().Execute();
Lauter answered 28/8, 2018 at 2:9 Comment(1)
what is the equivalent for java driver? Would you know?Conventional
F
-2

Consider using ALLOW FILTERING with column constraints and then sum values.

For example:

SELECT count(*)
FROM my_table
WHERE datetime_id >= '2020-09-16' ALLOW FILTERING;

SELECT count(*)
FROM my_table
WHERE datetime_id < '2020-09-16' ALLOW FILTERING;
Fortuitous answered 18/9, 2020 at 15:41 Comment(1)
please don't do that - you'll kill your cluster or it will simply timeoutMultinational
C
-5
nodetool cfstats | grep -A 1000 KEYSPACE

Replace KEYSPACE for getting the details of all tables in that KEYSPACE.

Chesty answered 14/8, 2018 at 9:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.