How to obtain number of rows in Cassandra table

Asked 28/10, 2014 at 23:13 Answered 18/9, 2020 at 15:41

This is a super basic question but it's actually been bugging me for days. Is there a good way to obtain the equivalent of a COUNT(*) of a given table in Cassandra?

I will be moving several hundreds of millions of rows into C* for some load testing and I'd like to at least get a row count on some sample ETL jobs before I move massive amounts of data over the network.

The best idea I have is to basically loop over each row with Python and auto increment a counter. Is there a better way to determine (or even estimate) the row size of a C* table? I've also poked around Datastax Ops Center to see if I can determine the row size there. If you can, I don't see how it's possible.

Anyone else needed to get a count(*) of a table in C*? If so, how'd you go about doing it?

Adaiha answered 28/10, 2014 at 23:13 Comment(1)

see #1952343 – Rollway 27/4, 2017 at 7:16

Yes, you can use COUNT(*). Here's the documentation.

A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users table:
SELECT COUNT(*) FROM users;

Rodroda answered 29/10, 2014 at 0:4 Comment(6)

I'm curiius to hear back how long your count takes. Maybe you can report back. :) – Gladis 29/10, 2014 at 0:31

in terms of benchmarks, right now I'm running a 5 node cluster with SSDs and 32 GB of ram. For a million rows, it's taking about a minute and a half to return that count *. I got a lot of tweaking and fine tuning to do. But in terms of where it is right now, "straight out of the box" if you will, it's definitely a bit on the slow side. – Adaiha 29/10, 2014 at 14:58

So that seems to point back to the postrelational mantra - design for your queries. Create a table that keeps a counter, and bump the count whenever you add a row. Then the query will come back quickly. I'd expect it to be on the order of a few milliseconds. – Gladis 29/10, 2014 at 15:10

If I needed to execute count(*)s against my tables often at all, I'd certainly agree with you. This is just sanity checking and control totals. "I sent it a million records... did all of them arrive?" kind of thing – Adaiha 29/10, 2014 at 15:30

Generally speaking, running aggregate queries in Cassandra is not good practice even though it is technically supported. Also, if you end up with millions of rows this query will likely timeout. Using some of the nodetool commands below is generally a better approach. – Lanalanae 11/8, 2017 at 14:25

i don't know of one. – Rodroda 17/9, 2019 at 23:31

You can use copy to avoid cassandra timeout usually happens on count(*)

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Goingson answered 23/8, 2018 at 11:53 Comment(2)

Nice! This is helpful – Devitalize 23/3, 2019 at 17:8

@shubham can you please explain the code. What this code is doing. Is there any side-effect? – Pentastich 24/12, 2021 at 6:50

nodetool tablestats can be pretty handy for quickly getting row estimates (and other table stats).

nodetool tablestats <keyspace.table> for a specific table

Unsteady answered 13/7, 2017 at 14:6 Comment(3)

I don't see an estimated count using tablestats, COPY seems a good option but for larger table(with less free storage left on disk) it's a bit difficult to manage space. any other way?? – Carryingon 14/10, 2019 at 13:18

I believe originally I thought "Number of keys" was sufficient but I see now that it's referring to partition keys. You could possibly try using Memtable Cell Count if you know the column size. docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/… – Unsteady 15/10, 2019 at 14:16

I don't see row estimates under nodetool tablestats – Diantha 3/12, 2021 at 17:40

You can use dsbulk count for retrieving the total count of the table. I struggled with all above mentioned command with read timeout, and finally able to obtain count using below command

e.g,

dsbulk count -k <keyspace_name> -t <table_name>

More information about dsbulk can be found here

Nuncio answered 26/6, 2020 at 18:44 Comment(2)

great tip! This was super helpful! – Gat 26/3, 2021 at 18:58

For anyone looking to download: downloads.datastax.com/#bulk-loader or github.com/datastax/dsbulk – Carver 22/2 at 12:15

You can also get some estimates from nodetool cfhistograms if you don't need an exact count (these values are estimates).

You can also use spark if you're running DSE.

Blasius answered 30/10, 2014 at 19:29 Comment(3)

How can you use Spark to make the count faster ? For example I have about billion records and I am using spark to count the number of rows. It took me 1hr 30 mins to complete it. Here are more details #40779436 – Sfax 24/11, 2016 at 18:38

There is no cfhistograms option now and the tablehistograms does not seem to offer much useful info in terms of number of rows. – Radiotelegraphy 30/12, 2018 at 2:12

cfhistograms and tablehistograms are the same thing, like I said you get some general estimates, not a strict count. – Blasius 31/12, 2018 at 1:37

$nodetool settimeout read 360000
cqlsh -e "SELECT COUNT(*) FROM table;" --request-timeout=3600

Richardo answered 26/10, 2018 at 9:32 Comment(1)

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. – Lx 26/10, 2018 at 9:52

I've been working with Elasticsearch and this can be an answer to this problem... Assuming you are willing to use Elassandra instead of Cassandra.

The search system maintains many statistics and within seconds of the last updates it should have a good idea of how many rows you have in a table.

Here is a Match All Query request that gives you the information:

curl -XGET \
     -H 'Content-Type: application/json' \
     "http://127.0.0.1:9200/<search-keyspace>/_search/?pretty=true" \
     -d '{ "size": 1, "query": { "match_all": {} } }'

Where the <search-keyspace> is a keyspace that Elassandra creates. It generally is named something like <keyspace>_<table>, so if you have a keyspace named foo and a table named bar in that keyspace, the URL will use .../foo_bar/.... If you want to get the total number of rows in all your tables, then just use /_search/.

The output is a JSON which looks like this:

{
  "took" : 124,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 519659,                <-- this is your number
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "foo_bar",
        "_type" : "content",
        "_id" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284",
        "_score" : 1.0,
        "_source" : {
          "date" : "2018-12-29T00:06:27.710Z",
          "key" : "cda683e5-d5c7-4769-8e2c-d0a30eca1284"
        }
      }
    ]
  }
}

And in terms of speed, this takes milliseconds, whatever the number of rows. I have tables with many millions of rows and it works like a charm. No need to wait hours or anything like that.

As others have mentioned, Elassandra is still a system heavily used in parallel by many computers. The counters will change quickly if you have many updates all the time. So the numbers you get from Elasticsearch are correct only if you prevent further updates for long enough for the counters to settle. Otherwise it's always going to be an approximate result.

Radiotelegraphy answered 31/12, 2018 at 22:26 Comment(0)

For count(*) for big tables, you can use Presto on top of Cassandra. I have tested and it works good.

Please refer below URL for the same: Key Word search: Cassandra question v3.11.3 …

select count(*) from table1

URL: Cassandra question v3.11.3 ... select count(*) from table1

Funda answered 9/5, 2019 at 10:37 Comment(0)

For those using the C# Linq Component Adapter you can use:

var t = new Table<T>(session);
var count = t.Count().Execute();

Lauter answered 28/8, 2018 at 2:9 Comment(1)

what is the equivalent for java driver? Would you know? – Conventional 18/10, 2019 at 8:0

-2

Consider using ALLOW FILTERING with column constraints and then sum values.

For example:

SELECT count(*)
FROM my_table
WHERE datetime_id >= '2020-09-16' ALLOW FILTERING;

SELECT count(*)
FROM my_table
WHERE datetime_id < '2020-09-16' ALLOW FILTERING;

Fortuitous answered 18/9, 2020 at 15:41 Comment(1)

please don't do that - you'll kill your cluster or it will simply timeout – Multinational 18/9, 2020 at 16:13

-5

nodetool cfstats | grep -A 1000 KEYSPACE

Replace KEYSPACE for getting the details of all tables in that KEYSPACE.

Chesty answered 14/8, 2018 at 9:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags