Counting all entries with KSQL

Asked 14/11, 2017 at 18:52 Answered 28/10, 2021 at 6:18

Solved apache-kafka apache-kafka-streams ksqldb

Is it possible to use KSQL to not only count entries of a specific column via GROUP BY but instead get an aggregate over all the entries that stream through the application?

I'm searching for something like this:

| Count all | Count id1 | count id2 |
| ---245----|----150----|----95-----|

Or more like this in KSQL:

[some timestamp] | Count all | 245   
[some timestamp] | Count id1 | 150   
[some timestamp] | Count id2 | 95   
.   
.   
.

Thank you
- Tim

Schiff answered 14/11, 2017 at 18:52 Comment(2)

so in pseudo-SQL, something like SELECT COUNT(*), COUNT(DISTINCT ID1), COUNT(DISTINCT ID2) FROM FOO? is that what you're after? – Lawson 14/11, 2017 at 21:21

Yes, that was the initial idea here. However the final idea was to simply get the whole amount of records so only count(*). @Hojjat answer seems like it's working perfectly fine! – Schiff 14/11, 2017 at 21:23

You cannot have both counts for the all and count for each key in the same query. You can have two queries here, one for counting each value in the given column and another for counting all values in the given column. Let's assume you have a stream with two columns, col1 and col2. To count each value in col1 with infinite window size you can use the following query:

SELECT col1, count(*) FROM mystream1 GROUP BY col1;

To count all the rows you need to write two queries since KSQL always needs GROUP BY clause for aggregation. First you create a new column with constant value and then you can count the values in new column and since it is a constant, the count will represent the count of all rows. Here is an example:

CREATE STREAM mystream2 AS SELECT 1 AS col3 FROM mystream1;
SELECT col3, count(*) FROM mystream2 GROUP BY col3;

Guarded answered 14/11, 2017 at 19:53 Comment(6)

Perfect, thank you very much, the second part was exactly what I was searching for! Thanks – Schiff 14/11, 2017 at 21:29

Note that if you have a topic with high data rate aggregation with one value for key will repartition the topic and send all of the messages to one partition which may result in scaling issues. So you should be careful in using this approach. – Guarded 14/11, 2017 at 21:33

Good to know, thank you very much. This will be mainly used in testing, so luckily not production use. – Schiff 14/11, 2017 at 21:39

Is there a ksql ticket for this rather obvious usecase of COUNT without grouping? – Crissman 14/10, 2019 at 18:20

I don't really like the solution as it creates an orphan stream. Maybe it could be adopted to at least drop the stream after doing the counting? – Crissman 14/10, 2019 at 18:23

will SELECT COUNT(*) FROM mystream GROUP BY 1 work? – Kary 5/11, 2019 at 8:48

This works too to get total rows count for a table:

ksql> SELECT COUNT(*) FROM `mytable` GROUP BY 1 EMIT CHANGES;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|KSQL_COL_0                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2298

Vernation answered 28/10, 2021 at 6:18 Comment(0)

you can do a extended describe on the stream or table to see the total messages

ksql> describe extended <stream or table name>

sample output

Local runtime statistics
------------------------
messages-per-sec:         0   total-messages:   2415888     last-message: 2019-12-06T02:29:43.005Z

Shafer answered 6/12, 2019 at 2:34 Comment(0)

Recommended topics

Hot tags