Apache Spark SQL is taking forever to count billion rows from Cassandra?
Asked Answered
R

1

1

I have the following code

I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864

code

scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB

df: org.apache.spark.sql.DataFrame = [test: binary]

scala> df.count

[Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean precisely.

If I invoke spark-shell as follows

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134

code

val df = spark.sql("SELECT test from hello") // This has about billion rows

scala> df.count


[Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?

Both of these versions didn't work Spark keeps running forever and I have been waiting for more than 15 mins and no response. Any ideas on what could be wrong and how to fix this?

I am using Spark 2.0.2 and spark-cassandra-connector_2.11-2.0.0-M3.jar

Rhizogenic answered 24/11, 2016 at 5:54 Comment(1)
With a billion rows you probably would have had to wait 2 or 3 hours to get the approximate answer.Logarithm
P
4

Dataset.count is slow because it is not very smart when it comes to external data sources. It rewrites query as (it is good):

SELECT COUNT(1) FROM table

but instead of pushing COUNT down it just executes :

SELECT 1 FROM table

against the source (it'll fetch a billion ones in your case) and then aggregates locally to get the final result. Numbers you see are tasks counters.

There is an optimized cassandraCount operation on CassandraRDD:

sc.cassandraTable(keyspace, table).cassandraCount

More about server side operations can be found in the documentation.

Portwine answered 24/11, 2016 at 8:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.