I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried RDD.count()
but it takes a lot of time. I've seen that i can use the function fold
. But i didn't found a java documentation of this function.
Could you please show me how to use it or show me another solution to get the number of rows of my RDD.
Here is my code :
JavaPairRDD<String, String> lines = getAllCustomers(sc).cache();
JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache();
JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache();
double count_ctid = (double)join.count(); // i want to get the count of these three RDD
double all = (double)lines.count();
double count_cfid = all - CFIDNotNull.count();
System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%");
Thank you.
Spark
's lazy evaluation. So I inserted somedf.cache.count
calls in my code. Can this significantly impact the performance and / or have some other implications? I'm onSpark 2.3.0
and usingScala 2.11.11
– Unwitnessed