I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .
Is there a way by which I can get the size of data in rdd .
This is my code :
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row
object RddSize {
def main(args: Array[String]) {
val sc = new SparkContext("local", "data size")
val FILE_LOCATION = "src/main/resources/employees.csv"
val peopleRdd = sc.textFile(FILE_LOCATION)
val newRdd = peopleRdd.filter(str => str.contains(",M,"))
//Here I want to find whats the size remaining data
}
}
I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).
def count(): Long Return the number of elements in the RDD.
, from the Spark doc. – FoodnewRDD
. If you want the size, you'll force it to be evaluated and probably do too much work., – Food