Find size of data stored in rdd from a text file in apache spark

Asked 24/8, 2015 at 9:52 Answered 27/8, 2015 at 19:6

Solved scala apache-spark apache-spark-1.4

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .

Is there a way by which I can get the size of data in rdd .

This is my code :

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row

object RddSize {

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "data size")
    val FILE_LOCATION = "src/main/resources/employees.csv"
    val peopleRdd = sc.textFile(FILE_LOCATION)

    val newRdd = peopleRdd.filter(str => str.contains(",M,"))
    //Here I want to find whats the size remaining data
  }
}

I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).

Unwitnessed answered 24/8, 2015 at 9:52 Comment(12)

What do you mean by "size"? Number of rows in the RDD? If so the "count" RDD function does that - def count(): Long Return the number of elements in the RDD., from the Spark doc. – Food 24/8, 2015 at 10:25

@Paul No , here size dosnt mean no of rows . Suppose my file is of size 100MB then i got file data in rdd and applied filter . Data must have reduced in its size . I want to get that size (in MB) – Unwitnessed 24/8, 2015 at 10:38

Not sure you want to do that. RDDs are lazy so nothing has been executed yet for newRDD. If you want the size, you'll force it to be evaluated and probably do too much work., – Food 24/8, 2015 at 10:40

Thanks for replying @Paul ,I am newbie in Spark . I have no idea how i can force it to evaluate the size . Can you give some suggestions . – Unwitnessed 24/8, 2015 at 10:45

You don't want to force it to evaluate the size.but to evaluate the answer overall. Why do you care what the size is? – Food 24/8, 2015 at 11:57

@Paul I am working on an application which takes a file and filter some data . In response i want to show the size of data which is actually useful after processing a large file (can be of 1GB). – Unwitnessed 24/8, 2015 at 12:9

Sorry, again. Why? Unless that's the final answer and no more processing is needed. In which case, write it to a file and look at the file size.I can't see why knowing the size it consumes in memory is of interest. – Food 24/8, 2015 at 13:13

Actually that's not the final answer. Depends on the user , if user wants @Paul they can do some more filter based upon the remaining size . For that i need to show the initial size and remaining size . Once the transformations are done data has to be stored in spark sql (this part i know, I am stuck at finding the size ). – Unwitnessed 24/8, 2015 at 14:35

Please explain further why you need the size for the remaining filtering. Spark works (and gains its performance) by being 'lazy' and only doing the actual computation when the result is needed. So, normally, there is no "remaining" size in the middle of a computation, because the computation hasn't been done yet. So please explain what you are trying to do overall - because very probably, knowing the size in the middle is not necessary, or will negatively impact performance. – Food 24/8, 2015 at 15:2

@Paul Suppose I have a file of 100mb . I read it into RDD now I did some transformations on RDD . After that I created a table in spark sql from that RDD . I want to know whats the size of data (increased or decreased) in RDD just before creating table in spark sql . I hope my issue is clear now . Thanks a lot for looking into my problem :) . Please let me know even its possible or not . – Unwitnessed 24/8, 2015 at 16:53

No, it's not clear. WHY do you want to know the size? You keep telling me you want to know the size, but why? What are you going to do with the answer? If you want to know what the size of the table will be, in memory, before you create it, then no, I don't think it's possible. I'm not being delberately obtuse, I have no idea how knowing the RDD is 100Mb or 90Mb or 110Mb will help you in any way – Food 24/8, 2015 at 16:55

Spark sql table compresses the data . so that wont be helpful else i would have gone with in memory size . Thanks a lot for the help :) – Unwitnessed 24/8, 2015 at 17:4

There are multiple way to get the RDD size

1.Add the spark listener in your spark context

SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
  val map = stageCompleted.stageInfo.rddInfos
  map.foreach(row => {
      println("rdd memSize " + row.memSize)
      println("rdd diskSize " + row.diskSize)
   })
}})

2. Save you rdd as text file.

myRDD.saveAsTextFile("person.txt")

and call Apache Spark REST API.

/applications/[app-id]/stages

3. You can also try SizeEstimater

val rddSize = SizeEstimator.estimate(myRDD)

Sand answered 27/8, 2015 at 19:6 Comment(3)

Thanks !! Will try these and let you know in case of any issues – Unwitnessed 28/8, 2015 at 5:0

@Sand : Nice explanation :) – Janeyjangle 13/10, 2016 at 17:50

thanx @RamPrasadG – Sand 14/10, 2016 at 5:7

I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.

    def calcRDDSize(rdd: RDD[String]): Long = {
        //map to the size of each string, UTF-8 is the default
        rdd.map(_.getBytes("UTF-8").length.toLong) 
           .reduce(_+_) //add the sizes together
    }

You can then call this function for your two RDDs:

println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")

This solution should work even if the file size is larger than the memory available in the cluster.

Menes answered 26/8, 2015 at 18:0 Comment(2)

I don’t want to cache the RDD . It will get data in memory with is not required . – Unwitnessed 27/8, 2015 at 15:32

Hi is it possible to convert into dataframe as well? – Mussolini 28/1, 2016 at 12:55

The Spark API doc says that:

You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
The RDD info includes memory and disk size: RDDInfo doc

Wherefore answered 27/8, 2015 at 7:44 Comment(0)

Recommended topics

Hot tags