Find size of data stored in rdd from a text file in apache spark
Asked Answered
U

3

5

I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd .

Is there a way by which I can get the size of data in rdd .

This is my code :

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.util.SizeEstimator
import org.apache.spark.sql.Row

object RddSize {

  def main(args: Array[String]) {

    val sc = new SparkContext("local", "data size")
    val FILE_LOCATION = "src/main/resources/employees.csv"
    val peopleRdd = sc.textFile(FILE_LOCATION)

    val newRdd = peopleRdd.filter(str => str.contains(",M,"))
    //Here I want to find whats the size remaining data
  }
} 

I want to get size of data before filter Transformation (peopleRdd) and after it (newRdd).

Unwitnessed answered 24/8, 2015 at 9:52 Comment(12)
What do you mean by "size"? Number of rows in the RDD? If so the "count" RDD function does that - def count(): Long Return the number of elements in the RDD., from the Spark doc.Food
@Paul No , here size dosnt mean no of rows . Suppose my file is of size 100MB then i got file data in rdd and applied filter . Data must have reduced in its size . I want to get that size (in MB)Unwitnessed
Not sure you want to do that. RDDs are lazy so nothing has been executed yet for newRDD. If you want the size, you'll force it to be evaluated and probably do too much work.,Food
Thanks for replying @Paul ,I am newbie in Spark . I have no idea how i can force it to evaluate the size . Can you give some suggestions .Unwitnessed
You don't want to force it to evaluate the size.but to evaluate the answer overall. Why do you care what the size is?Food
@Paul I am working on an application which takes a file and filter some data . In response i want to show the size of data which is actually useful after processing a large file (can be of 1GB).Unwitnessed
Sorry, again. Why? Unless that's the final answer and no more processing is needed. In which case, write it to a file and look at the file size.I can't see why knowing the size it consumes in memory is of interest.Food
Actually that's not the final answer. Depends on the user , if user wants @Paul they can do some more filter based upon the remaining size . For that i need to show the initial size and remaining size . Once the transformations are done data has to be stored in spark sql (this part i know, I am stuck at finding the size ).Unwitnessed
Please explain further why you need the size for the remaining filtering. Spark works (and gains its performance) by being 'lazy' and only doing the actual computation when the result is needed. So, normally, there is no "remaining" size in the middle of a computation, because the computation hasn't been done yet. So please explain what you are trying to do overall - because very probably, knowing the size in the middle is not necessary, or will negatively impact performance.Food
@Paul Suppose I have a file of 100mb . I read it into RDD now I did some transformations on RDD . After that I created a table in spark sql from that RDD . I want to know whats the size of data (increased or decreased) in RDD just before creating table in spark sql . I hope my issue is clear now . Thanks a lot for looking into my problem :) . Please let me know even its possible or not .Unwitnessed
No, it's not clear. WHY do you want to know the size? You keep telling me you want to know the size, but why? What are you going to do with the answer? If you want to know what the size of the table will be, in memory, before you create it, then no, I don't think it's possible. I'm not being delberately obtuse, I have no idea how knowing the RDD is 100Mb or 90Mb or 110Mb will help you in any wayFood
Spark sql table compresses the data . so that wont be helpful else i would have gone with in memory size . Thanks a lot for the help :)Unwitnessed
S
9

There are multiple way to get the RDD size

1.Add the spark listener in your spark context

SparkDriver.getContext.addSparkListener(new SparkListener() {
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
  val map = stageCompleted.stageInfo.rddInfos
  map.foreach(row => {
      println("rdd memSize " + row.memSize)
      println("rdd diskSize " + row.diskSize)
   })
}})

2. Save you rdd as text file.

myRDD.saveAsTextFile("person.txt")

and call Apache Spark REST API.

/applications/[app-id]/stages

3. You can also try SizeEstimater

val rddSize = SizeEstimator.estimate(myRDD)
Sand answered 27/8, 2015 at 19:6 Comment(3)
Thanks !! Will try these and let you know in case of any issuesUnwitnessed
@Sand : Nice explanation :)Janeyjangle
thanx @RamPrasadGSand
M
4

I'm not sure you need to do this. You could cache the rdd and check the size in the Spark UI. But lets say that you do want to do this programmatically, here is a solution.

    def calcRDDSize(rdd: RDD[String]): Long = {
        //map to the size of each string, UTF-8 is the default
        rdd.map(_.getBytes("UTF-8").length.toLong) 
           .reduce(_+_) //add the sizes together
    }

You can then call this function for your two RDDs:

println(s"peopleRdd is [${calcRDDSize(peopleRdd)}] bytes in size")
println(s"newRdd is [${calcRDDSize(newRdd)}] bytes in size")

This solution should work even if the file size is larger than the memory available in the cluster.

Menes answered 26/8, 2015 at 18:0 Comment(2)
I don’t want to cache the RDD . It will get data in memory with is not required .Unwitnessed
Hi is it possible to convert into dataframe as well?Mussolini
W
0

The Spark API doc says that:

  1. You can get info about your RDDs from the Spark context: sc.getRDDStorageInfo
  2. The RDD info includes memory and disk size: RDDInfo doc
Wherefore answered 27/8, 2015 at 7:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.