Spark losing println() on stdout
Asked Answered
E

2

19

I have the following code:

val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
  for (value <- data.getValues()) {
    if (record.getEnum() == DataEnum.BLUE) {
      blueCount += 1
      println("Enum = BLUE : " + value.toString()
    }
  }
  data
}.persist(StorageLevel.MEMORY_ONLY_SER)

output.saveAsTextFile("myOutput")

Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!

Elnoraelnore answered 20/10, 2015 at 0:14 Comment(0)
P
20

This is a conceptual question...

Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:

  • Where will that data be printed out?
  • What node has priority and what partition?
  • If all nodes are running in parallel, who will be printed first?
  • How will be this print queue created?

Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).

This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?

In order to prove this you can run this scala code for example:

// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)

def printStuff(x:Int):Int = {
  println(x)
  x + 1
}

// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)

// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)
Prudence answered 20/10, 2015 at 0:28 Comment(2)
I believe println works just fine: it just goes to the stdout /stderr on the computer that is running a spark executor. So unless you have a way to capture what is in those logs you will never see it. If you are using yarn there is a command to print it all out for you however.Bethannbethanne
While argumentation is valid Spark doesn't perform any type of static analysis to drop code. Output just doesn't go to the driver STDOUT as explained by @BethannbethanneSanferd
E
0

I was able to work it around by making a utility function:

object PrintUtiltity {
    def print(data:String) = {
      println(data)
    }
}
Elnoraelnore answered 28/10, 2015 at 3:3 Comment(2)
Because Spark thinks it is calling an Utility function instead of calling the print function. Spark apparently didn't (and couldn't practically) check every line in its utility function.Elnoraelnore
What you are doing is instantiating an object in your driver program. I would not count on this behavior without a clear model of exactly what is going on. Expect the behavior to change unpredictably with any change to your program or how you invoke the PrintUtility object. If you want to collect logs, use standard methods to do it, don't invent random mechanisms that you don't understand. Your explanation for why it works is dangerously wrong - there is no prohibition from doing what you did; there is no code checker to make sure you don't cheat: all behavior follows system designBethannbethanne

© 2022 - 2024 — McMap. All rights reserved.