how to redirect Scala Spark Dataset.show to log4j logger
Asked Answered
K

3

13

The Spark API Doc's show how to get a pretty-print snippit from a dataset or dataframe sent to stdout.

Can this output be directed to a log4j logger? Alternately: can someone share code which will create output formatted similarly to the df.show()?

Is there a way to do this which allow stdout to go to the console both before and after pushing the .show() output to the logger?

http://spark.apache.org/docs/latest/sql-programming-guide.htm

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+
Kori answered 11/1, 2017 at 20:42 Comment(4)
You can always implement similar function: github.com/apache/spark/blob/master/sql/core/src/main/scala/org/…Nameless
That got me there. TYKori
For Java, see #8708842 for how to redirect console output to a String.Boutonniere
You can see how to use the internal showString function by reflection here: #51218939Tumefaction
L
5

The showString() function from teserecter comes from Spark code (Dataset.scala).

You can't use that function from your code because it's package private but you can place the following snippet in a file DatasetShims.scala in your source code and mix-in the trait in your classes to access the function.

package org.apache.spark.sql

trait DatasetShims {
  implicit class DatasetHelper[T](ds: Dataset[T]) {
    def toShowString(numRows: Int = 20, truncate: Int = 20, vertical: Boolean = false): String =
      "\n" + ds.showString(numRows, truncate, vertical)
  }
}
Lakia answered 6/11, 2019 at 22:25 Comment(0)
K
2

Put this utility method somewhere in your code to produce a formatted string with the dataframe.show() format.

Then just include it in your logging output like:

log.info("at this point the dataframe named df shows as \n"+showString(df,100,-40))

/**
    * Compose the string representing rows for output
    *
    * @param _numRows Number of rows to show
    * @param truncate If set to more than 0, truncates strings to `truncate` characters and
    *                   all cells will be aligned right.
    */
    def showString(
        df:DataFrame
        ,_numRows: Int = 20
        ,truncateWidth: Int = 20
    ): String = {
        val numRows = _numRows.max(0)
        val takeResult = df.take(numRows + 1)
        val hasMoreData = takeResult.length > numRows
        val data = takeResult.take(numRows)

        // For array values, replace Seq and Array with square brackets
        // For cells that are beyond `truncate` characters, replace it with the
        // first `truncate-3` and "..."
        val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
            row.toSeq.map { cell =>
            val str = cell match {
                case null => "null"
                case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
                case array: Array[_] => array.mkString("[", ", ", "]")
                case seq: Seq[_] => seq.mkString("[", ", ", "]")
                case _ => cell.toString
            }
            if (truncateWidth > 0 && str.length > truncateWidth) {
                // do not show ellipses for strings shorter than 4 characters.
                if (truncateWidth < 4) str.substring(0, truncateWidth)
                else str.substring(0, truncateWidth - 3) + "..."
            } else {
                str
            }
        }: Seq[String]
    }
Kori answered 22/1, 2019 at 21:14 Comment(1)
I think the answer is missing at least a curly bracket but I think it is not retrieving as string and the concatenation of the variable named rows is missing.Tumefaction
H
0

It looks like a crutch but...

Using(new ByteArrayOutputStream()) { dfOutputStream =>
  Console.withOut(dfOutputStream) {
    df.show()
  }
  val dfOutput = new String(dfOutputStream.toByteArray)
  logger.info(dfOutput)
}

(redirecting console output to output stream)

Hidie answered 12/6 at 12:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.