Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names

Asked 21/3, 2016 at 12:59 Answered 23/4 at 19:34

Solved scala apache-spark dataframe apache-spark-sql

val columnName=Seq("col1","col2",....."coln");

Is there a way to do dataframe.select operation to get dataframe containing only the column names specified . I know I can do dataframe.select("col1","col2"...) but the columnNameis generated at runtime. I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?

Halflife answered 21/3, 2016 at 12:59 Comment(2)

duplicate? #34939270 – Atalanta 30/10, 2017 at 21:1

@Atalanta That is a duplicate of this question. See the timeline. – Halflife 31/10, 2017 at 9:4

val columnNames = Seq("col1","col2",....."coln")

// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)

// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)

Expiratory answered 21/3, 2016 at 13:3 Comment(4)

tail returns the sequence excluding the first item (head); : _* transforms a collection into a vararg argument - used when calling a method expecting a vararg, like select does: def select(col: String, cols: String*) – Expiratory 21/3, 2016 at 13:15

It's called, repeated parameters, you can check more about it here - chapter 4 section 2. – Dacoity 21/3, 2016 at 13:18

@V.Samma that won't compile, check the signatures of select - it's either select(col: String, cols: String*): DataFrame for Strings, or select(cols: Column*): DataFrame for Columns, there's no select(cols: String*): DataFrame. See spark.apache.org/docs/latest/api/scala/… – Expiratory 10/11, 2016 at 8:54

Is there a way to add alias for other columns like this? dataframe.select(columnNames.head, columnNames.tail: _*, col("abc").as("def")) ? – Leucite 8/11, 2022 at 17:43

Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():

  val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(name => col(name)): _*)

Lotty answered 15/5, 2018 at 7:40 Comment(2)

Please add some context and explanation to this answer. – Septic 15/5, 2018 at 9:4

@UserszrKs i am using spark 2.3.1 version , when i use the above it is giving an error .."type mismatch : found: org.apache.spark.sql.Column , required :Seq[?] , What is wrong here? – Concepcion 10/12, 2018 at 8:14

You can use (List(F.col("*")) ++ updatedColumns): _* in select.

val updatedColumns: List[Column] = inputColumnNames.map(x => (F.col(x) * F.col("is_t90d")).alias(x))

val outputSDF = {
    inputSDF
    .withColumn("is_t90d", F.col("original_date").between(firstAllowedDate, lastAllowedDate).cast(IntegerType))
    .select( // select existing and additional columns
        (List(F.col("*")) ++ updatedColumns): _*
    )
}

Smallminded answered 23/4 at 19:34 Comment(0)

-1

Alternatively, you can also write like this

val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(DF(_): _*)

Viscardi answered 5/5, 2021 at 4:9 Comment(0)

Recommended topics

Hot tags