Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names
Asked Answered
H

4

42
val columnName=Seq("col1","col2",....."coln");

Is there a way to do dataframe.select operation to get dataframe containing only the column names specified . I know I can do dataframe.select("col1","col2"...) but the columnNameis generated at runtime. I could do dataframe.select() repeatedly for each column name in a loop.Will it have any performance overheads?. Is there any other simpler way to accomplish this?

Halflife answered 21/3, 2016 at 12:59 Comment(2)
duplicate? #34939270Atalanta
@Atalanta That is a duplicate of this question. See the timeline.Halflife
E
85
val columnNames = Seq("col1","col2",....."coln")

// using the string column names:
val result = dataframe.select(columnNames.head, columnNames.tail: _*)

// or, equivalently, using Column objects:
val result = dataframe.select(columnNames.map(c => col(c)): _*)
Expiratory answered 21/3, 2016 at 13:3 Comment(4)
tail returns the sequence excluding the first item (head); : _* transforms a collection into a vararg argument - used when calling a method expecting a vararg, like select does: def select(col: String, cols: String*)Expiratory
It's called, repeated parameters, you can check more about it here - chapter 4 section 2.Dacoity
@V.Samma that won't compile, check the signatures of select - it's either select(col: String, cols: String*): DataFrame for Strings, or select(cols: Column*): DataFrame for Columns, there's no select(cols: String*): DataFrame. See spark.apache.org/docs/latest/api/scala/…Expiratory
Is there a way to add alias for other columns like this? dataframe.select(columnNames.head, columnNames.tail: _*, col("abc").as("def")) ?Leucite
L
8

Since dataFrame.select() expects a sequence of columns and we have a sequence of strings, we need to convert our sequence to a List of cols and convert that list to the sequence. columnName.map(name => col(name)): _* gives a sequence of columns from a sequence of strings, and this can be passed as a parameter to select():

  val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(name => col(name)): _*)
Lotty answered 15/5, 2018 at 7:40 Comment(2)
Please add some context and explanation to this answer.Septic
@UserszrKs i am using spark 2.3.1 version , when i use the above it is giving an error .."type mismatch : found: org.apache.spark.sql.Column , required :Seq[?] , What is wrong here?Concepcion
S
0

You can use (List(F.col("*")) ++ updatedColumns): _* in select.

val updatedColumns: List[Column] = inputColumnNames.map(x => (F.col(x) * F.col("is_t90d")).alias(x))

val outputSDF = {
    inputSDF
    .withColumn("is_t90d", F.col("original_date").between(firstAllowedDate, lastAllowedDate).cast(IntegerType))
    .select( // select existing and additional columns
        (List(F.col("*")) ++ updatedColumns): _*
    )
}
Smallminded answered 23/4 at 19:34 Comment(0)
V
-1

Alternatively, you can also write like this

val columnName = Seq("col1", "col2")
  val DFFiltered = DF.select(columnName.map(DF(_): _*)
Viscardi answered 5/5, 2021 at 4:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.