Unpacking a list to select multiple columns from a spark data frame
Asked Answered
S

8

60

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

Sumerology answered 22/1, 2016 at 3:59 Comment(0)
G
103

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

Groundwork answered 22/1, 2016 at 4:15 Comment(4)
Thanks! Worked like a charm. Could explain a bit more about the syntax? Specifically what does col.tail: _ * do?Sumerology
I think I understand now. The key is the method signature of select select(col: String, cols: String*). The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.Sumerology
Cool! You got it right :) Sorry I got both the notifications just now so couldn't reply earlier. :)Groundwork
No problem. Thanks again!Sumerology
I
33

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)
Inquest answered 9/6, 2016 at 11:45 Comment(1)
can you elaborate plz ?Kirit
C
25

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)
Calices answered 1/10, 2016 at 20:33 Comment(0)
C
3

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));
Connective answered 27/3, 2019 at 6:51 Comment(0)
J
2

You can pass arguments of type Column* to select:

val df = spark.read.json("example.json")
val cols: List[String] = List("a", "b")
//convert string to Column
val col: List[Column] = cols.map(df(_))
df.select(col:_*)
Jaipur answered 16/1, 2017 at 13:15 Comment(1)
What about a bit shorter version: df.select(cols.map(df(_)): _*) ?Swinford
O
2

you can do like this

String[] originCols = ds.columns();
ds.selectExpr(originCols)

spark selectExp source code

     /**
   * Selects a set of SQL expressions. This is a variant of `select` that accepts
   * SQL expressions.
   *
   * {{{
   *   // The following are equivalent:
   *   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   *   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
   * }}}
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
  }
Orphanage answered 10/5, 2018 at 6:21 Comment(0)
F
2

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

Foreman answered 13/2, 2019 at 4:30 Comment(1)
Can you please share how to do the same(pass the column names) in java while doing dataframeResult = inpDataframe.select("col1","col2",....)Picoline
A
0

Prepare a list where all the requirement features are listed then use spark inbuilt function using *, reference given below.

lst = ["col1", "col2", "col3"]
result = df.select(*lst)

Some time we get an error of:" Analysis Exception: cannot resolve 'col1' given input columns" try to convert features to string type as mentioned below:

from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
for i in lst:
   if i not in df.columns:
      df = df.withColumn(i, lit(None).cast(StringType()))

And finally you will get the dataset with required features.

Archiearchiepiscopacy answered 2/6, 2022 at 9:25 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Phenolic

© 2022 - 2024 — McMap. All rights reserved.