Column name with dot spark
Asked Answered
U

3

39

I am trying to take columns from a DataFrame and convert it to an RDD[Vector].

The problem is that I have columns with a "dot" in their name as the following dataset :

"col0.1","col1.2","col2.3","col3.4"
1,2,3,4
10,12,15,3
1,12,10,5

This is what I'm doing :

val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt")
val column=df.columns.map(c=>s"`${c}`")
val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd
val data =rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
  .map(org.apache.spark.mllib.linalg.Vectors.fromML)

val mat: RowMatrix = new RowMatrix(data)
//// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(mat.numCols().toInt, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s  // The singular values are stored in a local dense vector.
val V: Matrix = svd.V  // The V factor is a local dense matrix.

println(V)

Please any help to get me consider columns with dot in their names.Thanks

Unpractical answered 5/6, 2017 at 10:33 Comment(3)
Did you try changing the column names?Ancel
@RameshMaharjan it worked with columns witout dot, but i need to fix it with dots , any help thnxUnpractical
What I would suggest is to save the schema with dots, change the column names and after you finish with new column names change it back with dots. Doesn't that work?Ancel
M
66

If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name.

df.select("`col0.1`")

Muleteer answered 19/7, 2018 at 19:18 Comment(4)
Take care however that drop function behaves well with no backticks. In this case: df.drop("col0.1")Jevons
@Jevons As well as withColumn does work ok with "dotted" names. Apparently only referencing such a column in expressions needs to be escaped with backticks.Pettish
Note this also works: df.select("parent_column.`child.column.with.dots.i.dont.want.expanded`")Broderickbrodeur
@Broderickbrodeur further selection from the dataframe using the same syntax does not work. e.g i have this, notice the complaints about the column name is actually in the suggested list.``` AnalysisException: Column 'USER_DEFINED_EXTENSIONS.UDX.EDXF.REACH.UDX.EDXF.REACH.LISTDATE' does not exist. Did you mean one of the following? [USER_DEFINED_EXTENSIONS.UDX.EDXF.REACH.UDX.EDXF.REACH.LISTDATE ```Metacarpus
V
13

The problem here is VectorAssembler implementation, not the columns per se. You can for example skip the header:

val df = spark.read.format("csv")
  .options(Map("inferSchema" -> "true", "comment" -> "\""))
  .load(path)

new VectorAssembler()
  .setInputCols(df.columns)
  .setOutputCol("vs")
  .transform(df)

or rename columns before passing to VectorAssembler:

val renamed =  df.toDF(df.columns.map(_.replace(".", "_")): _*)

new VectorAssembler()
  .setInputCols(renamed.columns)
  .setOutputCol("vs")
  .transform(renamed)

Finally the best approach is to provide schema explicitly:

import org.apache.spark.sql.types._

val schema = StructType((0 until 4).map(i => StructField(s"_$i", DoubleType)))

val dfExplicit = spark.read.format("csv")
  .options(Map("header" -> "true"))
  .schema(schema)
  .load(path)

new VectorAssembler()
  .setInputCols(dfExplicit.columns)
  .setOutputCol("vs")
  .transform(dfExplicit)
Venerate answered 5/6, 2017 at 11:35 Comment(5)
Subject for a JIRA ?Tabulate
@Venerate i appreciate your help , you mean there is no other solution to handle this while keeping the same columns name ? thanksUnpractical
None that I am aware off, but I never intensively looked for a one. In general you shouldn't use anything else than [a-zA-Z0-9_] for the column names to keep reasonable compatibility with other tools (including Parquet). And for VectorAssembler it shouldn't really matter.Venerate
I'm not able to rename it either- it is failing even when I try to rename and show.Jared
the df.toDF(df.columns.map(_.replace(".", "_")): _*) does not go deeper to nested columns. i had to use an ugly trick to flatten the df (which also had to patch due to the dot in column names), and then use above to replace the dots after the flatten process.Metacarpus
C
0

For Spark SQL

spark.sql("select * from reg_data where reg_data.`createdResource.type` = 'Berlin'")
Claiborne answered 30/12, 2020 at 13:31 Comment(2)
could you add some addtional information, why your post would solve the problem? Please consider to check How to AnswerTruitt
this is an example for spark SQLClaiborne

© 2022 - 2024 — McMap. All rights reserved.