Column name with dot spark

Asked 5/6, 2017 at 10:33 Answered 30/12, 2020 at 13:31

scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

I am trying to take columns from a DataFrame and convert it to an RDD[Vector].

The problem is that I have columns with a "dot" in their name as the following dataset :

"col0.1","col1.2","col2.3","col3.4"
1,2,3,4
10,12,15,3
1,12,10,5

This is what I'm doing :

val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/test.txt")
val column=df.columns.map(c=>s"`${c}`")
val rows = new VectorAssembler().setInputCols(column).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd
val data =rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
  .map(org.apache.spark.mllib.linalg.Vectors.fromML)

val mat: RowMatrix = new RowMatrix(data)
//// Compute the top 5 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(mat.numCols().toInt, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s  // The singular values are stored in a local dense vector.
val V: Matrix = svd.V  // The V factor is a local dense matrix.

println(V)

Please any help to get me consider columns with dot in their names.Thanks

Unpractical answered 5/6, 2017 at 10:33 Comment(3)

Did you try changing the column names? – Ancel 5/6, 2017 at 11:32

@RameshMaharjan it worked with columns witout dot, but i need to fix it with dots , any help thnx – Unpractical 5/6, 2017 at 11:32

What I would suggest is to save the schema with dots, change the column names and after you finish with new column names change it back with dots. Doesn't that work? – Ancel 5/6, 2017 at 11:34

If your problem is the .(dot) in the column name, you could use `(backticks) to enclose the column name.

df.select("`col0.1`")

Muleteer answered 19/7, 2018 at 19:18 Comment(4)

Take care however that drop function behaves well with no backticks. In this case: df.drop("col0.1") – Jevons 23/8, 2018 at 19:28

@Jevons As well as withColumn does work ok with "dotted" names. Apparently only referencing such a column in expressions needs to be escaped with backticks. – Pettish 13/5, 2021 at 8:43

Note this also works: df.select("parent_column.`child.column.with.dots.i.dont.want.expanded`") – Broderickbrodeur 16/8, 2021 at 21:46

@Broderickbrodeur further selection from the dataframe using the same syntax does not work. e.g i have this, notice the complaints about the column name is actually in the suggested list.``` AnalysisException: Column 'USER_DEFINED_EXTENSIONS.UDX.EDXF.REACH.UDX.EDXF.REACH.LISTDATE' does not exist. Did you mean one of the following? [USER_DEFINED_EXTENSIONS.UDX.EDXF.REACH.UDX.EDXF.REACH.LISTDATE ``` – Metacarpus 20/7, 2022 at 9:44

The problem here is VectorAssembler implementation, not the columns per se. You can for example skip the header:

val df = spark.read.format("csv")
  .options(Map("inferSchema" -> "true", "comment" -> "\""))
  .load(path)

new VectorAssembler()
  .setInputCols(df.columns)
  .setOutputCol("vs")
  .transform(df)

or rename columns before passing to VectorAssembler:

val renamed =  df.toDF(df.columns.map(_.replace(".", "_")): _*)

new VectorAssembler()
  .setInputCols(renamed.columns)
  .setOutputCol("vs")
  .transform(renamed)

Finally the best approach is to provide schema explicitly:

import org.apache.spark.sql.types._

val schema = StructType((0 until 4).map(i => StructField(s"_$i", DoubleType)))

val dfExplicit = spark.read.format("csv")
  .options(Map("header" -> "true"))
  .schema(schema)
  .load(path)

new VectorAssembler()
  .setInputCols(dfExplicit.columns)
  .setOutputCol("vs")
  .transform(dfExplicit)

Venerate answered 5/6, 2017 at 11:35 Comment(5)

Subject for a JIRA ? – Tabulate 5/6, 2017 at 11:38

@Venerate i appreciate your help , you mean there is no other solution to handle this while keeping the same columns name ? thanks – Unpractical 5/6, 2017 at 11:59

None that I am aware off, but I never intensively looked for a one. In general you shouldn't use anything else than [a-zA-Z0-9_] for the column names to keep reasonable compatibility with other tools (including Parquet). And for VectorAssembler it shouldn't really matter. – Venerate 5/6, 2017 at 12:7

I'm not able to rename it either- it is failing even when I try to rename and show. – Jared 13/11, 2017 at 16:6

the df.toDF(df.columns.map(_.replace(".", "_")): _*) does not go deeper to nested columns. i had to use an ugly trick to flatten the df (which also had to patch due to the dot in column names), and then use above to replace the dots after the flatten process. – Metacarpus 20/7, 2022 at 12:3

For Spark SQL

spark.sql("select * from reg_data where reg_data.`createdResource.type` = 'Berlin'")

Claiborne answered 30/12, 2020 at 13:31 Comment(2)

could you add some addtional information, why your post would solve the problem? Please consider to check How to Answer – Truitt 30/12, 2020 at 14:0

this is an example for spark SQL – Claiborne 4/1, 2021 at 13:48

Recommended topics

Hot tags