SparkSQL : Can I explode two different variables in the same query?
Asked Answered
L

4

19

I have the following explode query, which works fine:

data1 = sqlContext.sql("select explode(names) as name from data")

I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:

data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")

But I got the errors:

 Only one generator allowed per select but Generate and and Explode found.;

Does any one have any idea?


I can actually make it work by doing two steps:

   data1 = sqlContext.sql("select explode(names) as name from data")
   data1.registerTempTable('data1')
   data1 = sqlContext.sql("select explode(colors) as color from data1")

But I am wondering if it is possible to do it in one step? Thanks a lot!

Luckily answered 26/4, 2016 at 22:30 Comment(1)
This solved my problem: #38602605Vanderhoek
K
39

The correct syntax is

select name, color 
from data 
lateral view explode(names) exploded_names as name 
lateral view explode(colors) exploded_colors as color

The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW.

Explanation

Think of it this way: LATERAL VIEW works like an implicit JOIN with with an ephemeral table created for every row from the structs in the collection being "viewed". So, the way to parse the syntax is:

LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...

Multiple output columns

If you use a table generating function such as posexplode() then you still have one output table but with multiple output columns:

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order

Nesting

You can also "nest" LATERAL VIEW by repeatedly exploding nested collections, e.g.,

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item

Performance considerations

While we are on the topic of LATERAL VIEW it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame DSL, e.g., myDF.explode(). The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode, which means that you can perform a more complicated transformation in one step.

Update

In recent versions of Spark, row-level explode via df.explode() has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...)). There is also an explode_outer(), which will produce output rows even if the input to be exploded is null. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.

Kirovograd answered 20/5, 2016 at 15:25 Comment(3)
Thanks! Your comment about LATERAL VIEW being more efficient than .explode() explains something I'd observed but couldn't explain.Mcevoy
The note about performance is completely wrong. Schema inference performed on logical plan after parse phase, so there is no difference between using SQL and DSL.Tamp
@Tamp your comment suggests that you either misunderstood my note about performance or are confused about Spark internals or both. The performance problem has nothing do with schema inference during plan creation. It has to do with execution time type conversion from Spark's InternalRow into the input data type of the function passed to explode, e.g., Row. If that doesn't help you understand, read the Spark codebase, e.g., UserDefinedGenerator, which is used in df.explode().Kirovograd
J
6

Try lateral view explode instead.

select name, color from data lateral view explode(names) as name lateral view explode(colors) as color;
Jp answered 27/4, 2016 at 1:41 Comment(2)
select name, color from data lateral view explode(names) as name lateral view explode(colors) as color; ^ SyntaxError: invalid syntaxLuckily
I probably messed up syntax, please look at this page for right syntax cwiki.apache.org/confluence/display/Hive/…Jp
T
4

There's a simple way to do explode on multiple columns by df.withColumn.

scala> val data = spark.sparkContext.parallelize(Seq((Array("Alice", "Bob"), Array("Red", "Green", "Blue"))))
  .toDF("names", "colors")
data: org.apache.spark.sql.DataFrame = [names: array<string>, colors: array<string>]

scala> data.show
+------------+------------------+                                               
|       names|            colors|
+------------+------------------+
|[Alice, Bob]|[Red, Green, Blue]|
+------------+------------------+

scala> data.withColumn("name", explode('names))
  .withColumn("color", explode('colors))
  .show

+------------+------------------+-----+-----+
|       names|            colors| name|color|
+------------+------------------+-----+-----+
|[Alice, Bob]|[Red, Green, Blue]|Alice|  Red|
|[Alice, Bob]|[Red, Green, Blue]|Alice|Green|
|[Alice, Bob]|[Red, Green, Blue]|Alice| Blue|
|[Alice, Bob]|[Red, Green, Blue]|  Bob|  Red|
|[Alice, Bob]|[Red, Green, Blue]|  Bob|Green|
|[Alice, Bob]|[Red, Green, Blue]|  Bob| Blue|
+------------+------------------+-----+-----+
Triplex answered 31/10, 2019 at 13:58 Comment(1)
I noticed @Kirovograd had mentioned .withColumn in his comment after I post this answer. I hope my answer makes it easier for someone looking for similar solutions, like me.Triplex
M
1

More than one explode is not allowed in spark sql as it is too confusing. This is because you get an implicit cartesian product of the two things you are exploding. If you want to do more than one explode, you have to use more than one select. Hive has a lateral view which can achieve what you need(explained by Rashid Ali in his answer here) . I would personally recommend two select's with data frames as it is much efficient in spark. Now assuming 'data' is a data frame.

val data1 = data.select($"id",$"names",$explode($"colors").alias("colors"))
           //select required columns from colors 
            .select($"id",$"colors.field1",explode($"names").alias("names"))
            //now select required cols from names
            .select($"id",$"field1",$"names.col1",$"names.col2")

You can do above select's in multiple dataframes or in a single one like above, it doesn't make a difference coming to performance.

Mireielle answered 27/4, 2016 at 17:49 Comment(12)
Thanks! And I am wondering what is the $"id" here? and what is $"names.col1" and $"names.col2" id here?Luckily
@Luckily $"id" is a sample column i added. $"names.col1" and $"names.col2" are nested columns within your names Json field. Since you are using an explode. I assumed your sample data looked like this names<array<struct<col1:some_value,col2:some_value1>>Mireielle
I just tried and got the error below: data1 = data.select($"id",$"names",$explode($"colors").alias("colors")) ^ SyntaxError: invalid syntax Could it be because I am using python instead of scala? Thanks!Luckily
@Luckily no not really, scala and python in terms of dataframes are not that different. in the sample code you posted, in select you are using field 'id', do you have a field name 'id' in your data ? If not remove it. And also update the post with sample of your source data, that would help me in writing the code according to the data.Mireielle
I removed the $, then the syntax error is gone. Guess python doesn't take $. After that I got another error: name 'explode' is not defined ... perhaps it is another format in python? Thanks!Luckily
@Luckily can you post your exact query here.Mireielle
@Mireielle You are incorrect, at least when it comes to Spark 1.6.1+. More than one LATERAL VIEW explode(...) is supported perfectly well when using HiveQL in Spark. I've used it many times.Kirovograd
@Kirovograd i am not sure why you down voted my answer. I would recommend to read my answer again. I am using spark 1.6.1, you cannot use two explode's in the same select for a "Spark data frame" . I was not talking about a plain sql or hive statement like you mentioned. And also i mentioned in my answer initially this can be achieved with "Lateral Views" in Hive.Mireielle
@Mireielle I downvoted your answer because it is not correct. You can use as many explode() methods as you want in the Scala DSL API, as long as you have collection columns left. The return type of explode() is DataFrame, not any other special type that would impose a restriction that explode() cannot be called again. Read the code.Kirovograd
@Sim, the return type of explode() DSL is a column not a dataframe. explode function is a generator which generates new rows and multiple generators in DSL is not supported in spark.Dissidence
@RameshMaharjan multiple generators are not supported only in a single select(). That's because the DSL hides LATERAL JOINs as projections via explode(). It's a coding convenience and a source of much confusion as explode() is not a projection. To avoid confusion, many people prefer withColumn("...", explode(...)) and you can chain as many of these as you want, each becoming a generator in the physical plan. That's what I meant by explode() "returning" a DataFrame. I referred to the result of the explosion, not of the function itself (all functions return Column in the DSL).Kirovograd
@Kirovograd - nice trick ! just one note to others: when you do withColumn($"foo", explode($"bar") I found that you need to have $bar in the select column list.Papyraceous

© 2022 - 2024 — McMap. All rights reserved.