Spark INLINE Vs. LATERAL VIEW EXPLODE differences?
Asked Answered
T

1

7

In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...

The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs

For example:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
),

inline_data AS (
SELECT id,
        INLINE(array_of_structs)
FROM sample
)

SELECT id,
        name AS person_name,
        age AS person_age
FROM inline_data

And using LATERAL VIEW EXPLODE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person 

The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.

Tideway answered 27/5, 2020 at 22:22 Comment(0)
G
10

EXPLODE UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person

Result:

id,name,age
1,frank,40
1,maria,51

And INLINE UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name because name and other struct elements are already extracted by INLINE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        name,
        age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

Result:

id,name,age
1,frank,40
1,maria,51 

Both INLINE and EXPLODE are UDTFs and require LATERAL VIEW in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE returns dataset of array elements(struct in your case) and INLINE is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.

INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.

Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view required.

In Spark this will work also:

inline_data AS (
SELECT id,
        EXPLODE(array_of_structs) as person
FROM sample
)

And to get age column you need to use person.age

Grus answered 29/5, 2020 at 11:54 Comment(3)
Thanks for your answer, it definitely describes what I needed to understand but I'd like to clarify something, in my "code example" I did not use the LATERAL VIEW for the INLINE and it worked perfectly, when you say it "requires" LATERAL VIEW, are you saying that behind the scenes a LATERAL VIEW is happening even if I don't specify it in my code?Tideway
@dim_user Tested in Hive 2.1.1 this causing exception: SELECT id, INLINE(array_of_structs) - UDTF's are not supported outside the SELECT clause, nor nested in expressions, in Spark 2.4.5 it works both INLINE and EXPLODE, the same. In Spark it does not require lateral view. Edited the answerGrus
@dim_user So, in Spark the difference is that INLINE extracts struct elements and EXPLODE returns only set of array elements. No other differenceGrus

© 2022 - 2024 — McMap. All rights reserved.