Hive Explode / Lateral View multiple arrays
Asked Answered
O

5

22

I have a hive table with the following schema:

COOKIE  | PRODUCT_ID | CAT_ID |    QTY    
1234123   [1,2,3]    [r,t,null]  [2,1,null]

How can I normalize the arrays so I get the following result

COOKIE  | PRODUCT_ID | CAT_ID |    QTY

1234123   [1]          [r]         [2]

1234123   [2]          [t]         [1] 

1234123   [3]          null       null 

I have tried the following:

select concat_ws('|',visid_high,visid_low) as cookie
,pid
,catid 
,qty
from table
lateral view explode(productid) ptable as pid
lateral view explode(catalogId) ptable2 as catid 
lateral view explode(qty) ptable3 as qty

however the result comes out as a Cartesian product.

Owlet answered 18/12, 2013 at 20:11 Comment(0)
T
16

You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

Using those UDFs, the query would be something like

select cookie,
   array_index( product_id_arr, n ) as product_id,
   array_index( catalog_id_arr, n ) as catalog_id,
   array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;
Texas answered 18/12, 2013 at 21:35 Comment(2)
@Jerome .. would this be able to work if the array is of different sizes.. ?Covet
I'm not sure different array sizes would make sense. Then you would have to check if n was larger than the current array. Something like . SELECT cookie, IF( n >= size(array1), array_index( array1, n), null), IF(n > =size(array2), array_index(array2,n ) .....Texas
N
27

I found a very good solution to this problem without using any UDF, posexplode is a very good solution :

SELECT COOKIE ,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE 
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;
Norse answered 15/11, 2016 at 16:22 Comment(2)
Hey this works! I guess with the caveat that your arrays need to have the same length, and if they don't it's going to truncate them to the length of the shortest one. I'm not sure on the performance of this either, especially with more and more lateral views.Guenevere
Yeah, it will take the shortest one because of "WHERE seqp = seqc AND seqc = seqq", for sure the performances will be affected.Norse
T
16

You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

Using those UDFs, the query would be something like

select cookie,
   array_index( product_id_arr, n ) as product_id,
   array_index( catalog_id_arr, n ) as catalog_id,
   array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;
Texas answered 18/12, 2013 at 21:35 Comment(2)
@Jerome .. would this be able to work if the array is of different sizes.. ?Covet
I'm not sure different array sizes would make sense. Then you would have to check if n was larger than the current array. Something like . SELECT cookie, IF( n >= size(array1), array_index( array1, n), null), IF(n > =size(array2), array_index(array2,n ) .....Texas
C
15

You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:

select 
  cookie, 
  n.pos as position, 
  n.prd_id as product_id,
  cat_id[pos] as catalog_id,
  qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;

This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).

Carbonic answered 22/5, 2018 at 11:28 Comment(2)
This deserves more votes! While the alternative answers with multiple posexplodes may be fine with smaller tables, this is the way to go with larger tables and larger number of variables (to be passed to posexplode).Tetrabrach
Feels like this should be the accepted answer. Quicker, cleaner and scales upQuartet
P
1

If you are using Spark 2.4 in pyspark, use arrays_zip with posexplode:

df = (df
    .withColumn('zipped', arrays_zip('col1', 'col2'))
    .select('id', posexplode('zipped')))
Phocomelia answered 25/6, 2020 at 13:0 Comment(0)
C
0

I tried to work out on your scenario... please try this code -

create table info(cookie string,productid int,catid string,qty string);

insert into table info
select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table
lateral view posexplode(productid) pro as myprod,pro
lateral view posexplode(categoryid) cate as mycat,cate
lateral view posexplode(qty) q as myqty,q
where myprod=mycat and mycat=myqty;

Note - In the above statements, if you place - select cookie,myprod,mycat,myqty from table in place of select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table in the output you will get the index of the element in the array of productid, categoryid and qty. Hope this will be helpful.

Complacence answered 25/5, 2018 at 20:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.