Is there a way to transpose data in Hive?
Asked Answered
D

2

3

Can data in Hive be transposed? As in, the rows become columns and columns are the rows? If there is no function straight up, is there a way to do it in a couple of steps?

I have a table like this:

 | ID   |   Names   |  Proc1   |   Proc2 |  Proc3  |
 | 1    |    A1     |   x      |   b     |  f      |
 | 2    |    B1     |   y      |   c     |  g      |
 | 3    |    C1     |   z      |   d     |  h      |
 | 4    |    D1     |   a      |   e     |  i      |

I want it to be like this:

 | A1   |   B1   |  C1   |   D1 |  
 | x    |    y   |   z   |   a  |
 | b    |    c   |   d   |   e  |
 | f    |    g   |   h   |   i  |

I have been looking up other related questions and they all mention using lateral views and explode, but is there a way to selectively choose columns for lateral(ly) view(ing) and explod(ing)?

Also, what might be the rough process to achieve what I would like to do? Please help me out. Thanks!

Edit: I have been reading this link: https://cwiki.apache.org/Hive/languagemanual-lateralview.html and it shows me half of what I want to achieve. The first example in the link is basically what I'd like except that I don't want the rows to repeat and want them as column names. Any ideas on how to get the data to a form such that if I do an explode, it would result in my desired output, or the other way, ie, explode first to lead to another step that would then lead to my desired output table. Thanks again!

Dantzler answered 4/9, 2013 at 22:41 Comment(0)
C
3

I don't know of a way out of the box in hive to do this, sorry. You get close with explode etc. but I don't think it can get the job done.

Overall, conceptually, I think it's hard to a transpose without knowing what the columns of the destination table are going to be in advance. This is true, in particular for hive, because the metadata related to how many columns, their types, their names, etc. in a database - the metastore. And, it's true in general, because not knowing the columns beforehand, would require some sort of in-memory holding of data (ok, sure with spills) and users may need to be careful about not overflowing the memory and such (just like dynamic partitioning in hive).

In any case, long story short, if you know the columns of the destination table beforehand, life is good. There isn't a set command in hive per se, to the best of my knowledge, but you could use a bunch of if clauses and case statements (ugly I know, but that's how I have done the same in the past) in the select clause to transpose the data. Something along the lines of SQL - How to transpose?

Do let me know how it goes!

Calumnious answered 10/9, 2013 at 4:5 Comment(5)
Thanks Mark, for the answer. Yes, I do have the column names and in fact, the entire table prepared in advance. I just want it to transpose columns 2-61 into the new table without losing the relation/order. I'll surely look into the if and case statements. Please do let me know if you have additional thoughts. Thanks!Dantzler
Hey Mark.. In the link you pointed me to, it says t.fieldname and t.fieldvalue. How do we access these properties in Hive? Any ideas? Thanks!Dantzler
I don't think you can access them, these are stored in the metastore dbLeatriceleave
Indeed, Lorand is right. I'd just hardcode them (i.e. the original table's column names) in your query.Calumnious
The MySQL approach (using MAX, CASE, GROUP BY) works fine in Hive. Thanks!Inferno
L
1

As Mark pointed out there's no easy way to do this in Hive since PIVOT doesn't present in Hive and you may also encounter issues when trying to use the case/when 'trick' since you have multiple values (proc1,proc2,proc3).

As for testing purposes, you may try a different approach:

select v, o1, o2, o3 from (
  select k, 
         v,
         LEAD(v,3) OVER() as o1,
         LEAD(v,6) OVER() as o2,
         LEAD(v,9) OVER() as o3
  from (select transform(name,proc1,proc2,proc3) using 'python strm.py' AS (k, v) 
    from input_table) q1
) q2 where k = 'A1';

where strm.py:

import sys

for line in sys.stdin:
  line = line.strip()
  name, proc1, proc2, proc3 = line.split('\t')
  print '%s\t%s' % (name, proc1)
  print '%s\t%s' % (name, proc2)
  print '%s\t%s' % (name, proc3)

The trick here is to use a python script in the map phase which emits each column of a row as distinct rows. Then every third (since we have 3 proc columns) row will form the resulting row which we get by peeking forward (lead).

However, this query does the job, it has the drawback that as the input grows, you need to peek the next 3rd element in the query which may lead to performance hit. Anyway you may evaluate it for testing purposes.

Leatriceleave answered 11/9, 2013 at 14:23 Comment(1)
3 was just for illustrative purposes. It's more like 60! :) Thanks for your answer. I will test it out and let you know in a few days. Right now, I'm using a map and multiple lateral views to kinda get it halfway there. I'll try this out. Thanks a lot! :)Dantzler

© 2022 - 2024 — McMap. All rights reserved.