How can I add row numbers for rows in PIG or HIVE?
Asked Answered
F

8

5

I have a problem when adding row numbers using Apache Pig. The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID.

For example, here is the input:

STR_ID
------------
3D64B18BC842
BAECEFA8EFB6
346B13E4E240
6D8A9D0249B4
9FD024AA52BA

How do I get the output like:

   STR_ID    |   ROW_NUM
----------------------------
3D64B18BC842 |     1
BAECEFA8EFB6 |     2
346B13E4E240 |     3
6D8A9D0249B4 |     4
9FD024AA52BA |     5

Answers using Pig or Hive are acceptable. Thank you.

Finished answered 15/2, 2012 at 5:58 Comment(0)
S
3

Facebook posted a number of hive UDFs including NumberRows. Depending on your hive version (I believe 0.8) you may need to add an attribute to the class (stateful=true).

Shastashastra answered 15/2, 2012 at 7:20 Comment(2)
Oh really? can you give me a link where I can get the UDF? I can upgrade HIVE if necessary, thank you very much for the help!Finished
sorry I've not noticed that you already gave the link, thank you, it's helpful!Finished
R
5

In Hive:

Query

select str_id,row_number() over() from tabledata;

Output

3D64B18BC842      1
BAECEFA8EFB6      2
346B13E4E240      3
6D8A9D0249B4      4
9FD024AA52BA      5
Riyadh answered 23/6, 2017 at 5:32 Comment(0)
S
3

Facebook posted a number of hive UDFs including NumberRows. Depending on your hive version (I believe 0.8) you may need to add an attribute to the class (stateful=true).

Shastashastra answered 15/2, 2012 at 7:20 Comment(2)
Oh really? can you give me a link where I can get the UDF? I can upgrade HIVE if necessary, thank you very much for the help!Finished
sorry I've not noticed that you already gave the link, thank you, it's helpful!Finished
I
2

Pig 0.11 introduced a RANK operator that can be used for this purpose.

Intricacy answered 6/6, 2013 at 20:52 Comment(1)
Yes - you will just need to order by col, rand() if you want to ensure different row numbers are assigned to identical rows.Morion
K
1

For folks wondering about Pig, I found the best way (currently) is to write your own UDF. I wanted to add row numbers for tuples in a bag. This is the code for that:

import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;

public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
    try {
        DataBag output = mBagFactory.newDefaultBag();
        DataBag bg = (DataBag)input.get(0);
        Iterator it = bg.iterator();
        Integer count = new Integer(1);
        while(it.hasNext())
            { Tuple t = (Tuple)it.next();
              t.append(count);
              output.add(t);
              count = count + 1;
            }

        return output;
    } catch (ExecException ee) {
        // error handling goes here
        throw ee;
    }
}
public Schema outputSchema(Schema input) {
     try{
         Schema bagSchema = new Schema();
         bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));

         return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                                                bagSchema, DataType.BAG));
     }catch (Exception e){
        return null;
     }
    }
}

This code is for reference only. Might not be error-proof.

Khan answered 9/7, 2012 at 20:49 Comment(0)
D
1

This is good answer for you on my example

Step 1. Define row_sequence() function to process for auto increase ID

add jar /Users/trongtran/research/hadoop/dev/hive-0.9.0-bin/lib/hive-contrib-0.9.0.jar;
drop temporary function row_sequence;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

Step 2. Insert unique id & STR

INSERT OVERWRITE TABLE new_table
SELECT 
    row_sequence(),
    STR_ID
FROM old_table;
Diego answered 3/5, 2013 at 11:8 Comment(0)
J
1

From version 0.11, hive supports analytic functions like lead,lag and also row number

https://issues.apache.org/jira/browse/HIVE-896

Jem answered 22/7, 2014 at 8:59 Comment(0)
M
1

Hive solution -

select *
  ,rank() over (rand()) as row_num
  from table

Or, if you want to have rows ascending by STR_ID -

select *
  ,rank() over (STR_ID,rank()) as row_num
  from table
Morion answered 15/1, 2015 at 19:7 Comment(1)
this didn't work in Hive 1.2.1.2.3.4.7-4. what version are you using this on? also, i get the superman reference in your name. that made me feel good to actually understand something on stack overflow.Sulphuryl
R
1

In Hive:

select
str_id, ROW_NUMBER() OVER() as row_num 
from myTable;
Rhodonite answered 11/11, 2016 at 18:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.