partition column in hive
Asked Answered
B

3

8

I have to partition the table in hive with a column which is also part of the table.

For eg:

Table: employee

Columns: employeeId, employeeName, employeeSalary

I have to partition the table using employeeSalary. So I write the following query:

 CREATE TABLE employee (employeeId INT, employeeName STRING, employeeSalary INT) PARTITIONED BY (ds INT); 

I just used the name "ds" here as it did'nt allow me to put the same name employeeSalary.

Is this right what I am doing? Also while inserting values into the table, I have to use a comma separated file. Now the file consists of row like: 2019,John,2000

as one row. If I have to partition using salary my first partition would be all people for salary 2000. So the query would be

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=2000);

Again after 100 entries with salary as 2000, I have next 500 entries with salary as 4000. So I would again fire the query:

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (ds=4000);

PLEASE LET ME KNOW IF I AM RIGHT...

Biennial answered 15/3, 2011 at 19:32 Comment(0)
K
6

Here's how to create a hive table with a partition on the column you specified

CREATE TABLE employee (employeeId INT, employeeName STRING) PARTITIONED BY (employeeSalary INT);

The partition column is specified in the PARTITIONED BY section.
In the Hive shell you can run describe employee; and it will show all the columns in the table. With your CREATE TABLE you should see 4 columns, not the 3 you are trying to get.

For your load command, you will want to specify all the partitions to write into. (I'm not very familiar with these, mostly basing off of http://wiki.apache.org/hadoop/Hive/LanguageManual/DML#Syntax

So something like

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (employeeSalary=2000, employeeSalary=4000);
Karie answered 15/3, 2011 at 21:10 Comment(3)
Hi,Thanks for your reply. So you mean to say that in the upload file there would be just 2 values employeeId and employeeName?Biennial
Also, for the employeeSalary, I will have to specify the two values in separate queries like: LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (employeeSalary=2000); LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE employee PARTITION (employeeSalary=4000);Biennial
There will be 3 columns in the table, and your file should have 3 entries per row. The employeeSalary is a column. I believe you should use a single query, using two might produce the same results... I haven't worked with loading data into multiple partitions enough to know.Karie
S
5

Here is how partition table in hive works: 1) the partition column data/value is not stored in the actual data file in warehouse instead it is stored in the hive meta store.

2) so you should not have the partition column data in the data files in the hive warehouse directory.

for your problem these should be the steps .

1)

CREATE TABLE employee (employeeId INT, employeeName STRING ) PARTITIONED BY (employeeSalary INT) stored as <your choice of format>;

This will create an entry in the hive metastore that you have created a table with 2 columns employeeId INT, employeeName STRING and it is having one partition column employeeSalary INT.

2) create a temp table lets say emp_temp.

CREATE TABLE emp_temp (employeeId INT, employeeName STRING,employeeSalary INT ) stored as text; 

i am assuming your input files are in text format.

3) copy all you file in the warehouse location of emp_temp table OR run the following query( i am asuming you have all you data files in the ./example/files folder.)

LOAD DATA LOCAL INPATH './examples/files/*.txt' OVERWRITE INTO TABLE emp_temp.

4) now run the following hql ( this will create partitions dynamically for you)

 INSERT OVERWRITE TABLE employee partition(employeeSalary) 
   SELECT employeeId , employeeName , employeeSalary 
   from emp_temp

Thanks, aditya

Sioux answered 12/2, 2018 at 13:17 Comment(0)
F
-1

Maybe, I think that, you should firstly load all the data into one table, then use Hive extension (multiple inserts):

FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT          EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

Then, if you want, you can

from big_data_table
insert overwrite table table1 partiton (ds=2000)
select * where employeeId>0 && employeeId<101>
insert overwrite table table2 partition (ds=4000)
select * where employeeId>=101&&employeeId<=600
Fachanan answered 1/8, 2012 at 2:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.