How can I efficiently store data in Hive and also store and retrieve compressed data in hive? Currently I am storing it as a TextFile. I was going through Bejoy article and I found that LZO compression will be good for storing the files and also it is splittable.
I have one HiveQL Select query that is generating some output and I am storing that output somewhere so that one of my Hive table (quality) can use that data so that I can query that quality
table.
Below is the quality
table in which I am loading the data from the below SELECT query by making the partition I am using to overwrite table quality
.
create table quality
(id bigint,
total bigint,
error bigint
)
partitioned by (ds string)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/uname/quality'
;
insert overwrite table quality partition (ds='20120709')
SELECT id , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1 FROM Table1;
So here currently I am storing it as a TextFile
, should I make this as a Sequence file
and start storing the data in LZO compression format
? Or text file will be fine here also? As from the select query I will be getting some GB of data, that need to be uploaded on table quality on a daily basis.
So which way is best? Should I store the output as a TextFile or SequenceFile format (LZO compression) so that when I am querying the Hive quality table, I am getting result fasters. Means querying is faster.
Update:-
What If I am storing as a SequenceFile with Block Compression? Like below-
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
I need to set some other things to enable BLOCK Compression apart from above? And also I am creating Table as a SequenceFile format
Update Again
I should create the table like this below? Or some other changes need to be made to enable BLOCK compression with Sequence File?
create table lipy
( buyer_id bigint,
total_chkout bigint,
total_errpds bigint
)
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/lipy'
;
set mapred.output.compress=true set mapred.output.compression.type=BLOCK set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
Or something else I need to set? – Susette