Efficiently Storing the data in Hive
Asked Answered
S

1

6

How can I efficiently store data in Hive and also store and retrieve compressed data in hive? Currently I am storing it as a TextFile. I was going through Bejoy article and I found that LZO compression will be good for storing the files and also it is splittable.

I have one HiveQL Select query that is generating some output and I am storing that output somewhere so that one of my Hive table (quality) can use that data so that I can query that quality table.

Below is the quality table in which I am loading the data from the below SELECT query by making the partition I am using to overwrite table quality.

create table quality
(id bigint,
  total bigint,
  error bigint
 )
partitioned by (ds string)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/uname/quality'
;

insert overwrite table quality partition (ds='20120709')
SELECT id  , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1  FROM Table1;

So here currently I am storing it as a TextFile, should I make this as a Sequence file and start storing the data in LZO compression format? Or text file will be fine here also? As from the select query I will be getting some GB of data, that need to be uploaded on table quality on a daily basis.

So which way is best? Should I store the output as a TextFile or SequenceFile format (LZO compression) so that when I am querying the Hive quality table, I am getting result fasters. Means querying is faster.

Update:-

What If I am storing as a SequenceFile with Block Compression? Like below-

set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;

I need to set some other things to enable BLOCK Compression apart from above? And also I am creating Table as a SequenceFile format

Update Again

I should create the table like this below? Or some other changes need to be made to enable BLOCK compression with Sequence File?

create table lipy
( buyer_id bigint,
  total_chkout bigint,
  total_errpds bigint
 )
 partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/lipy'
;
Susette answered 1/8, 2012 at 19:43 Comment(0)
S
1

I have not used Hive much, but from experience with Hadoop and structured data, I was getting the best performance from SequenceFiles with BLOCK compression. The default is row compression, but it is not as efficient as BLOCK compression when you store structured data and rows are not particularly big. To switch it on I used mapred.output.compression.type=BLOCK

Senlac answered 2/8, 2012 at 11:10 Comment(6)
Thanks alex for the suggestions. So If I need to use the Sequence files with BLOCK Compress, then what parameters I need to set? These are the parameters I need to set? set mapred.output.compress=true set mapred.output.compression.type=BLOCK set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec Or something else I need to set?Susette
Yes, I use these 3 with hadoop 0.20.2 version and it is enough.Senlac
Thanks for the comment. And I tried using the above 3 commands and it worked fine. So my question here is If I need to view that file that got compressed in LzoCodec format then what I need to do. When I tried using vi filename, I got something weird character in that file. So I need to decompress that file somehow and then I need to view that file? If yes, then how I can decompress that file? Below is the file name I got by using ls. /apps/hdmi-technology/b_apdpds/lip-data-quality/dt=20120711/attempt_201207311206_10800_r_000000_0Susette
There are 2 options: 1) if your records are actually strings, than you can use "hadoop fs -text /hadoop_file_path > output_file.txt" 2) SequenceFiles can be accessed from any Java program, just put hadoop-*.jar into your classpath and do not forget about the native libraries, to see examples of usage look into the source code or #7561015Senlac
Thanks alex for the suggestions. That was helpful. I have one more question. So in my case I should create the table as a SequenceFile that's all right? Just like I have updated in my question. Can you take a look and let me know whether it is right or not?Susette
I have not used Hive, but in Apache PIG and in java Hadoop jobs it was enough to set the properties before execution of the script/job. If you look into the the source code of the SequenceFileOutputFormat you will find that mapred.output.compression.type is used to setup the SequenceFile.Writer and the param is taken from the job context. To do a test you can set it to RECORD, run the job, than set it to BLOCK. In my cases I was getting noticeable difference in total file sizes. If you get the same situation, it works. Or write java prog to get the properties and inspect them.Senlac

© 2022 - 2024 — McMap. All rights reserved.