I've been playing with Hive for few days now but I still have a hard time with partition.
I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/*
Example:
/logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am)
/logs/2012/02/10/00/Part02xx
/logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm)
The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800]
How can I create a external table with partition in Hive that use my physical partition. I can't find any good documentation on Hive partition. I found related Question such as:
If I load my logs in an external table with Hive, I cannot partition with the time, since it's not the good format (Feb <=> 02). Even if if it was in a good format how do i transform a string "10/02/2012:00:00:00 -0800" into multiple directory "/2012/02/10/00"?
I could eventually use pig script to convert my raw logs into hive tables but at this point I should just be using pig instead of hive to do my reporting.