Loading data with Hive, S3, EMR, and Recover Partitions
Asked Answered
P

1

11

SOLVED: See Update #2 below for the 'solution' to this issue.

~~~~~~~

In s3, I have some log*.gz files stored in a nested directory structure like:

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like:

create external table logs (content string)
partitioned by (y string, m string, d string, h string)
location 's3://($BUCKET)';

Creation of the table works. I then attempt to recover all of the existing partitions:

alter table logs recover partitions;

This seems to work and it does drill down through my s3 structure and add all the various levels of directories:

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08
y=2012/m=11/d=06/h=09
y=2012/m=11/d=06/h=10
y=2012/m=11/d=06/h=11
y=2012/m=11/d=06/h=12
y=2012/m=11/d=06/h=13
y=2012/m=11/d=06/h=14
y=2012/m=11/d=06/h=15
y=2012/m=11/d=06/h=16
...

So it seems that Hive can see and interpret my file layout successfully. However, no actual data ever gets loaded. If I try to do a simple count or select *, I get nothing:

hive> select count(*) from logs;
...
OK
0

hive> select * from logs limit 10;
OK

hive> select * from logs where y = '2012' and m = '11' and d = '06' and h='16' limit 10;
OK

Thoughts? Am I missing some additional command to load data beyond recovering the partitions?

If I manually add a partition with an explicit location, then that works:

alter table logs2 add partition (y='2012', m='11', d='09', h='10') location 's3://($BUCKET)/y=2012/m=11/d=09/H=10/'

I can just write a script to do this, but it feels like I'm missing something fundamental w.r.t 'recover partitions'.

UPDATE #1

Thanks to a brilliant and keen observation by Joe K in a comment below, I think that case sensitivity issues might be involved here.

The files are definitely organized like the following path spec, with a capitalized H (I think this might be some nod to iso8601 formatting):

s3://($BUCKET)/y=2012/m=11/d=09/H=10/

I create my external table with a partition spec that does the proper capitalization:

partitioned by (y string, m string, d string, H string)

(Notice the 'H'). I do a recover partitions, which does seem to recurse through the directories and find the partitions appropriately, but somehow (despite using 'H' in all instructive places so far), it indeed seems that Hive saves it as a lower case 'h':

hive> show partitions logs;
OK
y=2012/m=11/d=06/h=08

(Note the 'h'). So it seems that Hive is able to discover the partitions, but then stores them in a lowercase form ... Later when it goes to look for data, these paths are (of course) empty because S3 is case sensitive.

I am going to move my data into an all-lowercase directory structure and see if that works...

UPDATE #2

Indeed, I have confirmed that the capitalized 'H' as a partition name (in the s3 file layout) was the problem here. As far as I can tell, this is what was happening:

  • My layout on S3 had a case-sensitive partition name (H=)
  • Running RECOVER PARTITIONS correctly discovers these partitions...
  • But then they are stored internally as lowercase (h)

The 'recover partitions' command is an extension of Hive authored by Amazon. I strongly suspect that the bug is in this component. To my knowledge native Hive has no concept of exploring a file root for partition discovery...

Pina answered 10/11, 2012 at 3:53 Comment(4)
In case someone else is venturing down this path, here are two very helpful blog posts: pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json and sites.google.com/a/khanacademy.org/forge/technical/data_n/…Pina
One thing I'm noticing is that your hour partition column has different cases in different parts of your post. Could this be part of the problem? I'm not sure which parts of S3/Hive are or aren't case-sensitive. Also, what is the output when you do describe formatted logs2 partition (y='2012', m='11', d='06', h='08')?Inwrought
Joe! Thank you! I think you are on to something. I am definitely now seeing behavior that is symptomatic of case-sensitivity issues. I am going to update my question with more info.Pina
Joe - I owe you the Accept on this. Please add an answer about case sensitivity and I'll checkmark it. Thanks again.Pina
I
10

It's a case issue on the hour field!

Inwrought answered 13/11, 2012 at 21:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.