Hive create table with inputs from nested sub-directories
Asked Answered
D

2

11

I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro. I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/*. (They're all based on the same Avro schema.)

I'm running the below query with flag mapred.input.dir.recursive=true:

CREATE EXTERNAL TABLE default.testtable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION 'hdfs://.../data/*/*'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://.../schema.avsc') 

The table ends up being empty unless I change LOCATION to be less nested, i.e. to be 'hdfs://.../data/[foldername]/' with a certain foldername. This worked no-problem with a less nested path for LOCATION.

I'd like to be able to source data from all these different [foldername] folders. How do I make the recursive input selection go further in my nested directories?

Dongdonga answered 26/6, 2014 at 18:59 Comment(0)
C
5

Use this Hive settings to enable recursive directories:

set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;

Create external table and specify root directory as a location:

LOCATION 'hdfs://.../data'

You will be able to query data from table location and all subdirectories

Coverup answered 5/5, 2017 at 12:27 Comment(4)
hive.input.dir.recursive? hive.supports.subdirectories?It seems you have copied it from other (wrong) answers.I suggest doing some research and testingBarometrograph
@Dudu Markovitz. I have tested this on Hive 1.2.1. This works great. Hive supports subdirectories. Maybe not all of these settings are neccessary, but this work for me.Coverup
It is not just unnecessary parameters, which makes this answer bad by itself, it is non-existing parameters. furthermore, the use of non-existing parameters throws exceptions since Hive 0.14 when hive.conf.validation is set to true, which is the default. issues.apache.org/jira/browse/HIVE-7211Barometrograph
Probably I have no this hive.conf.validation set in my Hive config. I tested on 1.2.1 on AWS. Thank to pointing me to this parameter, I will check my configuration.Coverup
L
1

One thing that would solve your problem is adding the folder name as a partition column to the external table. Then you can create the table as you're creating just on the data directory. Or you can take these nested files and flatten them in a single directory.

I don't think you'll be able to ask hive to have input of all these folders considered as 1 table otherwise.

This questions seems to be addressing a similar issue: when creating an external table in hive can I point the location to specific files in a direcotry?

There is an open jira issue on the same context: https://issues.apache.org/jira/browse/HIVE-951

Browsing more I saw this post suggesting you use SimlinkInputTextFormat as an alternative. I am not sure how well this would fly with your Avro format. https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html

Lehrer answered 11/12, 2014 at 3:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.