Create hive external table from partitioned parquet files in Azure HDInsights
Asked Answered
P

1

14

I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:

cont/data/year=2017/month=02/day=01/

I want to create external table in Hive using following create statement, which I wrote using this reference.

CREATE EXTERNAL TABLE table_name (uid string, title string, value string) 
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET 
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';

This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY clause and that seems to work. So looks like issue is with partitioning.

What am I missing?

Pauperism answered 11/4, 2017 at 12:46 Comment(0)
M
21

After you create the partitioned table, run the following in order to add the directories as partitions

MSCK REPAIR TABLE table_name;

If you have a large number of partitions you might need to set hive.msck.repair.batch.size

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)


Written by the OP:

This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.

As a workaround, there is another way to add partitions to Hive metastore one by one like:

alter table table_name add partition(year=2016, month=10, day=11, hour=11)

We wrote simple script to automate this alter statement and it seems to work for now.

Mersey answered 11/4, 2017 at 15:50 Comment(6)
Thanks for answer. I just found that statement in one docs. However, I am getting this error when running that: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask So looks like there is some other issue.Pauperism
@Pauperism It might be caused by the metastore schema verification, please refer to cwiki.apache.org/confluence/display/Hive/… to check whether the property hive.metastore.schema.verification is true in hive-site.xml. Or it might be caused by SQL Azure locked or deny some operations.Pape
@Pauperism - not enough information. Check logs or run in DEBUG mode.Bryner
Try changing the logging to DEBUGBryner
Have you try setting hive.msck.repair.batch.size?Bryner
Setting hive.msck.repair.batch.size is available only on new version (2.2.0) of Hive, which HDInsight doesn't support yet. Even then there seems to be some issue, which I linked in updated answer.Pauperism

© 2022 - 2024 — McMap. All rights reserved.