Confusion with the external tables in hive
Asked Answered
M

1

3

I have created the hive external table using below command:

 use hive2;

create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," 
stored as textfile location '/dataDir/';  

Now, when I view the HDFS I can see the db but there is no depTable inside the warehouse.

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
[cloudera@quickstart ~]$ 

Above you can see that there is no table created in this DB. As far as I know, external tables are not stored in the hive warehouse.So am I correct ?? If yes then where is it stored ??
But if I create external table first and then load the data then I am able to see the file inside hive2.db.

hive> create external table depTable (depId int comment 'This is the unique id for each dep', depName string,location string) comment 'department table' row format delimited fields terminated by "," stored as textfile;
OK
Time taken: 0.056 seconds
hive> load data inpath '/dataDir/department_data.txt' into table depTable;
Loading data to table default.deptable
Table default.deptable stats: [numFiles=1, totalSize=90]
OK
Time taken: 0.28 seconds
hive> select * from deptable;
OK
1001    FINANCE SYDNEY
2001    AUDIT   MELBOURNE
3001    MARKETING   PERTH
4001    PRODUCTION  BRISBANE

Now, if I fire the hadoop fs query I can see this table under database as below:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2019-01-17 09:07 /user/hive/warehouse/hive2.db/deptable  

If I delete the table still I am able to see table in the HDFS as below:

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/hive2.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2019-01-17 09:11 /user/hive/warehouse/hive2.db/deptable  

So, what is the exact behavior of the external tables ?? When I create using LOCATION keyword where does it get stored and when I create using load statement why it is getting stored in the HDFS and after deleting why it doesn't get deleted.

Mccutchen answered 17/1, 2019 at 17:15 Comment(0)
H
4

The main difference between EXTERNAL and MANAGED tables is in Drop table/partition behavior. When you drop MANAGED table/partition, the location with data files also removed. When you drop EXTERNAL table, the location with data files remains as is.

UPDATE: TBLPROPERTIES ("external.table.purge"="true") in release 4.0.0+ (HIVE-19981) when set on external table would delete the data as well.

EXTERNAL table as well as MANAGED is being stored in the location specified in DDL. You can create table on top of existing location with data files already in the location and it will work for both EXTERNAL or MANAGED, does not matter.

You even can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://mcmap.net/q/197345/-how-to-create-n-number-of-external-tables-with-a-single-hdfs-path-using-hive

If you specified location, the data will be stored in that location for both types of tables. If you did not specify location, the data will be in default location: /user/hive/warehouse/database_name.db/table_name for both managed and external tables.

Update: Also there can be some restrictions on location depending on platform/vendor, see https://mcmap.net/q/1924294/-unable-to-create-managed-hive-table-after-hortonworks-hdp-to-cloudera-cdp-migration, you may not be allowed to create manged/external tables outside their default allowed root location.

See also official Hive docs on Managed vs External Tables

Handle answered 17/1, 2019 at 18:52 Comment(10)
I created the external table pointing towards the data in HDFS but, I can't see it in warehouse folder . So my question is where does the external table gets stored ?? and if I create external table using load command it behaves as managed table. Can you help me here ?Mccutchen
Your answer didn't help me. I know about the basic concepts and all but had some doubts regarding the storage in HDFS. I hope I am loud and clearMccutchen
@Mccutchen What is the location '/dataDir/' means ? It should be HDFS dir in both cases. Iocal dir cannot be table locationHandle
yeah I know that but my question is different. Why I am not able to see existing external table in the warehouse?? Is there some different directory to see it ??Mccutchen
@Mccutchen create external table does not create folder, only modifies metadata. After you loaded file, it appeared in HDFSHandle
Okay but when I create external table using load command and then I try to delete it, it gets deleted from hive but in HDFS warehouse it still shows the table under warehouse folder . why is it so ?? Could you please try it on my behalf ??Mccutchen
@Mccutchen I already explained this in my answer. Because it is external table. You created external table and loaded data. Then you deleted table and data remain untouched- this how external table worksHandle
Then why it gets deleted from hive since it is an external table it shouldn't get deletedMccutchen
Only table is deleted (why shouldn't it? It should if you do drop table), data remains untouched in hdfs. Hive table - is a metastore information about table schema, location with data, statistics, table type=External, etc. Data is in HDFS. When you delete External table in Hive, only table get deleted, not dataHandle
Ok understood finally . Thanks for answering my stupid questions, mate :)Mccutchen

© 2022 - 2024 — McMap. All rights reserved.