I could not understand the difference between the following commands in sqoop. It would be better if someone could explain with small examples.
--warehouse-dir and --target-dir
Thanks
I could not understand the difference between the following commands in sqoop. It would be better if someone could explain with small examples.
--warehouse-dir and --target-dir
Thanks
Below parameter points to default hive table location.It can be used for dev purpose, where you just want to perform some tests on internal tables.
--warehouse-dir
Below parameter points to some hdfs location, where you can mount external hive tables.This is useful in production environment, where you want every data to be available to some external dir and external table.
--target-dir
As I got in case of import:
--warehouse-dir : It create a directory which works as database directory (sqoop_db_movies) and table name (as given in import command) directory automatically created with imported files with in warehouse dir(database directory).
Example: sqoop import --options-file /home/cloudera/sqoop/conn --table movies --warehouse-dir /sqoop_db_movies -m 1
Output as:
/sqoop_db_movies/movies
/sqoop_db_movies/movies/_SUCCESS
/sqoop_db_movies/movies/part-m-00000
--target-dir: It create a directory which work as table name (sqoop_table_movies) with imported files.
Example: sqoop import --options-file /home/cloudera/sqoop/conn --table movies --target-dir /sqoop_table_movies -m 1
Output as:
/sqoop_table_movies/_SUCCESS
/sqoop_table_movies/part-m-00000
--warehouse-dir
generally you use this option when you're importing all the tables with import-all-tables tool using sqoop. This directory can be anything, either your hive /data/warehouse directory or some other parent directory. All the tables will be imported in this parent directory.
--target-dir
This option is used when you've to import a single table using import-table tool. For each table you've to mention the directory and it must not already exist in the path.
If you want to run multiple Sqoop jobs for multiple tables, you will need to change the --target-dir parameter with every invocation.
As an alternative, Sqoop offers another parameter by which to select the output directory. Instead of directly specifying the final directory, the parameter --warehouse-dir allows you to specify only the parent directory.
Rather than writing data into the warehouse directory, Sqoop will create a directory with the same name as the table inside the warehouse directory and import data there.
This is similar to the default case where Sqoop imports data to your home directory on HDFS, with the notable exception that the --warehouse-dir parameter allows you to use a directory other than the home directory. Note that this parameter does not need to change with every table import unless you are importing tables with the same name.
--warehouse-dir points to the Hive folder to import data into (I've used it when importing tables wholesale) while --target-dir is needed when importing into Hive via query (sqoop errs asking for it). In the latter scenario, it is used as a temporary area for the mappers to be followed by LOAD INPATH. I was setting --target-dir the same as the --warehouse-dir (after switching from whole table to a query import) and was getting empty tables. Removed --warehouse-dir from sqoop command and changed --target-dir to /tmp/newfolder and my imports into Hive were happy.
We generally use warehouse-dir, even with warehouse dir multiple table or single table works fine.
Another advantage is, only warehouse-dir works with S3, this is very important, when you want external tables data to be stored in s3.
© 2022 - 2024 — McMap. All rights reserved.