Questions about Oozie/Sqoop
Asked Answered
C

1

1

I have few questions:

1. Why is there MapReduce process in Sqoop to load data from HDFS to MySQL? 

e.g.

Data is in HDFS on directory: /foo/bar

To load data in MySQL bar table, why is there a MapReduce process?

sqoop export --connect jdbc:mysql://localhost/hduser --table foo -m 1 --export-dir /foo/bar

After entering above command, MapReduce process executes.

2. How can I enable/disable key in MySQL using Sqoop/Oozie?

Since huge data is getting loaded to MySQL, we need to use enable/disable. How do I achieve it?

3. How to run multiple Oozie jobs in parallel? 

4. How to run Oozie jobs in Cron?

You can answer 1 or more questions.

Thank you.

Cozza answered 7/4, 2014 at 18:39 Comment(0)
B
1

I'll go through your questions one by one. Feel free to ask more questions in the comments and I will elaborate on the things that are unclear to you.

1. Why is there MapReduce process in Sqoop to load data from HDFS to MySQL?

This is because Sqoop is based on MapReduce. If you consider how files are stored in HDFS, they are split into small chunks and these chunks are stored across the cluster (some of the chunks might be on the same node). So it makes perfect sense to have a MapReduce job where the Map tasks read all these chunks of data in parallel and write them to MySQL.

2. How can I enable/disable key in MySQL using Sqoop/Oozie?

I don't know the answer to this one. However I feel that your question is a little ambiguous. Please try adding some more details & If I find something I'll get back on this.

3. How to run multiple Oozie jobs in parallel?

Each Oozie job is defined by a workflow.xml and a job.properties.

  • If you're talking about manual execution of multiple Oozie workflows (jobs), you can do this by simply running the command to start Oozie jobs for all the jobs you want to run in parallel. Sample command: oozie job -config job.properties -run

  • If you're talking about running multiple actions within an Oozie workflow in parallel, you can have a fork to trigger off multiple actions in parallel & then a join point for the parallel actions to meet upon completion. Example:

    <fork name = 'sampleFork'>
       <path start = 'sampleAction1'/>
       <path start = 'sampleAction2'/>
    </fork>
    
    <action name = 'sampleAction`>
      ..
      ..
      ..
      <ok to = 'joinActions'/>
      <error to = 'fail'/>
    </action>
    
    <join name = 'joinActions' to 'seqAction3'/>
    

4. How to run Oozie jobs in Cron?

If you want to automate execution of Oozie jobs, I suggest you look into Oozie coordinator. Using oozie coordinator, you can schedule workflows to trigger off after every interval (10 mins, 1 hour, 1 day etc. ).

Baler answered 11/4, 2014 at 22:28 Comment(5)
Hi, thank you for putting your thoughts, #2, We disable the keys load data and enable keys during ETL process, how to achieve it in Sqoop?: alter table table_name disable keys; Load data into the table... alter table table_name enable keys;Cozza
Another question would be can I have Hive action and Sqoop action together?Cozza
what do you mean together?Baler
Can I have 1 workflow.xml file where there are both hive action and sqoop action? I am using hive action to load data into HDFS and Sqoop action to load data from HDFS to MySQLCozza
yes you can. You can have many actions in a workflow, each one can be any type of action.Baler

© 2022 - 2024 — McMap. All rights reserved.