How to find optimal number of mappers when running Sqoop import and export?
Asked Answered
A

2

6

I'm using Sqoop version 1.4.2 and Oracle database.

When running Sqoop command. For example like this:

./sqoop import                               \
    --fs <name node>                         \
    --jt <job tracker>                       \
    --connect <JDBC string>                  \
    --username <user> --password <password>  \
    --table <table> --split-by <cool column> \
    --target-dir <where>                     \
    --verbose --m 2

We can specify --m - how many parallel tasks do we want Sqoop to run (also they might be accessing Database at same time). Same option is available for ./sqoop export <...>

Is there some heuristic (probably based on size of data) which will help to guess what is optimal number of task to use?

Thank you!

Anstus answered 17/5, 2013 at 22:23 Comment(2)
No, it depends on the number of CPUs/cores your database server has, the amount of disk access each task will require, the speed of those disks, how much of each task is being performed in RAM, the amount of RAM, the amount of extra temporary tablespace being taken up by what you cannot store in RAM, the filesystem you're using, the amount of RAM assigned to the OS as opposed to the database, potentially the size of your switches and network cables and the number of additional processes being run against the database and/or server and how all the above factors affect them, etc. Test it.Zapata
@Zapata - I'd submit that as the answerPalmira
U
6

This is taken from Apache Sqoop Cookbook by O'Reilly Media, and seems to be the most logical answer.

The optimal number of mappers depends on many variables: you need to take into account your database type, the hardware that is used for your database server, and the impact to other requests that your database needs to serve. There is no optimal number of mappers that works for all scenarios. Instead, you’re encouraged to experiment to find the optimal degree of parallelism for your environment and use case. It’s a good idea to start with a small number of mappers, slowly ramping up, rather than to start with a large number of mappers, working your way down.

Unrelenting answered 31/1, 2014 at 20:17 Comment(0)
H
0

In "Hadoop: The Definitive Guide," they explain that when setting up your maximum map/reduce task on each Tasktracker consider the processor and its cores to define the number of tasks for your cluster, so I would apply the same logic to this and take a look at how many processes you can run on your processor(s) (Counting HyperTreading, Cores) and set your --m to this value - 1 (leave one open for other tasks that may pop up during the export) BUT this is only if you have a large dataset and want to get the export done in a timely manner.

If you don't have a large dataset, then remember that your output will be the value of --m number of files, so if you are exporting a 100 row table, you may want to set --m to 1 to keep all the data localized in one file.

Hostility answered 18/5, 2013 at 1:41 Comment(5)
If you are going to downvote, please leave me constructive criticism so that I may improve my answer.Hostility
Thats answer to different question. I did not downvote it. But problem with sqoop is that it's hitting dabase by each mapper. So if I have 30 machines with lets sat 2 mapper rep each and I use -m 60 database will be very unhappy with it :)Anstus
You wouldn't set it to 60. You would set it to 2... since this setting would apply to each machine in your cluster. So each machine would be using two mappers for a total of 60 mappers deployed depending if the sqoop api takes this as a suggestion or hard setting.Hostility
@Engineiro, are you sure that this setting applies to each data node in the cluster? We just tried an example on a 7 node cluster with mappers set to 15, but we ended up with 15 files, not 105 files (7 * 15).Mighell
Nope got that completely wrong: sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html "7.2.4. Controlling Parallelism" We don't control how many mappers each machine run just how many mappers we deploy total. Thanks for questioning @DaveMorrisHostility

© 2022 - 2024 — McMap. All rights reserved.