Sqoop - Data splitting

Asked 5/5, 2016 at 4:27 Answered 25/11, 2016 at 12:5

Sqoop able to import data from multiple tables using --query clause but not clear whether it is able to import below query.

Select deptid, avg(salary) from emp group by deptid

Another question is

sqoop import --connect jdbc:mysql://myserver:1202/ --username=u1 --password=p1 --query 'Select * from emp where empid< 1123 and $CONDITIONS' --split-by empid --target-dir /uname/emp/salary

$CONDITIONS and split-by are used to perform parallel processing or we can say efficiently importing data. Former split the rows based on condition and later use min and max logic on primary key. What is the difference between these two ($CONDITIONS, split-by). If we use both in same sqoop statement, which clause would got the priority?

Thanks....

Turnout answered 5/5, 2016 at 4:27 Comment(0)

There is some gap in your understanding.

First of all, the degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

Second, --split-by <column-name>, will split your task on the basis of column-name.

Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

Example, You fired a query:

sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

Say, my empId is uniformly distributed from 1- 100.

Now, sqoop will take --split-by column and find its max and min value using query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

See it replaced $CONDITIONS with (1 = 1).

In our case, min, max values are 1 and 100.

As number of mappers are 4, sqoop will divide my query in 4 parts.

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId < 50

and so on for other 3 mappers.

Results from all the mappers is aggregated and written to a final HDFS directory.

Regarding your query :

select deptid, avg(salary) from emp group by deptid

you will specify

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

It will be first converted to

select deptid, avg(salary) from emp group by deptid where (1 = 0)

to fetch column metadata.

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

So you will not be able to use this query to fetch data using Sqoop.

Sqoop is used for simpler SQL queries.

Mebane answered 5/5, 2016 at 10:38 Comment(6)

Thanks Dev...Please ans my 1 ques also. If I understood correctly and query has condition like empid<200 and no of mappers are 4, it split the data in (1-49),(50-99),(100-149),(150-200) all inclusive.. what if, primary key is not uniformly distributed and there is no such key in the table. Will it use the concept of sampling in that case? – Turnout 5/5, 2016 at 11:1

@Turnout you can give any column in --split-by (not necessarily primary key). But non-string columns are recommended. – Mebane 5/5, 2016 at 13:15

Thanks Dev....Really helpful. Can we import data Select deptid, avg(salary) from emp group by deptid using this statement? How sqoop equally distribute work to all mappers in case value is not uniform? – Turnout 5/5, 2016 at 14:17

@Turnout It will simply divide query in parts (as per the number of mappers). It may possible some mapper does not carry any data. To get good efficiency, split by column should be uniform. – Mebane 5/5, 2016 at 15:44

Understood Sqoop supports simpler query but how sqoop decide where to add (1=1) condition and where (1=0) condition.. – Turnout 6/5, 2016 at 2:14

@Turnout it's simple when sqoop wants to fetch metadata, it fires queries like select deptid, avg(salary) from emp group by deptid where (1 = 0) here where 1=0 is false so no records are fetched but only metadata. When it needs to fetch all the records it puts 1=1 which is true so all the records are fetched – Mebane 6/5, 2016 at 2:38

Your query, 'Select * from emp where empid< 1123 and $CONDITIONS', works perfectly well to import data from RDBMS to HDFS.

Even though $CONDITIONS first evaluates to 1=0 in where clause, if you see the Sqoop import log on console, you will actually see another SQL Val boundary query that is replacing $CONDITIONS with 1=1 which will pass the query and hence the data can be imported.

Note that Sqoop CAN import even medium complex SQL queries like joins. I am not sure if it can support highly complex SQL queries though as I have not tested it myself.

Polo answered 25/11, 2016 at 12:5 Comment(0)

Recommended topics

Hot tags