As the github page of tez says, tez is very simple and at its heart has just two components:
The data-processing pipeline engine, and
A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' described above into a task-DAG
Well my first question is, How existing mapreduce jobs like wordcount that exists in tez-examples.jar, converted to task-DAG? where? or they don't...?
and my second and more important question is about this part:
Every 'task' in tez has the following:
- Input to consume key/value pairs from.
- Processor to process them.
- Output to collect the processed key/value pairs.
Who is in charge of splitting input data between the tez-tasks? Is it the code that user provide or is it Yarn (the resource manager) or even the tez itself?
The question is the same for output phase. Thanks in advance