Hive explain plan understanding

I will try to explain a litte what I know.

The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task. To see an execution plan for a query, you can do this, prefix the query with the keyword EXPLAIN, then run it. Execution plans can be long and complex. Fully understanding them requires a deep knowledge of MapReduce.

Example

EXPLAIN CREATE TABLE flights_by_carrier AS 
SELECT carrier, COUNT(flight) AS num 
FROM flights 
GROUP BY carrier;

This query is a CTAS statement that creates a new table named flights_by_carrier and populates it with the result of a SELECT query. The SELECT query groups the rows of the flights table by carrier and returns each carrier and the number of flights for that carrier.

Hive's output of the EXPLAIN statement for the example is shown here

+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|   Stage-3 depends on stages: Stage-0               |
|   Stage-2 depends on stages: Stage-3               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: flights                         |
|             Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: carrier (type: string), flight (type: smallint) |
|               outputColumnNames: carrier, flight   |
|               Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(flight)        |
|                 keys: carrier (type: string)       |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: string) |
|                   Statistics: Num rows: 61392822 Data size: 962183360 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           keys: KEY._col0 (type: string)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 30696411 Data size: 481091680 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.TextInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                 name: fly.flights_by_carrier       |
|                                                    |
|   Stage: Stage-0                                   |
|     Move Operator                                  |
|       files:                                       |
|           hdfs directory: true                     |
|           destination: hdfs://localhost:8020/user/hive/warehouse/fly.db/flights_by_carrier |
|                                                    |
|   Stage: Stage-3                                   |
|       Create Table Operator:                       |
|         Create Table                               |
|           columns: carrier string, num bigint      |
|           input format: org.apache.hadoop.mapred.TextInputFormat |
|           output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat |
|           serde name: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           name: fly.flights_by_carrier             |
|                                                    |
|   Stage: Stage-2                                   |
|     Stats-Aggr Operator                            |
|                                                    |
+----------------------------------------------------+--+

Stage Dependencies

The example query will execute in four stages, Stage-0 to Stage-3. Each stage could be a MapReduce job, an HDFS action, a metastore action, or some other action performed by the Hive server.

The numbering does not imply an order of execution or dependency.

The dependencies between stages determine the order in which they must execute, and Hive specifies these dependencies explicitly at the start of the EXPLAIN results.

A root stage, like Stage-1 in this example, has no dependencies and is free to run first.

Non-root stages cannot run until the stages upon which they depend have completed.

Stage Plans

The stage plans part of the output shows descriptions of the stages. For Hive, read them by starting at the top and then going down.

Stage-1 is identified as a MapReduce job.

The query plan shows that this job includes both a map phase (described by the Map Operator Tree) and a reduce phase (described by the Reduce Operator Tree). In the map phase, the map tasks read the flights table and select the carrier and flights columns.

This data is passed to the reduce phase, in which the reduce tasks group the data by carrier and aggregate it by counting flights.

Following Stage-1 is Stage-0, which is an HDFS action (Move).

In this stage, Hive moves the output of the previous stage to a new subdirectory in the warehouse directory in HDFS. This is the storage directory for the new table that will be named flights_by_carrier.

Following Stage-0 is Stage-3, which is a metastore action:

Create Table.

In this stage, Hive creates a new table named flights_by_carrier in the fly database. The table has two columns: a STRING column named carrier and a BIGINT column named num.

The final stage, Stage-2, collects statistics.

The details of this final stage are not important, but it gathers information such as the number of rows in the table, the number of files that store the table data in HDFS, and the number of unique values in each column in the table. These statistics can be used to optimize Hive queries.

Recommended topics

Hot tags