hive高阶1--sql和hive语句执行顺序、explain查看执行计划、groupby生成MR

16-08-08 来源：[db:作者]

收藏我要投稿

hive语句执行顺序

msyql语句执行顺序

代码写的顺序：

select ... from... where.... group by... having... order by.. 
    或者
from ... select ...

代码的执行顺序：

from... where...group by... having.... select ... order by...

hive 语句执行顺序

大致顺序
from... where.... select...group by... having ... order by...

explain查看执行计划

hive语句和mysql都可以通过explain查看执行计划，这样就可以查看执行顺序，比如代码

    explain
    select city,ad_type,device,sum(cnt) as cnt
    from tb_pmp_raw_log_basic_analysis
    where day = '2016-05-28' and type = 0 and media = 'sohu' and (deal_id = '' or deal_id = '-' or deal_id is NULL)    
    group by city,ad_type,device

显示执行计划如下

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tb_pmp_raw_log_basic_analysis
            Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (((deal_id = '') or (deal_id = '-')) or deal_id is null) (type: boolean)
              Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: city (type: string), ad_type (type: string), device (type: string), cnt (type: bigint)
                outputColumnNames: city, ad_type, device, cnt
                Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: sum(cnt)
                  keys: city (type: string), ad_type (type: string), device (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                    sort order: +++
                    Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                    Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col3 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

具体介绍如下
**stage1的map阶段**
        TableScan：from加载表，描述中有行数和大小等
        Filter Operator：where过滤条件筛选数据，描述有具体筛选条件和行数、大小等
        Select Operator：筛选列，描述中有列名、类型，输出类型、大小等。
        Group By Operator：分组，描述了分组后需要计算的函数，keys描述用于分组的列，outputColumnNames为输出的列名，可以看出列默认使用固定的别名_col0，以及其他信息
        Reduce Output Operator：map端本地的reduce，进行本地的计算，然后按列映射到对应的reduce
**stage1的reduce阶段Reduce Operator Tree**
        Group By Operator：总体分组，并按函数计算。map计算后的结果在reduce端的合并。描述类似。mode: mergepartial是说合并map的计算结果。map端是hash映射分组
        Select Operator：最后过滤列用于输出结果
        File Output Operator：输出结果到临时文件中，描述介绍了压缩格式、输出文件格式。
        stage0第二阶段没有，这里可以实现limit 100的操作。

总结

1，每个stage都是一个独立的MR，复杂的hql语句可以产生多个stage，可以通过执行计划的描述，看看具体步骤是什么。
2，执行计划有时预测数据量，不是真实运行，可能不准确

group by的MR

hive语句最好写子查询嵌套，这样分阶段的导入数据，可以逐步减少数据量。但可能会浪费时间。所以需要设计好。
group by本身也是一种数据筛选，可以大量减少数据，尤其用于去重等方面，功效显著。但group by产生MR有时不可控，不知道在哪个阶段更好。尤其，map端本地的reduce减少数据有很大作用。

尤其，hadoop的MR不患寡而患不均。数据倾斜将是MR计算的最大瓶颈。hive中可以设置分区、桶、distribute by等来控制分配数据给Reduce。
那么，group by生成MR是否可以优化呢?
下面两端代码，可以对比一下，

代码1

explain
select advertiser_id,crt_id,ad_place_id,channel,ad_type,rtb_type,media,count(1) as cnt
from (
  select 
    split(all,'\\\\|~\\\\|')[41] as advertiser_id,
    split(all,'\\\\|~\\\\|')[11] as crt_id,
    split(all,'\\\\|~\\\\|')[8] as ad_place_id,
    split(all,'\\\\|~\\\\|')[34] as channel,
    split(all,'\\\\|~\\\\|')[42] as ad_type,
    split(all,'\\\\|~\\\\|')[43] as rtb_type,
    split(split(all,'\\\\|~\\\\|')[5],'/')[1] as media
  from tb_pmp_raw_log_bid_tmp tb
) a 
group by advertiser_id,crt_id,ad_place_id,channel,ad_type,rtb_type,media;

代码2

 explain
  select 
    split(all,'\\\\|~\\\\|')[41] as advertiser_id,
    split(all,'\\\\|~\\\\|')[11] as crt_id,
    split(all,'\\\\|~\\\\|')[8] as ad_place_id,
    split(all,'\\\\|~\\\\|')[34] as channel,
    split(all,'\\\\|~\\\\|')[42] as ad_type,
    split(all,'\\\\|~\\\\|')[43] as rtb_type,
    split(split(all,'\\\\|~\\\\|')[5],'/')[1] as media
  from tb_pmp_raw_log_bid_tmp tb
  group by split(all,'\\\\|~\\\\|')[41],split(all,'\\\\|~\\\\|')[11],split(all,'\\\\|~\\\\|')[8],split(all,'\\\\|~\\\\|')[34],split(all,'\\\\|~\\\\|')[42],split(all,'\\\\|~\\\\|')[43],split(split(all,'\\\\|~\\\\|')[5],'/')[1]

先进行子查询，然后group by，还是直接group by，两种那个好一点，
我个人测试后认为，数据量小，第一种会好一点，如果数据量大，可能第二种会好。至于数据量多大。TB级以下的都是小数据。

两个执行计划对比如下，可以看出基本执行的步骤的数据分析量差不多。
group by一定要用，但内外，先后执行顺序效果差不多。

代码1

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tb
            Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: split(all, '\\|~\\|')[41] (type: string), split(all, '\\|~\\|')[11] (type: string), split(all, '\\|~\\|')[8] (type: string), split(all, '\\|~\\|')[34] (type: string), split(all, '\\|~\\|')[42] (type: string), split(all, '\\|~\\|')[43] (type: string), split(split(all, '\\|~\\|')[5], '/')[1] (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
              Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(1)
                keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
                  sort order: +++++++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
                  Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col7 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
          Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

代码2

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tb
            Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
            Select Operator


              expressions: all (type: string)
              outputColumnNames: all
              Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
              Group By Operator

                keys: split(all, '\\|~\\|')[41] (type: string), split(all, '\\|~\\|')[11] (type: string), split(all, '\\|~\\|')[8] (type: string), split(all, '\\|~\\|')[34] (type: string), split(all, '\\|~\\|')[42] (type: string), split(all, '\\|~\\|')[43] (type: string), split(split(all, '\\|~\\|')[5], '/')[1] (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
                Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
                  sort order: +++++++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
                  Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

      Reduce Operator Tree:
        Group By Operator

          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
          Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
            Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

点击复制链接与好友分享!回本站首页