02-Hive一个表创建另一个表，表分区，分桶

16-08-08 来源：[db:作者]

收藏我要投稿

这篇的主要内容目录是：

由一个表创建另一个表 hive不同文件读取对比 hive分区表 hive分桶

你现在开始吧！
1. 由一个表创建另一个表
格式：ceate table test3 like test2;
我要做的：create table testtext_c like testtext;（这种方式不会把数据复制过来，只是创建了相同的数据格式）
我先加载数据到表testtext中：

[root@hadoop1 host]# cat testtext
wer 46
wer 89
weree   78
rr  89
hive> load data local inpath '/usr/host/testtext' into table testtext;
Copying data from file:/usr/host/testtext
Copying file: file:/usr/host/testtext
Loading data to table default.testtext
OK
Time taken: 0.294 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.186 seconds
hive>

2 接着创建testtext_c吧（like方式）

hive> create table testtext_c like testtext;
OK
Time taken: 0.181 seconds
hive> select * from testtext;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.204 seconds
hive> select * from testtext_c;
OK
Time taken: 0.158 seconds
hive>

哎，testtext_c中确实没有数据吧！真的没骗你啊！
3 客官，别急，还有一种方式（as）

hive> create table testtext_cc as select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 20:49:59,404 null map = 0%,  reduce = 0%
2016-06-01 20:50:20,644 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec
2016-06-01 20:50:21,735 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec
MapReduce Total cumulative CPU time: 1 seconds 300 msec
Ended Job = job_1464828076391_0004
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 1011778050, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_20-49-43_516_5205177189363939745/-ext-10001
Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/testtext_cc
Table default.testtext_cc stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 48.014 seconds

又跑mapreduce，为啥？create table testtext_c like testtext;这个都不走mapreduce的啊！怎么这里就跑mapreduce？嘿嘿，其实这里有select关键字，只有select * from 啥的不走mapreduce，其余的select都是会跑mapreduce的，hive的底层设计原理其实就是走mapreduce的，不信你看看我前一篇博客。
查查有没有数据：

hive> select * from testtext_cc;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.116 seconds
hive>

有啦有啦！
所以：create table testtext_cc as select name,addr from testtext;(这一种方式是走mapreduce形式，这种方式是把数据也会复制过来）

4 接下来呢，看看不同文件格式读取对比
有textfile文件格式，sequencefile格式，rcfile格式，还有自定义的文件格式。

hive> create table test_text(name string,val string) stored as textfile;
OK
Time taken: 0.098 seconds
hive> desc formatted test_text;
OK
# col_name              data_type               comment             

name                    string                  None                
val                     string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 21:11:15 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/test_text    
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464840675          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.2 seconds
hive>

看到Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

输入流是TextInputFormat；输出流是HiveIgnoreKeyTextOutputFormat

hive> create table test_seq(name string,val string) stored as sequencefile;
OK
Time taken: 0.097 seconds
hive> desc formatted test_s;
hive> create table test_rc(name string,val string) stored as rcfile;
OK
Time taken: 0.126 seconds
hive> desc formatted test_rc;

自定义的在这里就不讲了。等xielaoshi厉害一点了再来说。

5.为什么要分区？其实在hive select查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作。
分区表指的是在创建时指定partition的分区空间
分区语法：
create table tablename(name string) partition by(key type,….)

6.砸门来创建一个分区表玩玩：
上一篇我们是创建了三个表：testtable,testtext,xielaoshi。先来show tables看看有哪些表存在：

hive> show tables;
OK
testtable
testtext
xielaoshi
Time taken: 0.264 seconds

如果你想删除表的话，这样：

hive> drop table testtable;

创建分区表：

hive> create table xielaoshi2(
    > name string,
    > salary float,
    > meinv array,
    > haoche map,
    > haoza struct
    > )
    > partitioned by (dt string,type string)
    > row format delimited
    > fields terminated by '\t'
    > collection items terminated by ','
    > map keys terminated by ':'
    > lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.353 seconds
hive>

温馨小指南：你可以在记事本上敲好代码，然后贴到hive命令行上，这样更666哦！就像这样：
这里写图片描述

7 纳尼？不知道这语法是啥意思？好吧，你不懂的地方可能是collection items terminated by ‘,’map keys terminated by ‘:’ 。你想想，集合和map键值对里面的数据之间都是要分隔的呀，这里用逗号和冒号来分隔咯！
看看描述信息吧！

hive> desc formatted xielaoshi2;
OK
# col_name              data_type               comment             

name                    string                  None                
salary                  float                   None                
meinv                   array           None                
haoche                  map       None                
haoza                   struct  None                

# Partition Information      
# col_name              data_type               comment             

dt                      string                  None                
type                    string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 20:09:05 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/xielaoshi2   
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464836945          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    colelction.delim        ,                   
    field.delim             \t                  
    line.delim              \n                  
    mapkey.delim            :                   
    serialization.format    \t                  
Time taken: 0.194 seconds
hive>

看到多了 Partition Information信息没？分两个区。
8 添加分区

hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test');
OK
Time taken: 0.188 seconds
hive>

这里写图片描述

不过瘾对不对？砸门再来分区：

hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test1');
OK
Time taken: 3.986 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160518',type='test2');
OK
Time taken: 0.327 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
Time taken: 0.273 seconds
hive>

这里写图片描述
纳尼？你说啥？还不够？那再分一下？好勒！

hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test');
OK
Time taken: 0.224 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test1');
OK
Time taken: 0.275 seconds
hive> alter table xielaoshi2 add if not exists partition(dt='20160519',type='test2');
OK
Time taken: 0.323 seconds
hive> show partitions xielaoshi2;
OK
dt=20160518/type=test
dt=20160518/type=test1
dt=20160518/type=test2
dt=20160519/type=test
dt=20160519/type=test1
dt=20160519/type=test2
Time taken: 0.308 seconds
hive>

看到没？dt下还有子分区type。
这里写图片描述

9.删除分区

hive> alter table xielaoshi2 drop if exists partition(dt='20160519',type='test2');
Dropping the partition dt=20160519/type=test2
OK
Time taken: 0.541 seconds
hive>

删除一个分区下的所有子分区

hive> alter table xielaoshi2 drop if exists partition(dt='20160519');
Dropping the partition dt=20160519/type=test
Dropping the partition dt=20160519/type=test1
OK
Time taken: 4.24 seconds
hive>

10.分桶
分桶：对于每一个表（table）或者分区，hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围
是怎么划分的？
hive是针对某一列进行分桶
hive采取对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中
好处：获得更高的查询处理效率；使取样（sampling）更高效（这才是重点！！！）
来吧，分桶：

hive> create table bucketed_user(
    > id string,
    > name string
    > )
    > clustered by(id) sorted by(name) into 4 buckets
    > row format delimited fields terminated by '\t' lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.283 seconds
hive>

查看描述信息：

hive> desc formatted bucketed_user;
OK
# col_name              data_type               comment             

id                      string                  None                
name                    string                  None                

# Detailed Table Information         
Database:               default                  
Owner:                  root                     
CreateTime:             Wed Jun 01 20:31:39 PDT 2016     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop1:9000/user/hive/warehouse/bucketed_user    
Table Type:             MANAGED_TABLE            
Table Parameters:        
    transient_lastDdlTime   1464838299          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            4                        
Bucket Columns:         [id]                     
Sort Columns:           [Order(col:name, order:1)]   
Storage Desc Params:         
    field.delim             \t                  
    line.delim              \n                  
    serialization.format    \t                  
Time taken: 0.363 seconds
hive>

看到Num Buckets:4,这里是分了4个桶

hive> select * from bucketed_user;
OK
Time taken: 0.533 seconds
hive>

啥也没有？当然咯，没插入数据呀！那插入数据看看，把testtext表里的数据插入bucketed_user中：

hive>insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for null: number of mappers: 1; number of reducers: 0
2016-06-01 21:17:07,755 null map = 0%,  reduce = 0%
2016-06-01 21:17:22,171 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:23,308 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
2016-06-01 21:17:24,401 null map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1464828076391_0005
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Ended Job = 180668474, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2016-06-01_21-16-49_815_8186991974761152344/-ext-10000
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 37.79 seconds

hive> select * from bucketed_user;
OK
wer 46
wer 89
weree   78
rr  89
Time taken: 0.273 seconds
hive>

启动了两个job.
这里写图片描述
然而并没有分桶！这是为啥？
要插入这句话：hive> set hive.enforce.bucketing=true;
再执行这句话：

hive> insert overwrite table bucketed_user select name,addr from testtext;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 1; number of reducers: 4
2016-06-01 21:24:40,053 null map = 0%,  reduce = 0%
2016-06-01 21:24:54,729 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:55,909 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:57,256 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:58,531 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:24:59,631 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:00,930 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:02,208 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:03,485 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:04,781 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:05,983 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:07,272 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:08,697 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:09,782 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:11,017 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:12,292 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:13,606 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:14,870 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:17,433 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:18,929 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:20,801 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:22,429 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:24,508 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:26,192 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:27,256 null map = 100%,  reduce = 0%, Cumulative CPU 1.21 sec
2016-06-01 21:25:31,612 null map = 100%,  reduce = 51%, Cumulative CPU 1.21 sec
2016-06-01 21:25:33,544 null map = 100%,  reduce = 51%, Cumulative CPU 2.94 sec
2016-06-01 21:25:35,433 null map = 100%,  reduce = 94%, Cumulative CPU 4.92 sec
2016-06-01 21:25:39,269 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:40,312 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:41,730 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:42,927 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
2016-06-01 21:25:44,187 null map = 100%,  reduce = 100%, Cumulative CPU 6.23 sec
MapReduce Total cumulative CPU time: 6 seconds 230 msec
Ended Job = job_1464828076391_0006
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Loading data to table default.bucketed_user
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/bucketed_user
Table default.bucketed_user stats: [num_partitions: 0, num_files: 4, num_rows: 0, total_size: 29, raw_data_size: 0]
OK
Time taken: 96.782 seconds
hive>

看这句话Hadoop job information for null: number of mappers: 1; number of reducers: 4，因为分4个桶，出现了4个reducers。
这里写图片描述

看一下数据：

hive> select * from bucketed_user;
OK
rr  89
weree   78
wer 89
wer 46
Time taken: 1.112 seconds
hive> select * from testtext where name = 'wer';
OK
wer 46
wer 89
Time taken: 31.796 seconds
hive>

点击复制链接与好友分享!回本站首页