hive高级查询(一)

阅读量：4161 次

发布时间：2019-05-26

本文共 6062 字，大约阅读时间需要 20 分钟。

文章目录

一.Hive查询 - SELECT基础

1.查询语法解析

SELECT用于映射符合指定查询条件的行

Hive SELECT是数据库标准SQL的子集

使用方法类似于MySQL

关键字和MySQL一样，不区分大小写

limit子句

where子句:运算符、like、rlike

group by子句

having子句
示例:

SELECT 1;SELECT * FROM table_name;SELECT id,name,age FROM people WHERE age>20;SELECT * FROM employee WHERE name!='Lucy'  LIMIT 5;select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

Hive查询 - CTE和嵌套查询

CTE（Common Table Expression）

-- CTE语法WITH t1 AS (SELECT …) SELECT * FROM t1;-- CTE演示with tab1 as (select id,name,age from people) select * from tab1;

嵌套查询

-- 嵌套查询示例SELECT * FROM (SELECT * FROM employee) a;

Hive JOIN - 关联查询

指对多表进行联合查询

JOIN用于将两个或多个表中的行组合在一起查询

类似于SQL JOIN，但是Hive仅支持等值连接

内连接：INNER JOIN

外连接：OUTER JOIN: RIGHT JOIN, LEFT JOIN, FULL OUTER JOIN

交叉连接：CROSS JOIN,形成笛卡尔积

隐式连接：Implicit JOIN,演示如下"

-- 隐式连接 不写join但是加条件就叫隐式连接 先出笛卡尔积数据量非常大,再次筛选 而join会先过滤效率高select * from emp_basic eb,emp_psn ep where eb.emp_id = ep.emp_id;

JOIN发生在WHERE子句之前

2.练习1：SELECT及关联查询

请分别实现以下需求

1.将顾客表、部门表、商品表数据存入Hive

-- 顾客表load data inpath '/data/retail_db/customers.csv' into table customers;select * from customers;-- 部门表load data inpath '/data/retail_db/departments.csv' into table departments;select * from departments;-- 商品表load data inpath '/data/retail_db/products.csv' into table products;select * from products;

2.查询顾客表中地区为“NY”所在城市为’New York’的用户

select * from  customerswhere customer_state = 'NY' and customer_city like 'New York';

3.查询订单表中共有多少不同顾客下过订单

select count(1)from (select count(1)from orders group by order_customer_id) a;

4.查询商品表中前5个商品

select * from products limit 5;

5.从顾客表中查询每位顾客的全名（分别使用CTE和子查询）

-- 用个毛子查询直接函数select concat(customer_fname,' ',customer_lname) from customers ;

6.使用正则表达式匹配顾客表中ID、姓名与所在城市列(rlike)

select * from customers where customer_id rlike '[1,2,3][4,5,6]' and customer_fname rlike '^Ma+[a,r]' and customer_lname rlike '^S' and customer_state rlike '^N' and customer_city rlike 'Green' ;

7.使用关联查询获取没有订单的所有顾客

select * from customers c where not exists(select * from orders o where order_customer_id = c.customer_id );

二. Hive JOIN – MAPJOIN

MapJoin操作在Map端完成

小表关联大表

可进行不等值连接

开启mapjoin操作

set hive.auto.convert.join=true; //设置 MapJoin 优化默认自动开启

set hive.mapjoin.smalltable.filesize=25000000 //设置小表不超过多大时开启 mapjoin 优化

运行时自动将连接转换为MAPJOIN

MAPJOIN操作不支持：

在UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY/CLUSTER BY/DISTRIBUTE BY等操作后面

在UNION, JOIN 以及其他 MAPJOIN之前

三.Hive集合操作(UNION)

所有子集数据必须具有相同的名称和类型

UNION ALL：合并后保留重复项

UNION：合并后删除重复项（v1.2之后）目前使用的版本为1.1

可以在顶层查询中使用（0.13.0之后）

ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY 和LIMIT适用于合并后的整个结果

order by 全局排序只有一个reducer(多个reduce会有多个分区无法保证)只要是order by 设置reduce task 任务数是无效的

sort by 不是全局排序,在进入reduce之前排序,局部有序,任务数为1时和order by 一样一般不会单用

distribute by 控制在map端如何拆分数据给reduce端的配合 sort by 使用 ,相当于group by

cluster by 除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒叙排序，不能指定排序规则为ASC或者DESC

集合其他操作可以使用JOIN/OUTER JOIN来实现

差集、交集

四.装载数据

1.INSERT表插入数据

使用INSERT语句将数据插入表/分区

-- INSERT支持OVERWRITE(覆盖)和INTO(追加)INSERT OVERWRITE/INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select fileds,... from tb_other;

Hive支持从同一个表进行多次插入

INSERT INTO中TABLE关键字是可选的

INSERT INTO可以指定插入到哪些字段中

INSERT INTO table_name VALUES，支持插入值列表,数据插入必须与指定列数相同

INSERT OVERWRITE TABLE test select 'hello'; -- INSERT不支持的写法insert into employee select * from ctas_employee; -- 通过查询语句插入-- 多插入from ctas_employeeinsert overwrite table employee select *insert overwrite table employee_internal select *;-- 插入到分区from ctas_patitioned insert overwrite table employee PARTITION (year, month)select *,'2018','09';-- 通过指定列插入(insert into可以省略table关键字)insert into employee(name) select 'John' from test limit 1;-- 通过指定值插入insert into employee(name) value('Judy'),('John');

2. Hive数据插入文件

使用insert语句将数据插入/导出到文件

文件插入只支持OVERWRITE

支持来自同一个数据源/表的多次插入

LOCAL：写入本地文件系统

默认数据以TEXT格式写入，列由^A分隔

支持自定义分隔符导出文件为不同格式,CSV,JSON等

-- 从同一数据源插入本地文件，hdfs文件，表from ctas_employeeinsert overwrite local directory '/tmp/out1'  select *insert overwrite directory '/tmp/out1' select *insert overwrite table employee_internal select *;-- 以指定格式插入数据insert overwrite directory '/tmp/out3'row format delimited fields terminated by ','select * from ctas_employee;-- 其他方式从表获取文件hdfs dfs -getmerge

3. Hive数据交换 - IMPORT/EXPORT

IMPORT和EXPORT用于数据导入和导出

常用于数据迁移场景

除数据库，可导入导出所有数据和元数据

使用EXPORT导出数据

EXPORT TABLE employee TO '/tmp/output3'; -- 导出到hdfsEXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/tmp/output5';

使用IMPORT导入数据

IMPORT TABLE employee FROM '/tmp/output3';IMPORT TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';

五.hive数据排序

1.order by

ORDER BY (ASC|DESC)类似于标准SQL

只使用一个Reducer执行全局数据排序

速度慢,应提前做好数据过滤

支持使用CASE WHEN或表达式

支持按位置编号排序:set hive.groupby.orderby.position.alias=true;

2.SORT BY/DISTRIBUTE BY

SORT BY对每个Reducer中的数据进行排序

当Reducer数量设置为1时，等于ORDER BY

排序列必须出现在SELECT column列表中

DISTRIBUTE BY类似于标准SQL中的GROUP BY

根据相应列以及对应reduce的个数进行分发

默认是采用hash算法

根据分区字段的hash码与reduce的个数进行模除

通常使用在SORT BY语句之前

-- 默认ASC,正序 DESC,倒序SELECT department_id , name, employee_id, evaluation_scoreFROM employee_hr DISTRIBUTE BY department_id SORT BY evaluation_score DESC;

3. CLUSTER BY

CLUSTER BY = DISTRIBUTE BY + SORT BY

不支持ASC|DESC

排序列必须出现在SELECT column列表中

为了充分利用所有的Reducer来执行全局排序，可以先使用CLUSTER BY，然后使用ORDER BY

练习2：实现Hive数据加载及排序

1.将order_items.csv数据通过load方式加载到order_items表

load data inpath '/data/retail_db/order_items.csv' into table order_items;select * from order_items;

2.将order_items表中数据加载到order_test2表

create  table order_test2 like order_items;from order_items	insert  into table order_test2  select *;

3.将order_items表中数据同时加载到o_01和o_02表

create  table o_01 like order_items;create  table o_02 like order_items;from order_items	insert  into table o_01  select *insert  into table o_02  select *;

4.将order_items表中数据导出到本地以及hdfs

-- 本地from order_itemsinsert overwrite local directory '/root/order_items.txt' select *;-- 数据导出到hdfsEXPORT TABLE order_items TO  '/tmp/output1';

5.统计order_items表中订单数量排行（取前10）

select order_item_order_id, sum(order_item_quantity) efrom order_items o group by order_item_order_idorder by e desc limit 10;

转载地址：http://scjxi.baihongyu.com/

你可能感兴趣的文章

Docker启动报错：SELinux is not supported with the overlay2 graph driver

查看>>

CentOS7安装hadoop2.7.3-单机版

查看>>

深入理解HDFS原理

查看>>

Intellij IDEA连接Hadoop HDFS实现本地调试

查看>>

MongoDB数据类型

查看>>

MongoDB的java客户端连接池配置说明

查看>>

MongoDB在CentOS7下的yum方式和压缩包方式安装

倾斜数据之加载——SuperMap iDesktop

查看>>

倾斜数据之合并根节点——SuperMap iDesktop

查看>>

SuperMap iClient3D for WebGL教程（Entity）- RectangleGraphics