Phoenix4.8整合Spark
来源:互联网 发布:淘宝优惠微信群 编辑:程序博客网 时间:2024/05/16 17:53
本文使用Spark shel操作phoenix(参考官方文档http://phoenix.apache.org/phoenix_spark.html,解决了官方文档的一些问题)
通过phoenix提供的包来操作hbase比直接使用jdbc的方式更高效(因为比jdbc的方式更细粒度化的调度)
软件环境:
Phoenix4.8
HBase1.1.2
Spark1.5.2/1.6.3/2.x
1、在phoenix中建表
CREATE TABLE TABLE1 (ID BIGINT NOT NULL PRIMARY KEY, COL1 VARCHAR);UPSERT INTO TABLE1 (ID, COL1) VALUES (1, 'test_row_1');UPSERT INTO TABLE1 (ID, COL1) VALUES (2, 'test_row_2');
1、启动spark-shelll
spark-shell --jars /opt/phoenix4.8/phoenix-spark-4.8.0-HBase-1.1.jar,/opt/phoenix4.8/phoenix-4.8.0-HBase-1.1-client.jar
2、使用DataSource API,load为DataFrame
import org.apache.phoenix.spark._val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "TABLE1", "zkUrl" -> "192.38.0.231:2181"))df.filter(df("COL1") === "test_row_1" && df("ID") === 1L).select(df("ID")).show
如果spark2.x版本,使用如下方式(前提是修改了phoenix-spark模块的代码,使之兼容spark2.x,重新编译):
import org.apache.phoenix.spark._val df = spark.read.format("org.apache.phoenix.spark").options(Map("table" -> "TABLE1", "zkUrl" -> "192.38.0.231:2181")).loaddf.filter(df("COL1") === "test_row_1" && df("ID") === 1L).select(df("ID")).show
使用spark-sql方式如下
spark-sql --jars /opt/phoenix4.8/phoenix-spark-4.8.0-HBase-1.1.jar,/opt/phoenix4.8/phoenix-4.8.0-HBase-1.1-client.jar
CREATE TABLE spark_phUSING org.apache.phoenix.sparkOPTIONS ( table "TABLE1", zkUrl "192.38.0.231:2181");
3、使用Configuration类,load为DataFrame
import org.apache.hadoop.conf.Configurationimport org.apache.phoenix.spark._val configuration = new Configuration()val df = sqlContext.phoenixTableAsDataFrame("TABLE1", Array("ID", "COL1"), conf = configuration)df.show
4、使用Zookeeper URL ,load为RDD
import org.apache.phoenix.spark._//将ID和COL1两列,加载为一个RDDval rdd = sc.phoenixTableAsRDD("TABLE1", Seq("ID", "COL1"), zkUrl = Some("192.38.0.231:2181"))rdd.count()val firstId = rdd.first()("ID").asInstanceOf[Long]val firstCol = rdd.first()("COL1").asInstanceOf[String]
5、通过Spark向Phoenix中写入数据(RDD方式)
phoenix中创建表output_test_table:
CREATE TABLE OUTPUT_TEST_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
import org.apache.phoenix.spark._val sc = new SparkContext("local", "phoenix-test")val dataSet = List((1L, "1", 1), (2L, "2", 2), (3L, "3", 3))sc.parallelize(dataSet).saveToPhoenix("OUTPUT_TEST_TABLE", Seq("ID","COL1","COL2"), zkUrl = Some("192.38.0.231:2181"))
0: jdbc:phoenix:localhost> select * from output_test_table;
+-----+-------+-------+
| ID | COL1 | COL2 |
+-----+-------+-------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
+-----+-------+-------+
3 rows selected (0.168 seconds)
6、通过Spark向Phoenix中写入数据(DataFrame方式)
phoenix中创建两张表:
CREATE TABLE INPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
upsert into input_table values(1,'col1',1);
upsert into input_table values(2,'col2',2);
CREATE TABLE OUTPUT_TABLE (id BIGINT NOT NULL PRIMARY KEY, col1 VARCHAR, col2 INTEGER);
import org.apache.phoenix.spark._import org.apache.spark.sql.SaveModeval df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE", "zkUrl" -> "192.38.0.231:2181"))df.save("org.apache.phoenix.spark", Map("table" -> "OUTPUT_TABLE", "zkUrl" -> "192.38.0.231:2181"))
spark2.x向phoenix加载数据
df.write.format("org.apache.phoenix.spark").options( Map("table" -> "OUTPUT_TABLE", "zkUrl" -> "192.38.0.231:2181")).mode("overwrite").save
+-----+-------+-------+
| ID | COL1 | COL2 |
+-----+-------+-------+
| 1 | col1 | 1 |
| 2 | col2 | 2 |
+-----+-------+-------+
2 rows selected (0.092 seconds)
示例:PageRank
1)下载数据enron.csv.gz
2)Phoenix中创建两张表:
CREATE TABLE EMAIL_ENRON(MAIL_FROM BIGINT NOT NULL, MAIL_TO BIGINT NOT NULL CONSTRAINT pk PRIMARY KEY(MAIL_FROM, MAIL_TO));
CREATE TABLE EMAIL_ENRON_PAGERANK(ID BIGINT NOT NULL, RANK DOUBLE CONSTRAINT pk PRIMARY KEY(ID));
3)加载数据到phoenix
gunzip enron.csv.gz
./psql.py -t EMAIL_ENRON localhost /opt/enron.csv
4)在spark中计算pagerange,结果保存到phoenix中
import org.apache.spark.graphx._import org.apache.phoenix.spark._val rdd = sc.phoenixTableAsRDD("EMAIL_ENRON", Seq("MAIL_FROM", "MAIL_TO"), zkUrl=Some("localhost"))val rawEdges = rdd.map{ e => (e("MAIL_FROM").asInstanceOf[VertexId], e("MAIL_TO").asInstanceOf[VertexId]) }val graph = Graph.fromEdgeTuples(rawEdges, 1.0)val pr = graph.pageRank(0.001) pr.vertices.saveToPhoenix("EMAIL_ENRON_PAGERANK", Seq("ID", "RANK"), zkUrl = Some("localhost"))
5)查看计算结果
0: jdbc:phoenix:localhost> select * from email_enron_pagerank order by rank desc limit 5;
+-------+---------------------+
| ID | RANK |
+-------+---------------------+
| 5038 | 497.2989872977676 |
| 273 | 117.18141799210386 |
| 140 | 108.63091596789913 |
| 458 | 107.2728800448782 |
| 588 | 106.11840798585399 |
+-------+---------------------+
5 rows selected (0.283 seconds)
- Phoenix4.8整合Spark
- Phoenix4.8与Hive整合
- Phoenix4.8安装
- centos安装Phoenix4.8+
- Spark Streaming+ FlumeNG整合
- 整合spark和hive
- Spark Streaming整合Kafka
- spark与elasticsearch整合
- spark/hadoop整合mongodb
- Hadoop与Spark整合
- spark streaming整合sparksql
- ipython notebook 整合spark
- spark整合kafka案例
- spark streaming 整合kafka
- Spark Streaming整合kafak
- spark整合hbase
- Spring 整合 spark 使用
- Spark整合Hive
- 第一篇 Android驱动开发环境搭建之一 -- 虚拟机搭建
- Haskell/Modules
- mfc 创建线程函数AfxBeginThread,线程中访问mfc控件
- Atitit.基于dsl的methodinvoker
- Mac不可或缺的插件-HomeBrew
- Phoenix4.8整合Spark
- MySQL 数据库 基本操作 (新建、增、删、改、查、show)
- python 的日志logging模块学习
- hadoop 四种压缩格式
- Egret的帧动画的使用
- poj 2823 Sliding Window 单调队列
- Spring事务管理(详解+实例)
- 第五章
- StringUtils工具类