spark sql 快速体验调试
来源:互联网 发布:echart map 下钻 json 编辑:程序博客网 时间:2024/05/18 04:51
spark sql提供了更快的查询性能,如何能够更快的体验,开发和调试spark sql呢?按照正规的步骤我们一般会集成hive,然后使用hive的元数据查询hive表进行操作,这样以来我们还需要考虑跟hive相关的东西,如果我们仅仅是学习spark sql查询功能,那么仅仅使用IDEA的IDE环境即可,而且能够在win上快速体验,不需要hive数据仓库,我们直接使用数组造点数据,然后转成DF,最后直接使用spark sql操作即可。
首先,看下pom文件的核心依赖:
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.xuele.bigdata</groupId> <artifactId>kp_diag</artifactId> <version>1.0.2</version> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <scala.version>2.11.8</scala.version> <hadoop.version>2.7.3</hadoop.version> <spark.version>2.0.2</spark.version> <spark.hive.version>2.0.2</spark.hive.version> <spark.sql.version>2.0.2</spark.sql.version> <neo4j-java-driver.version>1.0.5</neo4j-java-driver.version> <config.version>1.2.1</config.version> <jedis.version>2.9.0</jedis.version> <hbase.version>1.2.0</hbase.version> <kafka.version>0.9.0.0</kafka.version> <fastjson.version>1.2.15</fastjson.version> <elasticsearch.version>2.3.4</elasticsearch.version> </properties> <dependencies> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>${elasticsearch.version}</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>${fastjson.version}</version> </dependency> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>${kafka.version}</version> </dependency> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>${jedis.version}</version> </dependency> <dependency> <groupId>net.jpountz.lz4</groupId> <artifactId>lz4</artifactId> <version>1.3</version> </dependency> <dependency> <groupId>com.typesafe</groupId> <artifactId>config</artifactId> <version>${config.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>${hbase.version}</version> </dependency> <!--neo4j的java的驱动--> <dependency> <groupId>org.neo4j.driver</groupId> <artifactId>neo4j-java-driver</artifactId> <version>${neo4j-java-driver.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>${spark.sql.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>${spark.hive.version}</version> </dependency> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> <scope>compile</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>${spark.version}</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.1</version> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.0.2</version> </plugin> </plugins> </pluginManagement> <plugins> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <executions> <execution> <id>scala-compile-first</id> <phase>process-resources</phase> <goals> <goal>add-source</goal> <goal>compile</goal> </goals> </execution> <execution> <id>scala-test-compile</id> <phase>process-test-resources</phase> <goals> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <executions> <execution> <phase>compile</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> <configuration> <source>1.7</source> <target>1.7</target> <encoding>UTF-8</encoding> </configuration> </plugin> </plugins> <filters> <filter>src/main/filters/xuele-${build.profile.id}.properties</filter> </filters> <!--指定下面的目录为资源文件--> <resources> <!--设置自动替换--> <resource> <directory>src/main/resources</directory> <includes> <include>**/*</include> </includes> <!--也可以用排除标签--> <!--<excludes></excludes>--> <!--开启过滤--> <filtering>true</filtering> </resource> </resources> </build> <profiles> <!--默认激活开发配置,使用index-dev.properties来替换实际的文件key--> <profile> <id>dev</id> <activation> <activeByDefault>true</activeByDefault> </activation> <properties> <build.profile.id>dev</build.profile.id> </properties> </profile> <!-- 测试环境配置 --> <profile> <id>test</id> <properties> <build.profile.id>test</build.profile.id> </properties> </profile> <!-- 生产环境配置 --> <profile> <id>product</id> <properties> <build.profile.id>product</build.profile.id> </properties> </profile> </profiles></project>
然后看一个例子spark sql的测试例子:
import org.apache.spark.sql.SparkSession/** * spark sql本地测试例子 */object TestGroup { def main(args: Array[String]): Unit = { val spark = SparkSession .builder().master("local[1]")//设置loca模式 .appName("Spark SQL basic example")//设置app的名字 .getOrCreate() import spark.implicits._//导入隐式的转化函数 import spark.sql //导入sql函数 //使用Seq造数据,三列数据 val df = spark.sparkContext.parallelize(Seq((0,"p",30.9), (0,"u",22.1), (1,"r",19.6), (2,"cat40",20.7), (2,"cat187",27.9), (2,"cat183",11.3), (3,"cat8",35.6))).toDF("id", "name", "price")//转化df的三列数据s df.createTempView("pro")//创建表明为pro //按照id分组,统计每组数量,统计每组里面最小的价格,然后收集每组里面的数据 val ds=sql("select id, count(*) as c,min(price) as min_price,collect_list(struct(name, price)) as res from pro group by id "); ds.cache() //需要多次查询的数据,可以缓存起来 //获取查询的结果,遍历获取结果集 ds.select("id","c","res","min_price").collect().foreach(line=>{ import org.apache.spark.sql.Row//导入Row对象 val id=line.getAs[Int]("id")//获取id val count=line.getAs[Long]("c")//获取数量 val min_price=line.getAs[Double]("min_price")//获取最小的价格 val value=line.getAs[Seq[Row]]("res")//获取每组内的数据集合,注意是一个Row实体 println(id+" "+count+" "+" "+min_price)//打印数据 value.foreach(row=>{//遍历组内数据集合,然后打印 println(row.getAs[String]("name")+" "+row.getAs[Double]("price")) }) }) spark.stop() }}
至此,一个涵盖spark sql比较全的功能例子的小工程就完成了,上面的代码直接可在win上运行,而且里面的数据随时自己添加删除,以便于可以测试spark sql与预期效果对比,上面的sql中还用到了分组里面的高级用法,分组后,收集组内数据,注意组内数据收集,如果是单个字段,直接用collect_list或者collect_set即可,但是如果是多个字段,这个时候必须用到struct类型了,最终转化后的类型就是row的集合,里面的每个结构体会被转成一个row对象,一个组的数据,就是List<Row>了,最终可以在代码里面遍历取出。spark sql结合scala编程语言之后可以变得非常灵活,sql不擅长的就用编程语言解决的,sql擅长的就用sql方便快速得到数据,用起来非常干净清爽!
有什么问题可以扫码关注微信公众号:我是攻城师(woshigcs),在后台留言咨询。 技术债不能欠,健康债更不能欠, 求道之路,与君同行。
0 0
- spark sql 快速体验调试
- 3 分钟快速体验 Apache Spark SQL
- Spark Python 快速体验
- spark-sql初体验
- 在IDEA中调试运行Spark SQL
- Spark源码系列(九)Spark SQL初体验之解析过程详解
- Spark源码系列(九)Spark SQL初体验之解析过程详解
- 调试体验
- Spark 调试
- spark学习14之使用maven快速切换本地调试的spark版本
- Spark2.x学习笔记:13、Spark SQL快速入门
- spark-shell初体验
- Elasticsearch-Spark 体验
- spark streaming初体验
- spark初体验
- Spark Streaming+Spark SQL
- spark sql
- Spark SQL
- Lab2
- 利用kd树进行平面内最近点搜索
- mysql 大量time wait 解决办法
- 直播APP
- 两种 js下载文件的方法(转)
- spark sql 快速体验调试
- hdu 2008
- 关于内核打印信息的查看
- 腾讯云namenode启动失败
- Swift循环语句-for in 语句
- Android实战—selector使用之改变控件背景
- WAV文件格式(2)
- python爬虫(17)爬出新高度_抓取微信公众号文章(selenium+phantomjs)(下)(windows版本)
- 详解KMP算法