spark SQL学习(数据源之json)
来源:互联网 发布:淘宝网上工作 编辑:程序博客网 时间:2024/05/18 00:13
准备工作
数据文件students.json
{“id”:1, “name”:”leo”, “age”:18}
{“id”:2, “name”:”jack”, “age”:19}
{“id”:3, “name”:”marry”, “age”:17}
存放目录:hdfs://master:9000/student/2016113012/spark/students.json
scala代码
package wujiadong_sparkSQLimport org.apache.spark.sql.SQLContextimport org.apache.spark.{SparkConf, SparkContext}/** * Created by Administrator on 2017/2/12. *///通过加载json数据源创建datafrobject JsonOperation { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("JsonOperation") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) //直接读取json格式文件 val df1 = sqlContext.read.json("hdfs://master:9000/student/2016113012/spark/students.json") //通过load读取json格式文件,需要指定格式,不指定默认读取的是parquet格式文件 //sqlContext.read.format("json").load("hdfs://master:9000/student/2016113012/spark/students.json") df1.printSchema() df1.registerTempTable("t_students") val teenagers = sqlContext.sql("select name from t_students where age > 13 and age <19") teenagers.write.parquet("hdfs://master:9000/student/2016113012/teenagers") }}
提交集群
hadoop@master:~/wujiadong
17/02/14 10:58:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
17/02/14 10:58:56 INFO Slf4jLogger: Slf4jLogger started
17/02/14 10:58:56 INFO Remoting: Starting remoting
17/02/14 10:58:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:58268]
17/02/14 10:58:59 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/02/14 10:59:05 INFO FileInputFormat: Total input paths to process : 1
17/02/14 10:59:11 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/14 10:59:11 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/14 10:59:11 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/14 10:59:11 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/14 10:59:11 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
root
|– age: long (nullable = true)
|– id: long (nullable = true)
|– name: string (nullable = true)
17/02/14 10:59:18 INFO FileInputFormat: Total input paths to process : 1
17/02/14 10:59:18 INFO CodecPool: Got brand-new compressor [.gz]
SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/02/14 10:59:19 INFO FileOutputCommitter: Saved output of task ‘attempt_201702141059_0001_m_000000_0’ to hdfs://master:9000/studnet/2016113012/teenagers/_temporary/0/task_201702141059_0001_m_000000
常见报错
Exception in thread “main” java.io.IOException: No input paths specified in job
原因是读取数据源失败导致的,比如写错了数据源路径
- spark SQL学习(数据源之json)
- 7.Spark SQL:JSON数据源
- Spark SQL之External DataSource外部数据源(一)示例
- Spark学习之Spark SQL(8)
- Spark SQL之External DataSource外部数据源
- Spark SQL之External DataSource外部数据源(二)源码分析
- 5.Spark SQL:Parquet数据源之自动分区推断
- 6.Spark SQL:Parquet数据源之合并元数据
- 报表数据源之JSON
- spark sql读取json
- spark SQL学习(spark连接hive)
- spark SQL学习(spark连接 mysql)
- Spark SQL基础学习【三】以json的方式存储
- [2.6]Spark SQL 操作各种数据源笔记
- spark sql 使用hive作为数据源
- 8.Spark SQL:Hive数据源实战
- 9. Spark SQL:JDBC数据源实战
- 2.Spark SQL:数据源之通用的load和save操作
- 数据结构(C语言)学习之路(1)——绪论
- 58到家MQ如何快速实现流量削峰填谷
- Maven-maven介绍&maven安装配置&创建maven工程&M2Eclipse
- 百度地图
- 还债系列之数据结构——数组和链表
- spark SQL学习(数据源之json)
- ThinkPHP5的配置之修改默认跳转成功和失败页面
- 地位地图
- LeetCode#206. Reverse Linked List
- javascript分页类Killpage
- JavaScript 1.3 对象补充:JS声明对象时属性名加引号与不加引号的区别(转)
- 宅急送 项目第二天(完整流程)
- 强化学习-几个基本概念
- LeetCode (Remove Duplicates from Sorted Array)