RDD转化为DataFrames的两种创建方式
来源:互联网 发布:app inventor小软件 编辑:程序博客网 时间:2024/05/17 22:54
1.通过反射的方式
Scala的接口为SparkSQL提供了RDD通过转换成样例类,然后自动的转换成DataFrame,样例类定义了表的模式,使用反射读取case类参数的名称,并成为列的名称。case类也可以嵌套,或者包含复杂类型,如序列或数组。RDD可以隐式转换为一个DataFrame,然后被注册为一个表。表可用于后续SQL语句中。
// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// this is used to implicitly convert an RDD to a DataFrame.import sqlContext.implicits._// Define the schema using a case class.// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,// you can use custom classes that implement the Product interface.case class Person(name: String, age: Int)// Create an RDD of Person objects and register it as a table.val people = sc.textFile( "examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()people.registerTempTable("people")// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index:teenagers.map(t => "Name: " + t(0)).collect().foreach(println)// or by field name:teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)// Map("name" -> "Justin", "age" -> 19)
import org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.{DataFrame, SQLContext}case class Person(val name:String,val age:Integer)object RDD2DataFrameByReflection { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("sparkSQL").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val fileRDD: RDD[String] = sc.textFile("examples/src/main/resources/people.txt") val personRDD: RDD[Person] = fileRDD.map(_.split(",")).map(p => Person(p(0).trim,p(1).trim.toInt)) import sqlContext.implicits._ val df: DataFrame = personRDD.toDF() df.registerTempTable("student") val result: DataFrame = sqlContext.sql("select *from student") result.show() result.map(t =>"name:"+t(0)).foreach(println) result.map(_.getAs[String]("name")).foreach(println) result.map(_.getValuesMap[Any](List("name","age"))).foreach(println) }}
2.通过将RDD转化成Row的方式
当类不能被提前定义的时候(例如:结构化的记录被加密成是一个字符串,或者文本数据集将被解析给不同的用户使用)一个DataFrame的创建呀经历三个步骤:
(1)从原始的RDD创建一个数据类型是Row的RDD
(2)创建一个schema的结构和之前创建的Row RDD要匹配
(3)将row RDD转换成DataFrame
// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// Create an RDDval people = sc.textFile("examples/src/main/resources/people.txt")// The schema is encoded in a stringval schemaString = "name age"// Import Row.import org.apache.spark.sql.Row;// Import Spark SQL data typesimport org.apache.spark.sql.types.{StructType,StructField,StringType};// Generate the schema based on the string of schemaval schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))// Convert records of the RDD (people) to Rows.val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))// Apply the schema to the RDD.val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)// Register the DataFrames as a table.peopleDataFrame.registerTempTable("people")// SQL statements can be run by using the sql methods provided by sqlContext.val results = sqlContext.sql("SELECT name FROM people")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index or by field name.results.map(t => "Name: " + t(0)).collect().foreach(println)
import org.apache.spark.rdd.RDDimport org.apache.spark.sql.types.{IntegerType, StructField, StructType,StringType}import org.apache.spark.{SparkConf, SparkContext}import org.apache.spark.sql.{DataFrame, Row, SQLContext}object RDD2DataFrameByProgrammatically { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("sparkSQL").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val fileRDD: RDD[String] = sc.textFile("examples/src/main/resources/people.txt") val rowRDD: RDD[Row] = fileRDD.map(_.split(",")).map(p => Row(p(0).trim,p(1).trim.toInt)) val schema = StructType( Array( StructField("name",StringType,true) , StructField("age",IntegerType,true) ) ) val df: DataFrame = sqlContext.createDataFrame(rowRDD,schema) df.registerTempTable("student") sqlContext.sql("select *from student").show() }}
阅读全文
0 0
- RDD转化为DataFrames的两种创建方式
- RDD创建的两种方式
- Java接入Spark之创建RDD的两种方式和操作RDD
- Java接入Spark之创建RDD的两种方式和操作RDD
- RDD转换为DataFrame的两种方式及spark sql的简单实例
- RDD的几种创建方式
- RDD 转化为 DataFrame
- Spark将RDD转换成DataFrame的两种方式
- Spark将RDD转换成DataFrame的两种方式
- Spark RDD/DataFrame map保存数据的两种方式
- spark rdd转dataframe的两种方式
- Java中将xml文件转化为json的两种方式
- 动手实战创建RDD的三种方式
- [1.2]Spark core编程(一)之RDD总论与创建RDD的三种方式
- spark中创建RDD的方式
- 键值对RDD的创建方式
- spark创建RDD方式
- 链表转化为数组的两种策略
- SylixOS网络数据异步处理原理
- eclipse中使用自带git的常用操作
- 隐藏PHP文件名的操作
- php常见的语法错误
- android属性动画
- RDD转化为DataFrames的两种创建方式
- php错误调试
- 通过die和mysql_error语句调试PHP SQL错误
- Hi,腾讯WeTest联合Unity官方打造的性能分析工具UPA,今日全新发布!
- 初探和实现websocket心跳重连
- php数据数据乱码问题
- php封装一个异常的处理类
- 酷炫的Activity切换动画,打造更好的用户体验
- 装Linux系统(ubuntu14)遇到的坑