spark查询任意字段,并使用dataframe输出结果

来源:互联网 发布:自己设计头像软件 编辑:程序博客网 时间:2024/06/05 03:06

在写spark程序中,查询csv文件中某个字段,一般是这样的写法:
(1)直接使用dataframe 查询

val df = sqlContext.read    .format("com.databricks.spark.csv")    .option("header", "true") // Use first line of all files as header    .schema(customSchema)    .load("cars.csv")val selectedData = df.select("year", "model")

参考索引:https://github.com/databricks/spark-csv

以上读csv文件是spark1.x的写法,spark2.x的写法又不太一样:
val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()

(2)构建case class.

case class Person(name: String, age: Long)// For implicit conversions from RDDs to DataFramesimport spark.implicits._// Create an RDD of Person objects from a text file, convert it to a Dataframeval peopleDF = spark.sparkContext  .textFile("examples/src/main/resources/people.txt")  .map(_.split(","))  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))  .toDF()// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people")// SQL statements can be run by using the sql methods provided by Sparkval teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")// The columns of a row in the result can be accessed by field indexteenagersDF.map(teenager => "Name: " + teenager(0)).show()// +------------+// |       value|// +------------+// |Name: Justin|// +------------+

这是spark2.2.0网站上面的例子.

参考索引:http://spark.apache.org/docs/latest/sql-programming-guide.html

以上2种写法,如果只是测试一下小文件,文件的列头的字段不多(几十个)以内是可以用的.比如我只查询某个用户的Name, Age, Sex 这几个字段.

但是实际上,会遇到这种问题:
**(1)我不确定要查哪些字段;
(2)我不确定要查几个字段.**

上面的例子就不够用了.恰好有第三种方法(3):

import org.apache.spark.sql.types._// Create an RDDval peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")// The schema is encoded in a stringval schemaString = "name age"// Generate the schema based on the string of schemaval fields = schemaString.split(" ")  .map(fieldName => StructField(fieldName, StringType, nullable = true))val schema = StructType(fields)// Convert records of the RDD (people) to Rowsval rowRDD = peopleRDD  .map(_.split(","))  .map(attributes => Row(attributes(0), attributes(1).trim))// Apply the schema to the RDDval peopleDF = spark.createDataFrame(rowRDD, schema)// Creates a temporary view using the DataFramepeopleDF.createOrReplaceTempView("people")// SQL can be run over a temporary view created using DataFramesval results = spark.sql("SELECT name FROM people")// The results of SQL queries are DataFrames and support all the normal RDD operations// The columns of a row in the result can be accessed by field index or by field nameresults.map(attributes => "Name: " + attributes(0)).show()// +-------------+// |        value|// +-------------+// |Name: Michael|// |   Name: Andy|// | Name: Justin|// +-------------+

上面的例子,也是来自spark网站,仍然会使用dataframe,不过查询的字段结构,使用StructField 和StructType .查询的每个字段, 使用数字代替,而不是具体的Name,Age 字段名.不过,例(3)的使用效果跟例(1),(2)类似,没法解决上面提出的问题,还需要改进一下.

例(4):

val df = sparkSession.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("people.csv").cache()var schemaString = "name,age"//注册临时表df.createOrReplaceTempView("people")//sql 查询var dataDF = sparkSession.sql("select "+schemaString+" from people")//转rddvar dfrdd = dataDF.rddval fields = schemaString.split(",").map(fieldName => StructField(fieldName, StringType, nullable = true))var schema = StructType(fields)//将rdd转成dfvar newDF=sparkSession.createDataFrame(dfrdd, schema)

这样就可以实现以上提出的问题.

dataframe是很快的,特别是新的版本中。当然,在生产环境中,我们可能仍然用RDD去转换想要的数据。这时,可以这样写:

//将csv中某一整行的数组抽取需要的字段,并转成一个数组
比如查询
val queryArr = Array(“NAME”,”AGE”)

    val rowRDD2 = rowRDD.map(attributes => {      val myattributes : Array[String] = attributes      //包含要查询的字段所在列的位置的数组,比如,第n列      val mycolumnsNameIndexArr : Array[Int] = colsNameIndexArrBroadcast.value      var mycolumnsNameDataArrb : ArrayBuffer[String] = new ArrayBuffer[String]()      for(i<-0 until mycolumnsNameIndexArr.length){        mycolumnsNameDataArrb+=myattributes(mycolumnsNameIndexArr(i)).toString      }      val mycolumnsNameDataArr : Array[String] = mycolumnsNameDataArrb.toArray      mycolumnsNameDataArr    }).map(x => Row(x)).cache()

这样,返回的rdd每一行就是一个数组,再根据行数遍历,可以把行转换成列。

原创粉丝点击