Spark组件之SparkR学习3--使用spark-submit向集群提交R代码文件data-manipulation.R
来源:互联网 发布:日本校园霸凌 知乎 编辑:程序博客网 时间:2024/06/08 10:40
更多代码请见:https://github.com/xubo245/SparkLearning
1.数据准备:
1.1 下载数据文件
wget http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
1.2 上传到hdfs:
hadoop fs -put flights.csv ./
2.运行
2.1 默认本地运行:
<pre name="code" class="plain">spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master local data-manipulation.R flights.csv
运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master local data-manipulation.R flights.csvIvy Default Cache set to: /home/hadoop/.ivy2/cacheThe jars for the packages stored in: /home/hadoop/.ivy2/jars:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlcom.databricks#spark-csv_2.10 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0confs: [default]found com.databricks#spark-csv_2.10;1.4.0 in centralfound org.apache.commons#commons-csv;1.1 in centralfound com.univocity#univocity-parsers;1.5.1 in central:: resolution report :: resolve 364ms :: artifacts dl 11ms:: modules in use:com.databricks#spark-csv_2.10;1.4.0 from central in [default]com.univocity#univocity-parsers;1.5.1 from central in [default]org.apache.commons#commons-csv;1.1 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 3 | 0 | 0 | 0 || 3 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parentconfs: [default]0 artifacts copied, 3 already retrieved (0kB/8ms)WARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transformroot |-- date: string (nullable = true) |-- hour: string (nullable = true) |-- minute: string (nullable = true) |-- dep: string (nullable = true) |-- arr: string (nullable = true) |-- dep_delay: string (nullable = true) |-- arr_delay: string (nullable = true) |-- carrier: string (nullable = true) |-- flight: string (nullable = true) |-- dest: string (nullable = true) |-- plane: string (nullable = true) |-- cancelled: string (nullable = true) |-- time: string (nullable = true) |-- dist: string (nullable = true)DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224||2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224||2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224||2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224||2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224||2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+only showing top 6 rows date hour minute dep arr dep_delay arr_delay carrier flight1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 4282 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 4283 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 4284 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 4285 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 4286 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428 dest plane cancelled time dist1 DFW N576AA 0 40 2242 DFW N557AA 0 45 2243 DFW N541AA 0 48 2244 DFW N403AA 0 39 2245 DFW N492AA 0 44 2246 DFW N262AA 0 45 224 [1] "date" "hour" "minute" "dep" "arr" "dep_delay" [7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"[13] "time" "dist" [1] 227496 dest cancelled1 DFW 02 DFW 03 DFW 04 DFW 05 DFW 06 DFW 0
2.2 集群运行:
运行指令:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master spark://<strong>MasterIP</strong>:7077 data-manipulation.R flights.csv
<strong></strong>
<strong>MasterIP需要改为实际IP</strong>
<strong>集群运行比默认本地运行快很多</strong>
<strong></strong>运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master spark://<strong>MasterIP</strong>:7077 data-manipulation.R flights.csvIvy Default Cache set to: /home/hadoop/.ivy2/cacheThe jars for the packages stored in: /home/hadoop/.ivy2/jars:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlcom.databricks#spark-csv_2.10 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0confs: [default]found com.databricks#spark-csv_2.10;1.4.0 in centralfound org.apache.commons#commons-csv;1.1 in centralfound com.univocity#univocity-parsers;1.5.1 in central:: resolution report :: resolve 342ms :: artifacts dl 12ms:: modules in use:com.databricks#spark-csv_2.10;1.4.0 from central in [default]com.univocity#univocity-parsers;1.5.1 from central in [default]org.apache.commons#commons-csv;1.1 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 3 | 0 | 0 | 0 || 3 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parentconfs: [default]0 artifacts copied, 3 already retrieved (0kB/8ms)WARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transformroot |-- date: string (nullable = true) |-- hour: string (nullable = true) |-- minute: string (nullable = true) |-- dep: string (nullable = true) |-- arr: string (nullable = true) |-- dep_delay: string (nullable = true) |-- arr_delay: string (nullable = true) |-- carrier: string (nullable = true) |-- flight: string (nullable = true) |-- dest: string (nullable = true) |-- plane: string (nullable = true) |-- cancelled: string (nullable = true) |-- time: string (nullable = true) |-- dist: string (nullable = true)DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224||2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224||2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224||2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224||2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224||2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+only showing top 6 rows date hour minute dep arr dep_delay arr_delay carrier flight1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 4282 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 4283 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 4284 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 4285 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 4286 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428 dest plane cancelled time dist1 DFW N576AA 0 40 2242 DFW N557AA 0 45 2243 DFW N541AA 0 48 2244 DFW N403AA 0 39 2245 DFW N492AA 0 44 2246 DFW N262AA 0 45 224 [1] "date" "hour" "minute" "dep" "arr" "dep_delay" [7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"[13] "time" "dist" [1] 227496 dest cancelled 1 DFW 02 DFW 03 DFW 04 DFW 05 DFW 06 DFW 0
2.3 源码文件:
## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF licenses this file to You under the Apache License, Version 2.0# (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at## http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.## For this example, we shall use the "flights" dataset# The dataset consists of every flight departing Houston in 2011.# The data set is made up of 227,496 rows x 14 columns. # To run this example use# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3# examples/src/main/r/data-manipulation.R <path_to_csv># Load SparkR library into your R sessionlibrary(SparkR)args <- commandArgs(trailing = TRUE)if (length(args) != 1) { print("Usage: data-manipulation.R <path-to-flights.csv") print("The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ") q("no")}## Initialize SparkContextsc <- sparkR.init(appName = "SparkR-data-manipulation-example")## Initialize SQLContextsqlContext <- sparkRSQL.init(sc)flightsCsvPath <- args[[1]]# Create a local R dataframeflights_df <- read.csv(flightsCsvPath, header = TRUE)flights_df$date <- as.Date(flights_df$date)## Filter flights whose destination is San Francisco and write to a local data frameSFO_df <- flights_df[flights_df$dest == "SFO", ] # Convert the local data frame into a SparkR DataFrameSFO_DF <- createDataFrame(sqlContext, SFO_df)# Directly create a SparkR DataFrame from the source dataflightsDF <- read.df(sqlContext, flightsCsvPath, source = "com.databricks.spark.csv", header = "true")# Print the schema of this Spark DataFrameprintSchema(flightsDF)# Cache the DataFramecache(flightsDF)# Print the first 6 rows of the DataFrameshowDF(flightsDF, numRows = 6) ## Orhead(flightsDF)# Show the column names in the DataFramecolumns(flightsDF)# Show the number of rows in the DataFramecount(flightsDF)# Select specific columnsdestDF <- select(flightsDF, "dest", "cancelled")# Using SQL to select columns of data# First, register the flights DataFrame as a tableregisterTempTable(flightsDF, "flightsTable")destDF <- sql(sqlContext, "SELECT dest, cancelled FROM flightsTable")# Use collect to create a local R data framelocal_df <- collect(destDF)# Print the newly created local data framehead(local_df)# Filter flights whose destination is JFKjfkDF <- filter(flightsDF, "dest = \"JFK\"") ##ORjfkDF <- filter(flightsDF, flightsDF$dest == "JFK")# If the magrittr library is available, we can use it to# chain data frame operationsif("magrittr" %in% rownames(installed.packages())) { library(magrittr) # Group the flights by date and then find the average daily delay # Write the result into a DataFrame groupBy(flightsDF, flightsDF$date) %>% summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF # Print the computed data frame head(dailyDelayDF)}# Stop the SparkContext nowsparkR.stop()
3.1 路径对但读取不了,未理解=》解决:把文件发到用户目录下就可以了
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R /xubo/spark/data/r/input/flights.csvIvy Default Cache set to: /home/hadoop/.ivy2/cacheThe jars for the packages stored in: /home/hadoop/.ivy2/jars:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlcom.databricks#spark-csv_2.10 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0confs: [default]found com.databricks#spark-csv_2.10;1.4.0 in centralfound org.apache.commons#commons-csv;1.1 in centralfound com.univocity#univocity-parsers;1.5.1 in central:: resolution report :: resolve 357ms :: artifacts dl 11ms:: modules in use:com.databricks#spark-csv_2.10;1.4.0 from central in [default]com.univocity#univocity-parsers;1.5.1 from central in [default]org.apache.commons#commons-csv;1.1 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 3 | 0 | 0 | 0 || 3 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parentconfs: [default]0 artifacts copied, 3 already retrieved (0kB/9ms)WARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transformError in file(file, "rt") : cannot open the connectionCalls: read.csv -> read.table -> fileIn addition: Warning message:In file(file, "rt") : cannot open file '/xubo/spark/data/r/input/flights.csv': No such file or directoryExecution halted
3.2 文件不存在错误: =》解决办法:传上去就可以了
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R flights.csvIvy Default Cache set to: /home/hadoop/.ivy2/cacheThe jars for the packages stored in: /home/hadoop/.ivy2/jars:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlcom.databricks#spark-csv_2.10 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0confs: [default]found com.databricks#spark-csv_2.10;1.4.0 in centralfound org.apache.commons#commons-csv;1.1 in centralfound com.univocity#univocity-parsers;1.5.1 in central:: resolution report :: resolve 371ms :: artifacts dl 12ms:: modules in use:com.databricks#spark-csv_2.10;1.4.0 from central in [default]com.univocity#univocity-parsers;1.5.1 from central in [default]org.apache.commons#commons-csv;1.1 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 3 | 0 | 0 | 0 || 3 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parentconfs: [default]0 artifacts copied, 3 already retrieved (0kB/8ms)WARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transform16/04/20 12:41:53 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failedError in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://Master:9000/user/hadoop/flights.csvat org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)at scala.Option.getOrElse(Option.scala:120)at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)at scala.Option.getOrElse(Option.scala:120)at org.apache.spark.rdd.RDCalls: read.df -> callJStatic -> invokeJavaExecution halted
3.3 没有找到com.databricks.spark.csv模版:=》解决办法:加入com.databricks.spark.csv: spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R flights.csv
运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit data-manipulation.R flights.csvWARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transform16/04/20 12:28:18 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failedError in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:156)at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:606)at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)at org.apache.Calls: read.df -> callJStatic -> invokeJavaExecution halted
</pre><pre code_snippet_id="1654209" snippet_file_name="blog_20160420_13_9087207" name="code" class="plain">
3.4 不声明--master会很慢:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R flights.csvIvy Default Cache set to: /home/hadoop/.ivy2/cacheThe jars for the packages stored in: /home/hadoop/.ivy2/jars:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xmlcom.databricks#spark-csv_2.10 added as a dependency:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0confs: [default]found com.databricks#spark-csv_2.10;1.4.0 in centralfound org.apache.commons#commons-csv;1.1 in centralfound com.univocity#univocity-parsers;1.5.1 in central:: resolution report :: resolve 342ms :: artifacts dl 25ms:: modules in use:com.databricks#spark-csv_2.10;1.4.0 from central in [default]com.univocity#univocity-parsers;1.5.1 from central in [default]org.apache.commons#commons-csv;1.1 from central in [default]---------------------------------------------------------------------| | modules || artifacts || conf | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------| default | 3 | 0 | 0 | 0 || 3 | 0 |---------------------------------------------------------------------:: retrieving :: org.apache.spark#spark-submit-parentconfs: [default]0 artifacts copied, 3 already retrieved (0kB/8ms)WARNING: ignoring environment value of R_HOMELoading required package: methodsAttaching package: ‘SparkR’The following objects are masked from ‘package:stats’: filter, na.omitThe following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transformroot |-- date: string (nullable = true) |-- hour: string (nullable = true) |-- minute: string (nullable = true) |-- dep: string (nullable = true) |-- arr: string (nullable = true) |-- dep_delay: string (nullable = true) |-- arr_delay: string (nullable = true) |-- carrier: string (nullable = true) |-- flight: string (nullable = true) |-- dest: string (nullable = true) |-- plane: string (nullable = true) |-- cancelled: string (nullable = true) |-- time: string (nullable = true) |-- dist: string (nullable = true)DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224||2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224||2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224||2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224||2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224||2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+only showing top 6 rows date hour minute dep arr dep_delay arr_delay carrier flight1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 4282 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 4283 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 4284 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 4285 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 4286 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428 dest plane cancelled time dist1 DFW N576AA 0 40 2242 DFW N557AA 0 45 2243 DFW N541AA 0 48 2244 DFW N403AA 0 39 2245 DFW N492AA 0 44 2246 DFW N262AA 0 45 224 [1] "date" "hour" "minute" "dep" "arr" "dep_delay" [7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"[13] "time" "dist" [Stage 4:=============================> (1 + 0) / 2][Stage 4:=============================> (1 + 0) / 2][Stage 4:=============================> (1 + 0) / 2][Stage 4:=============================> (1 + 0) / 2][Stage 4:=============================> (1 + 0) / 2][Stage 4:=============================> (1 + 0) / 2]
0 0
- Spark组件之SparkR学习3--使用spark-submit向集群提交R代码文件data-manipulation.R
- Spark组件之SparkR学习2--使用spark-submit向集群提交R代码文件dataframe.R
- Spark组件之SparkR学习5--R语言函数调用(跨文件调用)
- SparkR (R on Spark)
- Spark组件之SparkR学习4--Eclipse下R语言环境搭建
- 通过SparkR在R上运行Spark
- SparkR principle | R spark 集成原理
- spark-submit提交集群命令
- 蜗龙徒行-Spark学习笔记【四】Spark集群中使用spark-submit提交jar任务包实战经验
- Spark组件之SparkR学习1--安装与测试
- R data Manipulation
- Spark-submit提交任务到集群
- spark-submit 提交作业到集群
- spark-submit提交任务到集群-案例
- spark-submit提交任务到集群
- Spark R
- Spark集群中使用spark-submit提交jar任务包实战经验
- 使用spark-submit提交jar包到spark standalone集群(续)
- 如何发布一个Python命令行工具
- pregel 与 spark graphX 的 pregel api
- 主外键关联取出主键时报错[org.hibernate.LazyInitializationException] could not initialize proxy - no Session
- FFmpeg:'UINT64_C' was not declared in ths scope
- sdut 2886 Weighted Median 结构体
- Spark组件之SparkR学习3--使用spark-submit向集群提交R代码文件data-manipulation.R
- Simple, Piso, Icofoam等不可压求解器
- 利用session做国际化引起的old区内存爆满及修复方法
- verilog过程块与赋值
- ABAP 动态生成内表的几种方法
- ZOJ 3876--May Day Holiday
- 《sort命令的k选项大讨论》-linux命令五分钟系列之二十七
- 趣味整数-黑洞数
- 基于Xilinx的Synthesize