Spark WordCount 读写hdfs文件 (read file from hadoop hdfs and write output to hdfs)
来源:互联网 发布:java读取动态日志文件 编辑:程序博客网 时间:2024/05/29 10:54
0 Spark开发环境按照下面博客创建:
http://blog.csdn.net/w13770269691/article/details/15505507
http://blog.csdn.net/qianlong4526888/article/details/21441131
1 在eclipse (juno version at least)中创建Scala开发环境
http://blog.csdn.net/w13770269691/article/details/15505507
http://blog.csdn.net/qianlong4526888/article/details/21441131
1 在eclipse (juno version at least)中创建Scala开发环境
just install scala : help->install new software->add url:http://download.scala-ide.org/sdk/e38/scala29/stable/site
refer to: http://dongxicheng.org/framework-on-yarn/spark-eclipse-ide/
2 用Scala在eclipse中写WordCount
create a scala project and a WordCount class as follow:
package com.qiurc.testimport org.apache.spark._import SparkContext._object WordCount { def main(args: Array[String]){ if(args.length != 3){ println("usage: com.qiurc.test.WordCount <master> <input> <output>") return } val sc = new SparkContext(args(0), "WordCount", System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_QIUTEST_JAR"))) val textFile = sc.textFile(args(1)) val result = textFile.flatMap(_.split(" ")) .map(word => (word, 1)).reduceByKey(_ + _) result.saveAsTextFile(args(2)) }}
3 导出为一个 jar包
right click the project and export as spark_qiutest.jar.
then put it into some dir, such as SPARK_HOME/qiutest
4 弄一个运行脚本运行这个jar包
copy run-example(in SparkHome) and change it!
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cp run-example run-qiu-test
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ vim run-qiu-test
copy run-example(in SparkHome) and change it!
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cp run-example run-qiu-test
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ vim run-qiu-test
____________________________________
SCALA_VERSION=2.9.3
# Figure out where the Scala framework is installed
FWDIR="$(cd `dirname $0`; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"
# Load environment variables from conf/spark-env.sh, if it exists
if [ -e $FWDIR/conf/spark-env.sh ] ; then
. $FWDIR/conf/spark-env.sh
fi
if [ -z "$1" ]; then
echo "Usage: run-example <example-class> [<args>]" >&2
exit 1
fi
# Figure out the JAR file that our examples were packaged into. This includes a bit of a hack
# to avoid the -sources and -doc packages that are built by publish-local.
QIUTEST_DIR="$FWDIR"/qiutest
SPARK_QIUTEST_JAR=""
if [ -e "$QIUTEST_DIR"/spark_qiutest.jar ]; then
export SPARK_QIUTEST_JAR=`ls "$QIUTEST_DIR"/spark_qiutest.jar`
fi
if [[ -z $SPARK_QIUTEST_JAR ]]; then
echo "Failed to find Spark qiutest jar assembly in $FWDIR/qiutest" >&2
echo "You need to build spark test jar assembly before running this program" >&2
exit 1
fi
# Since the examples JAR ideally shouldn't include spark-core (that dependency should be
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
CLASSdata-path="$SPARK_QIUTEST_JAR:$CLASSPATH"
# Find java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ `command -v java` ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
fi
if [ "$SPARK_PRINT_LAUNCH_COMMAND" == "1" ]; then
echo -n "Spark Command: "
echo "$RUNNER" -cp "$CLASSPATH" "$@"
echo "========================================"
echo
fi
exec "$RUNNER" -cp "$CLASSPATH" "$@"
SCALA_VERSION=2.9.3
# Figure out where the Scala framework is installed
FWDIR="$(cd `dirname $0`; pwd)"
# Export this as SPARK_HOME
export SPARK_HOME="$FWDIR"
# Load environment variables from conf/spark-env.sh, if it exists
if [ -e $FWDIR/conf/spark-env.sh ] ; then
. $FWDIR/conf/spark-env.sh
fi
if [ -z "$1" ]; then
echo "Usage: run-example <example-class> [<args>]" >&2
exit 1
fi
# Figure out the JAR file that our examples were packaged into. This includes a bit of a hack
# to avoid the -sources and -doc packages that are built by publish-local.
QIUTEST_DIR="$FWDIR"/qiutest
SPARK_QIUTEST_JAR=""
if [ -e "$QIUTEST_DIR"/spark_qiutest.jar ]; then
export SPARK_QIUTEST_JAR=`ls "$QIUTEST_DIR"/spark_qiutest.jar`
fi
if [[ -z $SPARK_QIUTEST_JAR ]]; then
echo "Failed to find Spark qiutest jar assembly in $FWDIR/qiutest" >&2
echo "You need to build spark test jar assembly before running this program" >&2
exit 1
fi
# Since the examples JAR ideally shouldn't include spark-core (that dependency should be
# "provided"), also add our standard Spark classpath, built using compute-classpath.sh.
CLASSPATH=`$FWDIR/bin/compute-classpath.sh`
CLASSdata-path="$SPARK_QIUTEST_JAR:$CLASSPATH"
# Find java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ `command -v java` ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
fi
if [ "$SPARK_PRINT_LAUNCH_COMMAND" == "1" ]; then
echo -n "Spark Command: "
echo "$RUNNER" -cp "$CLASSPATH" "$@"
echo "========================================"
echo
fi
exec "$RUNNER" -cp "$CLASSPATH" "$@"
5 Run it in spark with hadoop hdfs
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ ls
a.txt logs python spark-class2.cmd
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cat a.txt
a
b
c
c
d
d
e
e
(note : put a.txt into hdfs)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$hadoop fs -put a.txt ./
(note : check a.txt in hdfs)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ hadoop fs -ls
Found 6 items
-rw-r--r-- 2 hadoop supergroup 4215 2014-04-14 10:27 /user/hadoop/README.md
-rw-r--r-- 2 hadoop supergroup 19 2014-04-14 15:58 /user/hadoop/a.txt
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:17 /user/hadoop/dumpfile
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:19 /user/hadoop/dumpfiles
drwxr-xr-x - hadoop supergroup 0 2014-04-14 15:57 /user/hadoop/qiurc
drwxr-xr-x - hadoop supergroup 0 2013-07-06 19:48 /user/hadoop/temp
Found 6 items
-rw-r--r-- 2 hadoop supergroup 4215 2014-04-14 10:27 /user/hadoop/README.md
-rw-r--r-- 2 hadoop supergroup 19 2014-04-14 15:58 /user/hadoop/a.txt
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:17 /user/hadoop/dumpfile
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:19 /user/hadoop/dumpfiles
drwxr-xr-x - hadoop supergroup 0 2014-04-14 15:57 /user/hadoop/qiurc
drwxr-xr-x - hadoop supergroup 0 2013-07-06 19:48 /user/hadoop/temp
(note : create a dir named "qiurc" to store the output of WordCount in hdfs)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ hadoop fs -mkdir /user/hadoop/qiurc
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ hadoop fs -ls
Found 5 items
-rw-r--r-- 2 hadoop supergroup 4215 2014-04-14 10:27 /user/hadoop/README.md
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:17 /user/hadoop/dumpfile
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:19 /user/hadoop/dumpfiles
drwxr-xr-x - hadoop supergroup 0 2014-04-14 15:32 /user/hadoop/qiurc
drwxr-xr-x - hadoop supergroup 0 2013-07-06 19:48 /user/hadoop/temp
Found 5 items
-rw-r--r-- 2 hadoop supergroup 4215 2014-04-14 10:27 /user/hadoop/README.md
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:17 /user/hadoop/dumpfile
-rw-r--r-- 2 hadoop supergroup 0 2013-05-29 17:19 /user/hadoop/dumpfiles
drwxr-xr-x - hadoop supergroup 0 2014-04-14 15:32 /user/hadoop/qiurc
drwxr-xr-x - hadoop supergroup 0 2013-07-06 19:48 /user/hadoop/temp
开始运行我们的WordCount程序。指定输入输出位置。测试过只有加上hdfsXXX绝对路径才能写入hdfs
(note: prefix "hdfs://debian-master:9000/user/hadoop/" can't beforgot)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ ./run-qiu-testcom.qiurc.test.WordCount spark://debian-master:7077hdfs://debian-master:9000/user/hadoop/a.txthdfs://debian-master:9000/user/hadoop/qiurc
(note: prefix "hdfs://debian-master:9000/user/hadoop/" can't beforgot)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ ./run-qiu-testcom.qiurc.test.WordCount spark://debian-master:7077hdfs://debian-master:9000/user/hadoop/a.txthdfs://debian-master:9000/user/hadoop/qiurc
(note: get command is ok, too)
part-00000 part-00001 part-00002 _SUCCESS
(note: let me show these result )
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cat localFile/part-00000
(,1)
(c,2)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cat localFile/part-00001
(d,2)
(a,1)
hadoop@debian-master:~/spark-0.8.0-incubating-bin-hadoop1$ cat localFile/part-00002
(e,3)
(b,1)
Finish ! ^_^
0 0
- Spark WordCount 读写hdfs文件 (read file from hadoop hdfs and write output to hdfs)
- hadoop HDFS读写文件
- Ways to write & read HDFS files
- HADOOP -hdfs of wordcount
- Hadoop读写Hdfs系统文件
- write to and read from files (读写文件)
- hadoop 在hdfs中读写文件
- Hadoop之HDFS文件读写过程
- hdfs——hadoop文件读写操作
- spark on hdfs spark处理hdfs上的文件简单的wordcount
- Hadoop Problem : Wrong FS: hdfs://localhost:9000/output, expected: file:///
- Hadoop HDFS Wrong FS: hdfs:/ expected file:///
- Spark -14:spark Hadoop 高可用模式下读写hdfs
- read from and write to file
- spark从hdfs上读取文件运行wordcount
- spark从hdfs上读取文件运行wordcount
- Spark读取HDFS文件,文件格式为GB2312,实现WordCount示例
- Spark来监控hdfs里的文件,并用wordcount计算
- XHTML学习(1)开篇
- mysql高级 存储过程[2] ~之 传入的参数 && 控制结构
- 线性索引查找 - 概念
- linux学习第一天_基础部分
- hdu-4704 sum(费马小定理)
- Spark WordCount 读写hdfs文件 (read file from hadoop hdfs and write output to hdfs)
- Android关闭ListView,ScrollView等自带的蓝色荧光效果
- C语言 时间函数
- 1. 电路交换与分组交换的区别?优劣对比
- C# winform DataGridView中添加按钮
- 堆和栈的区别(转过无数次的文章)
- 【LeetCode】Unique Binary Search Trees
- 秋天里的夕阳
- Linux Shell sed(流编辑器)