Spark变量访问总结
来源:互联网 发布:广告录音软件 编辑:程序博客网 时间:2024/06/05 03:10
Spark程序在变量的访问方式上与传统的java程序有一些不同,导致了传值和结果上的差异。本文通过一组实验来分析Spark对各种形式变量的处理方式。
下面这个简单的小程序创建了
- 静态变量staticVariable、静态广播变量staticBroadcast、静态累加器staticAccumulator;
- 成员变量objectVariable、成员广播变量objectBroadcast、成员累加器objectAccumulator;
- 局部变量localVariable、局部广播变量localBroadcast、局部累加器localAccumulator;
共计9种变量,在函数中对字符串数据修改了两次数值(“banana”和“cat”)、在flatMap函数中对累加器的值进行累加。
/** * Created by Alex on 2016/9/25. */public class TestSharedVariable implements Serializable { private static Logger logger = LoggerFactory.getLogger(TestSharedVariable.class); private static String staticVariable = "apple";// private static Broadcast<String> staticBroadcast; //java.lang.NullPointerException// private static Accumulator<Integer> staticAccumulator; private String objectVariable = "apple"; private Broadcast<String> objectBroadcast; private Accumulator<Integer> objectAccumulator; public void testVariables(JavaSparkContext sc) throws Exception { staticVariable = "banana";// staticBroadcast = sc.broadcast("banana");// staticAccumulator = sc.intAccumulator(0); objectVariable = "banana"; objectBroadcast = sc.broadcast("banana"); objectAccumulator = sc.intAccumulator(0); String localVariable = "banana"; accessVariables(sc, localVariable); staticVariable = "cat";// staticBroadcast = sc.broadcast("cat"); objectVariable = "cat"; objectBroadcast = sc.broadcast("cat"); localVariable = "cat"; accessVariables(sc, localVariable); } public void accessVariables(JavaSparkContext sc, final String localVariable) throws Exception { final Broadcast<String> localBroadcast = sc.broadcast(localVariable); final Accumulator<Integer> localAccumulator = sc.intAccumulator(0); List<String> list = Arrays.asList("machine learning", "deep learning", "graphic model"); JavaRDD<String> rddx = sc.parallelize(list).flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String s) throws Exception { List<String> list = new ArrayList<String>(); if (s.equalsIgnoreCase("machine learning")) { list.add("staticVariable:" + staticVariable); list.add("objectVariable:" + objectVariable); list.add("objectBroadcast:" + objectBroadcast.getValue()); list.add("localVariable:" + localVariable); list.add("localBroadcast:" + localBroadcast.getValue()); }// staticAccumulator.add(1); objectAccumulator.add(1); localAccumulator.add(1); return list; } }); String desPath = "learn" + localVariable; HdfsOperate.deleteIfExist(desPath); HdfsOperate.openHdfsFile(desPath); List<String> resultList = rddx.collect(); for (String str : resultList) { HdfsOperate.writeString(str); } HdfsOperate.writeString("objectAccumulator:" + objectAccumulator.value()); HdfsOperate.writeString("localAccumulator:" + localAccumulator.value()); HdfsOperate.closeHdfsFile(); }}
最终得到两个文件:learnbanana、learncat
learnbanana的内容:
taticVariable:appleobjectVariable:bananaobjectBroadcast:bananalocalVariable:bananalocalBroadcast:bananaobjectAccumulator:3localAccumulator:3
learncat的内容:
staticVariable:appleobjectVariable:catobjectBroadcast:catlocalVariable:catlocalBroadcast:catobjectAccumulator:6localAccumulator:3
静态广播变量staticBroadcast、静态累加器staticAccumulator在运行过程中会引发异常:
java.lang.NullPointerException,所以在运行过程中需要注释掉。这点和普通的java函数有较大差别。另外可以看到静态变量staticVariable初始化之后无法在函数中改变它的值。成员变量objectVariable、成员广播变量objectBroadcast、成员累加器objectAccumulator、局部变量localVariable、局部广播变量localBroadcast、局部累加器localAccumulator都在程序中正常修改得到了我们想要的值。
需要注意的是局部变量、局部广播变量、局部累加器这三种变量由于需要在inner class中访问,需要被定义成final形式,但不影响其正常使用,可以看到局部累加器还是正确的增加了值。
小小总结一下,在使用Spark编程时如果要对变量进行访问和操作尽量使用成员类型变量和局部类型变量。另外普通变量和广播变量之间的区别主要在于广播变量在处理过程中是经过优化的,可以减少不必要的资源浪费,一般size较小的变量不需要用Broadcast,详细的解释请参考文章:http://g-chi.github.io/2015/10/21/Spark-why-use-broadcast-variables/
HdfsOperate类是使用hadoop的FileSystem接口对HDFS文件进行操作,完整代码如下:
/** * Created by Alex on 2016/8/30. */public class HdfsOperate implements Serializable{ private static Logger logger = LoggerFactory.getLogger(HdfsOperate.class); private static Configuration conf = new Configuration(); private static BufferedWriter writer = null; public static boolean isExist(String path) { try { FileSystem fileSystem = FileSystem.get(conf); Path path1 = new Path(path); if (fileSystem.exists(path1)) { return true; } } catch (Exception e) { logger.error("[HdfsOperate]>>>isExist error", e); } return false; } public static void deleteIfExist(String path) { try { FileSystem fileSystem = FileSystem.get(conf); Path path1 = new Path(path); if (fileSystem.exists(path1)) { fileSystem.delete(path1, true); } } catch (Exception e) { logger.error("[HdfsOperate]>>>deleteHdfsFile error", e); } } public static void openHdfsFile(String path) throws Exception { FileSystem fs = FileSystem.get(URI.create(path),conf); writer = new BufferedWriter(new OutputStreamWriter(fs.create(new Path(path)))); if(null!=writer){ logger.info("[HdfsOperate]>> initialize writer succeed!"); } } public static void writeString(String line) { try { writer.write(line + "\n"); }catch(Exception e){ logger.error("[HdfsOperate]>> writer a line error:" , e); } } public static void closeHdfsFile() { try { if (null != writer) { writer.close(); logger.info("[HdfsOperate]>> closeHdfsFile close writer succeed!"); } else{ logger.error("[HdfsOperate]>> closeHdfsFile writer is null"); } }catch(Exception e){ logger.error("[HdfsOperate]>> closeHdfsFile close hdfs error:" + e); } } public static void main(String[] args) { }}
- Spark变量访问总结
- 访问成员变量小小总结
- c++中成员变量访问以及const的使用总结
- Spark总结
- spark 总结
- Spark总结
- spark 5、共享变量
- Spark Broadcast 广播变量
- Spark PG7. 共享变量
- Spark共享变量
- Spark之广播变量
- spark-streaming 共享变量
- Spark踩坑记:共享变量
- Spark---Broadcast变量&Accumulators
- Spark 共享变量
- Spark共享变量---Java
- Spark共享变量---Scala
- spark中的共享变量
- 图算法概论
- Linux学习笔记 -- day07 网络命令
- Linux常用命令学习
- Statement 和 PreparedStatement的比较
- JS中数据类型及原生对象简介
- Spark变量访问总结
- hdu5902 GCD is Funny
- This app has been built with an incorrect configuration. Please configure your build for VectorDrawa
- 谱范数的理解与论述
- 常用纹理和纹理压缩格式
- 高仿途牛App下拉顶部滑出更多
- 请求转发与重定向的区别
- 第三方库-Universal-Image-Loader
- 激情撸一发,“爱尚阅”app