Spark心得
来源:互联网 发布:二维数组 编辑:程序博客网 时间:2024/06/16 11:26
开发spark会遇到不少的坑,寥寥记下,以为后顾
1. IDEA 搭建spark环境,其中 pom.xml 内容(包括库下载网站镜像)
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>xxx</groupId> <artifactId>xxx</artifactId> <version>1.0-SNAPSHOT</version> <inceptionYear>2008</inceptionYear> <properties> <scala.main.version>2.11</scala.main.version> <scala.version>2.11.8</scala.version> <spark.version>2.2.0</spark.version> <hadoop.version>2.7.3</hadoop.version> </properties> <repositories> <repository> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </repository> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> <repository> <id>repo.boundlessgeo.com</id> <name>repo.boundlessgeo.com</name> <url>https://repo.boundlessgeo.com/main</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.4</version> <scope>test</scope> </dependency> <dependency> <groupId>org.specs</groupId> <artifactId>specs</artifactId> <version>1.2.5</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_${scala.main.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-graphx_${scala.main.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>1.2.0</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-hbase-handler</artifactId> <version>2.1.1</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.39</version> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.3</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.3</version> </dependency> <dependency> <groupId>io.dropwizard.metrics</groupId> <artifactId>metrics-core</artifactId> <version>3.2.5</version> </dependency> <dependency> <groupId>org.geotools</groupId> <artifactId>gt-main</artifactId> <version>18.0</version> </dependency> <dependency> <groupId>org.geotools</groupId> <artifactId>gt-opengis</artifactId> <version>18.0</version> </dependency> <dependency> <groupId>org.geotools</groupId> <artifactId>gt-shapefile</artifactId> <version>18.0</version> </dependency> <dependency> <groupId>com.vividsolutions</groupId> <artifactId>jts-core</artifactId> <version>1.14.0</version> </dependency> <dependency> <groupId>javax.measure</groupId> <artifactId>jsr-275</artifactId> <version>1.0.0</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> <args> <arg>-target:jvm-1.5</arg> </args> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <configuration> <downloadSources>true</downloadSources> <buildcommands> <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand> </buildcommands> <additionalProjectnatures> <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature> </additionalProjectnatures> <classpathContainers> <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer> <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer> </classpathContainers> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <reporting> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <configuration> <scalaVersion>${scala.version}</scalaVersion> </configuration> </plugin> </plugins> </reporting></project>
2. 如何从jar包中加载配置文件
def loadConfigInJar(path:String):Properties={ val url = this.getClass.getClassLoader.getResource(path)//"conf.properties" val source = Source.fromURL(url) val prop = source.getLines().toArray.filter(item=>item.trim.length>0&&item.trim.indexOf("#")!=0).map(item=>item.split("=")).foreach(d=>{ if(d.length==2){ System.err.println(d(0)+"="+d(1)) properties.put(d(0).trim,d(1).trim) }else if(d.length==1){ System.err.println(d(0)+"=null") properties.put(d(0).trim,"") }else if(d.length>2){ val key = d(0) var value = d.toIndexedSeq.slice(1,d.length).mkString("","=","") System.err.println(key+"="+value) properties.put(key.trim,value.trim) } }) properties}
3. spark 初始化常用调优参数
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")val conf=new SparkConf() .setAppName(appName) .set("hive.metastore.schema.verification","false") .set("spark.network.timeout",Util.getProperty("spark.network.timeout")) .set("spark.yarn.executor.memoryOverhead",Util.getProperty("spark.yarn.executor.memoryOverhead")) .set("spark.storage.memoryFraction",Util.getProperty("spark.storage.memoryFraction")) .set("spark.driver.extraJavaOptions",Util.getProperty("spark.driver.extraJavaOptions")) .set("spark.executor.extraJavaOptions",Util.getProperty("spark.executor.extraJavaOptions")) .set("spark.driver.maxResultSize", "12g") .set("spark.rpc.io.backLog", "10000") .set("spark.cleaner.referenceTracking.blocking", "false")/*hive.mapreduce.map.memory.mb=10240hive.mapreduce.reduce.memory.mb=10240hive.yarn.nodemanager.vmem-pmem-ratio=2.1spark.network.timeout=300spark.storage.memoryFraction=0.4spark.yarn.executor.memoryOverhead=40960spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:+UseG1GC -XX:PermSize=2048M -XX:MaxPermSize=12288M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGCspark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:PermSize=2048M -XX:MaxPermSize=8192M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGCzookeeper.znode.parent=/hbasehbase.zookeeper.quorum=HOST1,HOST2,...hbase.zookeeper.property.clientPort=2181*/sc=new SparkContext(conf)sc.setLogLevel("ERROR")val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()val mmm = Util.getProperty("hive.mapreduce.map.memory.mb")val mrm = Util.getProperty("hive.mapreduce.reduce.memory.mb")val ynv = Util.getProperty("hive.yarn.nodemanager.vmem-pmem-ratio")spark.sql("set mapred.max.split.size=256000000")spark.sql("set mapred.min.split.size.per.node=100000000")spark.sql("set mapred.min.split.size.per.rack=100000000")spark.sql("set hive.input.format=org.apache.Hadoop.hive.ql.io.CombineHiveInputFormat")spark.sql("set hive.merge.mapfiles=true")spark.sql("set hive.merge.mapredfiles = true")spark.sql("set hive.merge.size.per.task = 25610001000")spark.sql("set hive.merge.smallfiles.avgsize=16000000")spark.sql("set mapreduce.map.memory.mb="+mmm);spark.sql("set mapreduce.reduce.memory.mb="+mrm);spark.sql("SET spark.sql.shuffle.partitions=10")scala.sys.addShutdownHook { sc.stop()}//计时器var t0 = 0ldef memTime() = { t0 = new Date().getTime}def formatTime(mss:Long) ={ val days = mss / (1000 * 60 * 60 * 24) val hours = (mss % (1000 * 60 * 60 * 24)) / (1000 * 60 * 60) val minutes = (mss % (1000 * 60 * 60)) / (1000 * 60) val seconds = (mss % (1000 * 60)) / 1000 var s = if(days>0) days+"天 " else "" s = s + (if(hours>0) hours+"小时 " else "") s = s + (if(minutes>0) minutes+"分钟 " else "") s = s + (if(seconds>0) seconds+"秒" else (mss % (1000 * 60) + "毫秒")) s = s + " ["+new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(Calendar.getInstance().getTime())+"]" s}def showCost(title:String="")={ val mss = new Date().getTime - t0 val s = formatTime(mss) System.err.println(title+s) Util.memTime() s}4. hadoop2.7.3 集群配置mapred-site.xml<configuration> <property> <name>mapred.job.tracker</name> <value>host:9001</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>host:10020</value> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>40</value> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>40</value> </property> <property> <name>yarn.app.mapreduce.am.resource.cpu-vcores</name> <value>40</value> </property></configuration>yarn-site.xml<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>host</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>host:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>host:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>host:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.https.address</name> <value>host:8090</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>host:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>40</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>40</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <description>Classpath for typical applications.通过执行 hadoop classpath来确认hadoop的相关类路径</description> <name>yarn.application.classpath</name> <value>/app/hadoop/lib:/app/hadoop/share/hadoop/yarn/lib:/app/rapidminer_libs/*.jar:/app/hadoop/contrib/capacity-scheduler/*.jar:/app/hadoop/etc/hadoop:/app/hadoop-2.7.3/share/hadoop/common/lib/*:/app/hadoop-2.7.3/share/hadoop/common/*:/app/hadoop-2.7.3/share/hadoop/hdfs:/app/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/app/hadoop-2.7.3/share/hadoop/hdfs/*:/app/hadoop-2.7.3/share/hadoop/yarn/lib/*:/app/hadoop-2.7.3/share/hadoop/yarn/*:/app/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/app/hadoop-2.7.3/share/hadoop/mapreduce/*</value> </property></configuration>hdfs-site.xml<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>host:50090</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/hadoop/tmp/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property></configuration>core-site.xml<configuration><property> <name>fs.defaultFS</name> <value>hdfs://host:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>file:///data/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property><!-- 下面两个属性解决hive连接时报错:User: root is not allowed to impersonate anonymous,如果是其他用户名,就用其替换root --> <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property></configuration>
阅读全文
0 0
- Spark心得
- Spark Application调试心得
- spark-hbase数据操作心得
- Hadoop与Spark平台搭建心得
- openfire+mysql+spark的配置心得与基础知识
- 使用Spark SQL和DataFrame的一些总结和心得
- 最近学习hadoop和spark的一些心得
- dl4j源码阅读心得及问题(Spark部分)
- 心得!
- 心得
- 心得
- 心得
- 心得
- 心得!
- 心得
- 心得
- 心得
- 心得
- 分布式事务以及解决方法
- unity 给大家分享一个可以直接将unity中的物体导出成fbx存在项目中的方法(实测pc可用,移动端不可以~~~其他待测~~)
- ubuntu 16.0.4 install chrome
- java重写和重载那点破事
- 04.开源项目--git文件删除
- Spark心得
- 如何写出让面试管满意的 string 函数
- 分开分表 分布式事物
- 提示用户输入的一个长度8位正整数数字 , 我们来统计这个数字中的每一位上出现的数字的重复次数 :
- samba配置
- LeetCode-Maximum Swap
- ffmpeg打时间戳
- Atitit r2017 r4 doc list on home ntpc.docx 驱动器 D 中的卷是 p2soft 卷的序列号是 9AD0-D3C8 D:\ati ext\r2017
- Java方法调用的几个主要过程