Spark与Mysql的交互

来源:互联网 发布:三星s4可以用4g网络吗 编辑:程序博客网 时间:2024/04/30 11:30

背景

Spark在对目标数据进行计算后,RDD格式的数据一般都会存在HDFS,Hive,HBase中,另一方面,对于非RDD格式的数据,可能会存放在像Mysql中这种传统的RDMS中.
但是写入过程中经常出现各种各样的问题, stackoverflow上有很多帖子:

  • Error writing spark dataframe to mysql table
  • JDBC batch insert performance

还有些其他的贴
- Using Apache Spark and MySQL for Data Analysis
- spark 1.3.0 将dataframe数据写入Hive分区表
- Spark读取数据库(Mysql)的四种方式详解
- 完整java开发中JDBC连接数据库代码和步骤
- Spark踩坑记——数据库(Hbase+Mysql)


RDD

Spark SQL通过JDBC连接MySQL读写数据

非RDD

import java.sql.{Date, DriverManager, PreparedStatement, Connection}/*     tableName = "tempTableName"    columns = [key : String, value : Int]    DBIP = 10.10.10.10    DBPort = 8888    DB = tempDB*/def connection2Mysql() = {    var conn : Connection = null    var ps : PreparedStatement = nulll    val userName = "admin"    val passwd = "admin"    val key = "Tom"    val value = 1024    val sql = "INSERT INTO tempTableName(key,value) values (?,?)"    try {        Class.forName("com.mysql.jdbc.Driver").newInstance        conn = DriverManager.getConnection("jdbc:mysql://10.10.10.10:8888/tempDB", userName, passwd)        ps = conn.prepareStatement(sql)        ps.setDate(1, key)        ps.setLong(2, value)        ps.executeUpdate()    } catch {        case e: Exception => println("----> Exception!\t:\t" + e + "<-----")    } finally {        if (ps != null) ps.close()        if (conn != null) conn.close()    }}

这里会遇到的一个问题是,在本地启动client进行功能检查,方法是可行的,但是通过submit提交给YARN之后,却被报错
java.sql.SQLException: No suitable driver found for jdbc:mysql://10.10.10.10:8888/tempDB
或是
Error:java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

发生这种情况的原因是因为在上述代码中,没有找到对应的类,即Class.forName("com.mysql.jdbc.Driver").newInstance 这行代码出现了问题. 这行代码的目的是对driver进行注册,查看Driver源码,如下所示:

/** * The Java SQL framework allows for multiple database drivers. Each driver should supply a class that implements the Driver interface *  * The DriverManager will try to load as many drivers as it can find and then for any given connection request, it will ask each driver in turn to try to * connect to the target URL. *  * It is strongly recommended that each Driver class should be small and standalone so that the Driver class can be loaded and queried without bringing in vast * quantities of supporting code. *  * When a Driver class is loaded, it should create an instance of itself and register it with the DriverManager. This means that a user can load and register a * driver by doing Class.forName("foo.bah.Driver") */public class Driver extends NonRegisteringDriver implements java.sql.Driver {    // Register ourselves with the DriverManager    static {        try {            java.sql.DriverManager.registerDriver(new Driver());        } catch (SQLException E) {            throw new RuntimeException("Can't register driver!");        }    }    /**     * Construct a new driver and register it with DriverManager     * @throws SQLException if a database error occurs.     */    public Driver() throws SQLException {        // Required for Class.forName().newInstance()    }}

出现这个问题有多种可能,
1. 在–jars参数里面加入Mysql jar包引用是没有用的. 需要通过加入–driver-class-path参数来设置driver的classpath.

$  bin/spark-submit --master local[2] --driver-class-path lib/mysql-connector-java-5.1.38.jar --class temp.jar

原因是两者分发的node不同, link
–driver-class-path driver所依赖的包,多个包之间用冒号(:)分割
–jars driver和executor都需要的包,多个包之间用逗号(,)分割
2. 使用依赖结果打包的时候没有将对应jar包导入.
使用依赖的时候需要将对应jar包打入最终的jar包中,这样才能正确的找到对应的类名并成功注册.
例如,在build.sbt中添加依赖后

// https://mvnrepository.com/artifact/mysql/mysql-connector-javalibraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38"

需要在Artifacts中添加对应的jar包.
Extracted mysql-connector-java-5.1.38.jar

0 0