spark-broadcast

来源:互联网 发布:linux的dd命令详解 编辑:程序博客网 时间:2024/06/13 03:31

spark-broadcast

@(spark)[broadcast]
Spark’s broadcast variables, used to broadcast immutable datasets to all node

Broadcast

/**                                                                                                                                                                      * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable                                                                           * cached on each machine rather than shipping a copy of it with tasks. They can be used, for                                                                            * example, to give every node a copy of a large input dataset in an efficient manner. Spark also                                                                        * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce                                                                             * communication cost.                                                                                                                                                   *                                                                                                                                                                       * Broadcast variables are created from a variable `v` by calling                                                                                                        * [[org.apache.spark.SparkContext#broadcast]].                                                                                                                          * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the                                                                          * `value` method. The interpreter session below shows this:                                                                                                             *                                                                                                                                                                       * {{{                                                                                                                                                                   * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))                                                                                                                * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)                                                                                         *                                                                                                                                                                       * scala> broadcastVar.value                                                                                                                                             * res0: Array[Int] = Array(1, 2, 3)                                                                                                                                     * }}}                                                                                                                                                                   *                                                                                                                                                                       * After the broadcast variable is created, it should be used instead of the value `v` in any                                                                            * functions run on the cluster so that `v` is not shipped to the nodes more than once.                                                                                  * In addition, the object `v` should not be modified after it is broadcast in order to ensure                                                                           * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped                                                                          * to a new node later).                                                                                                                                                 *                                                                                                                                                                       * @param id A unique identifier for the broadcast variable.                                                                                                             * @tparam T Type of the data contained in the broadcast variable.                                                                                                       */                                                                                                                                                                     abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {    /**                                                                                                                                                                      * :: DeveloperApi ::                                                                                                                                                    * An interface for all the broadcast implementations in Spark (to allow                                                                                                 * multiple broadcast implementations). SparkContext uses a user-specified                                                                                               * BroadcastFactory implementation to instantiate a particular broadcast for the                                                                                         * entire Spark job.                                                                                                                                                     */                                                                                                                                                                     @DeveloperApi                                                                                                                                                           trait BroadcastFactory {   

目前有两组实现,默认的是后者

HttpBroadcast

/**                                                                                                                                                                      * A [[org.apache.spark.broadcast.Broadcast]] implementation that uses HTTP server                                                                                       * as a broadcast mechanism. The first time a HTTP broadcast variable (sent as part of a                                                                                 * task) is deserialized in the executor, the broadcasted data is fetched from the driver                                                                                * (through a HTTP server running at the driver) and stored in the BlockManager of the                                                                                   * executor to speed up future accesses.                                                                                                                                 */                                                                                                                                                                     private[spark] class HttpBroadcast[T: ClassTag](     

TorrentBroadcast

/**                                                                                                                                                                      * A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].                                                                                         *                                                                                                                                                                       * The mechanism is as follows:                                                                                                                                          *                                                                                                                                                                       * The driver divides the serialized object into small chunks and                                                                                                        * stores those chunks in the BlockManager of the driver.                                                                                                                *                                                                                                                                                                       * On each executor, the executor first attempts to fetch the object from its BlockManager. If                                                                           * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or                                                                       * other executors if available. Once it gets the chunks, it puts the chunks in its own                                                                                  * BlockManager, ready for other executors to fetch from.                                                                                                                *                                                                                                                                                                       * This prevents the driver from being the bottleneck in sending out multiple copies of the                                                                              * broadcast data (one per executor) as done by the [[org.apache.spark.broadcast.HttpBroadcast]].                                                                        *                                                                                                                                                                       * When initialized, TorrentBroadcast objects read SparkEnv.get.conf.                                                                                                    *                                                                                                                                                                       * @param obj object to broadcast                                                                                                                                        * @param id A unique identifier for the broadcast variable.                                                                                                             */                                                                                                                                                                     private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long)                                                                                                      extends Broadcast[T](id) with Logging with Serializable {

随机选远程节点这个事情,是由blockManger完成的

0 0