jstorm源码分析:任务领取

来源：互联网发布：分身软件编辑：程序博客网时间：2024/04/28 23:18

任务领取

每个jstorm的工作机器会定时的扫描zookeeper的任务分配的目录，看是否有自己的任务，如果有，那么把对应的信息写到本地机器的制定目录中，这个工作主要是有SyncSupervisorEvent 线程中的run方法来完成的，我们主要来分析这个函数，在这个之前，我们来看下这个类的成员，便于后面的分析

   //标示supervisor的唯一id,因为一台机器上只有一个supervisor，所以也用这个来判断机器    private String supervisorId;    private EventManager processEventManager;    private EventManager syncSupEventManager;    //storm集群的状态信息（zk上信息操作接口）    private StormClusterState stormClusterState;    //本地信息接口    private LocalState localState;

run方法：

    @Override    public void run() {        LOG.debug("Synchronizing supervisor, interval seconds:" + TimeUtils.time_delta(lastTime));        lastTime = TimeUtils.current_time_secs();        try {            RunnableCallback syncCallback = new EventManagerZkPusher(this, syncSupEventManager);            /**             * Step 1: get all assignments and register /ZK-dir/assignment and every assignment watch             *              */            //通过zk目录获取集群中所有的任务 toplogy_id --> assignment            Map<String, Assignment> assignments = Cluster.get_all_assignment(stormClusterState, syncCallback);            LOG.debug("Get all assignments " + assignments);            /**             * Step 2: get topologyIds list from STORM-LOCAL-DIR/supervisor/stormdist/             */            //通过本地的目录信息，获取所有在本机上的任务            List<String> downloadedTopologyIds = StormConfig.get_supervisor_toplogy_list(conf);            LOG.debug("Downloaded storm ids: " + downloadedTopologyIds);            /**             * Step 3: get <port,LocalAssignments> from ZK local node's assignment             */            //通过zk信息，获取所有分配到本机的所有工作进程（通过遍历所有任务下的所有工作进程，看他的nodeid是不是等于supervisorId）            Map<Integer, LocalAssignment> zkAssignment = getLocalAssign(stormClusterState, supervisorId, assignments);            Map<Integer, LocalAssignment> localAssignment;            Set<String> updateTopologys;            /**             * Step 4: writer local assignment to LocalState             */            try {                LOG.debug("Writing local assignment " + zkAssignment);                localAssignment = (Map<Integer, LocalAssignment>) localState.get(Common.LS_LOCAL_ASSIGNMENTS);                if (localAssignment == null) {                    localAssignment = new HashMap<Integer, LocalAssignment>();                }                //更新状态                localState.put(Common.LS_LOCAL_ASSIGNMENTS, zkAssignment);                //比较新老状态，获取需要更新的任务(根据任务时间戳判断)                updateTopologys = getUpdateTopologys(localAssignment, zkAssignment, assignments);                Set<String> reDownloadTopologys = getNeedReDownloadTopologys(localAssignment);                //需要重新下载的也放到更新中去                if (reDownloadTopologys != null) {                    updateTopologys.addAll(reDownloadTopologys);                }            } catch (IOException e) {                LOG.error("put LS_LOCAL_ASSIGNMENTS " + zkAssignment + " of localState failed");                throw e;            }            /**             * Step 5: download code from ZK             */            Map<String, String> topologyCodes = getTopologyCodeLocations(assignments, supervisorId);            //  downloadFailedTopologyIds which can't finished download binary from nimbus            Set<String> downloadFailedTopologyIds = new HashSet<String>();            downloadTopology(topologyCodes, downloadedTopologyIds, updateTopologys, assignments, downloadFailedTopologyIds);            /**             * Step 6: remove any downloaded useless topology             */            //删除无用的toplogy(本地路径中还有信息，但是代码路径中没有了)            removeUselessTopology(topologyCodes, downloadedTopologyIds);            /**             * Step 7: push syncProcesses Event             */            // processEventManager.add(syncProcesses);            syncProcesses.run(zkAssignment, downloadFailedTopologyIds);            // If everything is OK, set the trigger to update heartbeat of            // supervisor            heartbeat.updateHbTrigger(true);        } catch (Exception e) {            LOG.error("Failed to Sync Supervisor", e);            // throw new RuntimeException(e);        }    }

从zookeeper获取所有的任务

先看代码

    public static Map<String, Assignment> get_all_assignment(StormClusterState stormClusterState, RunnableCallback callback) throws Exception {        Map<String, Assignment> ret = new HashMap<String, Assignment>();        // get /assignments {topology_id}        //获取zookeeper assignments目录下所有任务        List<String> assignments = stormClusterState.assignments(callback);        if (assignments == null) {            LOG.debug("No assignment of ZK");            return ret;        }        //对于每个任务，获取任务的详细信息        for (String topology_id : assignments) {            Assignment assignment = stormClusterState.assignment_info(topology_id, callback);            if (assignment == null) {                LOG.error("Failed to get Assignment of " + topology_id + " from ZK");                continue;            }            ret.put(topology_id, assignment);        }        return ret;    }

第一步是根据扫秒zk上的任务目录，得到所有任务的名称。具体实现看下面的代码：

    @Override    public List<String> assignments(RunnableCallback callback) throws Exception {        if (callback != null) {            assignments_callback.set(callback);        }        return cluster_state.get_children(Cluster.ASSIGNMENTS_SUBTREE, callback != null);    }```  @Override    public List<String> get_children(String path, boolean watch) throws Exception {        return zkobj.getChildren(zk, path, watch);    }   public List<String> getChildren(CuratorFramework zk, String path, boolean watch) throws Exception {        String npath = PathUtils.normalize_path(path);        if (watch) {            return zk.getChildren().watched().forPath(npath);        } else {            return zk.getChildren().forPath(npath);        }    }<div class="se-preview-section-delimiter"></div>

第二部是根据任务的名称获取任务的详情
“`
@Override
public Assignment assignment_info(String topologyId, RunnableCallback callback) throws Exception {
if (callback != null) {
assignment_info_callback.put(topologyId, callback);
}

    String assgnmentPath = Cluster.assignment_path(topologyId);    return (Assignment) getObject(assgnmentPath, callback != null);}

“`
根据任务的名称获取任务信息的路径，然后获取任务的信息，最后进行反序列化成Assignment信息

另外这里传入了一个回调的参数：syncCallback ，他是在zk的任务目发生改变的时候进行回调的,具体还需要好好看下（TODO）

获取本地任务信息

通过机器本地的文件，得到本地所有任务信息

    @SuppressWarnings("rawtypes")    public static List<String> get_supervisor_toplogy_list(Map conf) throws IOException {        // get the path: STORM-LOCAL-DIR/supervisor/stormdist/        String path = StormConfig.supervisor_stormdist_root(conf);        List<String> topologyids = PathUtils.read_dir_contents(path);        return topologyids;    }<div class="se-preview-section-delimiter"></div>

本地任务的路径是： Config.STORM_LOCAL_DIR)) + FILE_SEPERATEOR + “supervisor”

  public static List<String> read_dir_contents(String dir) {        ArrayList<String> rtn = new ArrayList<String>();        if (exists_file(dir)) {            File[] list = (new File(dir)).listFiles();            for (File f : list) {                rtn.add(f.getName());            }        }        return rtn;    }<div class="se-preview-section-delimiter"></div>

获取分配到本机的所有work

 private Map<Integer, LocalAssignment> getLocalAssign(StormClusterState stormClusterState, String supervisorId, Map<String, Assignment> assignments)            throws Exception {        Map<Integer, LocalAssignment> portLA = new HashMap<Integer, LocalAssignment>();        //遍历所有的任务        for (Entry<String, Assignment> assignEntry : assignments.entrySet()) {            String topologyId = assignEntry.getKey();            Assignment assignment = assignEntry.getValue();            //遍历一个任务下的所有worker， 看他是否是在本机（worker->NondeId == supervisorId）            Map<Integer, LocalAssignment> portTasks = readMyTasks(stormClusterState, topologyId, supervisorId, assignment);            if (portTasks == null) {                continue;            }            // a port must be assigned one storm            for (Entry<Integer, LocalAssignment> entry : portTasks.entrySet()) {                Integer port = entry.getKey();                LocalAssignment la = entry.getValue();                if (!portLA.containsKey(port)) {                    portLA.put(port, la);                } else {                    throw new RuntimeException("Should not have multiple topologys assigned to one port");                }            }        }        return portLA;    }<div class="se-preview-section-delimiter"></div>

遍历第一步中获取的所有zk上的任务（整个集群任务）, 看每个任务下的所有work是否在本地的（通过work的nodeid和supervisor比较是否一致），最终得到所有分配到这台机器上的work

更新本地的work信息

     try {                LOG.debug("Writing local assignment " + zkAssignment);                localAssignment = (Map<Integer, LocalAssignment>) localState.get(Common.LS_LOCAL_ASSIGNMENTS);                if (localAssignment == null) {                    localAssignment = new HashMap<Integer, LocalAssignment>();                }                //更新状态                localState.put(Common.LS_LOCAL_ASSIGNMENTS, zkAssignment);                //比较新老状态，获取需要更新的任务(根据任务时间戳判断)                updateTopologys = getUpdateTopologys(localAssignment, zkAssignment, assignments);                Set<String> reDownloadTopologys = getNeedReDownloadTopologys(localAssignment);                //需要重新下载的也放到更新中去                if (reDownloadTopologys != null) {                    updateTopologys.addAll(reDownloadTopologys);                }            } catch (IOException e) {                LOG.error("put LS_LOCAL_ASSIGNMENTS " + zkAssignment + " of localState failed");                throw e;            }<div class="se-preview-section-delimiter"></div>

这里主要干三件事情：
一更新本地的work信息
二通过对比，得到需要更新的任务
三通过对比，得到需要重新下载的任务
其中二和三任务都是需要更新对应的拓扑的

如何判断任务更新了呢？

    private Set<String> getUpdateTopologys(Map<Integer, LocalAssignment> localAssignments, Map<Integer, LocalAssignment> zkAssignments,            Map<String, Assignment> assignments) {        Set<String> ret = new HashSet<String>();        if (localAssignments != null && zkAssignments != null) {            for (Entry<Integer, LocalAssignment> entry : localAssignments.entrySet()) {                Integer port = entry.getKey();                LocalAssignment localAssignment = entry.getValue();                LocalAssignment zkAssignment = zkAssignments.get(port);                if (localAssignment == null || zkAssignment == null)                    continue;                Assignment assignment = assignments.get(localAssignment.getTopologyId());                if (localAssignment.getTopologyId().equals(zkAssignment.getTopologyId()) && assignment != null                        && assignment.isTopologyChange(localAssignment.getTimeStamp()))                    if (ret.add(localAssignment.getTopologyId())) {                        LOG.info("Topology-" + localAssignment.getTopologyId() + " has been updated. LocalTs=" + localAssignment.getTimeStamp() + ", ZkTs="                                + zkAssignment.getTimeStamp());                    }            }        }        return ret;    }<div class="se-preview-section-delimiter"></div>

从代码来看，首先任务是更新类型或者是扩容类型，同时本地任务更新时间早于zk上任务更新时间

同样，如果获取需要下载的任务呢？

  private Set<String> getNeedReDownloadTopologys(Map<Integer, LocalAssignment> localAssignment) {        Set<String> reDownloadTopologys = syncProcesses.getTopologyIdNeedDownload().getAndSet(null);        if (reDownloadTopologys == null || reDownloadTopologys.size() == 0)            return null;        Set<String> needRemoveTopologys = new HashSet<String>();        Map<Integer, String> portToStartWorkerId = syncProcesses.getPortToWorkerId();        for (Entry<Integer, LocalAssignment> entry : localAssignment.entrySet()) {            if (portToStartWorkerId.containsKey(entry.getKey()))                needRemoveTopologys.add(entry.getValue().getTopologyId());        }        LOG.debug("worker is starting on these topology, so delay download topology binary: " + needRemoveTopologys);        reDownloadTopologys.removeAll(needRemoveTopologys);        if (reDownloadTopologys.size() > 0)            LOG.info("Following topologys is going to re-download the jars, " + reDownloadTopologys);        return reDownloadTopologys;    }<div class="se-preview-section-delimiter"></div>

需要下载的所有任务，排除掉本地已经在启动的任务，剩下的还是需要重新下载

代码下载

   Map<String, String> topologyCodes = getTopologyCodeLocations(assignments, supervisorId);            //  downloadFailedTopologyIds which can't finished download binary from nimbus            Set<String> downloadFailedTopologyIds = new HashSet<String>();            downloadTopology(topologyCodes, downloadedTopologyIds, updateTopologys, assignments, downloadFailedTopologyIds);<div class="se-preview-section-delimiter"></div>

第一步是获取有work分配到当前机器的任务

  public static Map<String, String> getTopologyCodeLocations(Map<String, Assignment> assignments, String supervisorId) throws Exception {        Map<String, String> rtn = new HashMap<String, String>();        for (Entry<String, Assignment> entry : assignments.entrySet()) {            String topologyid = entry.getKey();            Assignment assignmenInfo = entry.getValue();            Set<ResourceWorkerSlot> workers = assignmenInfo.getWorkers();            for (ResourceWorkerSlot worker : workers) {                String node = worker.getNodeId();                if (supervisorId.equals(node)) {                    rtn.put(topologyid, assignmenInfo.getMasterCodeDir());                    break;                }            }        }        return rtn;    }<div class="se-preview-section-delimiter"></div>

过程还是类似的，对所有的任务，看他是否有work在当前机器，如果有那么就放到结果中。

第二部分就是下载

    public void downloadTopology(Map<String, String> topologyCodes, List<String> downloadedTopologyIds, Set<String> updateTopologys,                                 Map<String, Assignment> assignments, Set<String> downloadFailedTopologyIds) throws Exception {        Set<String> downloadTopologys = new HashSet<String>();        //对所有任务进行处理        for (Entry<String, String> entry : topologyCodes.entrySet()) {            String topologyId = entry.getKey();            String masterCodeDir = entry.getValue();            //没有下载过 或者 需要更新            if (!downloadedTopologyIds.contains(topologyId) || updateTopologys.contains(topologyId)) {                LOG.info("Downloading code for storm id " + topologyId + " from " + masterCodeDir);                int retry = 0;                while (retry < 3) {                    try {                        downloadStormCode(conf, topologyId, masterCodeDir);                        // Update assignment timeStamp                        StormConfig.write_supervisor_topology_timestamp(conf, topologyId, assignments.get(topologyId).getTimeStamp());                        break;                    } catch (IOException e) {                        LOG.error(e + " downloadStormCode failed " + "topologyId:" + topologyId + "masterCodeDir:" + masterCodeDir);                    } catch (TException e) {                        LOG.error(e + " downloadStormCode failed " + "topologyId:" + topologyId + "masterCodeDir:" + masterCodeDir);                    }                    retry++;                }                if (retry < 3) {                    LOG.info("Finished downloading code for storm id " + topologyId + " from " + masterCodeDir);                    downloadTopologys.add(topologyId);                } else {                    LOG.error("Cann't  download code for storm id " + topologyId + " from " + masterCodeDir);                    downloadFailedTopologyIds.add(topologyId);                }            }        }        // clear directory of topologyId is dangerous , so it only clear the topologyId which        // isn't contained by downloadedTopologyIds        for (String topologyId : downloadFailedTopologyIds) {            if (!downloadedTopologyIds.contains(topologyId)) {                try {                    String stormroot = StormConfig.supervisor_stormdist_root(conf, topologyId);                    File destDir = new File(stormroot);                    FileUtils.deleteQuietly(destDir);                } catch (Exception e) {                    LOG.error("Cann't  clear directory about storm id " + topologyId + " on supervisor ");                }            }        }        updateTaskCleanupTimeout(downloadTopologys);    }<div class="se-preview-section-delimiter"></div>

从代码来看，我们需要下载的有两种任务: 一是还没有下载过的，二是需要更新的（上面计算得到的）。真正代码下载就是从zk上下载并写入到本地的文件中，并把任务的时间戳写入到本地文件中。下载成功，那么写入到downloadTopologys，如果失败，同样进行记录，写到downloadFailedTopologyIds中去。

对于下载失败的，并且不在已经下载中的任务，删除本地的信息。

最后更新所有下载任务超时删除时间：一个任务超时删除时间首先看任务是否配置，如果任务没有配置，那么就系统统一配置，最后更新到localstatus中

删除无用的拓扑

  public void removeUselessTopology(Map<String, String> topologyCodes, List<String> downloadedTopologyIds) {        for (String topologyId : downloadedTopologyIds) {            if (!topologyCodes.containsKey(topologyId)) {                LOG.info("Removing code for storm id " + topologyId);                String path = null;                try {                    path = StormConfig.supervisor_stormdist_root(conf, topologyId);                    PathUtils.rmr(path);                } catch (IOException e) {                    String errMsg = "rmr the path:" + path + "failed\n";                    LOG.error(errMsg, e);                }            }        }    }

如果一个任务在本地下载的信息中存在，但是在zk上代码路径信息中不存在，那么就认为任务已经无效了，从本地信息中进行删除（删除信息目录）

1 0