jstorm源码分析:任务领取

来源:互联网 发布:分身软件 编辑:程序博客网 时间:2024/04/28 23:18

任务领取

每个jstorm的工作机器会定时的扫描zookeeper的任务分配的目录,看是否有自己的任务,如果有,那么把对应的信息写到本地机器的制定目录中,这个工作主要是有SyncSupervisorEvent 线程中的run方法来完成的,我们主要来分析这个函数, 在这个之前,我们来看下这个类的成员,便于后面的分析

   //标示supervisor的唯一id,因为一台机器上只有一个supervisor,所以也用这个来判断机器    private String supervisorId;    private EventManager processEventManager;    private EventManager syncSupEventManager;    //storm集群的状态信息(zk上信息操作接口)    private StormClusterState stormClusterState;    //本地信息接口    private LocalState localState;

run方法:

    @Override    public void run() {        LOG.debug("Synchronizing supervisor, interval seconds:" + TimeUtils.time_delta(lastTime));        lastTime = TimeUtils.current_time_secs();        try {            RunnableCallback syncCallback = new EventManagerZkPusher(this, syncSupEventManager);            /**             * Step 1: get all assignments and register /ZK-dir/assignment and every assignment watch             *              */            //通过zk目录获取集群中所有的任务 toplogy_id --> assignment            Map<String, Assignment> assignments = Cluster.get_all_assignment(stormClusterState, syncCallback);            LOG.debug("Get all assignments " + assignments);            /**             * Step 2: get topologyIds list from STORM-LOCAL-DIR/supervisor/stormdist/             */            //通过本地的目录信息,获取所有在本机上的任务            List<String> downloadedTopologyIds = StormConfig.get_supervisor_toplogy_list(conf);            LOG.debug("Downloaded storm ids: " + downloadedTopologyIds);            /**             * Step 3: get <port,LocalAssignments> from ZK local node's assignment             */            //通过zk信息,获取所有分配到本机的所有工作进程(通过遍历所有任务下的所有工作进程,看他的nodeid是不是等于supervisorId)            Map<Integer, LocalAssignment> zkAssignment = getLocalAssign(stormClusterState, supervisorId, assignments);            Map<Integer, LocalAssignment> localAssignment;            Set<String> updateTopologys;            /**             * Step 4: writer local assignment to LocalState             */            try {                LOG.debug("Writing local assignment " + zkAssignment);                localAssignment = (Map<Integer, LocalAssignment>) localState.get(Common.LS_LOCAL_ASSIGNMENTS);                if (localAssignment == null) {                    localAssignment = new HashMap<Integer, LocalAssignment>();                }                //更新状态                localState.put(Common.LS_LOCAL_ASSIGNMENTS, zkAssignment);                //比较新老状态,获取需要更新的任务(根据任务时间戳判断)                updateTopologys = getUpdateTopologys(localAssignment, zkAssignment, assignments);                Set<String> reDownloadTopologys = getNeedReDownloadTopologys(localAssignment);                //需要重新下载的也放到更新中去                if (reDownloadTopologys != null) {                    updateTopologys.addAll(reDownloadTopologys);                }            } catch (IOException e) {                LOG.error("put LS_LOCAL_ASSIGNMENTS " + zkAssignment + " of localState failed");                throw e;            }            /**             * Step 5: download code from ZK             */            Map<String, String> topologyCodes = getTopologyCodeLocations(assignments, supervisorId);            //  downloadFailedTopologyIds which can't finished download binary from nimbus            Set<String> downloadFailedTopologyIds = new HashSet<String>();            downloadTopology(topologyCodes, downloadedTopologyIds, updateTopologys, assignments, downloadFailedTopologyIds);            /**             * Step 6: remove any downloaded useless topology             */            //删除无用的toplogy(本地路径中还有信息,但是代码路径中没有了)            removeUselessTopology(topologyCodes, downloadedTopologyIds);            /**             * Step 7: push syncProcesses Event             */            // processEventManager.add(syncProcesses);            syncProcesses.run(zkAssignment, downloadFailedTopologyIds);            // If everything is OK, set the trigger to update heartbeat of            // supervisor            heartbeat.updateHbTrigger(true);        } catch (Exception e) {            LOG.error("Failed to Sync Supervisor", e);            // throw new RuntimeException(e);        }    }

从zookeeper获取所有的任务

先看代码

    public static Map<String, Assignment> get_all_assignment(StormClusterState stormClusterState, RunnableCallback callback) throws Exception {        Map<String, Assignment> ret = new HashMap<String, Assignment>();        // get /assignments {topology_id}        //获取zookeeper assignments目录下所有任务        List<String> assignments = stormClusterState.assignments(callback);        if (assignments == null) {            LOG.debug("No assignment of ZK");            return ret;        }        //对于每个任务,获取任务的详细信息        for (String topology_id : assignments) {            Assignment assignment = stormClusterState.assignment_info(topology_id, callback);            if (assignment == null) {                LOG.error("Failed to get Assignment of " + topology_id + " from ZK");                continue;            }            ret.put(topology_id, assignment);        }        return ret;    }

第一步是根据扫秒zk上的任务目录,得到所有任务的名称。具体实现看下面的代码:

    @Override    public List<String> assignments(RunnableCallback callback) throws Exception {        if (callback != null) {            assignments_callback.set(callback);        }        return cluster_state.get_children(Cluster.ASSIGNMENTS_SUBTREE, callback != null);    }```  @Override    public List<String> get_children(String path, boolean watch) throws Exception {        return zkobj.getChildren(zk, path, watch);    }   public List<String> getChildren(CuratorFramework zk, String path, boolean watch) throws Exception {        String npath = PathUtils.normalize_path(path);        if (watch) {            return zk.getChildren().watched().forPath(npath);        } else {            return zk.getChildren().forPath(npath);        }    }<div class="se-preview-section-delimiter"></div>

第二部是根据任务的名称获取任务的详情
“`
@Override
public Assignment assignment_info(String topologyId, RunnableCallback callback) throws Exception {
if (callback != null) {
assignment_info_callback.put(topologyId, callback);
}

    String assgnmentPath = Cluster.assignment_path(topologyId);    return (Assignment) getObject(assgnmentPath, callback != null);}

“`
根据任务的名称获取任务信息的路径,然后获取任务的信息,最后进行反序列化成Assignment信息

另外这里传入了一个回调的参数:syncCallback ,他是在zk的任务目发生改变的时候进行回调的,具体还需要好好看下(TODO)

获取本地任务信息

通过机器本地的文件,得到本地所有任务信息

    @SuppressWarnings("rawtypes")    public static List<String> get_supervisor_toplogy_list(Map conf) throws IOException {        // get the path: STORM-LOCAL-DIR/supervisor/stormdist/        String path = StormConfig.supervisor_stormdist_root(conf);        List<String> topologyids = PathUtils.read_dir_contents(path);        return topologyids;    }<div class="se-preview-section-delimiter"></div>

本地任务的路径是: Config.STORM_LOCAL_DIR)) + FILE_SEPERATEOR + “supervisor”

然后读取这个目录下所有子目录的名称

  public static List<String> read_dir_contents(String dir) {        ArrayList<String> rtn = new ArrayList<String>();        if (exists_file(dir)) {            File[] list = (new File(dir)).listFiles();            for (File f : list) {                rtn.add(f.getName());            }        }        return rtn;    }<div class="se-preview-section-delimiter"></div>

获取分配到本机的所有work

 private Map<Integer, LocalAssignment> getLocalAssign(StormClusterState stormClusterState, String supervisorId, Map<String, Assignment> assignments)            throws Exception {        Map<Integer, LocalAssignment> portLA = new HashMap<Integer, LocalAssignment>();        //遍历所有的任务        for (Entry<String, Assignment> assignEntry : assignments.entrySet()) {            String topologyId = assignEntry.getKey();            Assignment assignment = assignEntry.getValue();            //遍历一个任务下的所有worker, 看他是否是在本机(worker->NondeId == supervisorId)            Map<Integer, LocalAssignment> portTasks = readMyTasks(stormClusterState, topologyId, supervisorId, assignment);            if (portTasks == null) {                continue;            }            // a port must be assigned one storm            for (Entry<Integer, LocalAssignment> entry : portTasks.entrySet()) {                Integer port = entry.getKey();                LocalAssignment la = entry.getValue();                if (!portLA.containsKey(port)) {                    portLA.put(port, la);                } else {                    throw new RuntimeException("Should not have multiple topologys assigned to one port");                }            }        }        return portLA;    }<div class="se-preview-section-delimiter"></div>

遍历第一步中获取的所有zk上的任务(整个集群任务), 看每个任务下的所有work是否在本地的(通过work的nodeid和supervisor比较是否一致),最终得到所有分配到这台机器上的work

更新本地的work信息

     try {                LOG.debug("Writing local assignment " + zkAssignment);                localAssignment = (Map<Integer, LocalAssignment>) localState.get(Common.LS_LOCAL_ASSIGNMENTS);                if (localAssignment == null) {                    localAssignment = new HashMap<Integer, LocalAssignment>();                }                //更新状态                localState.put(Common.LS_LOCAL_ASSIGNMENTS, zkAssignment);                //比较新老状态,获取需要更新的任务(根据任务时间戳判断)                updateTopologys = getUpdateTopologys(localAssignment, zkAssignment, assignments);                Set<String> reDownloadTopologys = getNeedReDownloadTopologys(localAssignment);                //需要重新下载的也放到更新中去                if (reDownloadTopologys != null) {                    updateTopologys.addAll(reDownloadTopologys);                }            } catch (IOException e) {                LOG.error("put LS_LOCAL_ASSIGNMENTS " + zkAssignment + " of localState failed");                throw e;            }<div class="se-preview-section-delimiter"></div>

这里主要干三件事情:
一 更新本地的work信息
二 通过对比,得到需要更新的任务
三 通过对比,得到需要重新下载的任务
其中二和三任务都是需要更新对应的拓扑的

如何判断任务更新了呢?

    private Set<String> getUpdateTopologys(Map<Integer, LocalAssignment> localAssignments, Map<Integer, LocalAssignment> zkAssignments,            Map<String, Assignment> assignments) {        Set<String> ret = new HashSet<String>();        if (localAssignments != null && zkAssignments != null) {            for (Entry<Integer, LocalAssignment> entry : localAssignments.entrySet()) {                Integer port = entry.getKey();                LocalAssignment localAssignment = entry.getValue();                LocalAssignment zkAssignment = zkAssignments.get(port);                if (localAssignment == null || zkAssignment == null)                    continue;                Assignment assignment = assignments.get(localAssignment.getTopologyId());                if (localAssignment.getTopologyId().equals(zkAssignment.getTopologyId()) && assignment != null                        && assignment.isTopologyChange(localAssignment.getTimeStamp()))                    if (ret.add(localAssignment.getTopologyId())) {                        LOG.info("Topology-" + localAssignment.getTopologyId() + " has been updated. LocalTs=" + localAssignment.getTimeStamp() + ", ZkTs="                                + zkAssignment.getTimeStamp());                    }            }        }        return ret;    }<div class="se-preview-section-delimiter"></div>

从代码来看,首先任务是更新类型或者是扩容类型,同时本地任务更新时间早于zk上任务更新时间

同样,如果获取需要下载的任务呢?

  private Set<String> getNeedReDownloadTopologys(Map<Integer, LocalAssignment> localAssignment) {        Set<String> reDownloadTopologys = syncProcesses.getTopologyIdNeedDownload().getAndSet(null);        if (reDownloadTopologys == null || reDownloadTopologys.size() == 0)            return null;        Set<String> needRemoveTopologys = new HashSet<String>();        Map<Integer, String> portToStartWorkerId = syncProcesses.getPortToWorkerId();        for (Entry<Integer, LocalAssignment> entry : localAssignment.entrySet()) {            if (portToStartWorkerId.containsKey(entry.getKey()))                needRemoveTopologys.add(entry.getValue().getTopologyId());        }        LOG.debug("worker is starting on these topology, so delay download topology binary: " + needRemoveTopologys);        reDownloadTopologys.removeAll(needRemoveTopologys);        if (reDownloadTopologys.size() > 0)            LOG.info("Following topologys is going to re-download the jars, " + reDownloadTopologys);        return reDownloadTopologys;    }<div class="se-preview-section-delimiter"></div>

需要下载的所有任务,排除掉本地已经在启动的任务,剩下的还是需要重新下载

代码下载

   Map<String, String> topologyCodes = getTopologyCodeLocations(assignments, supervisorId);            //  downloadFailedTopologyIds which can't finished download binary from nimbus            Set<String> downloadFailedTopologyIds = new HashSet<String>();            downloadTopology(topologyCodes, downloadedTopologyIds, updateTopologys, assignments, downloadFailedTopologyIds);<div class="se-preview-section-delimiter"></div>

第一步是获取有work分配到当前机器的任务

  public static Map<String, String> getTopologyCodeLocations(Map<String, Assignment> assignments, String supervisorId) throws Exception {        Map<String, String> rtn = new HashMap<String, String>();        for (Entry<String, Assignment> entry : assignments.entrySet()) {            String topologyid = entry.getKey();            Assignment assignmenInfo = entry.getValue();            Set<ResourceWorkerSlot> workers = assignmenInfo.getWorkers();            for (ResourceWorkerSlot worker : workers) {                String node = worker.getNodeId();                if (supervisorId.equals(node)) {                    rtn.put(topologyid, assignmenInfo.getMasterCodeDir());                    break;                }            }        }        return rtn;    }<div class="se-preview-section-delimiter"></div>

过程还是类似的,对所有的任务,看他是否有work在当前机器,如果有那么就放到结果中。

第二部分就是下载

    public void downloadTopology(Map<String, String> topologyCodes, List<String> downloadedTopologyIds, Set<String> updateTopologys,                                 Map<String, Assignment> assignments, Set<String> downloadFailedTopologyIds) throws Exception {        Set<String> downloadTopologys = new HashSet<String>();        //对所有任务进行处理        for (Entry<String, String> entry : topologyCodes.entrySet()) {            String topologyId = entry.getKey();            String masterCodeDir = entry.getValue();            //没有下载过 或者 需要更新            if (!downloadedTopologyIds.contains(topologyId) || updateTopologys.contains(topologyId)) {                LOG.info("Downloading code for storm id " + topologyId + " from " + masterCodeDir);                int retry = 0;                while (retry < 3) {                    try {                        downloadStormCode(conf, topologyId, masterCodeDir);                        // Update assignment timeStamp                        StormConfig.write_supervisor_topology_timestamp(conf, topologyId, assignments.get(topologyId).getTimeStamp());                        break;                    } catch (IOException e) {                        LOG.error(e + " downloadStormCode failed " + "topologyId:" + topologyId + "masterCodeDir:" + masterCodeDir);                    } catch (TException e) {                        LOG.error(e + " downloadStormCode failed " + "topologyId:" + topologyId + "masterCodeDir:" + masterCodeDir);                    }                    retry++;                }                if (retry < 3) {                    LOG.info("Finished downloading code for storm id " + topologyId + " from " + masterCodeDir);                    downloadTopologys.add(topologyId);                } else {                    LOG.error("Cann't  download code for storm id " + topologyId + " from " + masterCodeDir);                    downloadFailedTopologyIds.add(topologyId);                }            }        }        // clear directory of topologyId is dangerous , so it only clear the topologyId which        // isn't contained by downloadedTopologyIds        for (String topologyId : downloadFailedTopologyIds) {            if (!downloadedTopologyIds.contains(topologyId)) {                try {                    String stormroot = StormConfig.supervisor_stormdist_root(conf, topologyId);                    File destDir = new File(stormroot);                    FileUtils.deleteQuietly(destDir);                } catch (Exception e) {                    LOG.error("Cann't  clear directory about storm id " + topologyId + " on supervisor ");                }            }        }        updateTaskCleanupTimeout(downloadTopologys);    }<div class="se-preview-section-delimiter"></div>

从代码来看,我们需要下载的有两种任务: 一是还没有下载过的,二是需要更新的(上面计算得到的)。 真正代码下载就是从zk上下载并写入到本地的文件中,并把任务的时间戳写入到本地文件中。下载成功,那么写入到downloadTopologys, 如果失败,同样进行记录,写到downloadFailedTopologyIds中去。

对于下载失败的,并且不在已经下载中的任务,删除本地的信息。

最后更新所有下载任务超时删除时间: 一个任务超时删除时间首先看任务是否配置,如果任务没有配置,那么就系统统一配置,最后更新到localstatus中

删除无用的拓扑

  public void removeUselessTopology(Map<String, String> topologyCodes, List<String> downloadedTopologyIds) {        for (String topologyId : downloadedTopologyIds) {            if (!topologyCodes.containsKey(topologyId)) {                LOG.info("Removing code for storm id " + topologyId);                String path = null;                try {                    path = StormConfig.supervisor_stormdist_root(conf, topologyId);                    PathUtils.rmr(path);                } catch (IOException e) {                    String errMsg = "rmr the path:" + path + "failed\n";                    LOG.error(errMsg, e);                }            }        }    }

如果一个任务在本地下载的信息中存在,但是在zk上代码路径信息中不存在,那么就认为任务已经无效了,从本地信息中进行删除(删除信息目录)

1 0
原创粉丝点击