How Apache Zeppelin runs a paragraph

来源：互联网发布：c语言中二维数组在内存编辑：程序博客网时间：2024/06/07 11:42

转发一篇zeppelin的主力committer Jongyoul Lee的讲解zeppelin的paragraph的运行机制的文章，原文地址：https://medium.com/apache-zeppelin-stories/how-apache-zeppelin-runs-a-paragraph-783a0a612ba9#.a85u5nlh4

Apache Zeppelin is one of the most popular open source projects. It helps users create their own notebooks easily and share some of reports simply. Most of users appreciate Apache Zeppelin’s functionality and extensibility. Most of contributors and administrators, however, shared their experience of having difficulties while debugging Apache Zeppelin because of its complicated structure. This post will describe how Apache Zeppelin handles users’ requests to run paragraphs from server module to interpreters.

Before diving into details, I’ll clarify some terms that would help you in understanding of this article. The first term is paragraph. It is a minimum unit to be executed. The second one is note which is a set of paragraphs, and also a member of notebook. Thus one instance has only one notebook which has many notes. You can see a notebook in a home page.

We also need to know what an interpreter is. Interpreter of Apache Zeppelin is the gateway to connect specific framework to run actual code. For instance, SparkInterpreter is a gateway to run Apache Spark, and JDBCInterpreter supports to handle JDBC drivers. Apache Zeppelin has 19 interpreters on a master branch.
The server module of Apache Zeppelin consists of three parts: handling rest/websocket, storing and loading data, and managing interpreters. This post will focus on the last one of managing interpreters. But it will help you understand whole path for running a paragraph. There’re two entry points for running paragraphs.

/**   * Run asynchronously paragraph job REST API   *   * @param message - JSON with params if user wants to update dynamic form's value   *                null, empty string, empty json if user doesn't want to update   * @return JSON with status.OK   * @throws IOException, IllegalArgumentException   */  @POST  @Path("job/{notebookId}/{paragraphId}")  @ZeppelinApi  public Response runParagraph(@PathParam("notebookId") String notebookId,      @PathParam("paragraphId") String paragraphId, String message)      throws IOException, IllegalArgumentException {    LOG.info("run paragraph job asynchronously {} {} {}", notebookId, paragraphId, message);    Note note = notebook.getNote(notebookId);    if (note == null) {      return new JsonResponse<>(Status.NOT_FOUND, "note not found.").build();    }    Paragraph paragraph = note.getParagraph(paragraphId);    if (paragraph == null) {      return new JsonResponse<>(Status.NOT_FOUND, "paragraph not found.").build();    }    // handle params if presented    handleParagraphParams(message, note, paragraph);    note.run(paragraph.getId());    return new JsonResponse<>(Status.OK).build();  }

NotebookRestApi.java on github

 private void runParagraph(NotebookSocket conn, HashSet<String> userAndRoles, Notebook notebook,      Message fromMessage) throws IOException {    final String paragraphId = (String) fromMessage.get("id");    if (paragraphId == null) {      return;    }    String noteId = getOpenNoteId(conn);    final Note note = notebook.getNote(noteId);    NotebookAuthorization notebookAuthorization = notebook.getNotebookAuthorization();    if (!notebookAuthorization.isWriter(noteId, userAndRoles)) {      permissionError(conn, "write", fromMessage.principal,          userAndRoles, notebookAuthorization.getWriters(noteId));      return;    }    Paragraph p = note.getParagraph(paragraphId);    String text = (String) fromMessage.get("paragraph");    p.setText(text);    p.setTitle((String) fromMessage.get("title"));    if (!fromMessage.principal.equals("anonymous")) {      AuthenticationInfo authenticationInfo = new AuthenticationInfo(fromMessage.principal,          fromMessage.ticket);      p.setAuthenticationInfo(authenticationInfo);    } else {      p.setAuthenticationInfo(new AuthenticationInfo());    }    Map<String, Object> params = (Map<String, Object>) fromMessage       .get("params");    p.settings.setParams(params);    Map<String, Object> config = (Map<String, Object>) fromMessage       .get("config");    p.setConfig(config);    // if it's the last paragraph, let's add a new one    boolean isTheLastParagraph = note.isLastParagraph(p.getId());    if (!(text.trim().equals(p.getMagic()) || Strings.isNullOrEmpty(text)) &&        isTheLastParagraph) {      note.addParagraph();    }    AuthenticationInfo subject = new AuthenticationInfo(fromMessage.principal);    note.persist(subject);    try {      note.run(paragraphId);    } catch (Exception ex) {      LOG.error("Exception from run", ex);      if (p != null) {        p.setReturn(            new InterpreterResult(InterpreterResult.Code.ERROR, ex.getMessage()),            ex);        p.setStatus(Status.ERROR);        broadcast(note.getId(), new Message(OP.PARAGRAPH).put("paragraph", p));      }    }  }

NotebookServer.java on github

In the two above methods, those call note.run(id) at the end of the methods, that method finds an actual paragraph from id and submits the paragraph into the scheduler of an interpreter parsed by paragraph and note. This is the flow of the front-side.

/**   * Run a single paragraph.   *   * @param paragraphId ID of paragraph   */  public void run(String paragraphId) {    Paragraph p = getParagraph(paragraphId);    p.setListener(jobListenerFactory.getParagraphJobListener(this));    String requiredReplName = p.getRequiredReplName();    Interpreter intp = factory.getInterpreter(getId(), requiredReplName);    if (intp == null) {      String intpExceptionMsg =          p.getJobName() + "'s Interpreter " + requiredReplName + " not found";      InterpreterException intpException = new InterpreterException(intpExceptionMsg);      InterpreterResult intpResult =          new InterpreterResult(InterpreterResult.Code.ERROR, intpException.getMessage());      p.setReturn(intpResult, intpException);      p.setStatus(Job.Status.ERROR);      throw intpException;    }    if (p.getConfig().get("enabled") == null || (Boolean) p.getConfig().get("enabled")) {      intp.getScheduler().submit(p);    }  }

Note.java on github

Through the code above, you can guess the relationship between a note and interpreters in a code level. Every note has its own interpreters’ mapping and stores it into interpreterFactory, every interpreter has its own scheduler and runs a paragraph from the scheduler, and the status of paragraph is managed by jobListenerFactory. Concerning jobListenerFactory, I’ll write another post for the lifecycle of paragraph.
For the first step to understand interpreter, we should know how to initialize interpreters when Apache Zeppelin starts up. InterpreterFactory manages the lifecycle of interpreters. When you start up the server, InterpreterFactory initializes with two major steps. The first is to read the directory of ${ZEPPELIN_HOME}/interpreter which has many sub directories that have all of jars including third party’s frameworks. InterpreterFactory makes the list of available interpreters with default configuration, and which is used to make a new interpreter setting. Secondly, InterpreterFactory reads ${ZEPPELIN_HOME}/conf/interpreter.json which stores actual configurations of interpreters and includes mapping between notes and interpreters. This is same information in an interpreter tab of UI. It finishes with preparation on running a paragraph by interpreterFactory. Here is the link of the code:
Before proceeding into the next step, you should know how Apache Zeppelin launches an interpreter. The main purpose of supporting different modes is to manage memory usage and overload of CPUs efficiently. No one wants to run MarkdownInterpreter per note, but most of users would like to run SparkInterpreter with their own instances. Apache Zeppelin supports three modes for managing interpreters. Shared mode is a traditional model. this mode shares all of resources. If you use SparkInterpreter with this mode, all running paragraph use one Spark instance. Scoped mode has different class loader in a same process. This mode will enable note to own separate resources within a same process. In case of using JDBCInterpreter, every note has its own connection. Isolated mode means that all notes can run paragraphs in different processes. There are two main functions to decide a mode.

  private String getInterpreterProcessKey(String noteId) {    if (getOption().isExistingProcess) {      return Constants.EXISTING_PROCESS;    } else if (getOption().isPerNoteProcess()) {      return noteId;    } else {      return SHARED_PROCESS;    }  }

InterpreterSetting.java on github

 private String getInterpreterInstanceKey(String noteId, InterpreterSetting setting) {    if (setting.getOption().isExistingProcess()) {      return Constants.EXISTING_PROCESS;    } else if (setting.getOption().isPerNoteSession() || setting.getOption().isPerNoteProcess()) {      return noteId;    } else {      return SHARED_SESSION;    }  }

InterpreterFactory.java on github

Now, we will look into the method of getInterpreter. Basically, it returns an interpreter which runs a paragraph. To determine specific interpreter, this function has three steps. You will encounter a new term called replName when you dig into the code. It is sort of alias to call a specific interpreter. According to the type of replName, getInterpreter chooses different values. If it’s null, it returns a default interpreter. If it doesn’t have any comma, getInterpreter treats it as a name of default interpreter group. For example, “%pyspark” means as same as “%spark.pyspark”. At last, it has two words separated by dot, getInterpreter handles it as “%{group_name}.{interpreter_name}” and returns a specific interpreter. Here is the link of the code:

Another function of getInterpreter is to make a RemoteInterpreter. Apache Zeppelin launches different processes for different interpreters and manages them via Apache Thrift. It is to avoid conflicts among different interpreters’ dependencies. If the result interpreter is never called before, getInterpreter will make a RemoteInterpreter for that interpreter. RemoteInterpreter is a wrapper including Thrift client interface and a connector between a server process and interpreter processes.

It’s time to find where a paragraph is executed. Let’s go back to note.run(id). that method calls intp.getScheduler().submit(p) at the end. Paragraph implements Job interface, and scheduler will execute Job one by one. If some paragraphs are submitted into the scheduler of an interpreter, the scheduler will run Paragraph.jobRun().

@Override  protected Object jobRun() throws Throwable {    String replName = getRequiredReplName();    Interpreter repl = getRepl(replName);    logger.info("run paragraph {} using {} " + repl, getId(), replName);    if (repl == null) {      logger.error("Can not find interpreter name " + repl);      throw new RuntimeException("Can not find interpreter for " + getRequiredReplName());    }    if (this.noteHasUser() && this.noteHasInterpreters()) {      InterpreterSetting intp = getInterpreterSettingById(repl.getInterpreterGroup().getId());      if (intp != null &&        interpreterHasUser(intp) &&        isUserAuthorizedToAccessInterpreter(intp.getOption()) == false) {        logger.error("{} has no permission for {} ", authenticationInfo.getUser(), repl);        return new InterpreterResult(Code.ERROR, authenticationInfo.getUser() +          " has no permission for " + getRequiredReplName());      }    }    String script = getScriptBody();    // inject form    if (repl.getFormType() == FormType.NATIVE) {      settings.clear();    } else if (repl.getFormType() == FormType.SIMPLE) {      String scriptBody = getScriptBody();      Map<String, Input> inputs = Input.extractSimpleQueryParam(scriptBody); // inputs will be built                                                                             // from script body      final AngularObjectRegistry angularRegistry = repl.getInterpreterGroup()              .getAngularObjectRegistry();      scriptBody = extractVariablesFromAngularRegistry(scriptBody, inputs, angularRegistry);      settings.setForms(inputs);      script = Input.getSimpleQuery(settings.getParams(), scriptBody);    }    logger.debug("RUN : " + script);    try {      InterpreterContext context = getInterpreterContext();      InterpreterContext.set(context);      InterpreterResult ret = repl.interpret(script, context);      if (Code.KEEP_PREVIOUS_RESULT == ret.code()) {        return getReturn();      }      String message = "";      context.out.flush();      InterpreterResult.Type outputType = context.out.getType();      byte[] interpreterOutput = context.out.toByteArray();      if (interpreterOutput != null && interpreterOutput.length > 0) {        message = new String(interpreterOutput);      }      if (message.isEmpty()) {        return ret;      } else {        String interpreterResultMessage = ret.message();        if (interpreterResultMessage != null && !interpreterResultMessage.isEmpty()) {          message += interpreterResultMessage;          return new InterpreterResult(ret.code(), ret.type(), message);        } else {          return new InterpreterResult(ret.code(), outputType, message);        }      }    } finally {      InterpreterContext.remove();    }  }

Paragraph.java on github

It looks complicated but we focus on repl.interpret(script, context) only. Paragraph gets repl by calling getInterpreter and run the method of interpret. Then RemoteInterpreter will pass the script and context into Interpreter on a different process and obtain the result from another process.

It’s very basic flow about running paragraph and this article skipped some steps for your understanding. through this article, I tried to describe how Apache Zeppelin selects correct interpreter and how an interpreter gets a script and executes it. there are also many new concepts that I didn’t explain. I’ll explain them with another article. Apache Zeppelin is emerging project and changes so fast. I, however, hope this article helps you understand Apache Zeppelin more, and contribute to Apache Zeppelin easily.

0 0