Ceph学习——Ceph网络通信机制与源码分析

来源:互联网 发布:ubuntu sublime text2 编辑:程序博客网 时间:2024/06/06 03:09

  • Ceph 网络通信总体设计
  • 基础类介绍
    • Message
  • Simple的通信机制
    • SimpleMessager
    • Connection
    • Pipe
    • Dispatcher
    • 流程分析
      • 消息的发送
      • 消息的接收
  • 参考文献

Ceph 网络通信总体设计

作为一个分布式存储系统,Ceph自然需要一个稳定的网络通信模块,用于客户端和服务端,以及各个节点之间的消息通信。Ceph的网络模块位于源代码的ceph/src/msg 下,该模块构造了网络通信的基本框架。在文件夹下还包含了三种接口的实现:simple、async、xio。由于simple比较简单,也是目前生产环境中可以使用的,所以就只介绍它。
simple对于每一个连接,都会创建两个线程,其中一个用于监听和读取该终端的读事件,另一个用于写事件。读线程得到请求以后会解析网络流并开始构建消息,然后派发到后面的 Dispatcher。写线程在大部分时候会处于 Sleep 状态,直到有新的消息需要发送才会被唤醒。
Messenger是网络模块的核心数据结构,负责接收/发送消息。OSD主要有两个Messenger:ms_public处于与客户端的消息,ms_cluster处理与其它OSD的消息


这里写图片描述

基础类介绍

Message

该类是消息的基类,所有要发送的消息都是继承自该类的,它由消息头(header)、数据(user_data)、结束标记(footer)构成。

打开 ceph/src/msg/Message.h

class Message : public RefCountedObject {protected:  ceph_msg_header  header;      // headerelope  ceph_msg_footer  footer;  bufferlist       payload;  // "front" unaligned blob  bufferlist       middle;   // "middle" unaligned blob  bufferlist       data;     // data payload (page-alignment will be preserved where possible)  ...  ...  ...

类中封装了ceph_msg_header 、ceph_msg_footer 。他们就是信息头和信息结束标志的结构体。数据则是由三部分组成的,分别是playload(一般保存相关元数据)、middle(留用)、data(读写数据)。

ceph_msg_header 主要是封装数据相关信息

struct ceph_msg_header {    __le64 seq;       /* message seq# for this session ## 当前session内 消息的唯一 序号*/                   __le64 tid;        /* transaction id ## 消息的全局唯一的 id*/    __le16 type;      /* message type ## 消息类型 */                   __le16 priority;    /* priority.  higher value == higher priority */    __le16 version;    /* version of message encoding */    __le32 front_len;   /* bytes in main payload ## payload 的长度*/                        __le32 middle_len;  /* bytes in middle payload ## middle 的长度*/                       __le32 data_len;    /* bytes of data payload ## data 的 长度 */            ...    ...    ...          } __attribute__ ((packed));

ceph_msg_footer 主要是封装结束标记和CRC校验码

struct ceph_msg_footer {    __le32 front_crc, middle_crc, data_crc;//三个部分的 crc 效验码                     __le64  sig;      // 消息的64位  signature    __u8 flags;       //结束标志} __attribute__ ((packed));

Simple的通信机制

在http://www.wzxue.com/ceph-network/ 中介绍了Ceph的网络通信的设计,这里采用这篇文章中的图来介绍通信的流程。


这里写图片描述

接下来按照图片来介绍其中的类。

SimpleMessager

该类实现了Messager的接口。这只是三种实现接口方式中的一种。Messager是整个网络模块的抽闲类,可以看成是一个顶层的控制模块,它定义了网络模块的基本API接口,实现发送和接收消息。

class SimpleMessenger : public SimplePolicyMessenger {   //是Messager接口的实现public:  Accepter accepter;//用来监听请求  DispatchQueue dispatch_queue;//请求的队列  friend class Accepter;  //用于创建一个Pipe  Pipe *connect_rank(const entity_addr_t& addr, int type, PipeConnection *con,             Message *first);  /**   * Queue up a Message for delivery to the entity specified   * by addr and dest_type.   * submit_message() is responsible for creating   * new Pipes (and closing old ones) as necessary.   * 提交发送消息到发送队列,在必要时创建Pipe   */  void submit_message(Message *m, PipeConnection *con,              const entity_addr_t& addr, int dest_type,              bool already_locked);  friend class Pipe; //在已存在的Pipe中查找  Pipe *_lookup_pipe(const entity_addr_t& k) {    ceph::unordered_map<entity_addr_t, Pipe*>::iterator p = rank_pipe.find(k);    if (p == rank_pipe.end())      return NULL;    // see lock cribbing in Pipe::fault()    if (p->second->state_closed)      return NULL;    return p->second;  }} ;

Connection

看类的名字就知道他是干嘛的了,是用来发送消息、接收消息的的。

struct Connection : public RefCountedObject {  mutable Mutex lock;  Messenger *msgr;  RefCountedObject *priv;  int peer_type;  entity_addr_t peer_addr;  utime_t last_keepalive, last_keepalive_ack;  ....  ....  ....  /**   * 判断是否能够发送消息了。。。。   * @return true if ready to send, or false otherwise   */  virtual bool is_connected() = 0;  ...  ...  /**   * @param m The Message to send. The Messenger consumes a single reference   * when you pass it in.   * 主要功能就是这个 !!!用来发送消息的!!!!   * @return 0 on success, or -errno on failure.   */  virtual int send_message(Message *m) = 0;  int send_message(boost::intrusive_ptr<Message> m)  {    return send_message(m.detach()); /* send_message(Message *m) consumes a reference */  }  ...  ...  ...};

Pipe

Pipe实现的就是开头提到的,对于每一个连接,都会在内部创建一个读线程、一个写线程用来处理接收消息和发送消息。它的层次位于Connetion和Dispatcher的中间,其中拥有读写线程pipe:: reader_thread和pipe::writer_thread,他们的入口函数分别为Pipe::reader和Pipe::writer函数。可以看到,其实具体的封装socket是在这个部分的吧。。。

class Pipe : public RefCountedObject {    /**     * The Reader thread handles all reads off the socket      * 读线程,用于接收消息     */     class Reader : public Thread {      Pipe *pipe;    public:      explicit Reader(Pipe *p) : pipe(p) {}      void *entry() override { pipe->reader(); return 0; }    } reader_thread;    /**     * The Writer thread handles all writes to the socket (after startup).     * 写线程用于发送消息     */    class Writer : public Thread {      Pipe *pipe;    public:      explicit Writer(Pipe *p) : pipe(p) {}      void *entry() override { pipe->writer(); return 0; }    } writer_thread;    ....    ....    ....    map<int, list<Message*> > out_q;  // priority queue 准备发送的消息队列    DispatchQueue *in_q;//接收到消息的队列    list<Message*> sent;//当前要发送的消息    Cond cond;    bool send_keepalive;    bool send_keepalive_ack;    utime_t keepalive_ack_stamp;    bool halt_delivery; //if a pipe's queue is destroyed, stop adding to it    ....    __u32 connect_seq, peer_global_seq;    uint64_t out_seq;//发送序号    uint64_t in_seq, in_seq_acked;//接收序号、ACK信号    ....    void set_socket_options();    ....    int read_message(Message **pm,             AuthSessionHandler *session_security_copy);    int write_message(const ceph_msg_header& h, const ceph_msg_footer& f, bufferlist& body);    void start_reader();    void start_writer();    void shutdown_socket() {      recv_reset();      if (sd >= 0)        ::shutdown(sd, SHUT_RDWR);    }    ...    ...    ...  };

Dispatcher

这个类用于消息的分发,在Pipe中接收、发送队列中有很多请求,他就是负责把Message的请求分发给具体的应用层。

lass Dispatcher {public:  explicit Dispatcher(CephContext *cct_)    : cct(cct_)  {  }  virtual ~Dispatcher() { }  /**   * The Messenger calls this function to query if you are capable   * of "fast dispatch"ing a message. Indicating that you can fast   * dispatch it requires that you:   * 1) Handle the Message quickly and without taking long-term contended   * locks. (This function is likely to be called in-line with message   * receipt.)   * 2) Be able to accept the Message even if you have not yet received   * an ms_handle_accept() notification for the Connection it is associated   * with, and even if you *have* called mark_down() or received an   * ms_handle_reset() (or similar) call on the Connection. You will   * not receive more than one dead "message" (and should generally be   * prepared for that circumstance anyway, since the normal dispatch can begin,   * then trigger Connection failure before it's percolated through your system).   * We provide ms_handle_fast_[connect|accept] calls if you need them, under   * similar speed and state constraints as fast_dispatch itself.   * 3) Be able to make a determination on fast_dispatch without relying   * on particular system state -- the ms_can_fast_dispatch() call might   * be called multiple times on a single message; the state might change between   * calling ms_can_fast_dispatch and ms_fast_dispatch; etc.   *   * @param m The message we want to fast dispatch.   * @returns True if the message can be fast dispatched; false otherwise.   */  /**   * Perform a "fast dispatch" on a given message. See   * ms_can_fast_dispatch() for the requirements.   *   * @param m The Message to fast dispatch.   */  virtual void ms_fast_dispatch(Message *m) { ceph_abort(); }  /**   * The Messenger calls this function to deliver a single message.   *   * @param m The message being delivered. You (the Dispatcher)   * are given a single reference count on it.   */  virtual bool ms_dispatch(Message *m) = 0;  /**   * This function will be called whenever a Connection is newly-created   * or reconnects in the Messenger.   *   * @param con The new Connection which has been established. You are not   * granted a reference to it -- take one if you need one!   */  virtual void ms_handle_connect(Connection *con) {}  /**   * This function will be called synchronously whenever a Connection is   * newly-created or reconnects in the Messenger, if you support fast   * dispatch. It is guaranteed to be called before any messages are   * dispatched.   *   * @param con The new Connection which has been established. You are not   * granted a reference to it -- take one if you need one!   */  virtual void ms_handle_fast_connect(Connection *con) {}  /**   * Callback indicating we have accepted an incoming connection.   *   * @param con The (new or existing) Connection associated with the session   */  virtual void ms_handle_accept(Connection *con) {}  ...  ...  ...};

流程分析

消息的发送:


这里写图片描述
图片来自 http://www.linuxidc.com/Linux/2015-10/124549.htm

  1. SimpleMessager 首先获取对应的Connection
    首先查找已有的Pipe,若没有则创建一个Pipe

  2. 获得connection后,调用发送函数 _send_message

  3. _send_message 最终调用submit_message函数,它会查看Pipe的状态,若空则创建一个Pipe,若不空且状态不是关闭状态,那就把调用_send 把消息发送到out_q发送队列中,触发发送线程。

  4. 触发发送线程后 ,每个的Pipe负责使用Write_thread 来发送out_q的消息,入口函数为Pipi::write。
    进入发送线程后,使用_get_next_outgoing来获取out_q中的一个消息,使用write_message来把消息发送出去

消息的接收


这里写图片描述

  1. 使用入口函数 Pipe::reader 启动接收线程

    1) 判断状态后,使用tcp_read 获取tag
    2)根据tag的类型获取不同类型的消息。
    3) 调用 read_message 来接收消息,函数返回后,消息接收完成。

  2. 判断能不能使用fast_dispatch ,能则不将消息加入mqueue (DispatchQueue中)直接由Pipe调用ms_fast_dispatch函数处理,否则加入该队列,让DispatchQueue调用ms_dispatch 处理。

参考文献

http://www.linuxidc.com/Linux/2015-10/124549.htm
http://blog.csdn.net/changtao381/article/details/50915328
http://www.wzxue.com/ceph-network/