TraTraffic Server 进程模型

来源：互联网发布：jbl煲机软件编辑：程序博客网时间：2024/05/22 03:39

http://www.cnblogs.com/liushaodong/archive/2013/02/26/2933280.html

1.概述

Traffic Server包括三个一起工作的进程来服务Traffic Server的请求，管理/控制/监控系统的健康状况。图1说明了三个进程的关系，三个进程将会在下面描述。

图1：进程之间的关系

1）traffic_server进程是 Traffic Server的事务处理引擎。它负责接收连接、处理协议请求以及从本地缓存或源服务器提供资源。

2）traffic_manager进程是用来命令和控制Traffic Server的工具，负责启动、监控以及重新配置traffic_server进程。traffic_manager进程同时负责代理自动配置端口、统计接口、集群管理以及vip故障转移。

如果traffic_manager进程检测到traffic_server进程失败，它不仅会立即重启该进程，而且会为所有传入的请求维护一个连接队列。在traffic_server重启前的几秒内传入的所有连接将会被保存在连接队列中，并以FIFO的方式处理。这个连接队列接受任何server故障重启时的连接。

3）traffic_cop进程监控traffic_server和traffic_manager进程的健康状况。traffic_cop进程通过抓取合成web页面的心跳请求方式周期性的（每分钟若干次）查询traffic_server和traffic_manager进程。如果失败事件发生（如果在超时时间间隔内没有收到请求或者收到错误的请求），traffic_cop重启traffic_server和traffic_manager进程。系统这样设计的好处便是给traffic_server进程加上了来自traffic_manager和traffic_cop的双重保障，因为traffic_server进程是工作进程，必须保证它的正常运行。-

4）traffic server采用的是多线程异步事件处理模型：Traffic Server并不是为每个连接都建立一个线程，而是事先创建一组数量可配置的工作线程，每一个工作线程上都运行着独立的异步事件处理程序。traffic_server创建若干组Thread，并将Event按类型调度到相应的Thread的Event队列上，Thread通过执行Event对应的Continuation中的回调函数，来完成状态的迁移。从初始态到终止态的迁移代表了整个事件的执行过程，而Thread是永不退出的，等待着下一个事件的到来。

本文重点在于分析traffic server中三个进程的关系以及实现，对于其多线程异步事件处理模型不作深入分析。进程模型图如下：

2.实现原理

基本原理：对traffic_manager进程和traffic_server进程分别配置对应的manager_lockfile和server_lockfile文件，traffic_cop通过两个lockfile文件来监控traffic_manager和traffic_server进程，同理traffic_manager进程通过server_lockfile来监控traffic_server进程。图2说明了这种关系：

图2：进程以及lockfile文件的关系

关键实现：

关键类 Lockfile

Lockfile::Open(pid_t * holding_pid)函数详解：

解释和说明：Lockfile::Open(pid_t * holding_pid)会有三种类型的返回值，close-on-exec:具体作用在于当开辟其他进程调用exec（）族函数时，在调用exec函数之前为exec族函数释放对应的文件描述符。

(1):返回1说明lockfile可以被打开，这也说明与lockfile关联的进程没有运行，如果关联的进程在运行，lockfile会被进程持有，就不会被打开；

(2):返回0说明检测到lockfile被某个进程持有，那么将持有lockfile的进程ID写入holding_pid返回，持有lockfile的进程ID是在对应进程运行的时候，由Get()函数写入到lockfile中的；

(3):返回负值一共有三种情况，一是打开fname失败，二是获取close-on-exec标识失败，三是设置clsoe-on-exec标识失败。

重要的kill进程的相关函数，简要说明如下:

// kill

//用于杀死指定pid的进程

//return: 0--okay，-1—error

1.int kill(pid_t pid, int sig);

//ink_killall

//杀死程序名称为pname的所有进程
// return: 0--okay，-1—error
2. ink_killall(const char *pname, int sig);

ink_killall调用:

　　3. ink_killall_get_pidv_xmalloc (pname, &pidv, &pidvcnt);
　　4. ink_killall_kill_pidv (pidv, pidvcnt, sig);

// ink_killall_get_pidv_xmalloc
//根据程序panme，获取程序运行的进程ID到pidv数组中，以及进程的个数到pidvcnt

//变量中

//return: -1 error (pidv: set to NULL; pidvcnt: set to 0); 0 okay (pidv: ats_malloc'd //pid vector; pidvcnt: number of pid's;if pidvcnt is set to 0, then pidv will //be set to NULL)

3.int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt);

// ink_killall_kill_pidv (pidv, pidvcnt, sig);
//将pidv中记录的进程ID逐个调用kill( pidv[i],sig)
// return: 0--okay，-1—error
4.int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt,int sig);
ink_killall_kill_pidv调用:
　　

　　1.kill(pid_t pid, int sig);

// safe_kill
//用于安全的杀死程序名称为pname的所有进程，lockfile_name为进程需要关联的lockfile文件//group表明是否需要杀死pname进程创造的子进程，因为它们在同一个进程组;

//return: void

5. static void safe_kill(const char *lockfile_name, const char *pname, bool group)；
static void safe_killd调用:

　　6. Lockfile::Kill(killsig, coresig, pname);

　　7. Lockfile::KillGroup(killsig, coresig, pname);

// Lockfile::Kill

//处理好对应的lockfile文件，杀死程序名为pname的所有进程，其中sig一般就是kill信号，//initial_sig默认为0，用于发送给init_pid进程的

//return：void

6. void Lockfile::Kill(int sig, int initial_sig, const char *pname);
Lockfile::Kill调用:

　　8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);

// Lockfile::KillGroup

//处理好对应的lockfile文件，杀死程序名为pname的进程，以及该进程创建的子进程（当然也包括//子进程创建的线程），sig为kill信号

//信号

//initial_sig同上kill函数

//return :void

7.void Lockfile::KillGroup(int sig, int initial_sig, const char *pname);
Lockfile::KillGroup调用:
　　
　　8.LockKill::lockfile_kill_internal(pid, initial_sig, pid, pname, sig);

// LockKill::lockfile_kill_internal

//首先杀死init_pid进程，然后杀死程序名称为pname的所有进程

//return :void

8.static void lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig);
lockfile_kill_internal调用:

　　1.kill(init_pid, init_sig);

　　3.ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt);
　　4.ink_killall_kill_pidv(pidv, pidvcnt, sig);

若想了解详细实现细节，请参见源代码.

2. 模拟traffic_cop对traffic_manager和traffic_server的监控

Traffic_cop启动以后进入main函数，main函数会调用一个check函数，在check里面会周期性的调用check_programs()函数来对traffic_manager和traffic_server进行监控。check_programs()函数有些复杂，流程图如下图。

3.模拟测试

根据原理，模仿了traffic_cop、traffic_manager和traffic_server三个进程，其中将traffic_cop实现为守护进程，traffic_manager进程对traffic-server进程的监控类似于traffic_cop对traffic_manager与traffic_server的监控，故不作重复说明。实验中，由于测试traffic_manager与traffic_server进程健康度的函数heartbear_manager()、server_up()与heartbeat_server()函数涉及到端口通信部分内容，由于其不妨碍原理部分的模拟，略写了它们的代码，而是让它们直接返回正常值。（程序运行的时候需要manage_lokfile和server_lockfile文件，读者应自己在可执行文件所在文件夹下加上这两个文件）

程序运行后，敲入命令 ps –axj|grep binary得到图如下：

前四个标识分别是：父进程ID/进程ID/进程组ID/会话ID

图中可以看出它们的正常关系。

当traffic_manager进程异常退出的时候，traffic_cop会重启traffic_manager进程，在日志文件中可以看出这一动作：（日志部分内容如下）

==============traffic_server is running, pid:'5443'!

----------------traffic_manager is running, pid:'5436'!

==============traffic_server is running, pid:'5443'!

---------------traffic_manager has a expcetion and eixt!

Entering check_programs()

traffic_manager not running, making sure traffic_server is dead

Entering safe_kill

Leaving safe_kill

Entering spwan_manager()!

Leaving spwan_manager()!

Leaving check_programs

----------------traffic_manager is running, pid:'5463'!

Entering spwan_server()!

Leaving spwan_server()!

==============traffic_server is running, pid:'5467'!

从日志中可以看出，某个时刻，traffic_manager进程ID是5436，traffic_server进程ID是5443；下一时刻中，traffic_manager进程出现了异常(---------------traffic_manager has a expcetion and eixt!)，然后traffic_cop在周期性的check_programs()中发现” traffic_manager not running”,然后它杀死了traffic_server进程（” making sure traffic_server is dead”）,然后重新创建了traffic_manager进程（” Entering spwan_manager()!”）,traffic_manager进程的ID已经变成了5463，traffic_manager正常运行后，发现traffic_server进程没有运行，随后它调用spwan_server()产生新的traffic_server进程，其ID号变成了5467。说明traffic_cop监控功能正常。

当traffic_server进程异常退出的时候，traffic_manager进程会检测到这一行为，然后重启traffic_server进程，在日志文件中也可以看出这一动作：（日志部分内容如下）

==============traffic_server is running, pid:'7703'!

----------------traffic_manager is running, pid:'7699'!

=================traffic_server has a expcetion and exit!

Entering safe_kill

Leaving safe_kill

--------------Entering spwan_server()!

--------------Leaving spwan_server()!

----------------traffic_manager is running, pid:'7699'!

==============traffic_server is running, pid:'7712'!

从日志上可以看出，某时刻，traffic_manager进程ID为7699，traffic_server进程ID是7703，接下来traffic_server进程出现异常退出，traffic_manager进程则调用spwan_server()重新开启了一个traffic_server进程，ID号为7712，此时traffic_manager进程的ID号仍然是7699，说明traffic_manager进程没有改变。这说明traffic_manager起到了监控traffic_server进程的作用。

4.总结

为什么设计了三个进程来工作，而不是采用两个进程：直接让traffic_manager进程来监管traffic_server进程。由于traffic_manager进程所负担的系统角色说明单独的两个进程是无法满足系统要求的。特别是当traffic_manager进程检测到traffic_server进程失败的时候，它会暂时将请求放入队列中，所以它也需要在端口上暂时监听请求，这样系统就无法保障该进程不会出现异常，这也意味着traffic_manager进程同样也会出现异常。为此系统设计了traffic_cop守护进程来监控，traffic_cop进程的角色就是纯粹的监控另外两个进程，理论上这个守护进程是不会异常结束的，这样的三层设计比两层设计更安全更可靠。当三个进程协同工作的时候，客户对于服务器的异常是透明的（设计上如此，但并非绝对，当traffic_manager与traffic_server同时异常结束的时候，traffic_cop在重启它们的几秒钟内，客户的请求会无法接收，小概率），客户是不会感知到自己的请求会出现问题的，可能会感觉延迟大了一些。从服务器的架构设计上可以看出，服务器的要求是尽可能的稳定安全，对于异常情况的考虑应周全。

源代码:

1.lock_and_kill.h

  1 #ifndef LOCK_AND_KILL_H  2 #define LOCK_AND_KILL_H  3 #include <sys/types.h>  4 #include <string.h>  5 #define PATH_NAME_MAX 4096  6   7 /*-------------------------------------------------------------------------  8    ink_killall  9    - Sends signal 'sig' to all processes with the name 'pname' 10    - Returns: -1 error 11                0 okay 12   -------------------------------------------------------------------------*/ 13 int ink_killall(const char *pname, int sig); 14  15 /*------------------------------------------------------------------------- 16    ink_killall_get_pidv_xmalloc 17    - Get all pid's named 'pname' and stores into ats_malloc'd 18      pid_t array, 'pidv' 19    - Returns: -1 error (pidv: set to NULL; pidvcnt: set to 0) 20                0 okay (pidv: ats_malloc'd pid vector; pidvcnt: number of pid's; 21                    if pidvcnt is set to 0, then pidv will be set to NULL) 22   -------------------------------------------------------------------------*/ 23 int ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt); 24  25 /*------------------------------------------------------------------------- 26    ink_killall_kill_pidv 27    - Kills all pid's in 'pidv' with signal 'sig' 28    - Returns: -1 error 29                0 okay 30   -------------------------------------------------------------------------*/ 31 int ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig); 32  33  34  35 class Lockfile 36 { 37 public: 38      39   Lockfile(void):fd(0) 40   { 41     fname[0] = '\0'; 42   } 43  44  45   // coverity[uninit_member] 46   Lockfile(const char *filename):fd(0) 47   { 48     strcpy(fname, filename); 49   } 50  51  52   ~Lockfile(void) 53   { 54   } 55  56   void SetLockfileName(const char *filename) 57   { 58     strcpy(fname, filename); 59   } 60  61   const char *GetLockfileName(void) 62   { 63     return fname; 64   } 65  66   // Open() -----非常重要的函数 67   // 68   // Tries to open a lock file, returning: 69   //   -errno on error 70   //   0 if someone is holding the lock (with holding_pid set) 71   //   1 if we now have a writable lock file 72   int Open(pid_t * holding_pid); 73  74   // Get() 75   // 76   // Gets write access to a lock file, and if successful, truncates 77   // file, and writes the current process ID.  Returns: 78   //   -errno on error 79   //   0 if someone is holding the lock (with holding_pid set) 80   //   1 if we now have a writable lock file 81   int Get(pid_t * holding_pid); 82  83   // Close() 84   // 85   // Closes the file handle on the opened Lockfile. 86   void Close(void); 87  88   // Kill() 89   // KillGroup() 90   // 91   // Ensures no one is holding the lock. It tries to open the lock file 92   // and if that does not succeed, it kills the process holding the lock. 93   // If the lock file open succeeds, it closes the lock file releasing 94   // the lock. 95   // 96   // The intial signal can be used to generate a core from the process while 97   // still ensuring it dies. 98   void Kill(int sig, int initial_sig = 0, const char *pname = NULL); 99   void KillGroup(int sig, int initial_sig = 0, const char *pname = NULL);100 101 private:102   char fname[PATH_NAME_MAX];103   int fd;104 };105 106 107 #endif

2.lock_and_kill.cpp

  1 #include <stdio.h>  2 #include <stdlib.h>  3 #include <dirent.h>  4 #include<unistd.h>   5 #include<sys/file.h>  6 #include <errno.h>  7 #include <signal.h>  8   9 #include "lock_and_kill.h" 10  11  12 #define PROC_BASE "/proc" 13 #define INITIAL_PIDVSIZE 32 14 #define LOCKFILE_BUF_LEN 16  15 #define LINE_MAX 1024 //may be hava problem with it 16 int 17 ink_killall(const char *pname, int sig) 18 { 19   int err; 20   pid_t *pidv; 21   int pidvcnt; 22    23   if (ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt) < 0) { 24     return -1; 25   } 26  27   if (pidvcnt == 0) { 28     free(pidv); 29     return 0; 30   } 31  32   err = ink_killall_kill_pidv(pidv, pidvcnt, sig); 33   free(pidv); 34   return err; 35 } 36  37 int 38 ink_killall_get_pidv_xmalloc(const char *pname, pid_t ** pidv, int *pidvcnt) 39 { 40   DIR *dir; 41   FILE *fp; 42   struct dirent *de; 43   pid_t pid, self; 44   char buf[LINE_MAX], *p, *comm; 45   int pidvsize = INITIAL_PIDVSIZE; 46  47   if (!pname || !pidv || !pidvcnt) 48     goto l_error; 49  50   self = getpid(); 51   if (!(dir = opendir(PROC_BASE))) 52     goto l_error; 53  54   *pidvcnt = 0; 55   *pidv = (pid_t *)malloc(pidvsize * sizeof(pid_t)); 56  57   while ((de = readdir(dir))) { 58     if (!(pid = (pid_t) atoi(de->d_name)) || pid == self) 59       continue; 60     snprintf(buf, sizeof(buf), PROC_BASE "/%d/stat", pid); 61     if ((fp = fopen(buf, "r"))) { 62       if (fgets(buf, sizeof buf, fp) == 0) 63         goto l_close; 64       if ((p = strchr(buf, '('))) { 65         comm = p + 1; 66         if ((p = strchr(comm, ')'))) 67           *p = '\0'; 68         else 69           goto l_close; 70         if (strcmp(comm, pname) == 0) { 71           if (*pidvcnt >= pidvsize) { 72             pid_t *pidv_realloc; 73             pidvsize *= 2; 74             if (!(pidv_realloc = (pid_t *)realloc(*pidv, pidvsize * sizeof(pid_t)))) { 75               free(*pidv); 76               goto l_error; 77             } else { 78               *pidv = pidv_realloc; 79             } 80           } 81           (*pidv)[*pidvcnt] = pid; 82           (*pidvcnt)++; 83         } 84       } 85     l_close: 86       fclose(fp); 87     } 88   } 89   closedir(dir); 90  91   if (*pidvcnt == 0) { 92     free(*pidv); 93     *pidv = 0; 94   } 95   return 0; 96 l_error: 97   *pidv = NULL; 98   *pidvcnt = 0; 99   return -1;100 }101 102 int103 ink_killall_kill_pidv(pid_t * pidv, int pidvcnt, int sig)104 {105   int err = 0;106   if (!pidv || (pidvcnt <= 0))107     return -1;108   while (pidvcnt > 0) {109     pidvcnt--;110     if (kill(pidv[pidvcnt], sig) < 0)111       err = -1;112   }113   return err;114 }115 116 117 ////////////////////类函数的实现在下面//////////////////////////////////118 ////////////////////////////////////////////////////////////////////////119 int120 Lockfile::Open(pid_t * holding_pid)121 {122   char buf[LOCKFILE_BUF_LEN];123   pid_t val;124   int err;125   *holding_pid = 0;126 127 #define FAIL(x) \128 { \129   if (fd > 0) \130     close (fd); \131   return (x); \132 }133 134   struct flock lock;135   char *t;136   int size;//开始的时候设置成无效的一个值137 138   // Try and open the Lockfile. Create it if it does not already139   // exist.140   do {141     fd = open(fname, O_RDWR | O_CREAT, 0644);142   } while ((fd < 0) && (errno == EINTR));143 144   if (fd < 0)145     return (-errno);146 147   // Lock it. Note that if we can't get the lock EAGAIN will be the148   // error we receive.149   lock.l_type = F_WRLCK;150   lock.l_start = 0;151   lock.l_whence = SEEK_SET;152   lock.l_len = 0;153 154   do {155     err = fcntl(fd, F_SETLK, &lock);156   } while ((err < 0) && (errno == EINTR));157 158   if (err < 0) {159     // We couldn't get the lock. Try and read the process id of the160     // process holding the lock from the lockfile.161     t = buf;162 163     for (size = 15; size > 0;) {164       do {165         err = read(fd, t, size);166       } while ((err < 0) && (errno == EINTR));167 168       if (err < 0)169         FAIL(-errno);170       if (err == 0)171         break;172 173       size -= err;174       t += err;175     }176     *t = '\0';177 178     // coverity[secure_coding]179     if (sscanf(buf, "%d\n", (int*)&val) != 1) {180       *holding_pid = 0;181     } else {182       *holding_pid = val;183     }184     FAIL(0);185     186   }187   // If we did get the lock, then set the close on exec flag so that188   // we don't accidently pass the file descriptor to a child process189   // when we do a fork/exec.190   do {191     err = fcntl(fd, F_GETFD, 0);192   } while ((err < 0) && (errno == EINTR));193 194   if (err < 0)195     FAIL(-errno);196   197   val = err | FD_CLOEXEC;198 199   do {200     err = fcntl(fd, F_SETFD, val);201   } while ((err < 0) && (errno == EINTR));202 203   if (err < 0)204     FAIL(-errno);205 206   // Return the file descriptor of the opened lockfile. When this file207   // descriptor is closed the lock will be released.208   return (1);                   // success209 #undef FAIL210 }211 212 int213 Lockfile::Get(pid_t * holding_pid)214 {215   char buf[LOCKFILE_BUF_LEN];216   int err;217   *holding_pid = 0;218 219   fd = -1;220 221   // Open the Lockfile and get the lock. If we are successful, the222   // return value will be the file descriptor of the opened Lockfile.223   err = Open(holding_pid);224   if (err != 1)225     return err;226 227   if (fd < 0) {228     return -1;229   }230 231   // Truncate the Lockfile effectively erasing it.232   do {233     err = ftruncate(fd, 0);234   } while ((err < 0) && (errno == EINTR));235 236   if (err < 0) {237     close(fd);238     return (-errno);239   }240 241   // Write our process id to the Lockfile.242   snprintf(buf, sizeof(buf), "%d\n", (int) getpid());243 244   do {245     err = write(fd, buf, strlen(buf));246   } while ((err < 0) && (errno == EINTR));247 248   if (err != (int) strlen(buf)) {249     close(fd);250     return (-errno);251   }252   return (1);                   // success253 }254 255 void256 Lockfile::Close(void)257 {258   if (fd != -1) {259     close(fd);260   }261 }262 263 //-------------------------------------------------------------------------264 // Lockfile::Kill() and Lockfile::KillAll()265 //266 // Open the lockfile. If we succeed it means there was no process267 // holding the lock. We'll just close the file and release the lock268 // in that case. If we don't succeed in getting the lock, the269 // process id of the process holding the lock is returned. We270 // repeatedly send the KILL signal to that process until doing so271 // fails. That is, until kill says that the process id is no longer272 // valid (we killed the process), or that we don't have permission273 // to send a signal to that process id (the process holding the lock274 // is dead and a new process has replaced it).275 //276 // INKqa11325 (Kevlar: linux machine hosed up if specific threads277 // killed): Unfortunately, it's possible on Linux that the main PID of278 // the process has been successfully killed (and is waiting to be279 // reaped while in a defunct state), while some of the other threads280 // of the process just don't want to go away.  Integrate ink_killall281 // into Kill() and KillAll() just to make sure we really kill282 // everything and so that we don't spin hard while trying to kill a283 // defunct process.284 //-------------------------------------------------------------------------285 286 287 static void288 lockfile_kill_internal(pid_t init_pid, int init_sig, pid_t pid, const char *pname, int sig)289 {290   int err;291 292 #if defined(linux)293 294   pid_t *pidv;295   int pidvcnt;296 297   // Need to grab pname's pid vector before we issue any kill signals.298   // Specifically, this prevents the race-condition in which299   // traffic_manager spawns a new traffic_server while we still think300   // we're killall'ing the old traffic_server.301   if (pname) {302       //这函数的功能是什么，将程序名为pname的进程都不给杀死，pidv是pid的数组指针，pidvcnt是进程个数303     ink_killall_get_pidv_xmalloc(pname, &pidv, &pidvcnt);304   }305 306   if (init_sig > 0) {307     kill(init_pid, init_sig);308     // sleep for a bit and give time for the first signal to be309     // delivered310     sleep(1);311   }312 313   do {314     if ((err = kill(pid, sig)) == 0) {315       sleep(1);316     }317     if (pname && (pidvcnt > 0)) {318       ink_killall_kill_pidv(pidv, pidvcnt, sig);319       sleep(1);320     }321   } while ((err == 0) || ((err < 0) && (errno == EINTR)));322 323   free(pidv);324 325 #else326 327   if (init_sig > 0) {328     kill(init_pid, init_sig);329     // sleep for a bit and give time for the first signal to be330     // delivered331     sleep(1);332   }333 334   do {335     err = kill(pid, sig);336   } while ((err == 0) || ((err < 0) && (errno == EINTR)));337 338 #endif  // linux check339 340 }341 342 /////////////////////////////////////////////////////////////////343 /////////////////////////////////////////////////////////////////344 void345 Lockfile::Kill(int sig, int initial_sig, const char *pname)346 {347   int err;348   int pid;349   pid_t holding_pid;350 351   err = Open(&holding_pid);352   if (err == 1)                 // success getting the lock file,说明没有对应的server进程存在353   {354     Close();                    //因此不需要处理，关闭就行了355   } else if (err == 0)          // someone else has the lock356   {357     pid = holding_pid;          //获取持有锁进程的pid358     if (pid != 0) {             //当进程pid有效的时候，就去杀死这个进程359     360       lockfile_kill_internal(pid, initial_sig, pid, pname, sig);361     }362   }363 }364 365 366 /////////////////////////////////////////////////////////////////////367 /////////////////////////////////////////////////////////////////////368 //没怎么明白这个函数!!369 void370 Lockfile::KillGroup(int sig, int initial_sig, const char *pname)371 {372   int err;373   pid_t pid;374   pid_t holding_pid;375 376   err = Open(&holding_pid);377   if (err == 1)                 // success getting the lock file378   {379     Close();380   } else if (err == 0)          // someone else has the lock381   {382     do {383       pid = getpgid(holding_pid);//获得进程组识别码384     } while ((pid < 0) && (errno == EINTR));385 386     if ((pid < 0) || (pid == getpid()))387       pid = holding_pid;388     else389       pid = -pid;390 391     if (pid != 0) {392       // We kill the holding_pid instead of the process_group393       // initially since there is no point trying to get core files394       // from a group since the core file of one overwrites the core395       // file of another one396       lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig);397     }398   }399 }

3.log.h

 1 #ifndef LOG_H 2 #define LOG_H 3 #include <stdio.h> 4  5 void write_to_log(char* c){ 6  7     FILE* fd; 8     fd = fopen("log.txt", "ab");  9     if (fd)10       {11         fputs(c, fd); 12         fclose(fd);13       }14 }15 16 #endif

4.traffic_cop.cpp

  1 #include "lock_and_kill.h"  2 #include "log.h"  3 #include <sys/types.h>  4 #include <sys/ipc.h>  5 #include <sys/sem.h>  6 #include <signal.h>  7 #include <sys/param.h>  8 #include <unistd.h>  9 #include <stdlib.h> 10 #include <sys/wait.h> 11 #include <time.h> 12 #include <string.h> 13 #include <stdio.h> 14 #include <sys/stat.h>  15  16  17 #define    NOWARN_UNUSED(x)    (void)(x) 18  19 static char cop_lockfile[PATH_NAME_MAX]; 20 static char manager_lockfile[PATH_NAME_MAX]; 21 static char server_lockfile[PATH_NAME_MAX]; 22  23 static char manager_binary[PATH_NAME_MAX] = "traffic_manager"; 24 static char server_binary[PATH_NAME_MAX] = "traffic_server"; 25 static int killsig=SIGKILL; 26 static int coresig=0; 27 static int server_not_found = 0; 28 static int server_failures=0; 29 static int manager_failures =0; 30  31 static const int sleep_time = 10;       // 10 sec 32 static const int manager_timeout = 3 * 60;      //  3 min 33 static const int server_timeout = 3 * 60;       //  3 min 34 static const int kill_timeout = 1 * 60; //  1 min 35  36  37 static void sig_alarm_warn(int signum=0) 38 { 39      alarm(kill_timeout); 40 } 41  42  43 static void sig_fatal(int signum) 44 { 45     abort(); 46 } 47  48  49 static void set_alarm_warn() 50 { 51     struct sigaction action; 52     action.sa_handler = sig_alarm_warn; 53      sigemptyset(&action.sa_mask); 54      action.sa_flags = 0; 55     sigaction(SIGALRM, &action, NULL); 56 } 57  58 static void set_alarm_death() 59 { 60     struct sigaction action; 61     action.sa_handler = sig_fatal; 62       sigemptyset(&action.sa_mask); 63       action.sa_flags = 0; 64     sigaction(SIGALRM, &action, NULL); 65 } 66  67 static void sig_child(int signum) 68 { 69   NOWARN_UNUSED(signum); 70   pid_t pid = 0; 71   int status = 0; 72   for (;;) { 73     pid = waitpid(WAIT_ANY, &status, WNOHANG); 74  75     if (pid <= 0) { 76       break; 77     } 78     // TSqa03086 - We can not log the child status signal from 79     //   the signal handler since syslog can deadlock.  Record 80     //   the pid and the status in a global for logging 81     //   next time through the event loop.  We will occasionally 82     //   lose some information if we get two sig childs in rapid 83     //   succession 84    // child_pid = pid; 85     //child_status = status; 86   } 87 } 88  89  90 static void init_signals() 91 { 92       struct sigaction action; 93       write_to_log("Entering init_signals()\n"); 94       action.sa_handler = sig_child; 95       sigemptyset(&action.sa_mask); 96       action.sa_flags = 0; 97       sigaction(SIGCHLD, &action, NULL); 98       action.sa_handler = sig_fatal; 99       sigemptyset(&action.sa_mask);100       action.sa_flags = 0;101       write_to_log("leaving init_signals()\n\n");102 }103 104 105 static void safe_kill(const char* lockfile_name,const char * pname,bool group)106 {107     Lockfile lockfile(lockfile_name);108     write_to_log("Entering safe_kill\n");109     set_alarm_warn();110       alarm(kill_timeout);111 112       if (group == true) {113         lockfile.KillGroup(killsig, coresig, pname);114       } else {115         lockfile.Kill(killsig, coresig, pname);116       }117       alarm(0);118       set_alarm_death();119      write_to_log("Leaving safe_kill\n\n");120 121 }122 123 124 //为了简单化，直接返回0125 static int server_up()126 {127     return 1;128 129 }130 131 132 static int heartbeat_manager()133 {134     //safe_kill(manager_lockfile, manager_binary, true);135     return 1;136 }137 138 static int heartbeat_server()139 {140     //safe_kill(server_lockfile, server_binary, false);141     //server_failures = 0;142     return 1;143 }144 145 146 147 static void spawn_manager()148 {149       int err;150       int key;151       err = fork();152   write_to_log("Entering spwan_manager()!\n\n");153   if (err == 0) {154     err = execv(manager_binary, NULL);155   write_to_log("somehow execv failed!\n");156     exit(1);157   } else if (err == -1) {158     write_to_log("unable to fork !\n");159     exit(1);160   } 161   162   manager_failures = 0;163   write_to_log("Leaving spwan_manager()!\n\n");164 }165 166 167 static void init_lockfiles()168 {169  // Layout::relative_to(cop_lockfile, sizeof(cop_lockfile), Layout::get()->runtimedir, COP_LOCK);170  // Layout::relative_to(manager_lockfile, sizeof(manager_lockfile), Layout::get()->runtimedir,      MANAGER_LOCK);171  // Layout::relative_to(server_lockfile, sizeof(server_lockfile), Layout::get()->runtimedir, SERVER_LOCK);172 173  write_to_log("Entering init_lockfiles()\n");174  strcpy(cop_lockfile,"cop_lockfile");175  strcpy(manager_lockfile,"manager_lockfile");176  strcpy(server_lockfile,"server_lockfile");177 178  strcpy(manager_binary,"manager_binary");179  strcpy(server_binary,"server_binary");180 181 182  write_to_log("leaving init_lockfiles()\n\n");183 184  //manager_lockfile="manager_lockfile";185  //server_lockfile="server_lockfile";186  //manager_binary="manager_binary";187  //server_binary="server_binary";188 189 }190 191 192 static void check_lockfile()193 {194 195   write_to_log("Entering check_lockfile()\n");196   int err;197   pid_t holding_pid;198   Lockfile cop_lf(cop_lockfile);199   err = cop_lf.Get(&holding_pid);200 201 202   if (err < 0) {203     write_to_log("leaving check_lockfile(),and err<0\n\n");204     exit(1);205   } else if (err == 0) {206     write_to_log("leaving check_lockfile(),and err==0\n\n");207     exit(1);208   }209     write_to_log("leaving check_lockfile()\n\n");210 211 }212 213 214 215 static void check_programs()216 {217     int err;218     pid_t holding_pid;219 220     write_to_log("Entering check_programs()\n");221     printf("Entering check_programs()\n");222   //尝试去获取 manager的lockfile，如果成功，说明没有manager进程在运行223     Lockfile manager_lf(manager_lockfile);224         err = manager_lf.Open(&holding_pid);225 226    //通过检测err的值来判断manager进程的运行情况227    if(err==0){228         write_to_log("in check_programs(),manager_lockfile,err==0\n");229 230         printf("in check_programs(),manager_lockfile,err==0\n");231         232         if(kill(holding_pid,0)==-1){233           234            printf("holding_pid is %d,and invalid\n",holding_pid);235 236                 ink_killall(manager_binary, killsig);237                 sleep(1);                 // give signals a chance to be received 238                  err = manager_lf.Open(&holding_pid);239             }240 241    }242 243 244     if(err>0){//说明可以获得manager lockfile245         // 'lockfile_open' returns the file descriptor of the opened246         // lockfile.  We need to close this before spawning the247         // manager so that the manager can grab the lock. 248             manager_lf.Close(); 249             // Make sure we don't have a stray traffic server running.250 251             write_to_log("traffic_manager not running, making sure traffic_server is dead\n");252             safe_kill(server_lockfile,server_binary,false);253             spawn_manager();254     }255     else256     {257 258             259             260 261             //err<0,Open中返回负值，说明可能是加锁成功，但是设置lockfile的文件信息失败262             // If there is a manager running we want to heartbeat it to263             // make sure it hasn't wedged. If the manager test succeeds we264             // check to see if the server is up. (That is, it hasn't been265             // brought down via the UI).  If the manager thinks the server266             // is up, we make sure there is actually a server process267             // running. If there is we test it.268 269                 alarm(2*manager_timeout);270                 err=heartbeat_manager();//?271                 alarm(0);272 273                 if(err<0){//???what case274                     return ;275 276                     }277 278                 279                 if(server_up()<=0){//???what case280                     return;//err>0 ,manager is running ,if server is down  we think manager can create a new server ,so return281                     }282 283                 Lockfile server_lf(server_lockfile);284                 err=server_lf.Open(&holding_pid);285 286                 if(err==0){287                     if(kill(holding_pid,0)==-1){288                         ink_killall(server_binary,killsig);289                         sleep(1);// give signals a chance to be received290                         err=server_lf.Open(&holding_pid);291                         }292                     }293 294                 if(err>0){295                     server_lf.Close();296                     server_not_found += 1;297 298                     if(server_not_found>1){299 300 301                         server_not_found=0;302                         safe_kill(manager_lockfile, manager_binary, true);303                         }304                 }else{305                           alarm(2 * server_timeout);306                                 heartbeat_server();//?307                               alarm(0);308 309                         }310                 311     }312    printf("Leaving check_programs\n\n");313    write_to_log("Leaving check_programs\n\n");314 }315 316 317 static void init()318 {    319     write_to_log("Entering init()\n");320     init_signals();321         init_lockfiles();322         check_lockfile();323     write_to_log("Leaving init()\n\n");324 }325 326 static void millisleep(int ms)327 {328   struct timespec ts;329   ts.tv_sec = ms / 1000;330   ts.tv_nsec = (ms - ts.tv_sec * 1000) * 1000 * 1000;331   nanosleep(&ts, NULL);332 }333 334 // Changed function from taking no argument and returning void335 // to taking a void* and returning a void*. The change was made336 // so that we can call ink_thread_create() on this function337 // in the case of running cop as a win32 service.338 339 static void* check(void* arg)340 {341     //bool mgmt_init=false;342     write_to_log("Entering check()\n\n");343     for(;;){344         345         // problems with the ownership of this file as root Make sure it is346         // owned by the admin user347         348         alarm(2 * (sleep_time + manager_timeout * 2 + server_timeout));349 350         check_programs();351         millisleep(sleep_time * 1000);352         }353     write_to_log("Leaveing check()\n\n");354     return arg;355 }356 357 void init_daemon(void) 358 { 359     int i; 360     pid_t pid;361     struct rlimit rl;362     struct sigaction sa;363     //printf("------------------------------\n");364     //umask(0);365     if(getrlimit(RLIMIT_NOFILE,&rl)<0){366         exit(1);367     }368 369 370     if((pid=fork())<0){371         exit(1);//fork失败，退出 372     }else if(pid> 0){ 373         exit(0);//是父进程，结束父进程 374         }375 376     //是第一子进程，后台继续执行 377     setsid();//第一子进程成为新的会话组长和进程组长 378     //并与控制终端分离 379     sa.sa_handler=SIG_IGN;380     sigemptyset(&sa.sa_mask);381     sa.sa_flags=0;382 383     if(sigaction(SIGHUP,&sa,NULL)<0){384         exit(1);385     }386 387     if((pid=fork())<0){388         exit(1);//fork失败，退出 389     }else if(pid> 0){ 390         exit(0);//是父进程，结束父进程 391         }392     //是第二子进程，继续 393     //第二子进程不再是会话组长 394     umask(0);395     if (rl.rlim_max==RLIM_INFINITY){396         rl.rlim_max=1024;397 398     }399 400     for(i=0;i< rl.rlim_max;++i)//关闭打开的文件描述符 401          {         402         close(i);403           } 404 405     //chdir("/tmp");//改变工作目录到/tmp 406     return; 407 } 408 409 410 int main()411 {412 413     init_daemon();//守护进程初始化函数414       write_to_log("Entering main()\n");415       signal(SIGHUP, SIG_IGN);416       signal(SIGTSTP, SIG_IGN);417       signal(SIGTTOU, SIG_IGN);418       signal(SIGTTIN, SIG_IGN);419       //setsid(); 420       init();421         check(NULL);422       write_to_log("leaving main()\n\n");423        return 0;424 }

5.traffic_manager.cpp

  1 #include "lock_and_kill.h"  2 #include "log.h"  3 #include <sys/types.h>  4 #include <sys/ipc.h>  5 #include <sys/sem.h>  6 #include <signal.h>  7 #include <unistd.h>  8 #include <stdlib.h>  9 #include <sys/wait.h> 10 #include <time.h> 11 #include <string.h> 12 #include <stdio.h> 13  14 #define    NOWARN_UNUSED(x)    (void)(x) 15 static char manager_lockfile[4096]="manager_lockfile"; 16 static char server_lockfile[4096]="server_lockfile"; 17 static int server_failures=0; 18 static int killsig=SIGKILL; 19 static int coresig=0; 20 static char server_binary[4096] = "server_binary"; 21 static const int sleep_time = 10;       // 10 sec 22 static const int manager_timeout = 3 * 60;      //  3 min 23 static const int server_timeout = 3 * 60;       //  3 min 24 static const int kill_timeout = 1 * 60; //  1 min 25  26 static void sig_alarm_warn(int signum=0) 27 { 28      alarm(kill_timeout); 29 } 30  31  32 static void sig_fatal(int signum) 33 { 34     abort(); 35 } 36  37  38 static void set_alarm_warn() 39 { 40     struct sigaction action; 41     action.sa_handler = sig_alarm_warn; 42      sigemptyset(&action.sa_mask); 43      action.sa_flags = 0; 44     sigaction(SIGALRM, &action, NULL); 45 } 46  47 static void set_alarm_death() 48 { 49     struct sigaction action; 50     action.sa_handler = sig_fatal; 51       sigemptyset(&action.sa_mask); 52       action.sa_flags = 0; 53     sigaction(SIGALRM, &action, NULL); 54 } 55  56 static void sig_child(int signum) 57 { 58   NOWARN_UNUSED(signum); 59   pid_t pid = 0; 60   int status = 0; 61   for (;;) { 62     pid = waitpid(WAIT_ANY, &status, WNOHANG); 63  64     if (pid <= 0) { 65       break; 66     } 67     // TSqa03086 - We can not log the child status signal from 68     //   the signal handler since syslog can deadlock.  Record 69     //   the pid and the status in a global for logging 70     //   next time through the event loop.  We will occasionally 71     //   lose some information if we get two sig childs in rapid 72     //   succession 73    // child_pid = pid; 74     //child_status = status; 75   } 76 } 77  78 static void safe_kill(const char* lockfile_name,const char * pname,bool group) 79 { 80     Lockfile lockfile(lockfile_name); 81     write_to_log("Entering safe_kill\n"); 82     set_alarm_warn(); 83       alarm(kill_timeout); 84  85       if (group == true) { 86         lockfile.KillGroup(killsig, coresig, pname); 87       } else { 88         lockfile.Kill(killsig, coresig, pname); 89       } 90       alarm(0); 91       set_alarm_death(); 92       write_to_log("Leaving safe_kill\n\n"); 93  94 } 95  96 static void spawn_server() 97 { 98       int err; 99       int key;100   write_to_log("--------------Entering spwan_server()!\n\n");101       err = fork();102   if (err == 0) {103     err = execv(server_binary, NULL);104     105     write_to_log("--------------somehow execv failed!\n");106        exit(1);107   } else if (err == -1) {108         write_to_log("--------------unable to fork server !\n");109        exit(1);110   } 111   112   server_failures = 0;113   write_to_log("--------------Leaving spwan_server()!\n\n");114 }115 116 117 void check_server()118 {119     int err;120     pid_t holding_pid;121     Lockfile server_lf(server_lockfile);122     err=server_lf.Get(&holding_pid);123 124     if(err==0){125         if(kill(holding_pid,0)==-1){126             ink_killall(server_binary,killsig);127             sleep(1);128             err=server_lf.Open(&holding_pid);129             }130 131         }132 133     if(err>0){134         server_lf.Close();135         safe_kill(server_lockfile,server_binary,false);136         spawn_server();137 138         }139 140 }141 142 143 144 145 int main()146 {147     pid_t holding_pid=0;148     Lockfile manager_lf(manager_lockfile);149     manager_lf.Get(&holding_pid);150 151     while(1){152 153         char buf[100];154         sprintf(buf,"----------------traffic_manager is running, pid:'%d'!\n",getpid());155         write_to_log(buf);156         157         printf("----------------traffic_manager is running,pidID: %d\n",getpid());158 159         sleep(5);160         int c=rand()%10;161         162         if(c==1){//模拟manager进程出现状况163             write_to_log("----------------traffic_manager has a expcetion and eixt!\n");164             exit(1);165         }else{//对server进程进行检查166             check_server();167         }168         }169 }

6.traffic_server.cpp

 1 #include "log.h" 2 #include "lock_and_kill.h" 3 #include <sys/types.h> 4 #include <unistd.h> 5 #include <stdlib.h> 6  7  8 static char server_lockfile[4096]="server_lockfile"; 9 10 int main()11 {12 13         pid_t holding_pid=0;14         Lockfile server_lf(server_lockfile);15         server_lf.Get(&holding_pid);16 17         while(1){18 19             char buf[100];20         sprintf(buf,"==============traffic_server is running, pid:'%d'!\n",getpid());21         write_to_log(buf);22             sleep(5);23             int c=rand()%100;24             25             if(c<30){//模拟server进程出现状况26                 write_to_log("=================traffic_server has a expcetion and exit!\n");27                 exit(1);28             }29         }30         return 0;31 32 }

以上文档为以前研究时所写，希望能给感兴趣的同学一点帮助,同时也请大家指点。我这里时简要的分析了traffic进程控制的问题，测试中许多是简化的，比如心跳测试之类的，代码中有说明。