How to use epoll? A complete example in C

来源:互联网 发布:詹姆斯全明星数据 编辑:程序博客网 时间:2024/05/19 01:09

How to use epoll? A complete example in C

Thursday, 2 June 2011 @ 1238 GMT by Mukund Sivaraman

Network servers are traditionally implemented using a separateprocess or thread per connection. For high performance applications thatneed to handle a very large number of clients simultaneously, thisapproach won't work well, because factors such as resource usage andcontext-switching time influence the ability to handle many clients at atime. An alternate method is to performnon-blockingI/O in a single thread, along with some readiness notificationmethod which tells you when you can read or write more data on asocket.

This article is an introduction to Linux's epoll(7)facility, which is the best readiness notification facility in Linux. Wewill write sample code for a complete TCP server implementation in C. Iassume you have C programming experience, know how to compile and runprograms on Linux, and can read manpages of the various C functions thatare used.

epoll was introduced in Linux 2.6, and is not available in otherUNIX-like operating systems. It provides a facility similar totheselect(2) andpoll(2)functions:

  • select(2) can monitor up to FD_SETSIZEnumber of descriptors at a time, typically a small number determined atlibc's compile time.
  • poll(2) doesn't have a fixed limit of descriptorsit can monitor at a time, but apart from other things, even we have toperform a linear scan of all the passed descriptors every time to checkreadiness notification, which is O(n) and slow.

epoll has no such fixed limits, and does not perform any linearscans. Hence it is able to perform better and handle a larger number ofevents.

An epoll instance is created by epoll_create(2)or epoll_create1(2) (they take different arguments),which return an epoll instance.epoll_ctl(2) is used toadd/remove descriptors to be watched on the epoll instance. To wait forevents on the watched set,epoll_wait(2) is used, whichblocks until events are available. Please see their manpages for moreinfo.

When descriptors are added to an epoll instance, they can be added intwo modes:level triggered andedgetriggered. When you use level triggered mode, and data isavailable for reading,epoll_wait(2) will always returnwith ready events. If you don't read the data completely, andcallepoll_wait(2) on the epoll instance watching thedescriptor again, it will return again with a ready event because datais available. In edge triggered mode, you will only get a readinessnotfication once. If you don't read the data fully, andcallepoll_wait(2) on the epoll instance watching thedescriptor again, it will block because the readiness event was alreadydelivered.

The epoll event structure that you passto epoll_ctl(2) is shown below. With every descriptorbeing watched, you can associate an integer or a pointer as userdata.

typedef union epoll_data{  void        *ptr;  int          fd;  __uint32_t   u32;  __uint64_t   u64;} epoll_data_t;struct epoll_event{  __uint32_t   events; /* Epoll events */  epoll_data_t data;   /* User data variable */};

Let's write code now. We'll implement a tiny TCP server that printseverything sent to the socket on standard output. We'll begin by writinga functioncreate_and_bind() which creates and binds aTCP socket:

static intcreate_and_bind (char *port){  struct addrinfo hints;  struct addrinfo *result, *rp;  int s, sfd;  memset (&hints, 0, sizeof (struct addrinfo));  hints.ai_family = AF_UNSPEC;     /* Return IPv4 and IPv6 choices */  hints.ai_socktype = SOCK_STREAM; /* We want a TCP socket */  hints.ai_flags = AI_PASSIVE;     /* All interfaces */  s = getaddrinfo (NULL, port, &hints, &result);  if (s != 0)    {      fprintf (stderr, "getaddrinfo: %s\n", gai_strerror (s));      return -1;    }  for (rp = result; rp != NULL; rp = rp->ai_next)    {      sfd = socket (rp->ai_family, rp->ai_socktype, rp->ai_protocol);      if (sfd == -1)        continue;      s = bind (sfd, rp->ai_addr, rp->ai_addrlen);      if (s == 0)        {          /* We managed to bind successfully! */          break;        }      close (sfd);    }  if (rp == NULL)    {      fprintf (stderr, "Could not bind\n");      return -1;    }  freeaddrinfo (result);  return sfd;}

create_and_bind() contains a standard code block fora portable way of getting a IPv4 or IPv6 socket. It acceptsaport argument as a string, whereargv[1] canbe passed. Thegetaddrinfo(3) function returns a bunchofaddrinfo structures inresult, which arecompatible with the hints passed in thehintsargument. Theaddrinfo struct looks like this:

struct addrinfo{  int              ai_flags;  int              ai_family;  int              ai_socktype;  int              ai_protocol;  size_t           ai_addrlen;  struct sockaddr *ai_addr;  char            *ai_canonname;  struct addrinfo *ai_next;};

We walk through the structures one by one and try creating socketsusing them, until we are able to both create and bind a socket. If wewere successful,create_and_bind() returns the socketdescriptor. If unsuccessful, it returns -1.

Next, let's write a function to make a socketnon-blocking. make_socket_non_blocking() setstheO_NONBLOCK flag on the descriptor passed inthesfd argument:

static intmake_socket_non_blocking (int sfd){  int flags, s;  flags = fcntl (sfd, F_GETFL, 0);  if (flags == -1)    {      perror ("fcntl");      return -1;    }  flags |= O_NONBLOCK;  s = fcntl (sfd, F_SETFL, flags);  if (s == -1)    {      perror ("fcntl");      return -1;    }  return 0;}

Now, on to the main() function of the program whichcontains the event loop. This is the bulk of the program:

#define MAXEVENTS 64intmain (int argc, char *argv[]){  int sfd, s;  int efd;  struct epoll_event event;  struct epoll_event *events;  if (argc != 2)    {      fprintf (stderr, "Usage: %s [port]\n", argv[0]);      exit (EXIT_FAILURE);    }  sfd = create_and_bind (argv[1]);  if (sfd == -1)    abort ();  s = make_socket_non_blocking (sfd);  if (s == -1)    abort ();  s = listen (sfd, SOMAXCONN);  if (s == -1)    {      perror ("listen");      abort ();    }  efd = epoll_create1 (0);  if (efd == -1)    {      perror ("epoll_create");      abort ();    }  event.data.fd = sfd;  event.events = EPOLLIN | EPOLLET;  s = epoll_ctl (efd, EPOLL_CTL_ADD, sfd, &event);  if (s == -1)    {      perror ("epoll_ctl");      abort ();    }  /* Buffer where events are returned */  events = calloc (MAXEVENTS, sizeof event);  /* The event loop */  while (1)    {      int n, i;      n = epoll_wait (efd, events, MAXEVENTS, -1);      for (i = 0; i < n; i++){  if ((events[i].events & EPOLLERR) ||              (events[i].events & EPOLLHUP) ||              (!(events[i].events & EPOLLIN)))    {              /* An error has occured on this fd, or the socket is not                 ready for reading (why were we notified then?) */      fprintf (stderr, "epoll error\n");      close (events[i].data.fd);      continue;    }  else if (sfd == events[i].data.fd)    {              /* We have a notification on the listening socket, which                 means one or more incoming connections. */              while (1)                {                  struct sockaddr in_addr;                  socklen_t in_len;                  int infd;                  char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];                  in_len = sizeof in_addr;                  infd = accept (sfd, &in_addr, &in_len);                  if (infd == -1)                    {                      if ((errno == EAGAIN) ||                          (errno == EWOULDBLOCK))                        {                          /* We have processed all incoming                             connections. */                          break;                        }                      else                        {                          perror ("accept");                          break;                        }                    }                  s = getnameinfo (&in_addr, in_len,                                   hbuf, sizeof hbuf,                                   sbuf, sizeof sbuf,                                   NI_NUMERICHOST | NI_NUMERICSERV);                  if (s == 0)                    {                      printf("Accepted connection on descriptor %d "                             "(host=%s, port=%s)\n", infd, hbuf, sbuf);                    }                  /* Make the incoming socket non-blocking and add it to the                     list of fds to monitor. */                  s = make_socket_non_blocking (infd);                  if (s == -1)                    abort ();                  event.data.fd = infd;                  event.events = EPOLLIN | EPOLLET;                  s = epoll_ctl (efd, EPOLL_CTL_ADD, infd, &event);                  if (s == -1)                    {                      perror ("epoll_ctl");                      abort ();                    }                }              continue;            }          else            {              /* We have data on the fd waiting to be read. Read and                 display it. We must read whatever data is available                 completely, as we are running in edge-triggered mode                 and won't get a notification again for the same                 data. */              int done = 0;              while (1)                {                  ssize_t count;                  char buf[512];                  count = read (events[i].data.fd, buf, sizeof buf);                  if (count == -1)                    {                      /* If errno == EAGAIN, that means we have read all                         data. So go back to the main loop. */                      if (errno != EAGAIN)                        {                          perror ("read");                          done = 1;                        }                      break;                    }                  else if (count == 0)                    {                      /* End of file. The remote has closed the                         connection. */                      done = 1;                      break;                    }                  /* Write the buffer to standard output */                  s = write (1, buf, count);                  if (s == -1)                    {                      perror ("write");                      abort ();                    }                }              if (done)                {                  printf ("Closed connection on descriptor %d\n",                          events[i].data.fd);                  /* Closing the descriptor will make epoll remove it                     from the set of descriptors which are monitored. */                  close (events[i].data.fd);                }            }        }    }  free (events);  close (sfd);  return EXIT_SUCCESS;}

main() firstcalls create_and_bind() which sets up the socket. Itthen makes the socket non-blocking, and thencallslisten(2). It then creates an epoll instanceinefd, to which it adds the listeningsocketsfd to watch for input events in an edge-triggeredmode.

The outer while loop is the main events loop. Itcalls epoll_wait(2), where the thread remains blockedwaiting for events. When events areavailable,epoll_wait(2) returns the events intheevents argument, which is a bunchofepoll_event structures.

The epoll instance in efd is continuously updated in theevent loop when we add new incoming connections to watch, and removeexisting connections when they die.

When events are available, they can be of three types:

  • Errors: When an error condition occurs, or theevent is not a notification about data available for reading, we simplyclose the associated descriptor. Closing the descriptor automaticallyremoves it from the watched set of epoll instanceefd.
  • New connections: When the listeningdescriptor sfd is ready for reading, it means one or morenew connections have arrived. While there are newconnections,accept(2) the connections, print a messageabout it, make the incoming socket non-blocking and add it to thewatched set of epoll instanceefd.
  • Client data: When data is available for reading onany of the client descriptors, we useread(2) to readthe data in pieces of 512 bytes in an inner while loop. This is becausewe have to read all the data that is available now, as we won't getfurther events about it as the descriptor is watched in edge-triggeredmode. The data which is read is written to stdout (fd=1)usingwrite(2). Ifread(2) returns 0,it means an EOF and we can close the client's connection. If -1 isreturned, anderrno is set toEAGAIN, it means that all datafor this event was read, and we can go back to the main loop.

That's that. It goes around and around in a loop, adding and removingdescriptors in the watched set.

Download the epoll-example.cprogram.

Update1: Level and edge triggered definitions wereerroneously reversed (though the code was correct). It was noticed byReddit userbodski. The article hasbeen corrected now. I should have proof-read it beforeposting. Apologies, and thank you for pointing out the mistake. :)

Update2: The code has been modified to runaccept(2) until it says it would block, so that ifmultiple connections have arrived, we accept all of them. It wasnoticed by Reddit usercpitchford. Thank youfor the comments. :)



accept():
在non-blocking模式下,如果返回值为-1,且errno == EAGAIN或errno == EWOULDBLOCK表示no connections没有新连接请求;

recv()/recvfrom():
在non-blocking模式下,如果返回值为-1,且errno == EAGAIN表示没有可接受的数据或很在接受尚未完成;

send()/sendto():
在non-blocking模式下,如果返回值为-1,且errno == EAGAIN或errno == EWOULDBLOCK表示没有可发送数据或数据发送正在进行没有完成。

read/write:
在non-blocking模式下,如果返回-1,且errno == EAGAIN表示没有可读写数据或可读写正在进行尚未完成。

connect():
在non-bloking模式下,如果返回-1,且errno = EINPROGRESS表示正在连接。