1 .\" Copyright (C) 2003 Davide Libenzi
3 .\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
4 .\" This program is free software; you can redistribute it and/or modify
5 .\" it under the terms of the GNU General Public License as published by
6 .\" the Free Software Foundation; either version 2 of the License, or
7 .\" (at your option) any later version.
9 .\" This program is distributed in the hope that it will be useful,
10 .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
11 .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 .\" GNU General Public License for more details.
14 .\" You should have received a copy of the GNU General Public
15 .\" License along with this manual; if not, see
16 .\" <http://www.gnu.org/licenses/>.
19 .\" Davide Libenzi <davidel@xmailserver.org>
21 .TH EPOLL 7 2021-03-22 "Linux" "Linux Programmer's Manual"
23 epoll \- I/O event notification facility
26 .B #include <sys/epoll.h>
31 API performs a similar task to
33 monitoring multiple file descriptors to see if I/O is possible on any of them.
36 API can be used either as an edge-triggered or a level-triggered
37 interface and scales well to large numbers of watched file descriptors.
39 The central concept of the
44 an in-kernel data structure which, from a user-space perspective,
45 can be considered as a container for two lists:
49 list (sometimes also called the
51 set): the set of file descriptors that the process has registered
52 an interest in monitoring.
56 list: the set of file descriptors that are "ready" for I/O.
57 The ready list is a subset of
58 (or, more precisely, a set of references to)
59 the file descriptors in the interest list.
60 The ready list is dynamically populated
61 by the kernel as a result of I/O activity on those file descriptors.
63 The following system calls are provided to
71 instance and returns a file descriptor referring to that instance.
74 extends the functionality of
75 .BR epoll_create (2).)
77 Interest in particular file descriptors is then registered via
79 which adds items to the interest list of the
85 blocking the calling thread if no events are currently available.
86 (This system call can be thought of as fetching items from
91 .SS Level-triggered and edge-triggered
94 event distribution interface is able to behave both as edge-triggered
95 (ET) and as level-triggered (LT).
96 The difference between the two mechanisms
97 can be described as follows.
99 this scenario happens:
101 The file descriptor that represents the read side of a pipe
107 A pipe writer writes 2\ kB of data on the write side of the pipe.
111 is done that will return
113 as a ready file descriptor.
115 The pipe reader reads 1\ kB of data from
124 file descriptor has been added to the
133 will probably hang despite the available data still present in the file
135 meanwhile the remote peer might be expecting a response based on the
136 data it already sent.
137 The reason for this is that edge-triggered mode
138 delivers events only when changes occur on the monitored file descriptor.
141 the caller might end up waiting for some data that is already present inside
143 In the above example, an event on
145 will be generated because of the write done in
147 and the event is consumed in
149 Since the read operation done in
151 does not consume the whole buffer data, the call to
155 might block indefinitely.
157 An application that employs the
159 flag should use nonblocking file descriptors to avoid having a blocking
160 read or write starve a task that is handling multiple file descriptors.
161 The suggested way to use
165 interface is as follows:
167 with nonblocking file descriptors; and
169 by waiting for an event only after
176 By contrast, when used as a level-triggered interface
183 and can be used wherever the latter is used since it shares the
186 Since even with edge-triggered
188 multiple events can be generated upon receipt of multiple chunks of data,
189 the caller has the option to specify the
193 to disable the associated file descriptor after the receipt of an event with
198 it is the caller's responsibility to rearm the file descriptor using
204 (or processes, if child processes have inherited the
206 file descriptor across
210 waiting on the same epoll file descriptor and a file descriptor
211 in the interest list that is marked for edge-triggered
213 notification becomes ready,
214 just one of the threads (or processes) is awoken from
216 This provides a useful optimization for avoiding "thundering herd" wake-ups
219 .SS Interaction with autosleep
223 .I /sys/power/autosleep
224 and an event happens which wakes the device from sleep, the device
225 driver will keep the device awake only until that event is queued.
226 To keep the device awake until the event has been processed,
227 it is necessary to use the
237 .IR "struct epoll_event" ,
238 the system will be kept awake from the moment the event is queued,
241 call which returns the event until the subsequent
244 If the event should keep the system awake beyond that time,
247 should be taken before the second
251 The following interfaces can be used to limit the amount of
252 kernel memory consumed by epoll:
253 .\" Following was added in 2.6.28, but them removed in 2.6.29
255 .\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
256 .\" This specifies an upper limit on the number of epoll instances
257 .\" that can be created per real user ID.
259 .IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
260 This specifies a limit on the total number of
261 file descriptors that a user can register across
262 all epoll instances on the system.
263 The limit is per real user ID.
264 Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
265 and roughly 160 bytes on a 64-bit kernel.
267 .\" 2.6.29 (in 2.6.28, the default was 1/32 of lowmem)
268 the default value for
270 is 1/25 (4%) of the available low memory,
271 divided by the registration cost in bytes.
272 .SS Example for suggested usage
275 when employed as a level-triggered interface does have the same
278 the edge-triggered usage requires more clarification to avoid stalls
279 in the application event loop.
280 In this example, listener is a
281 nonblocking socket on which
286 uses the new ready file descriptor until
288 is returned by either
292 An event-driven state machine application should, after having received
294 record its current state so that at the next call to
300 from where it stopped before.
304 #define MAX_EVENTS 10
305 struct epoll_event ev, events[MAX_EVENTS];
306 int listen_sock, conn_sock, nfds, epollfd;
308 /* Code to set up listening socket, \(aqlisten_sock\(aq,
309 (socket(), bind(), listen()) omitted. */
311 epollfd = epoll_create1(0);
312 if (epollfd == \-1) {
313 perror("epoll_create1");
318 ev.data.fd = listen_sock;
319 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
320 perror("epoll_ctl: listen_sock");
325 nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
327 perror("epoll_wait");
331 for (n = 0; n < nfds; ++n) {
332 if (events[n].data.fd == listen_sock) {
333 conn_sock = accept(listen_sock,
334 (struct sockaddr *) &addr, &addrlen);
335 if (conn_sock == \-1) {
339 setnonblocking(conn_sock);
340 ev.events = EPOLLIN | EPOLLET;
341 ev.data.fd = conn_sock;
342 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
344 perror("epoll_ctl: conn_sock");
348 do_use_fd(events[n].data.fd);
355 When used as an edge-triggered interface, for performance reasons, it is
356 possible to add the file descriptor inside the
359 .RB ( EPOLL_CTL_ADD )
361 .RB ( EPOLLIN | EPOLLOUT ).
362 This allows you to avoid
363 continuously switching between
371 .SS Questions and answers
373 What is the key used to distinguish the file descriptors registered in an
376 The key is the combination of the file descriptor number and
377 the open file description
378 (also known as an "open file handle",
379 the kernel's internal representation of an open file).
381 What happens if you register the same file descriptor on an
385 You will probably get
387 However, it is possible to add a duplicate
392 file descriptor to the same
395 .\" But a file descriptor duplicated by fork(2) can't be added to the
396 .\" set, because the [file *, fd] pair is already in the epoll set.
397 .\" That is a somewhat ugly inconsistency. On the one hand, a child process
398 .\" cannot add the duplicate file descriptor to the epoll set. (In every
399 .\" other case that I can think of, file descriptors duplicated by fork have
400 .\" similar semantics to file descriptors duplicated by dup() and friends.) On
401 .\" the other hand, the very fact that the child has a duplicate of the
402 .\" file descriptor means that even if the parent closes its file descriptor,
403 .\" then epoll_wait() in the parent will continue to receive notifications for
404 .\" that file descriptor because of the duplicated file descriptor in the child.
406 .\" See http://thread.gmane.org/gmane.linux.kernel/596462/
407 .\" "epoll design problems with common fork/exec patterns"
410 This can be a useful technique for filtering events,
411 if the duplicate file descriptors are registered with different
417 instances wait for the same file descriptor?
418 If so, are events reported to both
422 Yes, and events would be reported to both.
423 However, careful programming may be needed to do this correctly.
427 file descriptor itself poll/epoll/selectable?
432 file descriptor has events waiting, then it will
433 indicate as being readable.
435 What happens if one attempts to put an
437 file descriptor into its own file descriptor set?
443 However, you can add an
445 file descriptor inside another
451 file descriptor over a UNIX domain socket to another process?
453 Yes, but it does not make sense to do this, since the receiving process
454 would not have copies of the file descriptors in the interest list.
456 Will closing a file descriptor cause it to be removed from all
460 Yes, but be aware of the following point.
461 A file descriptor is a reference to an open file description (see
463 Whenever a file descriptor is duplicated via
470 a new file descriptor referring to the same open file description is
472 An open file description continues to exist until all
473 file descriptors referring to it have been closed.
475 A file descriptor is removed from an
476 interest list only after all the file descriptors referring to the underlying
477 open file description have been closed.
478 This means that even after a file descriptor that is part of an
479 interest list has been closed,
480 events may be reported for that file descriptor if other file
481 descriptors referring to the same underlying file description remain open.
482 To prevent this happening,
483 the file descriptor must be explicitly removed from the interest list (using
486 before it is duplicated.
488 the application must ensure that all file descriptors are closed
489 (which may be difficult if file descriptors were duplicated
490 behind the scenes by library functions that used
495 If more than one event occurs between
497 calls, are they combined or reported separately?
499 They will be combined.
501 Does an operation on a file descriptor affect the
502 already collected but not yet reported events?
504 You can do two operations on an existing file descriptor.
505 Remove would be meaningless for
507 Modify will reread available I/O.
509 Do I need to continuously read/write a file descriptor
514 flag (edge-triggered behavior)?
516 Receiving an event from
518 should suggest to you that such
519 file descriptor is ready for the requested I/O operation.
520 You must consider it ready until the next (nonblocking)
523 When and how you will use the file descriptor is entirely up to you.
525 For packet/token-oriented files (e.g., datagram socket,
526 terminal in canonical mode),
527 the only way to detect the end of the read/write I/O space
528 is to continue to read/write until
531 For stream-oriented files (e.g., pipe, FIFO, stream socket), the
532 condition that the read/write I/O space is exhausted can also be detected by
533 checking the amount of data read from / written to the target file
535 For example, if you call
537 by asking to read a certain amount of data and
539 returns a lower number of bytes, you
540 can be sure of having exhausted the read I/O space for the file
542 The same is true when writing using
544 (Avoid this latter technique if you cannot guarantee that
545 the monitored file descriptor always refers to a stream-oriented file.)
546 .SS Possible pitfalls and ways to avoid them
548 .B o Starvation (edge-triggered)
550 If there is a large amount of I/O space,
551 it is possible that by trying to drain
552 it the other files will not get processed causing starvation.
553 (This problem is not specific to
556 The solution is to maintain a ready list
557 and mark the file descriptor as ready
558 in its associated data structure, thereby allowing the application to
559 remember which files need to be processed but still round robin amongst
561 This also supports ignoring subsequent events you
562 receive for file descriptors that are already ready.
564 .B o If using an event cache...
566 If you use an event cache or store all the file descriptors returned from
568 then make sure to provide a way to mark
569 its closure dynamically (i.e., caused by
570 a previous event's processing).
571 Suppose you receive 100 events from
573 and in event #47 a condition causes event #13 to be closed.
574 If you remove the structure and
576 the file descriptor for event #13, then your
577 event cache might still say there are events waiting for that
578 file descriptor causing confusion.
580 One solution for this is to call, during the processing of event 47,
581 .BR epoll_ctl ( EPOLL_CTL_DEL )
582 to delete file descriptor 13 and
584 then mark its associated
585 data structure as removed and link it to a cleanup list.
587 event for file descriptor 13 in your batch processing,
588 you will discover the file descriptor had been
589 previously removed and there will be no confusion.
593 API was introduced in Linux kernel 2.5.44.
594 .\" Its interface should be finalized in Linux kernel 2.5.66.
595 Support was added to glibc in version 2.3.2.
599 API is Linux-specific.
600 Some other systems provide similar
601 mechanisms, for example, FreeBSD has
606 The set of file descriptors that is being monitored via
607 an epoll file descriptor can be viewed via the entry for
608 the epoll file descriptor in the process's
609 .IR /proc/[pid]/fdinfo
618 operation can be used to test whether a file descriptor
619 is present in an epoll instance.
621 .BR epoll_create (2),
622 .BR epoll_create1 (2),