1 .\" Copyright (C) 2003 Davide Libenzi
3 .\" SPDX-License-Identifier: GPL-2.0-or-later
5 .\" Davide Libenzi <davidel@xmailserver.org>
7 .TH epoll 7 (date) "Linux man-pages (unreleased)"
9 epoll \- I/O event notification facility
12 .B #include <sys/epoll.h>
17 API performs a similar task to
19 monitoring multiple file descriptors to see if I/O is possible on any of them.
22 API can be used either as an edge-triggered or a level-triggered
23 interface and scales well to large numbers of watched file descriptors.
25 The central concept of the
30 an in-kernel data structure which, from a user-space perspective,
31 can be considered as a container for two lists:
35 list (sometimes also called the
37 set): the set of file descriptors that the process has registered
38 an interest in monitoring.
42 list: the set of file descriptors that are "ready" for I/O.
43 The ready list is a subset of
44 (or, more precisely, a set of references to)
45 the file descriptors in the interest list.
46 The ready list is dynamically populated
47 by the kernel as a result of I/O activity on those file descriptors.
49 The following system calls are provided to
57 instance and returns a file descriptor referring to that instance.
60 extends the functionality of
61 .BR epoll_create (2).)
63 Interest in particular file descriptors is then registered via
65 which adds items to the interest list of the
71 blocking the calling thread if no events are currently available.
72 (This system call can be thought of as fetching items from
77 .SS Level-triggered and edge-triggered
80 event distribution interface is able to behave both as edge-triggered
81 (ET) and as level-triggered (LT).
82 The difference between the two mechanisms
83 can be described as follows.
85 this scenario happens:
87 The file descriptor that represents the read side of a pipe
93 A pipe writer writes 2\ kB of data on the write side of the pipe.
97 is done that will return
99 as a ready file descriptor.
101 The pipe reader reads 1\ kB of data from
110 file descriptor has been added to the
119 will probably hang despite the available data still present in the file
121 meanwhile the remote peer might be expecting a response based on the
122 data it already sent.
123 The reason for this is that edge-triggered mode
124 delivers events only when changes occur on the monitored file descriptor.
127 the caller might end up waiting for some data that is already present inside
129 In the above example, an event on
131 will be generated because of the write done in
133 and the event is consumed in
135 Since the read operation done in
137 does not consume the whole buffer data, the call to
141 might block indefinitely.
143 An application that employs the
145 flag should use nonblocking file descriptors to avoid having a blocking
146 read or write starve a task that is handling multiple file descriptors.
147 The suggested way to use
151 interface is as follows:
153 with nonblocking file descriptors; and
155 by waiting for an event only after
162 By contrast, when used as a level-triggered interface
169 and can be used wherever the latter is used since it shares the
172 Since even with edge-triggered
174 multiple events can be generated upon receipt of multiple chunks of data,
175 the caller has the option to specify the
179 to disable the associated file descriptor after the receipt of an event with
184 it is the caller's responsibility to rearm the file descriptor using
190 (or processes, if child processes have inherited the
192 file descriptor across
196 waiting on the same epoll file descriptor and a file descriptor
197 in the interest list that is marked for edge-triggered
199 notification becomes ready,
200 just one of the threads (or processes) is awoken from
202 This provides a useful optimization for avoiding "thundering herd" wake-ups
205 .SS Interaction with autosleep
209 .I /sys/power/autosleep
210 and an event happens which wakes the device from sleep, the device
211 driver will keep the device awake only until that event is queued.
212 To keep the device awake until the event has been processed,
213 it is necessary to use the
223 .IR "struct epoll_event" ,
224 the system will be kept awake from the moment the event is queued,
227 call which returns the event until the subsequent
230 If the event should keep the system awake beyond that time,
233 should be taken before the second
237 The following interfaces can be used to limit the amount of
238 kernel memory consumed by epoll:
239 .\" Following was added in Linux 2.6.28, but them removed in Linux 2.6.29
241 .\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
242 .\" This specifies an upper limit on the number of epoll instances
243 .\" that can be created per real user ID.
245 .IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
246 This specifies a limit on the total number of
247 file descriptors that a user can register across
248 all epoll instances on the system.
249 The limit is per real user ID.
250 Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
251 and roughly 160 bytes on a 64-bit kernel.
253 .\" Linux 2.6.29 (in Linux 2.6.28, the default was 1/32 of lowmem)
254 the default value for
256 is 1/25 (4%) of the available low memory,
257 divided by the registration cost in bytes.
258 .SS Example for suggested usage
261 when employed as a level-triggered interface does have the same
264 the edge-triggered usage requires more clarification to avoid stalls
265 in the application event loop.
266 In this example, listener is a
267 nonblocking socket on which
272 uses the new ready file descriptor until
274 is returned by either
278 An event-driven state machine application should, after having received
280 record its current state so that at the next call to
286 from where it stopped before.
290 #define MAX_EVENTS 10
291 struct epoll_event ev, events[MAX_EVENTS];
292 int listen_sock, conn_sock, nfds, epollfd;
294 /* Code to set up listening socket, \[aq]listen_sock\[aq],
295 (socket(), bind(), listen()) omitted. */
297 epollfd = epoll_create1(0);
298 if (epollfd == \-1) {
299 perror("epoll_create1");
304 ev.data.fd = listen_sock;
305 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
306 perror("epoll_ctl: listen_sock");
311 nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
313 perror("epoll_wait");
317 for (n = 0; n < nfds; ++n) {
318 if (events[n].data.fd == listen_sock) {
319 conn_sock = accept(listen_sock,
320 (struct sockaddr *) &addr, &addrlen);
321 if (conn_sock == \-1) {
325 setnonblocking(conn_sock);
326 ev.events = EPOLLIN | EPOLLET;
327 ev.data.fd = conn_sock;
328 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
330 perror("epoll_ctl: conn_sock");
334 do_use_fd(events[n].data.fd);
341 When used as an edge-triggered interface, for performance reasons, it is
342 possible to add the file descriptor inside the
345 .RB ( EPOLL_CTL_ADD )
347 .RB ( EPOLLIN | EPOLLOUT ).
348 This allows you to avoid
349 continuously switching between
357 .SS Questions and answers
359 What is the key used to distinguish the file descriptors registered in an
362 The key is the combination of the file descriptor number and
363 the open file description
364 (also known as an "open file handle",
365 the kernel's internal representation of an open file).
367 What happens if you register the same file descriptor on an
371 You will probably get
373 However, it is possible to add a duplicate
378 file descriptor to the same
381 .\" But a file descriptor duplicated by fork(2) can't be added to the
382 .\" set, because the [file *, fd] pair is already in the epoll set.
383 .\" That is a somewhat ugly inconsistency. On the one hand, a child process
384 .\" cannot add the duplicate file descriptor to the epoll set. (In every
385 .\" other case that I can think of, file descriptors duplicated by fork have
386 .\" similar semantics to file descriptors duplicated by dup() and friends.) On
387 .\" the other hand, the very fact that the child has a duplicate of the
388 .\" file descriptor means that even if the parent closes its file descriptor,
389 .\" then epoll_wait() in the parent will continue to receive notifications for
390 .\" that file descriptor because of the duplicated file descriptor in the child.
392 .\" See http://thread.gmane.org/gmane.linux.kernel/596462/
393 .\" "epoll design problems with common fork/exec patterns"
396 This can be a useful technique for filtering events,
397 if the duplicate file descriptors are registered with different
403 instances wait for the same file descriptor?
404 If so, are events reported to both
408 Yes, and events would be reported to both.
409 However, careful programming may be needed to do this correctly.
413 file descriptor itself poll/epoll/selectable?
418 file descriptor has events waiting, then it will
419 indicate as being readable.
421 What happens if one attempts to put an
423 file descriptor into its own file descriptor set?
429 However, you can add an
431 file descriptor inside another
437 file descriptor over a UNIX domain socket to another process?
439 Yes, but it does not make sense to do this, since the receiving process
440 would not have copies of the file descriptors in the interest list.
442 Will closing a file descriptor cause it to be removed from all
446 Yes, but be aware of the following point.
447 A file descriptor is a reference to an open file description (see
449 Whenever a file descriptor is duplicated via
456 a new file descriptor referring to the same open file description is
458 An open file description continues to exist until all
459 file descriptors referring to it have been closed.
461 A file descriptor is removed from an
462 interest list only after all the file descriptors referring to the underlying
463 open file description have been closed.
464 This means that even after a file descriptor that is part of an
465 interest list has been closed,
466 events may be reported for that file descriptor if other file
467 descriptors referring to the same underlying file description remain open.
468 To prevent this happening,
469 the file descriptor must be explicitly removed from the interest list (using
472 before it is duplicated.
474 the application must ensure that all file descriptors are closed
475 (which may be difficult if file descriptors were duplicated
476 behind the scenes by library functions that used
481 If more than one event occurs between
483 calls, are they combined or reported separately?
485 They will be combined.
487 Does an operation on a file descriptor affect the
488 already collected but not yet reported events?
490 You can do two operations on an existing file descriptor.
491 Remove would be meaningless for
493 Modify will reread available I/O.
495 Do I need to continuously read/write a file descriptor
500 flag (edge-triggered behavior)?
502 Receiving an event from
504 should suggest to you that such
505 file descriptor is ready for the requested I/O operation.
506 You must consider it ready until the next (nonblocking)
509 When and how you will use the file descriptor is entirely up to you.
511 For packet/token-oriented files (e.g., datagram socket,
512 terminal in canonical mode),
513 the only way to detect the end of the read/write I/O space
514 is to continue to read/write until
517 For stream-oriented files (e.g., pipe, FIFO, stream socket), the
518 condition that the read/write I/O space is exhausted can also be detected by
519 checking the amount of data read from / written to the target file
521 For example, if you call
523 by asking to read a certain amount of data and
525 returns a lower number of bytes, you
526 can be sure of having exhausted the read I/O space for the file
528 The same is true when writing using
530 (Avoid this latter technique if you cannot guarantee that
531 the monitored file descriptor always refers to a stream-oriented file.)
532 .SS Possible pitfalls and ways to avoid them
534 .B Starvation (edge-triggered)
536 If there is a large amount of I/O space,
537 it is possible that by trying to drain
538 it the other files will not get processed causing starvation.
539 (This problem is not specific to
542 The solution is to maintain a ready list
543 and mark the file descriptor as ready
544 in its associated data structure, thereby allowing the application to
545 remember which files need to be processed but still round robin amongst
547 This also supports ignoring subsequent events you
548 receive for file descriptors that are already ready.
550 .B If using an event cache...
552 If you use an event cache or store all the file descriptors returned from
554 then make sure to provide a way to mark
555 its closure dynamically (i.e., caused by
556 a previous event's processing).
557 Suppose you receive 100 events from
559 and in event #47 a condition causes event #13 to be closed.
560 If you remove the structure and
562 the file descriptor for event #13, then your
563 event cache might still say there are events waiting for that
564 file descriptor causing confusion.
566 One solution for this is to call, during the processing of event 47,
567 .BR epoll_ctl ( EPOLL_CTL_DEL )
568 to delete file descriptor 13 and
570 then mark its associated
571 data structure as removed and link it to a cleanup list.
573 event for file descriptor 13 in your batch processing,
574 you will discover the file descriptor had been
575 previously removed and there will be no confusion.
577 Some other systems provide similar mechanisms;
587 .\" Its interface should be finalized in Linux 2.5.66.
590 The set of file descriptors that is being monitored via
591 an epoll file descriptor can be viewed via the entry for
592 the epoll file descriptor in the process's
593 .IR /proc/ pid /fdinfo
602 operation can be used to test whether a file descriptor
603 is present in an epoll instance.
605 .BR epoll_create (2),
606 .BR epoll_create1 (2),