Documentation/networking/NAPI_HOWTO.txt

   1 HISTORY:
   2 February 16/2002 -- revision 0.2.1:
   3 COR typo corrected
   4 February 10/2002 -- revision 0.2:
   5 some spell checking ;->
   6 January 12/2002 -- revision 0.1
   7 This is still work in progress so may change.
   8 To keep up to date please watch this space.
   9
  10 Introduction to NAPI
  11 ====================
  12
  13 NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
  14 to improve network performance on Linux. For more details please
  15 read that paper.
  16 NAPI provides a "inherent mitigation" which is bound by system capacity
  17 as can be seen from the following data collected by Robert on Gigabit
  18 ethernet (e1000):
  19
  20  Psize    Ipps       Tput     Rxint     Txint    Done     Ndone
  21  ---------------------------------------------------------------
  22    60    890000     409362        17     27622        7     6823
  23   128    758150     464364        21      9301       10     7738
  24   256    445632     774646        42     15507       21    12906
  25   512    232666     994445    241292     19147   241192     1062
  26  1024    119061    1000003    872519     19258   872511        0
  27  1440     85193    1000003    946576     19505   946569        0
  28
  29
  30 Legend:
  31 "Ipps" stands for input packets per second.
  32 "Tput" == packets out of total 1M that made it out.
  33 "txint" == transmit completion interrupts seen
  34 "Done" == The number of times that the poll() managed to pull all
  35 packets out of the rx ring. Note from this that the lower the
  36 load the more we could clean up the rxring
  37 "Ndone" == is the converse of "Done". Note again, that the higher
  38 the load the more times we couldnt clean up the rxring.
  39
  40 Observe that:
  41 when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
  42 The system cant handle the processing at 1 interrupt/packet at that load level.
  43 At lower rates on the other hand, rx interrupts go up and therefore the
  44 interrupt/packet ratio goes up (as observable from that table). So there is
  45 possibility that under low enough input, you get one poll call for each
  46 input packet caused by a single interrupt each time. And if the system
  47 cant handle interrupt per packet ratio of 1, then it will just have to
  48 chug along ....
  49
  50
  51 0) Prerequisites:
  52 ==================
  53 A driver MAY continue using the old 2.4 technique for interfacing
  54 to the network stack and not benefit from the NAPI changes.
  55 NAPI additions to the kernel do not break backward compatibility.
  56 NAPI, however, requires the following features to be available:
  57
  58 A) DMA ring or enough RAM to store packets in software devices.
  59
  60 B) Ability to turn off interrupts or maybe events that send packets up
  61 the stack.
  62
  63 NAPI processes packet events in what is known as dev->poll() method.
  64 Typically, only packet receive events are processed in dev->poll().
  65 The rest of the events MAY be processed by the regular interrupt handler
  66 to reduce processing latency (justified also because there are not that
  67 many of them).
  68 Note, however, NAPI does not enforce that dev->poll() only processes
  69 receive events.
  70 Tests with the tulip driver indicated slightly increased latency if
  71 all of the interrupt handler is moved to dev->poll(). Also MII handling
  72 gets a little trickier.
  73 The example used in this document is to move the receive processing only
  74 to dev->poll(); this is shown with the patch for the tulip driver.
  75 For an example of code that moves all the interrupt driver to
  76 dev->poll() look at the ported e1000 code.
  77
  78 There are caveats that might force you to go with moving everything to
  79 dev->poll(). Different NICs work differently depending on their status/event
  80 acknowledgement setup.
  81 There are two types of event register ACK mechanisms.
  82         I)  what is known as Clear-on-read (COR).
  83         when you read the status/event register, it clears everything!
  84         The natsemi and sunbmac NICs are known to do this.
  85         In this case your only choice is to move all to dev->poll()
  86
  87         II) Clear-on-write (COW)
  88          i) you clear the status by writing a 1 in the bit-location you want.
  89                 These are the majority of the NICs and work the best with NAPI.
  90                 Put only receive events in dev->poll(); leave the rest in
  91                 the old interrupt handler.
  92          ii) whatever you write in the status register clears every thing ;->
  93                 Cant seem to find any supported by Linux which do this. If
  94                 someone knows such a chip email us please.
  95                 Move all to dev->poll()
  96
  97 C) Ability to detect new work correctly.
  98 NAPI works by shutting down event interrupts when theres work and
  99 turning them on when theres none.
 100 New packets might show up in the small window while interrupts were being
 101 re-enabled (refer to appendix 2).  A packet might sneak in during the period
 102 we are enabling interrupts. We only get to know about such a packet when the
 103 next new packet arrives and generates an interrupt.
 104 Essentially, there is a small window of opportunity for a race condition
 105 which for clarity we'll refer to as the "rotting packet".
 106
 107 This is a very important topic and appendix 2 is dedicated for more
 108 discussion.
 109
 110 Locking rules and environmental guarantees
 111 ==========================================
 112
 113 -Guarantee: Only one CPU at any time can call dev->poll(); this is because
 114 only one CPU can pick the initial interrupt and hence the initial
 115 netif_rx_schedule(dev);
 116 - The core layer invokes devices to send packets in a round robin format.
 117 This implies receive is totaly lockless because of the guarantee only that
 118 one CPU is executing it.
 119 -  contention can only be the result of some other CPU accessing the rx
 120 ring. This happens only in close() and suspend() (when these methods
 121 try to clean the rx ring);
 122 ****guarantee: driver authors need not worry about this; synchronization
 123 is taken care for them by the top net layer.
 124 -local interrupts are enabled (if you dont move all to dev->poll()). For
 125 example link/MII and txcomplete continue functioning just same old way.
 126 This improves the latency of processing these events. It is also assumed that
 127 the receive interrupt is the largest cause of noise. Note this might not
 128 always be true.
 129 [according to Manfred Spraul, the winbond insists on sending one
 130 txmitcomplete interrupt for each packet (although this can be mitigated)].
 131 For these broken drivers, move all to dev->poll().
 132
 133 For the rest of this text, we'll assume that dev->poll() only
 134 processes receive events.
 135
 136 new methods introduce by NAPI
 137 =============================
 138
 139 a) netif_rx_schedule(dev)
 140 Called by an IRQ handler to schedule a poll for device
 141
 142 b) netif_rx_schedule_prep(dev)
 143 puts the device in a state which allows for it to be added to the
 144 CPU polling list if it is up and running. You can look at this as
 145 the first half of  netif_rx_schedule(dev) above; the second half
 146 being c) below.
 147
 148 c) __netif_rx_schedule(dev)
 149 Add device to the poll list for this CPU; assuming that _prep above
 150 has already been called and returned 1.
 151
 152 d) netif_rx_reschedule(dev, undo)
 153 Called to reschedule polling for device specifically for some
 154 deficient hardware. Read Appendix 2 for more details.
 155
 156 e) netif_rx_complete(dev)
 157
 158 Remove interface from the CPU poll list: it must be in the poll list
 159 on current cpu. This primitive is called by dev->poll(), when
 160 it completes its work. The device cannot be out of poll list at this
 161 call, if it is then clearly it is a BUG(). You'll know ;->
 162
 163 All these above nethods are used below. So keep reading for clarity.
 164
 165 Device driver changes to be made when porting NAPI
 166 ==================================================
 167
 168 Below we describe what kind of changes are required for NAPI to work.
 169
 170 1) introduction of dev->poll() method
 171 =====================================
 172
 173 This is the method that is invoked by the network core when it requests
 174 for new packets from the driver. A driver is allowed to send upto
 175 dev->quota packets by the current CPU before yielding to the network
 176 subsystem (so other devices can also get opportunity to send to the stack).
 177
 178 dev->poll() prototype looks as follows:
 179 int my_poll(struct net_device *dev, int *budget)
 180
 181 budget is the remaining number of packets the network subsystem on the
 182 current CPU can send up the stack before yielding to other system tasks.
 183 *Each driver is responsible for decrementing budget by the total number of
 184 packets sent.
 185         Total number of packets cannot exceed dev->quota.
 186
 187 dev->poll() method is invoked by the top layer, the driver just sends if it
 188 can to the stack the packet quantity requested.
 189
 190 more on dev->poll() below after the interrupt changes are explained.
 191
 192 2) registering dev->poll() method
 193 ===================================
 194
 195 dev->poll should be set in the dev->probe() method.
 196 e.g:
 197 dev->open = my_open;
 198 .
 199 .
 200 /* two new additions */
 201 /* first register my poll method */
 202 dev->poll = my_poll;
 203 /* next register my weight/quanta; can be overridden in /proc */
 204 dev->weight = 16;
 205 .
 206 .
 207 dev->stop = my_close;
 208
 209
 210
 211 3) scheduling dev->poll()
 212 =============================
 213 This involves modifying the interrupt handler and the code
 214 path which takes the packet off the NIC and sends them to the
 215 stack.
 216
 217 it's important at this point to introduce the classical D Becker
 218 interrupt processor:
 219
 220 ------------------
 221 static irqreturn_t
 222 netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 223 {
 224
 225         struct net_device *dev = (struct net_device *)dev_instance;
 226         struct my_private *tp = (struct my_private *)dev->priv;
 227
 228         int work_count = my_work_count;
 229         status = read_interrupt_status_reg();
 230         if (status == 0)
 231                 return IRQ_NONE; /* Shared IRQ: not us */
 232         if (status == 0xffff)
 233                 return IRQ_HANDLED;      /* Hot unplug */
 234         if (status & error)
 235                 do_some_error_handling()
 236
 237         do {
 238                 acknowledge_ints_ASAP();
 239
 240                 if (status & link_interrupt) {
 241                         spin_lock(&tp->link_lock);
 242                         do_some_link_stat_stuff();
 243                         spin_lock(&tp->link_lock);
 244                 }
 245
 246                 if (status & rx_interrupt) {
 247                         receive_packets(dev);
 248                 }
 249
 250                 if (status & rx_nobufs) {
 251                         make_rx_buffs_avail();
 252                 }
 253
 254                 if (status & tx_related) {
 255                         spin_lock(&tp->lock);
 256                         tx_ring_free(dev);
 257                         if (tx_died)
 258                                 restart_tx();
 259                         spin_unlock(&tp->lock);
 260                 }
 261
 262                 status = read_interrupt_status_reg();
 263
 264         } while (!(status & error) || more_work_to_be_done);
 265         return IRQ_HANDLED;
 266 }
 267
 268 ----------------------------------------------------------------------
 269
 270 We now change this to what is shown below to NAPI-enable it:
 271
 272 ----------------------------------------------------------------------
 273 static irqreturn_t
 274 netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 275 {
 276         struct net_device *dev = (struct net_device *)dev_instance;
 277         struct my_private *tp = (struct my_private *)dev->priv;
 278
 279         status = read_interrupt_status_reg();
 280         if (status == 0)
 281                 return IRQ_NONE;         /* Shared IRQ: not us */
 282         if (status == 0xffff)
 283                 return IRQ_HANDLED;         /* Hot unplug */
 284         if (status & error)
 285                 do_some_error_handling();
 286
 287         do {
 288 /************************ start note *********************************/
 289                 acknowledge_ints_ASAP();  // dont ack rx and rxnobuff here
 290 /************************ end note *********************************/
 291
 292                 if (status & link_interrupt) {
 293                         spin_lock(&tp->link_lock);
 294                         do_some_link_stat_stuff();
 295                         spin_unlock(&tp->link_lock);
 296                 }
 297 /************************ start note *********************************/
 298                 if (status & rx_interrupt || (status & rx_nobuffs)) {
 299                         if (netif_rx_schedule_prep(dev)) {
 300
 301                                 /* disable interrupts caused
 302                                  *      by arriving packets */
 303                                 disable_rx_and_rxnobuff_ints();
 304                                 /* tell system we have work to be done. */
 305                                 __netif_rx_schedule(dev);
 306                         } else {
 307                                 printk("driver bug! interrupt while in poll\n");
 308                                 /* FIX by disabling interrupts  */
 309                                 disable_rx_and_rxnobuff_ints();
 310                         }
 311                 }
 312 /************************ end note note *********************************/
 313
 314                 if (status & tx_related) {
 315                         spin_lock(&tp->lock);
 316                         tx_ring_free(dev);
 317
 318                         if (tx_died)
 319                                 restart_tx();
 320                         spin_unlock(&tp->lock);
 321                 }
 322
 323                 status = read_interrupt_status_reg();
 324
 325 /************************ start note *********************************/
 326         } while (!(status & error) || more_work_to_be_done(status));
 327 /************************ end note note *********************************/
 328         return IRQ_HANDLED;
 329 }
 330
 331 ---------------------------------------------------------------------
 332
 333
 334 We note several things from above:
 335
 336 I) Any interrupt source which is caused by arriving packets is now
 337 turned off when it occurs. Depending on the hardware, there could be
 338 several reasons that arriving packets would cause interrupts; these are the
 339 interrupt sources we wish to avoid. The two common ones are a) a packet
 340 arriving (rxint) b) a packet arriving and finding no DMA buffers available
 341 (rxnobuff) .
 342 This means also acknowledge_ints_ASAP() will not clear the status
 343 register for those two items above; clearing is done in the place where
 344 proper work is done within NAPI; at the poll() and refill_rx_ring()
 345 discussed further below.
 346 netif_rx_schedule_prep() returns 1 if device is in running state and
 347 gets successfully added to the core poll list. If we get a zero value
 348 we can _almost_ assume are already added to the list (instead of not running.
 349 Logic based on the fact that you shouldn't get interrupt if not running)
 350 We rectify this by disabling rx and rxnobuf interrupts.
 351
 352 II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
 353 These functionalities are still around actually......
 354
 355 infact, receive_packets(dev) is very close to my_poll() and
 356 make_rx_buffs_avail() is invoked from my_poll()
 357
 358 4) converting receive_packets() to dev->poll()
 359 ===============================================
 360
 361 We need to convert the classical D Becker receive_packets(dev) to my_poll()
 362
 363 First the typical receive_packets() below:
 364 -------------------------------------------------------------------
 365
 366 /* this is called by interrupt handler */
 367 static void receive_packets (struct net_device *dev)
 368 {
 369
 370         struct my_private *tp = (struct my_private *)dev->priv;
 371         rx_ring = tp->rx_ring;
 372         cur_rx = tp->cur_rx;
 373         int entry = cur_rx % RX_RING_SIZE;
 374         int received = 0;
 375         int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
 376
 377         while (rx_ring_not_empty) {
 378                 u32 rx_status;
 379                 unsigned int rx_size;
 380                 unsigned int pkt_size;
 381                 struct sk_buff *skb;
 382                 /* read size+status of next frame from DMA ring buffer */
 383                 /* the number 16 and 4 are just examples */
 384                 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
 385                 rx_size = rx_status >> 16;
 386                 pkt_size = rx_size - 4;
 387
 388                 /* process errors */
 389                 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
 390                     (!(rx_status & RxStatusOK))) {
 391                         netdrv_rx_err (rx_status, dev, tp, ioaddr);
 392                         return;
 393                 }
 394
 395                 if (--rx_work_limit < 0)
 396                         break;
 397
 398                 /* grab a skb */
 399                 skb = dev_alloc_skb (pkt_size + 2);
 400                 if (skb) {
 401                         .
 402                         .
 403                         netif_rx (skb);
 404                         .
 405                         .
 406                 } else {  /* OOM */
 407                         /*seems very driver specific ... some just pass
 408                         whatever is on the ring already. */
 409                 }
 410
 411                 /* move to the next skb on the ring */
 412                 entry = (++tp->cur_rx) % RX_RING_SIZE;
 413                 received++ ;
 414
 415         }
 416
 417         /* store current ring pointer state */
 418         tp->cur_rx = cur_rx;
 419
 420         /* Refill the Rx ring buffers if they are needed */
 421         refill_rx_ring();
 422         .
 423         .
 424
 425 }
 426 -------------------------------------------------------------------
 427 We change it to a new one below; note the additional parameter in
 428 the call.
 429
 430 -------------------------------------------------------------------
 431
 432 /* this is called by the network core */
 433 static int my_poll (struct net_device *dev, int *budget)
 434 {
 435
 436         struct my_private *tp = (struct my_private *)dev->priv;
 437         rx_ring = tp->rx_ring;
 438         cur_rx = tp->cur_rx;
 439         int entry = cur_rx % RX_BUF_LEN;
 440         /* maximum packets to send to the stack */
 441 /************************ note note *********************************/
 442         int rx_work_limit = dev->quota;
 443
 444 /************************ end note note *********************************/
 445     do {  // outer beginning loop starts here
 446
 447         clear_rx_status_register_bit();
 448
 449         while (rx_ring_not_empty) {
 450                 u32 rx_status;
 451                 unsigned int rx_size;
 452                 unsigned int pkt_size;
 453                 struct sk_buff *skb;
 454                 /* read size+status of next frame from DMA ring buffer */
 455                 /* the number 16 and 4 are just examples */
 456                 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
 457                 rx_size = rx_status >> 16;
 458                 pkt_size = rx_size - 4;
 459
 460                 /* process errors */
 461                 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
 462                     (!(rx_status & RxStatusOK))) {
 463                         netdrv_rx_err (rx_status, dev, tp, ioaddr);
 464                         return 1;
 465                 }
 466
 467 /************************ note note *********************************/
 468                 if (--rx_work_limit < 0) { /* we got packets, but no quota */
 469                         /* store current ring pointer state */
 470                         tp->cur_rx = cur_rx;
 471
 472                         /* Refill the Rx ring buffers if they are needed */
 473                         refill_rx_ring(dev);
 474                         goto not_done;
 475                 }
 476 /**********************  end note **********************************/
 477
 478                 /* grab a skb */
 479                 skb = dev_alloc_skb (pkt_size + 2);
 480                 if (skb) {
 481                         .
 482                         .
 483 /************************ note note *********************************/
 484                         netif_receive_skb (skb);
 485 /**********************  end note **********************************/
 486                         .
 487                         .
 488                 } else {  /* OOM */
 489                         /*seems very driver specific ... common is just pass
 490                         whatever is on the ring already. */
 491                 }
 492
 493                 /* move to the next skb on the ring */
 494                 entry = (++tp->cur_rx) % RX_RING_SIZE;
 495                 received++ ;
 496
 497         }
 498
 499         /* store current ring pointer state */
 500         tp->cur_rx = cur_rx;
 501
 502         /* Refill the Rx ring buffers if they are needed */
 503         refill_rx_ring(dev);
 504
 505         /* no packets on ring; but new ones can arrive since we last
 506            checked  */
 507         status = read_interrupt_status_reg();
 508         if (rx status is not set) {
 509                         /* If something arrives in this narrow window,
 510                         an interrupt will be generated */
 511                         goto done;
 512         }
 513         /* done! at least thats what it looks like ;->
 514         if new packets came in after our last check on status bits
 515         they'll be caught by the while check and we go back and clear them
 516         since we havent exceeded our quota */
 517     } while (rx_status_is_set);
 518
 519 done:
 520
 521 /************************ note note *********************************/
 522         dev->quota -= received;
 523         *budget -= received;
 524
 525         /* If RX ring is not full we are out of memory. */
 526         if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 527                 goto oom;
 528
 529         /* we are happy/done, no more packets on ring; put us back
 530         to where we can start processing interrupts again */
 531         netif_rx_complete(dev);
 532         enable_rx_and_rxnobuf_ints();
 533
 534        /* The last op happens after poll completion. Which means the following:
 535         * 1. it can race with disabling irqs in irq handler (which are done to
 536         * schedule polls)
 537         * 2. it can race with dis/enabling irqs in other poll threads
 538         * 3. if an irq raised after the begining of the outer  beginning
 539         * loop(marked in the code above), it will be immediately
 540         * triggered here.
 541         *
 542         * Summarizing: the logic may results in some redundant irqs both
 543         * due to races in masking and due to too late acking of already
 544         * processed irqs. The good news: no events are ever lost.
 545         */
 546
 547         return 0;   /* done */
 548
 549 not_done:
 550         if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
 551             tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 552                 refill_rx_ring(dev);
 553
 554         if (!received) {
 555                 printk("received==0\n");
 556                 received = 1;
 557         }
 558         dev->quota -= received;
 559         *budget -= received;
 560         return 1;  /* not_done */
 561
 562 oom:
 563         /* Start timer, stop polling, but do not enable rx interrupts. */
 564         start_poll_timer(dev);
 565         return 0;  /* we'll take it from here so tell core "done"*/
 566
 567 /************************ End note note *********************************/
 568 }
 569 -------------------------------------------------------------------
 570
 571 From above we note that:
 572 0) rx_work_limit = dev->quota
 573 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
 574 it does the work.
 575 2) We have a done and not_done state.
 576 3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
 577 4) we have a new way of handling oom condition
 578 5) A new outer for (;;) loop has been added. This serves the purpose of
 579 ensuring that if a new packet has come in, after we are all set and done,
 580 and we have not exceeded our quota that we continue sending packets up.
 581
 582
 583 -----------------------------------------------------------
 584 Poll timer code will need to do the following:
 585
 586 a)
 587
 588         if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
 589             tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 590                 refill_rx_ring(dev);
 591
 592         /* If RX ring is not full we are still out of memory.
 593            Restart the timer again. Else we re-add ourselves
 594            to the master poll list.
 595          */
 596
 597         if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
 598                 restart_timer();
 599
 600         else netif_rx_schedule(dev);  /* we are back on the poll list */
 601
 602 5) dev->close() and dev->suspend() issues
 603 ==========================================
 604 The driver writter neednt worry about this. The top net layer takes
 605 care of it.
 606
 607 6) Adding new Stats to /proc
 608 =============================
 609 In order to debug some of the new features, we introduce new stats
 610 that need to be collected.
 611 TODO: Fill this later.
 612
 613 APPENDIX 1: discussion on using ethernet HW FC
 614 ==============================================
 615 Most chips with FC only send a pause packet when they run out of Rx buffers.
 616 Since packets are pulled off the DMA ring by a softirq in NAPI,
 617 if the system is slow in grabbing them and we have a high input
 618 rate (faster than the system's capacity to remove packets), then theoretically
 619 there will only be one rx interrupt for all packets during a given packetstorm.
 620 Under low load, we might have a single interrupt per packet.
 621 FC should be programmed to apply in the case when the system cant pull out
 622 packets fast enough i.e send a pause only when you run out of rx buffers.
 623 Note FC in itself is a good solution but we have found it to not be
 624 much of a commodity feature (both in NICs and switches) and hence falls
 625 under the same category as using NIC based mitigation. Also experiments
 626 indicate that its much harder to resolve the resource allocation
 627 issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness
 628 proved harder. In any case, FC works even better with NAPI but is not
 629 necessary.
 630
 631
 632 APPENDIX 2: the "rotting packet" race-window avoidance scheme
 633 =============================================================
 634
 635 There are two types of associations seen here
 636
 637 1) status/int which honors level triggered IRQ
 638
 639 If a status bit for receive or rxnobuff is set and the corresponding
 640 interrupt-enable bit is not on, then no interrupts will be generated. However,
 641 as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
 642 generated.  [assuming the status bit was not turned off].
 643 Generally the concept of level triggered IRQs in association with a status and
 644 interrupt-enable CSR register set is used to avoid the race.
 645
 646 If we take the example of the tulip:
 647 "pending work" is indicated by the status bit(CSR5 in tulip).
 648 the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
 649 the CSR5 will continue to be turned on with new packet arrivals even if
 650 we clear it the first time)
 651 Very important is the fact that if we turn on the interrupt bit on when
 652 status is set that an immediate irq is triggered.
 653
 654 If we cleared the rx ring and proclaimed there was "no more work
 655 to be done" and then went on to do a few other things;  then when we enable
 656 interrupts, there is a possibility that a new packet might sneak in during
 657 this phase. It helps to look at the pseudo code for the tulip poll
 658 routine:
 659
 660 --------------------------
 661         do {
 662                 ACK;
 663                 while (ring_is_not_empty()) {
 664                         work-work-work
 665                         if quota is exceeded: exit, no touching irq status/mask
 666                 }
 667                 /* No packets, but new can arrive while we are doing this*/
 668                 CSR5 := read
 669                 if (CSR5 is not set) {
 670                         /* If something arrives in this narrow window here,
 671                         *  where the comments are ;-> irq will be generated */
 672                         unmask irqs;
 673                         exit poll;
 674                 }
 675         } while (rx_status_is_set);
 676 ------------------------
 677
 678 CSR5 bit of interest is only the rx status.
 679 If you look at the last if statement:
 680 you just finished grabbing all the packets from the rx ring .. you check if
 681 status bit says theres more packets just in ... it says none; you then
 682 enable rx interrupts again; if a new packet just came in during this check,
 683 we are counting that CSR5 will be set in that small window of opportunity
 684 and that by re-enabling interrupts, we would actually triger an interrupt
 685 to register the new packet for processing.
 686
 687 [The above description nay be very verbose, if you have better wording
 688 that will make this more understandable, please suggest it.]
 689
 690 2) non-capable hardware
 691
 692 These do not generally respect level triggered IRQs. Normally,
 693 irqs may be lost while being masked and the only way to leave poll is to do
 694 a double check for new input after netif_rx_complete() is invoked
 695 and re-enable polling (after seeing this new input).
 696
 697 Sample code:
 698
 699 ---------
 700         .
 701         .
 702 restart_poll:
 703         while (ring_is_not_empty()) {
 704                 work-work-work
 705                 if quota is exceeded: exit, not touching irq status/mask
 706         }
 707         .
 708         .
 709         .
 710         enable_rx_interrupts()
 711         netif_rx_complete(dev);
 712         if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
 713                 disable_rx_and_rxnobufs()
 714                 goto restart_poll
 715         } while (rx_status_is_set);
 716 ---------
 717
 718 Basically netif_rx_complete() removes us from the poll list, but because a
 719 new packet which will never be caught due to the possibility of a race
 720 might come in, we attempt to re-add ourselves to the poll list.
 721
 722
 723
 724
 725 APPENDIX 3: Scheduling issues.
 726 ==============================
 727 As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
 728 general solution to schedule softirq's to run before next interrupt and by putting
 729 them under scheduler control. Also this prevents consecutive softirq's from
 730 monopolize the CPU. This also have the effect that the priority of ksoftirq needs
 731 to be considered when running very CPU-intensive applications and networking to
 732 get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
 733 (eventually more) is reported cure problems with low network performance at high
 734 CPU load.
 735
 736 Most used processes in a GIGE router:
 737 USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND
 738 root         3  0.2  0.0     0     0  ?  RWN Aug 15 602:00 (ksoftirqd_CPU0)
 739 root       232  0.0  7.9 41400 40884  ?  S   Aug 15  74:12 gated
 740
 741 --------------------------------------------------------------------
 742
 743 relevant sites:
 744 ==================
 745 ftp://robur.slu.se/pub/Linux/net-development/NAPI/
 746
 747
 748 --------------------------------------------------------------------
 749 TODO: Write net-skeleton.c driver.
 750 -------------------------------------------------------------
 751
 752 Authors:
 753 ========
 754 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
 755 Jamal Hadi Salim <hadi@cyberus.ca>
 756 Robert Olsson <Robert.Olsson@data.slu.se>
 757
 758 Acknowledgements:
 759 ================
 760 People who made this document better:
 761
 762 Lennert Buytenhek <buytenh@gnu.org>
 763 Andrew Morton  <akpm@zip.com.au>
 764 Manfred Spraul <manfred@colorfullife.com>
 765 Donald Becker <becker@scyld.com>
 766 Jeff Garzik <jgarzik@pobox.com>