Interrupts and Network Drivers

来源：互联网发布：js 正则匹配标点符号编辑：程序博客网时间：2024/05/02 01:54

devices and the kernel can use two main techniques for exchanging data: polling and interrupts

Polling
With this technique, the kernel constantly keeps checking whether the device has anything to say. It can do that by continually reading a memory register on the device, for instance, or returning to check it when a timer expires.

Interrupts

When the event is the reception of a frame, the handler queues the frame somewhere and notifies the kernel about it.This technique, which is quite common, still represents the best option under low traffic loads. Unfortunately, it does not perform well under high traffic loads: forcing an interrupt for each frame received can easily make the CPU waste all of its time handling interrupts

The code that takes care of an input frame is split into two parts: first the driver copies the frame into an input queue accessible by the kernel, and then the kernel processes it (usually passing it to a handler dedicated to the associated protocol such as IP). The first part is executed in interrupt context and can preempt the execution of the second part. This means that the code that accepts input frames and copies them into the queue has higher priority than the code that actually processes the frames. Under a high traffic load, the interrupt code would keep preempting the processing code. The consequence is obvious: at some point the input queue will be full, but since the code that is supposed to dequeue and process those frames does not have a chance to run due to its lower priority, the system collapses. New frames cannot be queued since there is no space, and old frames cannot be processed because there is no CPU available for them. This condition is called receive-livelock in the literature

Processing Multiple Frames During an Interrupt

This approach is used by quite a few Linux device drivers. When an interrupt is notified and the driver handler is executed, the latter keeps downloading frames and queuing them to the kernel input queue, up to a maximum number of frames (or a window of time). Of course, it would be possible to keep doing that until the queue gets empty, but let’s remember that device drivers should behave as good citizens. They have to share the CPU with other subsystems and IRQ lines with other devices. Polite behavior is especially important because interrupts are disabled while the driver handler is running.

Storage limitations also apply, as they did in the previous section. Each device has a limited amount of memory, and therefore the number of frames it can store is limited. If the driver does not process them in a timely manner, the buffers can get full and new frames (or old ones, depending on the driver policies) could be dropped.If a loaded device kept processing incoming frames until its queue emptied out, this form of starvation could happen to other devices

Timer-Driven Interrupts

This technique is an enhancement to the previous ones. Instead of having the device asynchronously notify the driver about frame receptions, the driver instructs the device to generate an interrupt at regular intervals. The handler will then check if any frames have arrived since the previous interrupt, and handles all of them in one shot. Even better would be to have the driver generate interrupts at intervals, but only if it has something to say.

Combinations

A good combination would use the interrupt technique under low load and switch to the timer-driven interrupt under high load. The tulip driver included in the Linux kernel, for instance, can do this (see drivers/net/tulip/interrupt.c*).

Example
A balanced approach to processing multiple frames is shown in the following piece of code, taken from the drivers/net/3c59x.c Ethernet driver :
static irqreturn_t vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
int work_done = max_interrupt_work;
ioaddr = dev->base_addr;
... ... ...
status = inw(ioaddr + EL3_STATUS);
do {
... ... ...
if (status & RxComplete)
vortex_rx(dev);
if (--work_done < 0) {
/* Disable all pending interrupts. */
... ... ...
/* The timer will re-enable interrupts. */
mod_timer(&vp->timer, jiffies + 1*HZ);
break;
}
... ... ...
} while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete));
... ... ...
}
In vortex_interrupt, the driver reads from the device the reasons for the interrupt and stores it into status. Network devices can generate an interrupt for different reasons, and several reasons can be grouped together in a single interrupt. If RxComplete (a symbol specially defined by this driver to mean a new frame has been received) is among those reasons, the code invokes vortex_rx.* During its execution, interrupts are disabled for the device. However, the driver can read a hardware register on the card and find out if in the meantime, a new interrupt was posted. The IntLatch flag is true when a new interrupt has been posted (and it is cleared by the driver when it is done processing it).

vortex_interrupt keeps processing incoming frames as long as the register says there is an interrupt pending (IntLatch) and that it is due to the reception of a frame (RxComplete). This also means that only multiple occurrences of RxComplete interrupts can be handled in one shot. Other types of interrupts, which are much less frequent, can wait.

Finally—here is where good citizenship enters—the loop terminates if it reaches the maximum number of input frames that can be processed, stored in work_done. This driver uses a default value of 32 and allows that value to be tuned at module load time.

Interrupt Handlers

Bottom Halves Solutions

The kernel provides different mechanism for implementing bottom halves and for deferring work in general. These mechanisms differ mainly with regard to the following points:

Running context
Interrupts are seen by the kernel as having a different running context from userspace processes or other kernel code. When the function executed by a bottom half is capable of going to sleep, it is restricted to mechanisms allowed in process context, as opposed to interrupt context.
Concurrency and locking
When a mechanism can take advantage of SMP, this has implications for how serialization is enforced (if necessary) and how locking influences scalability.

Concurrency and Locking

• Only one old-style bottom half can run at any time, regardless of the number of CPUs (kernel 2.2).
• Only one instance of each tasklet can run at any time. Different tasklets can run concurrently on different CPUs. This means that given any tasklet, there is no need to enforce any serialization because already it is enforced by the kernel: you cannot have multiple instances of the same tasklet running concurrently.
• Only one instance of each softirq can run at the same time on a CPU. However, the same softirq can run on different CPUs concurrently. This means that given any softirq you need to make sure that accesses to shared data by different CPUs use proper locking. To increase parallelization, the softirqs should be designed to access only per-CPU data as much as possible, reducing the need for locking considerably.

Preemption

system calls and other kernel tasks can be preempted by other kernel tasks with higher priorities. Once preemption was added,developers just had to define explicitly where to disable it (in hardware and software interrupt code, in the scheduler itself, in the code protected by spin locks and read/write locks, etc.).

preempt_disable
Disables preemption for the current task. Can be called repeatedly, incrementing a reference counter.
preempt_enable
preempt_enable_no_resched
The reverse of preempt_disable, allowing preemption to be enabled again. preempt_enable_no_resched simply decrements a reference counter, which allows preemption to be re-enabled when it reaches zero. preempt_enable, in addition, checks whether the counter is zero and forces a call to schedule( ) to allow any higher-priority task to run.
preempt_check_resched
This function is called by preempt_enable and differentiates it from preempt_enable_no_resched.

A counter for each process, named preempt_count and embedded in the thread_info structure, indicates whether a given process allows preemption. The field can be read with preempt_count( ) and is manipulated indirectly through the inc_preempt_count and dec_preempt_count functions defined in include/linux/preempt.h. There are situations in which the kernel should not be preempted. These include when it is servicing hardware, as well as when it uses one of the calls just shown to disable preemption. Therefore, preempt_count is split into three components. Each byte is a counter for a different condition that requires nonpreemption: hardware interrupts, software interrupts, and general nonpreemption.

The figure shows, in addition to the purpose of each byte, the main functions that manipulate it. The high-order byte is not fully used at the moment, but its second least significant bit is set before calling the schedule function and tells that function that it has been called to preempt the current task.* In include/asm-xxx/hardirq.h you can find several macros that make it easier to read and write preempt_counter; Despite all this complexity, whenever a check has to be done on the current process to see if it can be preempted, all the kernel needs to know is whether preempt_count is NULL (it does not really matter why preemption is disabled).

Bottom-Half Handlers
The infrastructure for bottom halves must address the following needs:
• Classifying the bottom half as the proper type
• Registering the association between a bottom half type and its handler
• Scheduling a bottom half for execution
• Notifying the kernel about the presence of scheduled BHs

Bottom-half handlers in kernel 2.2

The 2.2 model for bottom-half handlers divides them into a large number of types, which are differentiated by when and how often the kernel checks for them and runs.them. The 2.2 list is as follows, taken from include/linux/interrupt.h. In this book, we are most interested in NET_BH

enum {
TIMER_BH = 0,
CONSOLE_BH,
TQUEUE_BH,
DIGI_BH,
SERIAL_BH,
RISCOM8_BH,
SPECIALIX_BH,
AURORA_BH,
ESP_BH,
NET_BH,
SCSI_BH,
IMMEDIATE_BH,
KEYBOARD_BH,
CYCLADES_BH,
CM206_BH,
JS_BH,
MACSERIAL_BH,
ISICOM_BH
};

Each bottom-half type is associated with a function handler by means of init_bh. The networking code, for instance, initializes the NET_BH bottom-half type to the net_bh handler in net_dev_init
_ _initfunc(int net_dev_init(void))
{
... ... ...
init_bh(NET_BH, net_bh);
... ... ...
}
The main function used to unregister a BH handler is remove_bh. (There are other related functions too, such as enable_bh/disable_bh, but we do not need to see all of them.)

Whenever an interrupt handler wants to trigger the execution of a bottom half handler, it has to set the corresponding flag with mark_bh. This function is very simple: it sets a bit into a global bitmap bh_active, which, as we will see in a moment, is tested in several places.
extern inline void mark_bh(int nr)
{
set_bit(nr, &bh_active);
};
For instance, every time a network device driver has successfully received a frame, it signals the kernel about it with a call to netif_rx. The latter queues the newly received frame into the ingress queue backlog (shared by all the CPUs) and marks the NET_BH bottom-half handler flag.

skb_queue_tail(&backlog, skb);
mark_bh(NET_BH);
return

During several routine operations, the kernel checks whether any bottom halves are scheduled for execution. If any are waiting, the kernel runs the function do_bottom_half (currently in kernel/softirq.c), to execute them. The checks are performed during:

do_IRQ
Whenever the kernel is notified by an IRQ about a hardware interrupt, it calls do_IRQ to execute the associated handler. Since a good number of bottom halves are scheduled for execution by interrupt handlers, what could give them less latency than an invocation right at the end of do_IRQ? For this reason, the regular timer interrupt that expires with frequency HZ represents an upper bound between two consecutive executions of do_bottom_half.
Returns from interrupts and exceptions (which includes system calls) See arch/XXX/kernel/entry.S for the assembly language code that takes care of this case.

schedule
This function, which decides what to execute next on the CPU, checks if any bottom-half handlers are pending and gives them higher priority over other tasks.
asmlinkage void schedule(void)
{
/* Do "administrative" work here while we don't hold any locks */
if (bh_mask & bh_active)
goto handle_bh;
handle_bh_back:
... ... ...
handle_bh:
do_bottom_half( );
goto handle_bh_back;
... ... ...
}

run_bottom_half, the function used by do_bottom_half to execute the pending interrupt handlers, looks like this:
active = get_active_bhs( );
clear_active_bhs(active);
bh = bh_base;
do {
if (active & 1)
(*bh)( );
bh++;
active >>= 1;
} while (active);

The order in which the pending handlers are invoked depends on the positions of the associated flags inside the bitmap and the direction used to scan those flags (returned by get_active_bhs). In other words, bottom halves are not run on a first-come-firstserved basis. And since networking bottom halves can take a long time, those that have the misfortune to be dequeued last can experience high latency. Bottom halves in 2.2 and earlier kernels suffer from a ban on concurrency. Only one bottom half can run at any time, regardless of the number of CPUs.

Bottom-half handlers in kernel 2.4 and above: the introduction of the softirq

The biggest improvement between kernels 2.2 and 2.4, as far as interrupt handling is concerned, was the introduction of software interrupts (softirqs), which can be seen as the multithreaded version of bottom half handlers. Not only can many softirqs run concurrently, but also the same softirq can run on different CPUs concurrently. The only restriction on concurrency is that only one instance of each softirq can run at the same time on a CPU. The new softirq model has only six types (from include/linux/interrupt.h):
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
SCSI_SOFTIRQ,
TASKLET_SOFTIRQ
};

All the XXX_BH bottom-half types in the old model are still available to old drivers, but have been reimplemented to run as softirqs of the HI_SOFTIRQ type (which means they take priority over the other softirq types).

Softirqs, like the old bottom halves, run with interrupts enabled and therefore can be suspended at any time to handle a new, incoming interrupt. However, the kernel does not allow a new request for a softirq to run on a CPU if another instance of that softirq has been suspended on that CPU; this drastically reduces the amount of locking needed. Each softirq type can maintain an array of data structures of type softnet_data, one per CPU, to hold state information about the current softirq; Since different instances of the same type of softirq can run simultaneously on different CPUs, the functions run by softirqs still need to lock other data structures that are shared, to avoid race conditions.

softirq handlers are registered with the open_softirq function, which, unlike init_bh, accepts an extra parameter so that the function handler can be passed some input data if needed. None of the softirqs, however, currently uses that extra parameter, and a proposal has been floated to remove it. open_softirq simply copies the input parameters into the global array softirq_vec, declared in kernel/softirq.c, which holds the associations between types and handlers.
static struct softirq_action softirq_vec[32] _ _cacheline_aligned_in_smp;
void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
{
softirq_vec[nr].data = data;
softirq_vec[nr].action = action;
}

A softirq can be scheduled for execution on the local CPU by the following functions:
_ _raise_softirq_irqoff
This function, the counterpart of mark_bh in 2.2, simply sets the bit flag associated to the softirq to run. Later on, when the flag is checked, the associated handler will be invoked.
raise_softirq_irqoff
This is a wrapper around __raise_softirq_irqoff that additionally schedules the ksoftirqd thread if the function is not called from a hardware or software interrupt context and preemption has not been disabled. If the function is called from interrupt context, invoking the thread is not necessary because, as we will see, do_softirq will be called anyway.
raise_softirq
This is a wrapper around raise_softirq_irqoff that executes the latter with hardware interrupts disabled.

The only difference between this early stage of softirqs and the 2.2 bottom-half model is that the softirq version has to check the flags on a per-CPU basis, since each CPU has its own bitmap of pending softirqs

Here are the main points where do_softirq may be invoked:*

do_IRQ
The skeleton for do_IRQ, which is defined in the per-architecture files arch/archname/kernel/irq.c, is:
fastcall unsigned int do_IRQ(struct pt_regs * regs)
{
irq_enter( );
... ... ...
/* handle the IRQ number "irq" with the registered handler */
... ... ...
irq_exit( );
return 1;
}

Since nested calls to irq_enter are allowed, irq_exit calls invoke_softirq only when all the usual conditions are met (there are softirqs pending, etc.) and the reference count associated with the interrupt context has reached zero, indicating that the kernel is leaving the interrupt context. Here is the generic definition of irq_exit from kernel/softirq.c, but there are architectures that define their own versions:
void irq_exit(void)
{
...
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt( ) && local_softirq_pending( ))
invoke_softirq( );
preempt_enable_no_resched( );
}

smp_apic_timer_interrupt, which handles SMPtimers in arch/XXX/kernel/apic.c, also uses irq_enter/irq_exit.

local_bh_enable
When softirqs are re-enabled on a CPU, pending requests are processed (if any) with a call to do_softirq.
The kernel threads, ksoftirqd_CPUn
To keep softirqs from monopolizing all the CPUs (which could happen easily on a heavily loaded network because the NET_TX_SOFTIRQ and NET_RX_SOFTIRQ interrupts have a higher priority than user processes), developers introduced a new set of per-CPU threads. These have the names ksoftirqd_CPU0, ksoftirqd_CPU1, and so on, and can be seen by a ps command.

Tasklets

A tasklet is a function that some interrupt or other task has deferred to execute later. Tasklets are built on top of softirqs and are usually kicked off by interrupt handlers. (But other parts of the kernel, such as the neighboring subsystem).*

HI_SOFTIRQ is used to implement high-priority tasklets, and TASKLET_SOFTIRQ is used for lower-priority ones. Each time a request for a deferred execution is issued, an instance of a tasklet_struct structure is queued onto either a list processed by HI_SOFTIRQ or another one that is instead processed by TASKLET_SOFTIRQ.

Since softirqs are handled independently by each CPU, it should not be a surprise that there are two lists of pending tasklet_structs for each CPU, one associated with HI_SOFTIRQ and one with TASKLET_SOFTIRQ. These are their definitions from kernel/softirq.c:
static DEFINE_PER_CPU(struct tasklet_head tasklet_vec) = { NULL };
static DEFINE_PER_CPU(struct tasklet_head tasklet_hi_vec) = { NULL };
At first sight, tasklets may seem to be just like the old bottom halves, but there actually are substantial differences:
• There is no limit on the number of different tasklets, whereas the old bottom halves were limited to one type for each bit flag of bh_base.
• Tasklets provide two levels of priority.
• Different tasklets can run concurrently on different CPUs.
• Tasklets, unlike old bottom halves and softirqs, are dynamic and do not need to be statically declared in an XXX_BH or XXX_SOFTIRQ enumeration list.

The tasklet_struct data structure is defined in include/linux/interrupt.h as follows:
struct tasklet_struct
{
struct tasklet_struct *next;
unsigned long state;
atomic_t count;
void (*func)(unsigned long);
unsigned long data;
};

struct tasklet_struct *next
A pointer used to link together the pending structures associated with the same CPU. New elements are added at the head by the functions tasklet_hi_schedule and tasklet_schedule.
unsigned long state
A bitmap flag whose possible values are represented by the TASKLET_STATE_XXX enums listed in include/linux/interrupt.h:
TASKLET_STATE_SCHED
The tasklet has been scheduled for execution, and the data structure is already in the list associated with HI_SOFTIRQ or TASKLET_SOFTIRQ, based on the priority assigned. The same tasklet cannot be scheduled concurrently on different CPUs. If other requests to execute the tasklet arrive when the first one has not started its execution yet, they will be dropped. Since for any given tasklet, there can be only one instance in execution, there is no reason to schedule it for execution more than once.

TASKLET_STATE_RUN
The tasklet is being executed. This flag is used to prevent multiple instances of the same tasklet from being executed concurrently. It is meaningful only for SMPsystems. The flag is manipulated with the three locking functions tasklet_trylock, tasklet_unlock, and tasklet_unlock_wait.

atomic_t count
There are cases where you may need to temporarily disable and later re-enable a tasklet. This is accomplished by this counter: a value of zero means that the tasklet is disabled (and thus not executable) and nonzero means the tasklet is enabled. Its value is incremented and decremented by the tasklet[_hi]_enable and tasklet[_hi]_disable functions described later in this section.
void (*func)(unsigned long)
unsigned long data
func is the function to execute and data is an optional input that can be passed to func.

The following are some important kernel functions that handle tasklets, from kernel/softirq.c and include/linux/interrupt.h:
tasklet_init
Fills in the fields of a tasklet_struct structure with the func and data values provided as arguments.
tasklet_action, tasklet_hi_action
Execute low-priority and high-priority tasklets, respectively.
tasklet_schedule, tasklet_hi_schedule
Schedule a low-priority and a high-priority tasklet, respectively, for execution. They add the tasklet_struct structure to the list of pending tasklets associated with the local CPU and then schedule the associated softirq (TASKLET_SOFTIRQ or HI_SOFTIRQ). If the tasklet is already scheduled (but not running), these APIs do nothing (see the TASKLET_STATE_SCHED flag).
tasklet_enable, tasklet_hi_enable
These two functions are identical and are used to enable a tasklet

tasklet_disable, tasklet_disable_nosync
Both of these functions disable a tasklet and can be used with low- and highpriority
tasklets. Tasklet_disable is a wrapper to tasklet_disable_nosync.
While the latter returns right away (it is asynchronous), the former returns only when the tasklet has terminated its execution in case it was running (it is synchronous). tasklet_enable, tasklet_hi_enable, and tasklet_disable_nosync manipulate the value of the count field to declare the tasklet enabled or disabled. Nested calls are allowed.

Softirq Initialization

During kernel initialization, softirq_init initializes the software IRQ layer with the two general-purpose softirqs: tasklet_action and tasklet_hi_action, which are associated with TASKLET_SOFTIRQ and HI_SOFTIRQ, respectively.

void _ _init softirq_init( )
{
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}
The two softirqs used by the networking code NET_RX_SOFTIRQ and NET_TX_SOFTIRQ are initialized in net_dev_init, one of the networking initialization functions

HI_SOFTIRQ is mainly used by sound card device drivers.*
Users of TASKLET_SOFTIRQ include:
• Drivers for network interface cards (not only Ethernets)
• Numerous other device drivers
• Media layers (USB, IEEE 1394, etc.)
• Networking subsystems (Neighboring, ATM qdisc, etc.)

Pending softirq Handling

do_softirq stops and does nothing if the CPU is currently serving a hardware or software interrupt. The function checks for this by calling in_interrupt, which is equivalent to if (in_irq( ) || in_softirq( )).
If do_softirq decides to proceed, it saves pending softirqs in pending with local_softirq_pending.
#ifndef _ _ARCH_HAS_DO_SOFTIRQ
asmlinkage void do_softirq(void)
{
if (in_interrupt( ))
return;
local_irq_save(flags);

pending = local_softirq_pending( );
if (pending)
_ _do_softirq( );
local_irq_restore;
}
EXPORT_SYMBOL(do_softirq);
#endif
From the preceding snapshot, it could seem that do_softirq runs with IRQs disabled, but that’s not true. IRQs are kept disabled only when manipulating the bitmap of pending softirqs (i.e., accessing the softnet_data structure). You will see in a moment that _ _do_softirq internally re-enables IRQs when running the softirq handlers.

_ _do_softirq function
It is possible for the same softirq type to be scheduled multiple times while do_softirq is running. Since IRQs are enabled when running the softirq handlers, the bitmap of pending softirq can be manipulated while serving an interrupt, and therefore any of the softirq handlers that has been executed by _ _do_softirq could be re-scheduled during the execution of _ _do_softirq itself.
For this reason, before _ _do_softirq re-enables IRQs, it saves the current bitmap of the pending softirq on the local variable pending and clears it from the softnet_data instance associated with the local CPU using local_softirq_pending( )=0. Then based on pending, it calls all the necessary handlers.
Once all the handlers have been called, _ _do_softirq checks whether in the meantime any softirqs were scheduled again (this request disables IRQs). If there was at least one pending softirq, it will repeat the whole process. However, _ _do_softirq repeats it only up to MAX_SOFTIRQ_RESTART times (experimentation has found that 10 times works well).

The use of MAX_SOFTIRQ_RESTART is a design decision made to keep a single type of interrupt—particularly a stream of networking interrupts—from starving other interrupts out of one of the CPUs. Without the limit in _ _do_softirq, starvation could easily happen when a server is highly loaded by network traffic and the number of NET_RX_SOFTIRQ interrupts goes through the roof.
Let’s see how starvation could take place. do_IRQ would raise a NET_RX_SOFTIRQ interrupt that would cause do_softirq to be executed. _ _do_softirq would clear the NET_RX_SOFTIRQ flag, but before it ended it would be interrupted by an interrupt that would set NET_RX_SOFTIRQ again, and so on, indefinitely.
Let’s see now how the central part of _ _do_softirq manages to invoke the softirq handler. Every time one softirq type is served, its bit is cleared from the local copy of the active softirqs, pending. h is initialized to point to the global data structure softirq_vec that holds the associations between softirq types and their function (for instance, NET_RX_SOFTIRQ is handled by net_rx_action). The loop ends when the bitmap is cleared.
Finally, if there are pending softirqs that cannot be handled because do_softirq must return, having repeated its job MAX_SOFTIRQ_RESTART times already, the ksoftirqd thread is awakened and given the responsibility of handling them later. Because do_softirq is invoked at so many points within the kernel, it is actually likely that a later invocation of do_softirq will handle these interrupts before the ksoftirqd thread is scheduled.
#define MAX_SOFTIRQ_RESTART 10
asmlinkage void _ _do_softirq(void)
{
struct softirq_action *h;

_ _u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
pending = local_softirq_pending( );
local_bh_disable( );
cpu = smp_processor_id( );
restart:
/* Reset the pending bitmask before enabling irqs */
local_softirq_pending( ) = 0;
local_irq_enable( );
h = softirq_vec;

do {
if (pending & 1) {
h->action(h);
rcu_bh_qsctr_inc(cpu);
}
h++;
pending >>= 1;
} while (pending);
local_irq_disable( );
pending = local_softirq_pending( );
if (pending && --max_restart)
goto restart;
if (pending)
wakeup_softirqd( );
_ _local_bh_enable( );
}

Per-Architecture Processing of softirq
The do_softirq function provided in kernel/softirq.c can be overridden by another function provided by the architecture code (which ends up calling _ _do_softirq anyway).
This explains why the definition of do_softirq in kernel/softirq.c is wrapped with the check on _ _ARCH_HAS_DO_SOFTIRQ
A few architectures, including i386 (see arch/i386/kernel/irq.c), define their own version of do_softirq. Such architecture versions are used when the architectures use 4 KB stacks (instead of 8 KB) and use the remaining 4 K to implement stacked handling of both hard IRQs and softirqs.

ksoftirqd Kernel Threads
Background kernel threads are assigned the job of checking for softirqs that have been left unexecuted by the functions previously described, and executing as many of those softirqs as they can before needing to give that CPU back to other activities. There is one kernel thread for each CPU, named ksoftirqd_CPU0, ksoftirqd_CPU1, and so on.
The function ksoftirqd associated to these threads is pretty simple and is defined in the same file softirq.c:

static int ksoftirqd(void * _ _bind_cpu)
{
set_user_nice(current, 19);
...
while (!kthread_should_stop( )) {
if (!local_softirq_pending( ))
schedule( );
_ _set_current_state(TASK_RUNNING);
while (local_softirq_pending( )) {
/* Preempt disable stops cpu going offline.
If already offline, we'll be on wrong CPU:
don't process */
preempt_disable( );
if (cpu_is_offline((long)_ _bind_cpu))
goto wait_to_die;
do_softirq( );
preempt_enable( );
cond_resched( );
}
set_current_state(TASK_INTERRUPTIBLE);
}
_ _set_current_state(TASK_RUNNING);
return 0;
...
}

The priority of a process, also called the nice priority, is a number ranging from –20 (maximum) to +19 (minimum). The ksoftirqd threads are given a low priority of 19. This is done so that frequently running softirqs such as NET_RX_SOFTIRQ cannot completely kidnap the CPU, which would leave almost no resources to other processes. We already saw that do_softirq can be invoked from different places in the code, so this low priority doesn’t represent a handicap. Once started, the loop simply keeps calling do_softirq (always with preemption disabled) until one of the following conditions becomes true:
• There are no more pending softirqs to handle (local_softirq_pending( ) returns FALSE).

In this case, the function sets the thread’s state to TASK_INTERRUPTIBLE and calls schedule( ) to release the CPU. The thread can be awakened by means of wakeup_softirqd, which can be called from both _ _do_softirq itself and raise_softirq_irqoff.
• The thread has run for too long and is asked to release the CPU.
The handler associated with the timer interrupt, among other things, sets the need_resched flag to signal that the current process/thread has used its time slot. In this case, ksoftirqd releases the CPU, keeping its state as TASK_RUNNING, and will soon be resumed.
Starting the threads

There is one ksoftirqd thread for each CPU. When the system’s first CPU comes online, the first thread is started at kernel boot time inside do_pre_smp_initcalls.*
The ksoftirqd threads for the other CPUs that come up at boot time, and for any other CPU that may be enabled later on a system that can handle hot-pluggable CPUs, are taken care of through the cpu_chain notification chain.
The cpu_chain chain lets various subsystems know when a CPU is up and running or when one dies. The softirq subsystem registers to the cpu_chain with spawn_ksoftirqd, called from the function do_pre_smp_initcalls mentioned previously. The callback routine cpu_callback that processes notifications from cpu_chain is used to initialize the necessary per-CPU data structures and start the ksoftirqd thread on the CPU.
The complete list of CPU_XXX notifications is in include/linux/notifier.h, but we need only four of them in the context of this chapter:

CPU_UP_PREPARE
Generated when the CPU starts coming up, but is not ready yet.
CPU_ONLINE
Generated when the CPU is ready

CPU_UP_CANCELLED
CPU_DEAD
These two messages are generated only when the kernel is compiled with support for hot-pluggable CPUs. The first is used when one of the tasks triggered by a previous CPU_UP_PREPARE notification failed and therefore does not allow the CPU to go online. The second one is used when a CPU dies. CPU_PREPARE_UP creates the thread and binds it to the associated CPU, but does not wake up the thread. CPU_ONLINE wakes up the thread. When a CPU dies, its associated ksoftirqd instance is killed:
static int _ _devinit cpu_callback(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
...
switch(action) {
...
}
return NOTIFY_OK;
}

static struct notifier_block _ _devinitdata cpu_nfb = {
.notifier_call = cpu_callback
};
_ _init int spawn_ksoftirqd(void)
{
void *cpu = (void *)(long)smp_processor_id( );
cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
register_cpu_notifier(&cpu_nfb);
return 0;
}
Note that spawn_ksoftirqd places two direct calls to cpu_callback before registering with cpu_chain via register_cpu_notifier. This is necessary because CPU notifications are not generated for the first CPU that comes online

Tasklet Processing
The two handlers for low-priority tasklets (TASKLET_SOFTIRQ) and high-priority tasklets (HI_SOFTIRQ) are identical; they simply work on two different lists. For this reason, we will describe only one of them: tasklet_action, the one associated with TASKLET_SOFTIRQ.
Only one instance of each tasklet can be waiting for execution at any time. When tasklet_schedule or tasklet_hi_schedule schedules a tasklet, the function sets the TASKLET_STATE_SCHED bit. Attempts to reschedule the same tasklet will be ignored because TASKLET_STATE_SCHED is already set. The bit is cleared only when the tasklet starts its execution; thus, during or after its execution another instance can be scheduled.
The tasklet_action function starts by copying the list of tasklets waiting to be processed into a local variable first; it then clears the global list.* This is the only part of the function that is executed with interrupts disabled. Disabling them is necessary to avoid race conditions with interrupt handlers that could add new elements to the list while tasklet_action accesses it.
At this point, the function goes through the list tasklet by tasklet. For each element it invokes the handler if both of the following are true:

The tasklet is not already running—in other words, TASKLET_STATE_RUN is clear.
(The function runs tasklet_trylock to see whether TASKLET_STATE_RUN is already set; if not; tasklet_trylock sets the bit.)
The tasklet is enabled (count is zero).
The part of the function implementing these activities follows:
struct tasklet_struct *list;
local_irq_disable( );
list = _ _get_cpu_var(tasklet_vec).list;
_ _get_cpu_var(tasklet_vec).list = NULL;
local_irq_enable( );
while (list) {
struct tasklet_struct *t = list;
list = list->next;
if (tasklet_trylock(t)) {
if (!atomic_read(&t->count)) {

At this stage, since the tasklet was not already being executed and it was extracted from the list of pending tasklets, it must have the TASKLET_STATE_SCHED flag set:
if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
BUG( );
t->func(t->data);
tasklet_unlock(t);
continue;
}
tasklet_unlock(t);
}
If the handler cannot be executed, the tasklet is put back into the list and TASKLET_SOFTIRQ is rescheduled to take care of all of those tasklets that for one of the two reasons listed earlier cannot be handled now:
local_irq_disable( );
t->next = _ _get_cpu_var(tasklet_vec).list;

_ _get_cpu_var(tasklet_vec).list = t;
_ _raise_softirq_irqoff(TASKLET_SOFTIRQ);
local_irq_enable( );
}
}

How the Networking Code Uses softirqs

The networking subsystem has been assigned two different softirqs. NET_RX_SOFTIRQ handles incoming traffic and NET_TX_SOFTIRQ handles outgoing traffic. Both are registered in net_dev_init (described in Chapter 5) through the following lines:
open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
Because different instances of the same softirq handler can run concurrently on different CPUs (unlike tasklets), networking code is both low latency and scalable.
Both networking softirqs are higher in priority than normal tasklets (TASKLET_SOFTIRQ) but are lower in priority than high-priority tasklets (HI_SOFTIRQ). This prioritization guarantees that other high-priority tasks can proceed in a responsive and timely manner even when a system is under a high network load.

softnet_data Structure

each CPU has its own queue for incoming frames. Because each CPU has its own data structure to manage ingress and egress traffic, there is no need for any locking among different CPUs. The data structure for this queue, softnet_data, is defined in include/linux/netdevice.h as follows:
struct softnet_data
{
int throttle;
int cng_level;
int avg_blog;
struct sk_buff_head input_pkt_queue;
struct list_head poll_list;
struct net_device *output_queue;
struct sk_buff *completion_queue;
struct net_device backlog_dev;
}
The structure includes both fields used for reception and fields used for transmission.
In other words, both the NET_RX_SOFTIRQ and NET_TX_SOFTIRQ softirqs refer to the structure. Ingress frames are queued to input_pkt_queue,* and egress frames are placed into the specialized queues handled by Traffic Control (the QoS layer) instead of being handled by softirqs and the softnet_data structure, but softirqs are still used to clean up transmitted buffers afterward, to keep that task from slowing transmission.

throttle
avg_blog
cng_level

These three parameters are used by the congestion management algorithm and are further described following this list. All three, by default, are updated with the reception
of every frame.
input_pkt_queue
This queue, initialized in net_dev_init, is where incoming frames are stored before being processed by the driver. It is used by non-NAPI drivers; those that have been upgraded to NAPI use their own private queues.
backlog_dev
This is an entire embedded data structure (not just a pointer to one) of type net_device, which represents a device that has scheduled net_rx_action for execution on the associated CPU. This field is used by non-NAPI drivers. The name stands for “backlog device.”

poll_list
This is a bidirectional list of devices with input frames waiting to be processed.
output_queue
completion_queue
output_queue is the list of devices that have something to transmit, and completion_queue is the list of buffers that have been successfully transmitted and therefore can be released.

throttle is treated as a Boolean variable whose value is true when the CPU is overloaded and false otherwise. Its value depends on the number of frames in input_pkt_queue. When the throttle flag is set, all input frames received by this CPU are dropped, regardless of the number of frames in the queue.* avg_blog represents the weighted average value of the input_pkt_queue queue length; it can range from 0 to the maximum length represented by netdev_max_backlog. avg_blog is used to compute cng_level. cng_level, which represents the congestion level. As avg_blog hits one of the thresholds shown in the figure, cng_level changes value. The definitions of the NET_RX_XXX enum values are in include/linux/netdevice.h, and the definitions of the congestion levels mod_cong, lo_cong, and no_cong are in net/core/dev.c. avg_blog and cng_level are recalculated with each frame, by default, but recalculation can be postponed and tied to a timer to avoid adding too much overhead

avg_blog and cng_level are associated with the CPU and therefore apply to non-NAPI devices, which share the queue input_pkt_queue that is used by each CPU.

Initialization of softnet_data
Each CPU’s softnet_data structure is initialized by net_dev_init, which runs at boot time. The initialization code is:
for (i = 0; i < NR_CPUS; i++) {
struct softnet_data *queue;
queue = &per_cpu(softnet_data,i);
skb_queue_head_init(&queue->input_pkt_queue);
queue->throttle = 0;
queue->cng_level = 0;
queue->avg_blog = 10; /* arbitrary non-zero */
queue->completion_queue = NULL;
INIT_LIST_HEAD(&queue->poll_list);
set_bit(_ _LINK_STATE_START, &queue->backlog_dev.state);
queue->backlog_dev.weight = weight_p;
queue->backlog_dev.poll = process_backlog;
atomic_set(&queue->backlog_dev.refcnt, 1);
}
NR_CPUS is the maximum number of CPUs the Linux kernel can handle and softnet_data is a vector of struct softnet_data structures.

The code also initializes the fields of softnet_data.backlog_dev, a structure of type net_device, a special device representing non-NAPI devices.

0 0