task---task definition, switch, creation, termination

来源：互联网发布：mac 专卖店屏幕保护编辑：程序博客网时间：2024/05/17 08:40

====================================================================================

1. Intro

2. Reference

3. task definition
3.1. task descriptor
3.2. task state machine / task exit state machine
3.3. kernel mode stack / "thread_union" / "thread_info", and "current"

4. task switch
4.1. what is "task switch context"?
4.1.1. "thread_struct" - "task switch context"
4.2. task switch logic - schedule(), context_switch()
4.3. switch_to() - x86
4.3.1. x86 "tss_struct"
4.4. switch_to() - mips

5. task creation
5.1. system calls which create a task and corresponding service routine
5.2. do_fork()

6. task termination
6.1. _exit() system call - terminating one task
6.2. exit_group() system call - terminating a task group
6.3. release_task()

7. wait-like system call

8. misc tips
8.1. When forking a new thread, how pthread_create() handles thread_func

====================================================================================

1. Intro

This doc introduce the basic concept of linux task, and describe some aspect of task behavior.

====================================================================================

2. Reference

[1]. ulk - OReilly.Understanding.The.Linux.Kernel.3rd.Edition
Chapter 3. Processes

[2]. ia32 - CPU manual from intel, x86 architecture
vol.3. ->CHAPTER 6. TASK MANAGEMENT

[3]. mips32 - CPU manual from mips, 32-bit architecture

====================================================================================

3. task definition

We ususally hear about some theoretical concepts, like:
"process"
"thread"
they are about the theretical concepts in Operating System or POSIX specification terms.

When it comes to linux kernel implementation, things are different, there are no explicit corresponding structures definition to "process" and "thread", there is ONLY:
"struct task_struct" # task descriptor

This is because, linux kernel implements:

multi-thread process - thread group, consisting of multiple tasks which share resources,
in this case, a task is called LWP(LightWeight Process).

single-thread process - just one task.

[*] To avoid confusion:
We talk about "process" and "thread" ONLY in its theoretical meaning.
When a task is a LWP, then, we use "LWP task" specifically.
We mainly use "task".
We don't use "thread group", instead, we use "LWP group".

[*] When we talk about "task", by default, we mean the task definition in linux kernel.

Some architectures defines the concept of task, such as:
<<x86 manual.vol.3>>
->CHAPTER 6. TASK MANAGEMENT
"A task is made up of two parts: a task execution space and a task-state segment (TSS). "
However, we don't use this definition by default, because as we can see:
linux's implementation of "task", is based on the architecture's definition of "task", but not necessarily the same, linux's implementation may cheat the architecture.

When we talk about architecture's definition of "task", then, we will use "$(arch) task" explicitly.

====================================================================================

3.1. task descriptor

In linux, a task is represented by a task descriptor "task_struct", below we talk about some fields of "task_struct", not all of them.

Refer to:
<<ulk>>
/3.2. Process Descriptor
Figure 3-1. The Linux process descriptor

#
# task descriptor, one for each task.
#
struct task_struct {

#
# task state machine.
#
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

#
# task exit state machine.
#
int exit_state;

#
# A quick way to reference the kernel mode stack of this task.
#
void *stack;

#
# The MM descriptor, including VMA descriptors and page table set(__describes the linear
# address space which this task sees).
#
# See:
# <<memory - process page table, user-level range, page fault handler.txt>>
#
#
# MM descriptor "mm_struct" is reference-counting, if this task is a LWP task, then, all LWPs in the LWP group
# share the MM descriptor, thus the linear address space, and all LWPs hold a reference to the MM descriptor.
#
#
# If a user-level task, then, it has:
# "mm" == "active_mm"
#
# If a kernel task, then, its "mm" is NULL, and "active_mm" is borrowed from the "prev" task which switch it in.
#
struct mm_struct *mm, *active_mm;

#
# The task PID, uniquely identify this task(!^^__in "pid_namespace").
#
pid_t pid;

#
# The task group ID - the task PID of the task group leader.
#
pid_t tgid;

#
# "thread_struct" keep the "task context", mainly about CPU registers value, during task switch
# "prev" task => "next" task
# Linux kernel saves the "task context" of "prev" into its "thead_struct", restores the "task context" of "next"
# from its "thread_struct".
#
/* CPU-specific state of this task */
struct thread_struct thread;

#
# The filesystem view of this "task".
#
# "fs_struct" is reference-counting.
#
# If LWP task, then all the LWP tasks in this LWP group share the same filesystem view.
#
/* filesystem information */
struct fs_struct *fs;

#
# The table of open VFS "file" objects.
#
# "files_struct" is reference-counting.
#
# If LWP task, then all the LWP tasks in this LWP task group share the same "files_struct".
#
# Besides LWP task group, POSIX specifies that a newly-forked "process" inherits the "file" object of its parent
# "process", so, in this case, the newly-forked "process" also shares the same "files_struct" and hold a reference
# to it
#
/* open file information */
struct files_struct *files;

#
# signal structures, reference-counting and LWP task shared.
#
/* signal handlers */
struct signal_struct *signal;
struct sighand_struct *sighand;

====================================================================================

3.2. task state machine / task exit state machine

struct task_struct
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
int exit_state;

are called:
"task state machine"
"task exit state machine"
respectively.

As for task state machine "task_struct->state", there are following possible task state:

TASK_RUNNING # ready state, either running on CPUs or in rq waiting to be scheduled.

TASK_INTERRUPTIBLE # interruptible wait

TASK_UNINTERRUPTIBLE # uninterruptible wait

TASK_STOPPED # stopped

TASK_TRACED # traced by a tracer


As for task exit state machine "task_struct->exit_state", there are following possible task exit state:

EXIT_ZOMBIE # the task is terminated, in zombie state waiting for its parent to issue wait()-like
# system call to get to know its termination status.

EXIT_DEAD # the parent's already got the termination status, so the lifetime of this
# task is finally over, being destroyed.

[*] The change of task state machine is somewhat complicated, we don't describe them in detail in this doc. We talk about task exit state machine when we talk about wait-like system call.

====================================================================================

3.3. kernel mode stack / "thread_union" / "thread_info", and "current"

A task (if a user-level stack) ususally has 2 stacks:
user mode stack # exists in user linear address range
kernel mode stack # exists in kernel linear address range

If the task is kernel task, which doesn't have user linear address range, then, of course, it doesn't have user mode stack.

When a task switch between user mode and kernel mode, there is also a stack switch between user mode and kernel mode stack. Usually, this stack switch is performed by CPU automically(!^^__not sure if all architectures are like this, but at least x86 is).

As we can see in:
<<x86 manual.vol.1>>
//6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures

If the code segment for the handler procedure has the same privilege level as the
currently executing program or task, the handler procedure uses the current stack; if
the handler executes at a more privileged level, the processor switches to the stack
for the handler’s privilege level.

If a stack switch does occur, the processor does the following:

2. Loads the segment selector and stack pointer for the new stack (that is, the stack
for the privilege level being called) from the TSS into the SS and ESP registers
and switches to the new stack.

The user mode stack is controlled under mmap() system call and page fault handler.
See:
<<memory - process page table, user-level range, page fault handler.txt>>

The kernel mode stack is allocated when the task is created.

See:
<<ulk>>
///3.2.2.1. Process descriptors handling
Figure 3-2. Storing the thread_info structure and the process kernel stack in two page frames

The kernel mode stack is represented by "thread_union":

#
# Note that, "thread_union" is a C union, not a C struct.
#
# Note that, the definition of "thread_union" is in:
# /include/linux/sched.h
# So, its definition is architecture-generic.
#
union thread_union {
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)]; # THREAD_SIZE tells the size of kernel mode stack, ususally,
}; # it is 2 page frames

So the layout of kernel mode stack is:

____________________________ high address
| |
| |
| used range |
| |
| | |
| ---------|----------- | <---- (SP)stack pointer
| | increase |
| \/ |
| |
| |
| available range |
| |
|___________________________|
| |
| thread_info |
|___________________________| low address

The "thread_info" stores some basic information about a task.

Note that the definition of "thread_info" is in:
/arch/$(arch)/include/asm/thread_info.h
so, its definition is architecture-specific.

However, there are some common fields of "thread_info" across different architectures:

struct thread_info {
#
# Refer back to task descriptor
#
struct task_struct *task; /* main task structure */

struct exec_domain *exec_domain; /* execution domain */

#
# TIF_XXX flags set
#
__u32 flags; /* low level flags */

#
# If the task is currently running on a CPU, then, record the SMP ID of this CPU.
# This field is set when the task is switched in.
#
__u32 cpu; /* current CPU */

#
# The "preempt_count" of this task:
# | hardirq counter | softirq counter | preemption counter |
#
# See:
# <<task - preemption.txt>>
#
int preempt_count; /* 0 => preemptable, <0 => BUG */
}

Sometimes, we need to get the "thread_info" of the task currently running on local CPU, in this case, we use:
current_thread_info() macro
Its implementation is architecture-specific, and in:
/arch/$(arch)/include/asm/thread_info.h

#
# x86 - /arch/x86/include/asm/thread_info.h
#
# It seems that, x86 maintains a per-CPU variable "kernel_stack", which record down the
# kernel mode stack when user mode => kernel mode switch happens.So, we can get current "thread_info" from
# the "kernel_stack"
#
static inline struct thread_info *current_thread_info(void)
{
struct thread_info *ti;
ti = (void *)(percpu_read_stable(kernel_stack) +
KERNEL_STACK_OFFSET - THREAD_SIZE);
return ti;
}
DEFINE_PER_CPU(unsigned long, kernel_stack) =
(unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;

#
# mips - /arch/mips/include/asm/thread_info.h
#
# On mips, register $28 is "global pointer".
# It seems $28 SHOULD be set to the "thread_info" of the task to be switched in during each process switch.
# And in current kernel-level context, $28 remains the same, until the next process switch.
#
/* How to get the thread information struct from C. */
register struct thread_info *__current_thread_info __asm__("$28");
#define current_thread_info() __current_thread_info

#
# ppc - /arch/ppc/include/asm/thread_info.h
#
# The implementation of ppc current_thread_info() is the conventional way:
# Get the current SP(stack pointer) value, then, mask the value to the location of "thread_info".
#
/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
register unsigned long sp asm("r1");

/* gcc4, at least, is smart enough to turn this into a single
* rlwinm for ppc32 and clrrdi for ppc64 */
return (struct thread_info *)(sp & ~(THREAD_SIZE-1));
}

Ususally, we want to know which task is currently running on local CPU, in this case, we use "current" macro, which gets the task descriptor of the currently-running task.

The implementation of "current" macro is also architecture-specific.

#
# generic - /include/asm-generic/current.h
#
# If architectures doesn't provide their customized implementation of get_current(), then, just get the task descriptor
# from its "thread_info", and the implementation of current_thread_info() is architecture-specific.
#

#define current get_current()
#define get_current() (current_thread_info()->task)

#
# x86 - /arch/x86/include/asm/current.h
#
# On x86, it seems that, kernel maintains a per-CPU variable "current_task", and during each task switch,
# the kernel would modify the local-CPU element of "current_task" to the task to be switched in.
# So, the local-CPU element of "current_task", is just the one we want.
#

#define current get_current()
static __always_inline struct task_struct *get_current(void)
{
return percpu_read_stable(current_task);
}
DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned =
&init_task;

#
# mips - /arch/mips/include/asm/current.h
#
# mips kernel just uses the generic implementation of "current".
#

#include <asm-generic/current.h>

#
# ppc - /arch/ppc/include/asm/current.h
#
# It seems ppc32 "current" is just mips current_thread_info(), it uses a specific register "r2" to keep something,
# and this register is changed to the task to be switched in at each task switch.
#

/*
* We keep `current' in r2 for speed.
*/
register struct task_struct *current asm ("r2");

====================================================================================

4. task switch

In this chapter, we are gonna talk about task switch.

====================================================================================

4.1. what is "task switch context"?

First of all, we need to discuss the concept of "task switch context".

We talk about "context" frequently, basically, it includes:
The runtime values of CPU registers of a task.

However, in fact, the "context" can be extended:

#a. When a exception or IRQ is generated, then:
If currrently running under user mode, then we need to switch from user mode to kernel mode to execute a certain kernel control path.
If currently running under kernel mode, then currently-running kernel control path is interrupted to execute another kernel control path.

In this case, CPU will automatically save some context into kernel mode stack(!^^__this is done by CPU in HW-level), and in addition to that, kernel will also save some context into kernel mode stack.

These contexts belongs to the code trace been interrupted.


In this case, the context is called:

"task interrupted context"

See:
<<kernel control path - exception & IRQ handing.txt>>

#b. When kernel is doing task switch, it needs to save the context of the currently-running task, and restore the context of the task to be switched in.

In this case, the context is called:

"task switch context"

[*] Note that, "task switch context" and "task interrupted context" are different, though they are both about CPU registers, they are different in concept, and kernel treat them differently.

====================================================================================

4.1.1. "thread_struct" - "task switch context"

The "task switch context" is represented by "thread_struct" struct, whose definition is architecture-specific. This is straightforward, because each architecture defines its own register model.

The "thread_struct" struct is defined in the:
/arch/$(arch)/include/asm/processor.h

And task descriptor "task_struct" contains a embedded filed of "thread_struct"

struct task_struct {

/* CPU-specific state of this task */
struct thread_struct thread;
};

Basically, the task switch is to:

Save the "task switch context" of the currently-running task(!^^__"current", or called "prev") into "prev->thread". This means save the CPU registers values into "prev->thread".

Restore the "task switch context" of the task to be switched in(!^^__called "next"), that is, "next->thread". This means, reload the CPU registers with the values kept in "next->thread".

Jump to the restored EIP of "next", thus, finished the task switch.

#
# /arch/x86/include/asm/processor.h
#
# x86 "thread_struct"
#
struct thread_struct {
/* Cached TLS descriptors: */
struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES];
unsigned long sp0;
unsigned long sp;
#ifdef CONFIG_X86_32
unsigned long sysenter_cs;
#else
unsigned long usersp; /* Copy from PDA */
unsigned short es;
unsigned short ds;
unsigned short fsindex;
unsigned short gsindex;
#endif
#ifdef CONFIG_X86_32
unsigned long ip;
#endif
#ifdef CONFIG_X86_64
unsigned long fs;
#endif
unsigned long gs;
/* Save middle states of ptrace breakpoints */
struct perf_event *ptrace_bps[HBP_NUM];
/* Debug status used for traps, single steps, etc... */
unsigned long debugreg6;
/* Keep track of the exact dr7 value set by the user */
unsigned long ptrace_dr7;
/* Fault info: */
unsigned long cr2;
unsigned long trap_no;
unsigned long error_code;
/* floating point and extended processor state */
struct fpu fpu;

#ifdef CONFIG_X86_32
/* Virtual 86 mode info */
struct vm86_struct __user *vm86_info;
unsigned long screen_bitmap;
unsigned long v86flags;
unsigned long v86mask;
unsigned long saved_sp0;
unsigned int saved_fs;
unsigned int saved_gs;
#endif
/* IO permissions: */
unsigned long *io_bitmap_ptr;
unsigned long iopl;
/* Max allowed port in the bitmap, in bytes: */
unsigned io_bitmap_max;
};

#
# /arch/mips/include/asm/processor.h
#
# x86 "thread_struct"
#
struct thread_struct {
/* Saved main processor registers. */
unsigned long reg16;
unsigned long reg17, reg18, reg19, reg20, reg21, reg22, reg23;
unsigned long reg29, reg30, reg31;

/* Saved cp0 stuff. */
unsigned long cp0_status;

/* Saved fpu/fpu emulator stuff. */
struct mips_fpu_struct fpu;

/* Saved state of the DSP ASE, if available. */
struct mips_dsp_state dsp;

/* Saved watch register state, if available. */
union mips_watch_reg_state watch;

/* Other stuff associated with the thread. */
unsigned long cp0_badvaddr; /* Last user fault */
unsigned long cp0_baduaddr; /* Last kernel fault accessing USEG */
unsigned long error_code;
unsigned long irix_trampoline; /* Wheee... */
unsigned long irix_oldctx;
#ifdef CONFIG_CPU_CAVIUM_OCTEON
struct octeon_cop2_state cp2 __attribute__ ((__aligned__(128)));
struct octeon_cvmseg_state cvmseg __attribute__ ((__aligned__(128)));
#endif
struct mips_abi *abi;
};

From the definition of "thread_struct", we can see what kind of CPU registers belongs to "task switch context".

====================================================================================

4.2. task switch logic - schedule(), context_switch()

The basic logic of task switch is:

Save the "task switch context" of the currently-running task(!^^__"current", or called "prev") into "prev->thread". This means save the CPU registers values into "prev->thread".

Restore the "task switch context" of the task to be switched in(!^^__called "next"), that is, "next->thread". This means, reload the CPU registers with the values kept in "next->thread".

Jump to the restored EIP of "next", thus, finished the task switch.

The logic above is the the core of task switch. However, due to the complexity of modern CPUs and kernel, the implementation is definitely not that simple and straightforward. Before doing that core logic, there are a lot of work to do.

The ONLY function to perform task switch is:

asmlinkage void __sched schedule(void)

No matter under what conditions, there is just schedule(), no other functions.

@@trace - schedule() - the ONLY task switch function

#
# In:
# /kernel/sched.c
#
# schedule() is the ONLY function to perform task switch, no matter in kernel preemption or not.
#
# It is architecture-independent itself, but it calls a lot of architecture-specific functions and macros to do its job.
#
# As you can see, there is no argument of schedule(). In fact, the logic of schedule() is:
#
# "current" as the candidate task to be switched out. need to check if "current" is really suitable to be
# switched out, and if it is, choose a appropriate task(!^^__next) to be switched in. perform a lot of preparation work
# of task switch, and then, do the task switch actually.
#
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)

#
# When implementing task switch, we conventionally call:
#
# task to be switched out - "prev" - it is, in fact, "current" which calls schedule()
# task to be switched in - "next"
#
struct task_struct *prev, *next;

need_resched:

#
# We MUST disable preemption, because the following operations are critical, and we are doing task switch ourself.
# We do NOT want the following case:
# If preemption is enabled, then, a IRQ happens, during IRQ handle returns, "current" is preempted.
# So the logic becomes mess.
#
preempt_disable();

#
# Get local rq.
#
cpu = smp_processor_id();
rq = cpu_rq(cpu);

#
# "rq->curr" is just "current".
#
prev = rq->curr;

#
# Accquire the spinlock of local rq.
#
# Note that, at this point, it is "prev" who accquired this local rq lock.
#
raw_spin_lock_irq(&rq->lock);

#
# Pick the task to be switched in, that is, "next".
#
next = pick_next_task(rq);

#
# Clear the TIF_NEED_RESCHED flag of "prev".
#
clear_tsk_need_resched(prev);
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

if (likely(prev != next)) {
#
# If "next" is a different task, then, we call context_switch() to do the actual job.
#
context_switch(rq, prev, next); /* unlocks the rq */
}
else {
#
# Otherwise, "prev" == "next", we do nothing.
#
raw_spin_unlock_irq(&rq->lock);
}

#
# At this point, we are "prev", there might be 2 possiblities:
#
# #1. Most likely, context_switch() performed one task switch, "prev" is switched out, "next" is switched in.
# then, CPU runs "next" for a while, after sometime, another task switch happens, so "prev" takes over control again.
#
# This is due to some magic inside context_switch() / switch_to().
#
#
# #2. Otherwise, we have "prev" == "next", so we reach here directly.
#

#
# Anyway, we get out of the critial operations, so we enable preeemption again.
#
preempt_enable_no_resched();

#
# If TIF_NEED_RESCHED flag is set for "current", then, we jump to "need_sched" label, to perform the logic the
# whole over from begining.
#
# This might be due to kernel preemption, some higher-priority tasks are ready when we were doing the operations above,
# such as:
# A IRQ handler / softirq waked up a higher-priority RT task.
#
if (need_resched()) # unlikely(test_thread_flag(TIF_NEED_RESCHED));
goto need_resched;

------------------------------------------------------------------------------------

@@trace - context_switch()
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)

struct mm_struct *mm, *oldmm;

#
# Performs misc works, but none related to the core logic.
#
prepare_task_switch(rq, prev, next);
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
trace_sched_switch(prev, next);


#
# Get the MM descriptor of "prev" and "next".
#
mm = next->mm;
oldmm = prev->active_mm;


#
# If "next->mm" is NULL, then, we could tell it is a kernel task, which doesn't have its own MM descriptor, so,
# it reuses the MM descriptor of "prev", not that, it is "prev->active_mm", because "prev" might also be a kernel task.
#
#
# Otherwise, if "next->mm" is not NULL, which means it is not a user-level task, then, we need to switch to
# the linear address space and page tables set of "next".
#
# The implementation of switch_mm() is architecture-specific, but basically, it is to load page tables set
# "mm_struct->pgd" "next" into related MMU register.
#
# [*] Note that, switch_mm() is in fact to switch linear address space. After switch_mm() is called, we are using
# linear address space of "next", but we are still running "prev", this is OK, because we are under kernel mode,
# and all linear address space shares their "kernel-level linear address range"(!^^__more accurately, "direct mapping
# range", and all the data structures referenced happend during task switch are just inside "direct mapping range").
#
if (!mm) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);

if (!prev->mm) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}

#
# [*] switch_to() is the real magic. Its implementation is architecture-specific.
#
# switch_to() performs the core logic of task switch, as we said previously:
#
# #a. Save the "task switch context" of the currently-running task(!^^__"current", or called "prev")
# into "prev->thread". This means save the CPU registers values into "prev->thread".
#
# #b. Restore the "task switch context" of the task to be switched in(!^^__called "next"), that is,
# "next->thread". This means, reload the CPU registers with the values kept in "next->thread".
#
# #c. Jump to the restored EIP of "next", thus, finished the task switch.
#
# Note the <comment> below, and we are gonna describe switch_to() in detail later.
#
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);

#
# [*] When we reach here, we are still "prev", but there is a time interval:
#
# We'd better call:
# "prev" - task_a
# "next" - task_b
#
# Inside switch_to(), task switch actually happened, CPU was taken over by "task_b", after sometime,
# another task switch happens, and "task_b" is switched in again and continue to execute, to this point.
#

barrier();

#
# [*] Note that, inside schedule(), we(!^^__"task_a") accquired "rq->lock", here, we release it.
#
# Some may wonder, we said "there is a time interval" above, so, is "rq->lock" taken too long, always
# by "task_a"?? That doesn't make sense.
#
# But, that situation will not happen, the truth is:
#
# "task_a" accquire "rq->lock" in schedule().
#
# In switch_to(), "task_b" is switched in, and "task_b" continue to run to this point, it release "rq->lock",
# on behalf of "task_a"
#
# Later, when "task_b" calls schedule(), it accquires "rq->lock", then, switch_to() switches in the "task_c",
# and "task_c" release "rq->lock", on behalf of "task_b"
#
# Later, when "task_c"...
#
# Well, it is a dedicate loop.
#
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);

finish_lock_switch(rq, prev);
raw_spin_unlock_irq(&rq->lock);

if (mm)
mmdrop(mm);

====================================================================================

4.3. switch_to() - x86
4.3.1. x86 "tss_struct"

Refer to:
<<ulk>>
///3.3.3.1. The switch_to macro

[*] <!!attention>

RTFS.

One thing to note:

x86 architecture defines "TASK" in its HW-level, it requires each "TASK" provide a TSS segment.

However, linux mangle the x86 "TASK", it maintains just one "TASK" for each CPU, and reuses the "TASK TSS segment":

Use "thread_struct" (!^^__"task_struct->thread") to save "task switch context", instead of "TASK TSS segment".

When task switch, it refill the necessary context information from "next->thread", into "TASK TSS segment".

====================================================================================

4.4. switch_to() - mips

#
# /arch/mips/include/asm/system.h
#
#define switch_to(prev, next, last) \
do { \
__mips_mt_fpaff_switch_to(prev); \
if (cpu_has_dsp) \
__save_dsp(prev); \
__clear_software_ll_bit(); \
(last) = resume(prev, next, task_thread_info(next)); \
} while (0)

mips switch_to() internally calls resume(), which is a ASM function, and there is:

/arch/mips/kernel/r4k_switch.S
LEAF(resume)

/arch/mips/kenrel/octeon_switch.S
LEAF(resume)

/arch/mips/kernel/r2300_switch.S
LEAF(resume)

these XXX_switch.S are only to implement resume() ASM function, for different mips CPU model.

Well, just RTFS if interested.

====================================================================================

5. task creation

A task is forked under 3 cases:

#a. When a process forks a new process.

#b. When a multi-thread process creates a new thread.

#c. When creating a kernel task.

As for case #a, it corresponds to fork() / vfork() libc functions.

[*] A interesting thing is, when a thread in a multi-thread process calls fork() / vfork(), then what about the new process to be forked, is it also a multi-thread process??? The answer is "NO", the newly-forked process is a single thread process, and it runs the same code path as its parent thread, the address space would be COW.

As for case #b, it corresponds to pthread_create() libc function.

As for case #c, it corresponds to kernel_thread() function. We are not gonna describe kernel task in detail in this doc, it SHOULD be described in:
<<kernel utility - kernel task.txt>>

====================================================================================

5.1. system calls which create a task and corresponding service routine

Case #a:
fork() libc function - fork() system call - service routine sys_fork()
vfork() libc function - vfork() system call - service routine sys_vfork()

Case #b:
pthread_create() libc function - clone() system call - service routine sys_clone()

Case #c:
kernel_thread() function

The implementation of these functions are slightly different among different architectures:

#
# x86 - /arch/x86/kernel/process.c
#
int sys_fork(struct pt_regs *regs)
{
return do_fork(SIGCHLD, regs->sp, regs, 0, NULL, NULL);
}

int sys_vfork(struct pt_regs *regs)
{
return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->sp, regs, 0,
NULL, NULL);
}

long
sys_clone(unsigned long clone_flags, unsigned long newsp,
void __user *parent_tid, void __user *child_tid, struct pt_regs *regs)
{
if (!newsp)
newsp = regs->sp;
return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid);
}

int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
...

/* Ok, create the new process.. */
return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
}

#
# ppc - /arch/powerpc/kernel/process.c
#
int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3,
unsigned long p4, unsigned long p5, unsigned long p6,
struct pt_regs *regs)
{
CHECK_FULL_REGS(regs);
return do_fork(SIGCHLD, regs->gpr[1], regs, 0, NULL, NULL);
}

int sys_vfork(unsigned long p1, unsigned long p2, unsigned long p3,
unsigned long p4, unsigned long p5, unsigned long p6,
struct pt_regs *regs)
{
CHECK_FULL_REGS(regs);
return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->gpr[1],
regs, 0, NULL, NULL);
}

int sys_clone(unsigned long clone_flags, unsigned long usp,
int __user *parent_tidp, void __user *child_threadptr,
int __user *child_tidp, int p6,
struct pt_regs *regs)
{
CHECK_FULL_REGS(regs);
if (usp == 0)
usp = regs->gpr[1]; /* stack pointer for child */

return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
}

#
# /arch/powerpc/kernel/misc_32.S
#
/*
* Create a kernel thread
* kernel_thread(fn, arg, flags)
*/
_GLOBAL(kernel_thread)
mr r30,r3 /* function */
mr r31,r4 /* argument */
ori r3,r5,CLONE_VM /* flags */
oris r3,r3,CLONE_UNTRACED>>16
li r4,0 /* new sp (unused) */
li r0,__NR_clone
sc # issue system call request, to sys_clone()

#
# mips - /arch/mips/kernel/syscall.c
#
save_static_function(sys_fork);
static int __used noinline
_sys_fork(nabi_no_regargs struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.regs[29], &regs, 0, NULL, NULL);
}

save_static_function(sys_clone);
static int __used noinline
_sys_clone(nabi_no_regargs struct pt_regs regs)
{
clone_flags = regs.regs[4];
newsp = regs.regs[5];
if (!newsp)
newsp = regs.regs[29];
parent_tidptr = (int __user *) regs.regs[6];

return do_fork(clone_flags, newsp, &regs, 0,
parent_tidptr, child_tidptr);
}

#
# /arch/mips/kernel/process.c
#
long kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
..........

/* Ok, create the new process.. */
return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
}

====================================================================================

5.2. do_fork()

As we can see above, no matter case #a / #b / #c, creating a new task is essentially done by do_fork() function. We are gonna describe do_fork() here.

@@trace - do_fork() forks a new task
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
long do_fork(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr)

struct task_struct *p;

#
# copy_process() create child task by following parent task as template, return the task descriptor of child task.
#
p = copy_process(clone_flags, stack_start, regs, stack_size,
child_tidptr, NULL, trace);

#
# A lot of sanity checks in copy_process()
#

#
# dup_task_struct() is to create the task descriptor / kernel mode stack of child task, by following
# parent task as template.
#
p = dup_task_struct(current);

#
# Allocate task descriptor for the new task to be forked.
#
tsk = alloc_task_struct_node(node);

#
# Allocate kernel mode stack(!^^__including "thread_info") for the new task.
#
ti = alloc_thread_info_node(tsk, node);

#
# Perform a byte-to-byte copy from parent task descriptor to child task descriptor.
#
# [*] Note that, this byte-to-byte copy is interesting, and a lot of operations followed are based on it,
# or overwrite the content.
#
# One important content of this byte-to-byte copy, is "thread_struct" - "task_struct->thread", so,
# when child task is switched in for the first time, it would have the same registers context as the parent
# task.
#
err = arch_dup_task_struct(tsk, orig);
*dst = *src;

#
# Perform a byte-to-byte copy from parent "thread_info" to child "thread_info".
#
# And associate child "thread_info" to child task descriptor, by setting child "thread_info->task".
#
setup_thread_stack(tsk, orig);
*task_thread_info(p) = *task_thread_info(org);
task_thread_info(p)->task = p;

#
# Set the stack end of child kernel mode stack to a magic value, this helps in detecting kernel mode
# stack overflow.
#
stackend = end_of_stack(tsk);
*stackend = STACK_END_MAGIC; /* for overflow detection */

#
# Note <comment> below.
#
# Set the refcnt of task descriptor to 2:
#
# One is for the new task itself which should hold a reference to its task descriptor.
#
# One is for the task to which child task report its termination status when it run over.
# the task will use wait() system call to get child task's termination status.
# Ususally, the task is the parent task, but we can not be sure for now.
#
/*
* One for us, one for whoever does the "release_task()" (usually
* parent)
*/
atomic_set(&tsk->usage, 2);

#
# Some code block to overwite the byte-by-byte copy content of child task descriptor from parent task descriptor.
#

#
# The new task hasn't call execve() system call yet, so we set its "did_exec" field to 0.
#
p->did_exec = 0;

/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
__sched_fork(p); # just clear the scheduling-related fields of the new task descriptor.
/*
* We mark the process as running here. This guarantees that
* nobody will actually run it, and a signal or other external
* event cannot wake it up and insert it on the runqueue either.
*/
p->state = TASK_RUNNING;

#
# child task inherit the priority of parent task.
#
/*
* Make sure we do not leak PI boosting priority to the child.
*/
p->prio = current->normal_prio;

#
# [*] Note these copy_XXX() functions.....
#
/* copy all the process information */
retval = copy_semundo(clone_flags, p); # IPC semaphore undo list

retval = copy_files(clone_flags, p); # open files table

retval = copy_fs(clone_flags, p); # filesystem view

retval = copy_sighand(clone_flags, p); # signal structures
retval = copy_signal(clone_flags, p);

retval = copy_mm(clone_flags, p); # MM descriptor / address space

retval = copy_namespaces(clone_flags, p); # nsproxy

retval = copy_io(clone_flags, p); # I/O context


#
# [*] copy_thread() is important, it set the content of kernel mode stack of new task.
#
# The implementation of copy_thread() is architecture-specific, here, we ONLY describe those generic parts:
#
# Set "pt_regs" in new task kernel mode stack properly.
#
# [*] Set "thread_struct->ip" of new task to "ret_from_fork", then, when the new task is switched in
# for the first time, it will execute "ret_from_fork" ASM label.
#
retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);

p->thread.ip = (unsigned long) ret_from_fork;

#
# Allocate a task PID for the new task.
#
pid = alloc_pid(p->nsproxy->pid_ns);
p->pid = pid_nr(pid);
p->tgid = p->pid;
if (clone_flags & CLONE_THREAD) # we are forking a thread, so set its "tgid" properly
p->tgid = current->tgid;


#
# Set the parent-child relationship for the new task
#
/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
p->real_parent = current->real_parent;
p->parent_exec_id = current->parent_exec_id;
} else {
p->real_parent = current;
p->parent_exec_id = current->self_exec_id;
}

#
# Link the new task into various list which representing sibling relationship.
#
# Link the new task into pidhash hashtable.
#
...list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
...attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
...attach_pid(p, PIDTYPE_SID, task_session(current));
...list_add_tail(&p->sibling, &p->real_parent->children);
...list_add_tail_rcu(&p->tasks, &init_task.tasks);
attach_pid(p, PIDTYPE_PID, pid);

/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
if (!IS_ERR(p)) {

#
# If parent task wants to get the task PID of child task, then, put_user() copies the child task PID to
# "parent_tidptr" which specifies a location in user-level linear range of parent task.
#
# [*] When fork a new process, then we have COW, so put_user() will generate page fault, and page fault
# handler will perform COW.
#
nr = task_pid_vnr(p);

if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);

#
# If we are in sys_vfork() trace, then, we initialized a "completion", the parent task will wait on this
# "completion" until child task run over and complete this "completion".
#
struct completion vfork;
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
}

#
# Wake up the new task, insert it into rq, so it can be scheduled and run.
#
wake_up_new_task(p);
set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
activate_task(rq, p, 0);
enqueue_task(rq, p, flags);
#ifdef CONFIG_SMP
if (p->sched_class->task_woken)
p->sched_class->task_woken(rq, p);
#endif

#
# If we are in sys_vfork() trace, then parent task will wait on this "completion" for the termination of the
# new child task.
#
# After child task run over, its do_exit() function will complete this "completion", so parent task can continue.
#
if (clone_flags & CLONE_VFORK) {
freezer_do_not_count();
wait_for_completion(&vfork);
freezer_count();
ptrace_event(PTRACE_EVENT_VFORK_DONE, nr);
}

} //if (!IS_ERR(p))


#
# Return the child task PID.
#
return nr;

====================================================================================

6. task termination

Refer to:
<<ulk>>
/3.5. Destroying Processes

There are 2 system calls in linux kernel to perform task termination:

#a. exit_group() system call, corresponding to glibc exit() function, which is to terminate all the LWPs task in a multi-thread application.

#b. _exit() system call, corresponding to pthread function pthread_exit() function, which is just to terminate task itself.

====================================================================================

6.1. _exit() system call - terminating one task

@@trace - sys_exit() - the service routine of _exit() system call

asmlinkage long sys_exit(int error_code)

do_exit((error_code&0xff)<<8);

#
# sys_exit() is to terminate the task which is calling it, that is, "current".
#
struct task_struct *tsk = current;

#
# profiling the task exit event.
#
profile_task_exit(tsk);
blocking_notifier_call_chain(&task_exit_notifier, 0, task);


#
# ptrace the terminating task.
#
ptrace_event(PTRACE_EVENT_EXIT, code);


#
# Set PF_EXITING flag to indicate I am exiting.
#
exit_signals(tsk); /* sets PF_EXITING */
tsk->flags |= PF_EXITING;

#
# Keep the exit code into "task_struct->exit_code".
# Later, my parent issues wait() system call will be interested in it.
#
tsk->exit_code = code;


#
# Release the reference held by me to MM descriptor / linear address space.
#
# If I am the last task in the task group, then, the MM descriptor / linear address space would be destroyed.
#
exit_mm(tsk);
mm_release(tsk, mm);
mmput(mm);

#
# Release the reference to various resources, such as IPC semaphore, shared memory, open file tables, filesystem
# view, but these resources may be shared by other tasks, so they may not be deallocated right away.
#
exit_sem(tsk);
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);

#
# Don't be scared by this exit_thread(), it is just to release some architecture-specific resource which this
# task has.
#
# For example, for x86, it clear the IObitmap of this task in the local-CPU TSS; for mips, it is simply NULL
# implementation.
#
exit_thread(tsk);

#
# exit_notify() does massive things, see internal for detail.
#
exit_notify(tsk, group_dead);

/*
* This does two things:
*
* A. Make init inherit all the child processes
* B. Check to see if any process groups have become orphaned
* as a result of our exiting, and if they have any stopped
* jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
*/

#
# Note <comment> above, forget_original_parent() just reparent the children this this exiting task, to
# a new "child_reaper", ususally, the process 1 "init"( which is calling wait-like() system call in a loop).
#
forget_original_parent(tsk);

exit_ptrace(father);
reaper = find_new_reaper(father);

list_for_each_entry_safe(p, n, &father->children, sibling) {
struct task_struct *t = p;
do {
t->real_parent = reaper;
if (t->parent == father) {
BUG_ON(t->ptrace);
t->parent = t->real_parent;
}
if (t->pdeath_signal)
group_send_sig_info(t->pdeath_signal,
SEND_SIG_NOINFO, t);
} while_each_thread(p, t);
reparent_leader(father, p, &dead_children);
}

list_for_each_entry_safe(p, n, &dead_children, sibling) {
list_del_init(&p->sibling);
release_task(p);
}

#
# Release various namespaces this exiting task references to.
#
exit_task_namespaces(tsk);
switch_task_namespaces(p, NULL);

/* Let father know we died
*
* Thread signals are configurable, but you aren't going to use
* that to send signals to arbitrary processes.
* That stops right now.
*
* If the parent exec id doesn't match the exec id we saved
* when we started then we know the parent has changed security
* domain.
*
* If our self_exec id doesn't match our parent_exec_id then
* we have changed execution domain as these two values started
* the same after a fork.
*/
tsk->exit_signal = SIGCHLD;

#
# We don't consider the case of this task being ptraced.
#
# Most ususally, if the exiting task is a task group leader, and currently it is the last task remained in
# the task group, then, it needs to notify its parent with "tsk->exit_signal", then later, its parent
# will release it in the wait-like system call.
#
# Otherwise, if the exiting task is simply a LWP, then, it has "autoreap" set, and will be released
# silently.
#
# Note that, a important flag "autoreap" is set to TRUE or FALSE here.
#
if (unlikely(tsk->ptrace)) {
int sig = thread_group_leader(tsk) &&
thread_group_empty(tsk) &&
!ptrace_reparented(tsk) ?
tsk->exit_signal : SIGCHLD;
autoreap = do_notify_parent(tsk, sig);
} else if (thread_group_leader(tsk)) {
autoreap = thread_group_empty(tsk) &&
do_notify_parent(tsk, tsk->exit_signal);
} else {
autoreap = true;
}

#
# do_notify_parent() just fills in a siginfo, and send it to the parent of the exiting task, and wake up
# the parent.
#
# Then, after sometime, when the parent is scheduled and run, the parent would handle the signal which
# its exited child sent to it, and release the exited task.
#
#
# Of course, sometimes the parent set in its signal handlers "psig" that it doesn't care about its
# children's exit state, then, "autoreap" would be set to true, then, the exiting task would release itself
# later in exit_notify().
#
do_notify_parent()

struct siginfo info;

psig = tsk->parent->sighand;
if (!tsk->ptrace && sig == SIGCHLD &&
(psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN ||
(psig->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDWAIT))) {
/*
* We are exiting and our parent doesn't care. POSIX.1
* defines special semantics for setting SIGCHLD to SIG_IGN
* or setting the SA_NOCLDWAIT flag: we should be reaped
* automatically and not left for our parent's wait4 call.
* Rather than having the parent do it as a magic kind of
* signal handler, we just set this to tell do_exit that we
* can be cleaned up without becoming a zombie. Note that
* we still call __wake_up_parent in this case, because a
* blocked sys_wait4 might now return -ECHILD.
*
* Whether we send SIGCHLD or not for SA_NOCLDWAIT
* is implementation-defined: we do (if you don't want
* it, just use SIG_IGN instead).
*/
autoreap = true;
if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN)
sig = 0;
}

if (valid_signal(sig) && sig)
__group_send_sig_info(sig, &info, tsk->parent);
__wake_up_parent(tsk, tsk->parent);

return autoreap;

#
# See that, if "autoreap" is TRUE, then, its exit state machine goes directly to EXIT_DEAD,
# and later, it is released just here silently.
#
# Otherwise, its exit state machine goes to EXIT_ZOMBIE, waiting for its parent to release it.
#
#
# "autoreap" flag is set due to various reasons.
#
tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;

/* If the process is dead, release it - nobody will wait for it */
if (autoreap)
release_task(tsk);

#
# At this point, we set task state machine to TASK_DEAD,
#
# And we see in exit_notify() above, we already set the task exit state machine to either EXIT_DEAD,
# or EXIT_ZOMBIE.
#
# But task state machine TASK_DEAD, make this task out of reach of scheduler, and we call schedule() right away,
# give up the CPU, and later this task would never be switched in. If that happens, we consider it a serious bug.
#
#
# [*] Note that, previously, we see in exit_notify() when "autoreap" is set, the the exiting task will call
# release_task() to release itself, but we see here, the code path still reference the exiting task. Will this
# cause trouble??
#
# This is because, release_task() does NOT deallocate the task descriptor directly, but append it to RCU list, and
# later during kernel mode -> user mode switch(__well, schedule() is just task switch) when RCU is run,
# the task descriptor would be deallocated by RCU mechanism.
#

/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;

schedule();

BUG();
/* Avoid "noreturn function does return". */
for (;;)
cpu_relax(); /* For when BUG is null */

====================================================================================

6.2. exit_group() system call - terminating a task group

@@trace - sys_exit_group() - the service routine of exit_group() system call

/*
* this kills every thread in the thread group. Note that any externally
* wait4()-ing process will get the correct exit code - even if this
* thread is not the thread group leader.
*/
asmlinkage long sys_exit_group(int error_code)

/*
* Take down every thread in the group. This is called by fatal signals
* as well as by sys_exit_group (below).
*/
do_group_exit((error_code & 0xff) << 8);

struct signal_struct *sig = current->signal;

#
# signal_group_exit() tells whether or not sig->flags has SIGNAL_GROUP_EXIT set.
#
# if set, then, it means other tasks in the same task group has already called exit_group() system call, so,
# current task just call do_exit() to takes care of itself.
#
# Otherwise, current task is the first one which calls exit_group() system call, then, it set SIGNAL_GROUP_EXIT
# in "signal_struct"( which is shared by all the tasks in the same task group), and call zap_other_threads() to
# send SIGKILL to other tasks in the same task group. As for other tasks, their signal handler for SIGKILL is,
# of course, calls exit() system call(__then, sys_exit() -> do_exit().
#
if (signal_group_exit(sig))
exit_code = sig->group_exit_code;
else if (!thread_group_empty(current)) { # there are other tasks in the same task group.

struct sighand_struct *const sighand = current->sighand;
spin_lock_irq(&sighand->siglock);
if (signal_group_exit(sig))
/* Another thread got here before we took the lock. */
exit_code = sig->group_exit_code;
else {
sig->group_exit_code = exit_code;
sig->flags = SIGNAL_GROUP_EXIT;

zap_other_threads(current);

while_each_thread(p, t) {
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
}
}
spin_unlock_irq(&sighand->siglock);
}

#
# The current task calls do_exit() to take care of itself.
#
do_exit(exit_code);

====================================================================================

6.3. release_task()

We see in the trace of _exit() system call, if "autoreap" is set, then the exiting task will call release_task() to release itself, and if "autoreap" is not set, then it is the parent call release_task() to release its exited children(!^^__as we will see later in wait-like system call).

As we see in the description of <<ulk>>//3.5.2. Process Removal:

The release_task( ) function detaches the last data structures from the descriptor of a zombie process; it is
applied on a zombie process in two possible ways: by the do_exit( ) function if the parent is not interested in
receiving signals from the child, or by the wait4( ) or waitpid( ) system calls after a signal has been sent to the
parent. In the latter case, the function also will reclaim the memory used by the process descriptor, while in the
former case the memory reclaiming will be done by the scheduler (see Chapter 7). This function executes the following
steps:

@@trace - release_task()

void release_task(struct task_struct * p)

/*
* proc_flush_task - Remove dcache entries for @task from the /proc dcache.
* @task: task that should be flushed.
*
* When flushing dentries from proc, one needs to flush them from global
* proc (proc_mnt) and from all the namespaces' procs this task was seen
* in. This call is supposed to do all of this job.
*
* Looks in the dcache for
* /proc/@pid
* /proc/@tgid/task/@pid
* if either directory is present flushes it and all of it'ts children
* from the dcache.
*/
proc_flush_task(p);

write_lock_irq(&tasklist_lock);

/*
* <<ulk>>//3.5.2. Process Removal
*
* 3. Invokes __exit_signal() to cancel any pending signal and to release the signal_struct descriptor of the process.
* If the descriptor is no longer used by other lightweight processes, the function also removes this data structure.
* Moreover, the function invokes exit_itimers() to detach any POSIX interval timer from the process.
*/
__exit_signal(p);

/*
* If we are the last non-leader member of the thread
* group, and the leader is zombie, then notify the
* group leader's parent process. (if it wants notification.)
*/
zap_leader = 0;
leader = p->group_leader;

/**
* <<ulk>>//3.5.2. Process Removal
*
* 6. If the process is not a thread group leader, the leader is a zombie, and the process is the last member of the
* thread group, the function sends a signal to the parent of the leader to notify it of the death of the process.
*/
if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
/*
* If we were the last child thread and the leader has
* exited already, and the leader's parent ignores SIGCHLD,
* then we are the one who should release the leader.
*/
zap_leader = do_notify_parent(leader, leader->exit_signal);
if (zap_leader)
leader->exit_state = EXIT_DEAD;
}

write_unlock_irq(&tasklist_lock);

#
# Don't be scared, release_thread() is just like exit_thread(), perform some architecture-specific cleanup.
#
release_thread(p);

#
# Release the reference to this task by RCU.
#
# This reference is supposed to represent the task itself.
#
# [*] We see previously in exit_notify(), if "autoreap" is TRUE, then, the exiting task will call release_task() to
# itself, but after that, do_exit() code trace still operate on the exiting task for a while, this is because, the
# task descriptor is not deallocated directly, but through RCU.
#
call_rcu(&p->rcu, delayed_put_task_struct);

#
# delayed_put_task_struct() is run by RCU, it simply release a reference to task descriptor, and this reference
# is in fact represent the task itself.
#
# If the refcnt of this task descriptor drops to 0, then deallocate this task descriptor and kernel mode stack.
#
# If there are other guys which hold references to this task, then, the task descriptor is just there.
#
static void delayed_put_task_struct(struct rcu_head *rhp)

put_task_struct(tsk);

if (atomic_dec_and_test(&t->usage))
__put_task_struct(t);

exit_creds(tsk);

put_signal_struct(tsk->signal);

free_task(tsk);

free_thread_info(tsk->stack);

free_task_struct(tsk);
kmem_cache_free(task_struct_cachep, (tsk))


====================================================================================

7. wait-like system call

As for wait-like user-level functions, just refer to <glibc> documentation. Here, we just care about its kernel-level counterpart service routine.

/*
* sys_waitpid() remains for compatibility. waitpid() should be
* implemented by calling sys_wait4() from libc.a.
*/
asmlinkage long sys_waitpid(pid_t pid, int __user *stat_addr, int options)

return sys_wait4(pid, stat_addr, options, NULL);


asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
int options, struct rusage __user *ru)

ret = do_wait(&wo);


asmlinkage long sys_waitid(int which, pid_t pid,
struct siginfo __user *infop,
int options, struct rusage __user *ru)

ret = do_wait(&wo);

Then, as we see above, the core is do_wait() function.

@@trace - do_wait()

static long do_wait(struct wait_opts *wo)

#
# Well, wait-like system call is allowed to wait interruptibly, until meet the one exited
# child task which match the criteria.
#
init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
wo->child_wait.private = current;
add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);

repeat:
/*
* If there is nothing that can match our critiera just get out.
* We will clear ->notask_error to zero if we see any child that
* might later match our criteria, even if we are not able to reap
* it yet.
*/
wo->notask_error = -ECHILD;
if ((wo->wo_type < PIDTYPE_MAX) &&
(!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
goto notask;

#
# Perform interruptible wait, if needed.
#
set_current_state(TASK_INTERRUPTIBLE);

#
# do_wait_thread() and ptrace_do_wait() do the real job, the former is to handle exited children in "tsk->children"
# list, the latter is to handle exited children in "tsk->ptraced" list.
#
# They both call wait_consider_task() to examine and handle the children.
#
tsk = current;
do {
retval = do_wait_thread(wo, tsk);
list_for_each_entry(p, &tsk->children, sibling) {
ret = wait_consider_task(wo, 0, p);
}

retval = ptrace_do_wait(wo, tsk);
list_for_each_entry(p, &tsk->ptraced, ptrace_entry) {
int ret = wait_consider_task(wo, 1, p);
}

} while_each_thread(current, tsk);

/*
* Consider @p for a wait by @parent.
*
* -ECHILD should be in ->notask_error before the first call.
* Returns nonzero for a final return, when we have unlocked tasklist_lock.
* Returns zero if the search for a child should continue;
* then ->notask_error is 0 if @p is an eligible child,
* or another error from security_task_wait(), or still -ECHILD.
*/
static int wait_consider_task(struct wait_opts *wo, int ptrace,
struct task_struct *p)

#
# If a child's task state machine is set to EXIT_DEAD, then, it MUST have taken care of itself in its do_exit()
# trace, with "autoreap" as TRUE.
#
/* dead body doesn't have much to contribute */
if (p->exit_state == EXIT_DEAD)
return 0;

#
# If a child's task state machine is EXIT_ZOMBIE, then, this child expect its parent to release it.
#
# And the parent calls wait_task_zombie(), which essentially performs some acconting and calls release_task().
#
/* slay zombie? */
if (p->exit_state == EXIT_ZOMBIE) {

,,,
wait_task_zombie(wo, p);

if (p != NULL)
release_task(p);

#
# We don't care about STOPPED task, because that is another story besides task desctruction...
#

/*
* Wait for stopped. Depending on @ptrace, different stopped state
* is used and the two don't interact with each other.
*/
ret = wait_task_stopped(wo, ptrace, p);

/*
* Wait for continued. There's only one continued state and the
* ptracer can consume it which can confuse the real parent. Don't
* use WCONTINUED from ptracer. You don't need or want it.
*/
return wait_task_continued(wo, p);

====================================================================================

8. misc tips

====================================================================================

8.1. When forking a new thread, how pthread_create() handles thread_func

See
<<ulk>>//3.4.1 The clone(), fork(), and vfork() System Calls

clone() is actually a wrapper function defined in the C library (see the section "POSIX APIs and System Calls" in Chapter 10), which sets up the stack of the new lightweight process and invokes a clone() system call hidden to the programmer. The sys_clone() service routine that implements the clone() system call does not have the fn and arg parameters. In fact, the wrapper function saves the pointer fn into the child's stack position corresponding to the return address of the wrapper function itself; the pointer arg is saved on the child's stack right below fn. When the wrapper function terminates, the CPU fetches the return address from the stack and executes the fn(arg) function.
（!^^
user-level clone() function，是一个wrapper function，其内部调用clone() system call。

clone() function内部实现实际上是，设置所要创建的LWP的stack后，然后调用clone() system call。

__<!!attention>
但是，实际实现clone() system call的sys_clone() service routine，并没有'fn'和'arg'参数。

实际上是，user-level clone() wrapper function，首先是创建child所使用的User mode stack（!^^__分配一块memory area，拷贝parent的当前stack的内容到其中，而这时，还没有调用clone() system call实际地创建child），然后，该wrapper function将"fn"和"arg"参数，保存在刚刚为将要生成child process新创建的stack中，而"fn"参数在child stack之中的位置，就正好是该clone() function，在child中的return address返回位置（!^^__<!!attention>即，return instruction pointer）。

这样，当clone() function，在child中返回时（!^^__执行RET instruction），CPU将从child的stack中，取得先前保存的"fn"参数，并开始fn( arg ) function的执行。
）

0 0