Linux进程创建
准备
以fork函数为例,看下Linux进程创建具体工作流程:
下面是使用fork
函数创建进程的一段C代码:
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
int
main (int argc, char *argv[])
{
pid_t pid;
pid = fork ();
if (pid < 0){
printf("fork() error\n");
}else if (pid == 0){
printf("child process \n");
}else {
printf("parent process \n");
}
exit(0);
}
进程创建流程
trace跟踪
- 用户程序调用
glibc
中的提供fork
函数,fork
函数触发系统调用clone
, 去创建一个进程, 下面通过strace
跟踪下编译链接后的可执行文件:
# strace ./fork
...
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
....
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
,child_tidptr=0x7fb0be0eca10) = 16273
...
- 从trace日志可以看到,在
fork()
对应系统调用clone
而非fork
fork man(2)
Since version 2.3.3, rather than invoking the kernel’s fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).查看 glibc版本
[root@centosgpt first]# /lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.17
- 由于使用的动态链接方式,所以可以看到
glibc
的动态链接器,使用静态编译则不会出现
流程整理
- 应用程序
- 编译
- 源码调用glib提供的API
- 编译源码为目标文件
- 目标文件重定位 -fPIC
- 生成符号表
- 链接
- 根据重定位类型计算重定位地址
- 可以设置静态和动态两种方式加载glibc提供的API
- glibc封装系统调用
- 封装方式通过
- 脚本封装
- syscall.list提供文件
- make-syscall.sh生成宏定义
- syscall-template.s 生成系统调用API
- 嵌入方式C代码嵌入汇编
- trap in kernel
- X86-32bit 使用 int 0x80
- X86-64bit 使用 syscall
-
内核处理
- 系统调用
- 启动时通过
trap_init
设置系统调用表、指定的函数进行处理 - 64位模式下新增MSR保存相关系统调用信息
-
进程创建
-
clone -> sys_call_table 转换为
_do_fork
-
复制并初始化 task_struct, copy_process()
- dup_task_struct: 创建申请task_struct, stack, thread_info
- copy_creds: 分配 cred 结构体并复制, 复制相关权限信息
- 判断进程是否达到上限
- rcu相关设置,进程列表及抢占式RCU用的锁信息
- 初始化运行时统计量 utime, start_time , real_start_time
- sched_fork 调度相关结构体: 分配并初始化 sched_entity; 设置优先级和调度类; 设置进程状态; 调用调度类函数
- LSM相关信息, security分配
- 设置了CLONE_SYSVSEM,则共享信号量undo_list
- 初始化文件和文件系统变量
- copy_files: 复制进程打开的文件信息, 用 files_struct 维护;
- copy_fs: 复制进程目录信息, 包括根目录/根文件系统; pwd 等, 用 fs_struct 维护
- 初始化信号相关内容: 复制信号和处理函数
- copy_sighand 设置CLONE_SIGHAND增加计数否则,新建
- copy_signal, 设置CLONE_THREAD则不处理,否则新建
- 复制内存空间: 分配并复制 mm_struct; 复制内存映射信息,设置CLONE_VM则共享否则新申请
- 复制namespace 检查flags: CLONE_NEWNS , CLONE_NEWUTS , CLONE_NEWIPC , CLONE_NEWPID , CLONE_NEWNET , CLONE_NEWCGROUP 不存在则新建
- 复制IO上下文 判断CLONE_ID没有则新建
- 复制线程tls 设置CLONE_SETTLS子线程创建TLS, 设置sp, sp0, io_bit_ptr,gs,TIF_IO_BITMAP
- 分配 pid
- 唤醒新进程 wake_up_new_task()
- state = TASK_RUNNING; activate 用调度类将当前子进程入队列
- 其中 enqueue_entiry 中会调用 update_curr 更新运行统计量, 再加入队列
- 调用 check_preempt_curr 看是否能抢占, 若 task_fork_fair 中已设置 sysctl_sched_child_runs_first, 直接返回, 否则进一步比较并调用 resched_curr 做抢占标记
- 若父进程被标记会被抢占, 则系统调用 fork 返回过程会调度子进程
应用程序部分
编译阶段
目标文件对重定向的函数占位, 设置重定位符号表
gcc option:-fPIC ‘gcc -g -c -fPIC $<‘
- 目标文件对应的函数位置进行占位:
# objdump -S -d ./fork.o
...
int
main (int argc, char *argv[])
{
...
pid = fork ();
f: e8 00 00 00 00 callq 14 <main+0x14>
...
- 生成对应符号表,
UND
为重定位项, 值为0000000000000000
# readelf -s ./fork
...
Symbol table '.symtab' contains 71 entries:
Num: Value Size Type Bind Vis Ndx Name
70: 0000000000000000 0 FUNC GLOBAL DEFAULT UND fork@@GLIBC_2.2.5
...
- 查看的重定向信息, 重定位类型
R_X86_64_PLT32
> /arch/x86/include/asm/elf.h
> #define R_X86_64_PLT32 4 /* 32 bit PLT address */
# objdump -r fork.o
fork.o: file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
...
OFFSET TYPE VALUE
fork.o: file format elf64-x86-64
0000000000000010 R_X86_64_PLT32 fork-0x0000000000000004
...
链接阶段
- 根据重定位类型计算重定位地址
Name | value | Field | Caclulation |
---|---|---|---|
R_X86_64_PLT32 | 4 | word32 | L + A – P |
A
Represents the addend used to compute the value of the relocatable field.
B
Represents the base address at which a shared object has been loaded into memory during execution. Generally, a shared object is built with a 0 base virtual address, but the execution address will be different.
G
Represents the offset into the global offset table at which the relocation entry’s symbol will reside during execution.
GOT
Represents the address of the global offset table.
L
Represents the place (section offset or address) of the Procedure Linkage Table entry for a symbol.
P
Represents the place (section offset or address) of the storage unit being relocated (computed using r_offset).
S
Represents the value of the symbol whose index resides in the relocation entry.
x86_64_abi X86_64 与AMD ABI 中重定位定义是一致的, 后续待补充
- 验证下函数重定位地址
- 获取plt地址
L = 00000000004004a0
- 获取plt地址
# objdump -S -D ./fork
...
00000000004004a0 <fork@plt>:
4004a0: ff 25 8a 0b 20 00 jmpq *0x200b8a(%rip) # 601030 <fork@GLIBC_2.2. 5>
4004a6: 68 03 00 00 00 pushq $0x3
4004ab: e9 b0 ff ff ff jmpq 400460 <.plt>
...
- 目标文件对应的函数位置的值
A = 0000 0000
# objdump -S -d ./fork.o
...
pid = fork ();
f: e8 00 00 00 00 callq 14 <main+0x14>
...
- 函数在存储单元中的位置 P = 4005c1
# objdump -S -D ./fork
339 00000000004005ad <main>:
353 pid = fork ();
354 4005bc: e8 df fe ff ff callq 4004a0 <fork@plt>
355 4005c1: 89 45 fc mov %eax,-0x4(%rbp)
- 修正后的地址
ADDR = L+A-P = 00000000004004a0 + 00000000 - 00000000004005c1 = fffffffffffffedf
由于VM Linux x86_64小端法
存放数据,对应地址为dffeffff
callq使用相对寻址,指令调用地址是该条指令下一条指令的地址,加上相对地址,就是函数入库地址
0000004005c1+fffffffffffffedf=4004a0
- 动态库加载
- 通过PLT获取函数的地址, PLT通过跳转GOT表来获取实际地址
>PLT是可执行权限,GOT才有写入权限。
- 通过PLT获取函数的地址, PLT通过跳转GOT表来获取实际地址
# objdump -S -D ./fork
245 00000000004004a0 <fork@plt>:
246 4004a0: ff 25 8a 0b 20 00 jmpq *0x200b8a(%rip) # 601030 <fork@GLIBC_2.2. 5>
247 4004a6: 68 03 00 00 00 pushq $0x3
248 4004ab: e9 b0 ff ff ff jmpq 400460 <.plt>
- GOT表中的实际地址
# objdump -S -D ./fork
768 Disassembly of section .got:
769
770 0000000000600ff8 <.got>:
771 ...
772
773 Disassembly of section .got.plt:
774
775 0000000000601000 <_GLOBAL_OFFSET_TABLE_>:
776 601000: 28 0e sub %cl,(%rsi)
777 601002: 60 (bad)
778 ...
779 601017: 00 76 04 add %dh,0x4(%rsi)
780 60101a: 40 00 00 add %al,(%rax)
781 60101d: 00 00 add %al,(%rax)
782 60101f: 00 86 04 40 00 00 add %al,0x4004(%rsi)
783 601025: 00 00 add %al,(%rax)
784 601027: 00 96 04 40 00 00 add %dl,0x4004(%rsi)
785 60102d: 00 00 add %al,(%rax)
786 60102f: 00 a6 04 40 00 00 add %ah,0x4004(%rsi)
787 601035: 00 00 add %al,(%rax)
788 ...
- 运行时加载
- 开始调试
fork
处设置断点
[root@centosgpt first]# gdb fork
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
...
(gdb) b 10
Breakpoint 1 at 0x4005bc: file fork.c, line 10.
(gdb) run
Starting program: /root/src/first/fork
Breakpoint 1, __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:55
55 {
- 查看
PLT
表信息对应处理函数, 调用GOT
表的函数GOT[6]
(gdb) x/i 0x4004a0
0x4004a0 <fork@plt>: jmpq *0x200b8a(%rip) # 0x601030
- 查看
GOT[6]
信息, 存放的是PLT
中fork
的下一条指令
(gdb) x/10a 0x601000
0x601000: 0x600e28 0x0
0x601010: 0x0 0x400476 <puts@plt+6>
0x601020: 0x400486 <__libc_start_main@plt+6> 0x400496 <exit@plt+6>
0x601030: 0x4004a6 <fork@plt+6> 0x0
fork@plt
的下一条指令, 传参数3
调用PLT[0]
函数
(gdb) x/2i x4004a6
0x4004a6 <fork@plt+6>: pushq $0x3
0x4004ab <fork@plt+11>: jmpq 0x400460
PLT[0]
函数调用, 传入参数GOT[1]
调用GOT[2]
(gdb) x/2i 0x400460
0x400460: pushq 0x200ba2(%rip) # 0x601008
0x400466: jmpq *0x200ba4(%rip) # 0x601010
准备就绪,下一步看下fork
地址加载过程
(gdb) set output-radix 16
Output radix now set to decimal 16, hex 10, octal 20.
(gdb) watch *0x601030
Hardware watchpoint 2: *0x601030
(gdb) cont
Continuing.
Hardware watchpoint 2: *0x601030
Old value = 0x4004a6
New value = 0xf7ad2fe0
_dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:149
149 }
GOT
表更新, 之后通过PLT
表跳转后就可以直接找到函数地址并执行
(gdb) x/10a 0x601000
0x601000: 0x600e28 0x7ffff7ffe150
0x601010: 0x7ffff7df1890 <_dl_runtime_resolve_xsave> 0x400476 <puts@plt+6>
0x601020: 0x7ffff7a303a0 <__libc_start_main> 0x400496 <exit@plt+6>
0x601030: 0x7ffff7ad2fe0 <__libc_fork> 0x0
0x601040: 0x0 0x0
gdb 提示
warning: the debug information found in “/usr/lib/debug//lib64/ld-2.17.so.debug” does not match “/lib64/ld-linux-x86-64.so.2” (CRC mismatch).
按照提示安装即可
debuginfo-install glibc-2.17-260.el7_6.6.x86_64
重复调试几遍过后发现, GOT地址调试启动后就已经加载为实际地址
执行了以下操作
rm /etc/ld.so.cache
glibc部分
封装方式
脚本封装
- syscalls.list文件
# File name Caller Syscall name # args Strong name Weak names
socket - socket i:iii __socket socket
- make-syscall.sh 对以上文件处理生成 宏定义
#define SYSCALL_NAME socket
-
syscall-template.S使用上面的宏定义生成系统调用,
DO_CALL
实现不同体系间的调用
#define PSEUDO(name, syscall_name, args) \
lose: \
jmp JUMPTARGET(syscall_error) \
.globl syscall_error; \
ENTRY (name) \
DO_CALL (syscall_name, args); \
jb lose
sysdeps\x86_64\sysdep.h
trapping to kernel mode
x86(32位)模式:
syscall-template.S
使用make-syscall.sh生成的宏,定义这个系统调用的实现方式
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
ret
T_PSEUDO_END (SYSCALL_SYMBOL
sysdeps\unix\syscall-template.S
- 通过调用DO_CALL实现系统调用
#define PSEUDO(name, syscall_name, args) \
.text; \
ENTRY (name) \
DO_CALL (syscall_name, args); \
cmpl $-4095, %eax; \
jae SYSCALL_ERROR_LABEL
- 通过
SYS_ify
获取系统调用编号后,调用ENTER_KERNEL
#define DO_CALL(syscall_name, args) \
PUSHARGS_##args \
DOARGS_##args \
movl $SYS_ify (syscall_name), %eax; \
ENTER_KERNEL \
POPARGS_##args
- 针对
x86
通过int $0x80
触发
#define ENTER_KERNEL int $0x80
sysdeps\unix\sysv\linux\i386\sysdep.h
–int 0x80
软中断方式陷入内核
x86_64模式:
- x86_64位模式,使用
syscall
方式陷入内核
# undef DO_CALL
# define DO_CALL(syscall_name, args) \
DOARGS_##args \
movl $SYS_ify (syscall_name), %eax; \
syscall;
sysdeps\unix\sysv\linux\x86_64\sysdep.h
嵌入方式实现系统调用
相关系统调用的汇编代码嵌入到相关代码中, 下面调用过程演示的是这种模式
调用过程
glibc封装了常用的Linux内核系统调用, 为用户屏蔽了根据不同体系复杂性.
- fork 定义
weak_alias (__libc_fork, fork)
- __libc_fork 函数调用arch_fork
pid_t
__libc_fork (void)
{
pid_t pid;
...
pid = arch_fork (&THREAD_SELF->tid);
...
return pid;
}
glibc-2.29\sysdeps\nptl\fork.c
- arch_fork 函数调用
static inline pid_t
arch_fork (void *ctid)
{
const int flags = CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD;
long int ret;
#ifdef __ASSUME_CLONE_BACKWARDS
# ifdef INLINE_CLONE_SYSCALL
ret = INLINE_CLONE_SYSCALL (flags, 0, NULL, 0, ctid);
# else
ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, 0, ctid);
# endif
#elif defined(__ASSUME_CLONE_BACKWARDS2)
ret = INLINE_SYSCALL_CALL (clone, 0, flags, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_BACKWARDS3)
ret = INLINE_SYSCALL_CALL (clone, flags, 0, 0, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE2)
ret = INLINE_SYSCALL_CALL (clone2, flags, 0, 0, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_DEFAULT)
ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, ctid, 0);
#else
# error "Undefined clone variant"
#endif
return ret;
}
sysdeps\unix\sysv\linux\kernel-features.h
2.29版本
*__ASSUME_CLONE_BACKWARDS: for variant 1.*
*__ASSUME_CLONE_BACKWARDS2: for variant 2 (s390).*
*__ASSUME_CLONE_BACKWARDS3: for variant 3 (microblaze).*
*__ASSUME_CLONE_DEFAULT: for variant 4.*
*__ASSUME_CLONE2: for clone2 with variant 3 (ia64).*
- INLINE_SYSCALL_CALL
#define INLINE_SYSCALL_CALL(...) \ __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
sysdeps\unix\sysdep.h
- 通过
_INLINE_SYSCALL_NARGS_X
获取参数个数, 通过__SYSCALL_CONCAT
根据参数个数组合成特定宏定义
#define __SYSCALL_CONCAT_X(a,b) a##b
#define __SYSCALL_CONCAT(a,b) __SYSCALL_CONCAT_X (a, b)
#define __INLINE_SYSCALL_NARGS_X(a,b,c,d,e,f,g,h,n,...) n
#define __INLINE_SYSCALL_NARGS(...) \
__INLINE_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)
#define __INLINE_SYSCALL_DISP(b,...) \
__SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
- 可变参数的系统调用
#define __INLINE_SYSCALL0(name) \
INLINE_SYSCALL (name, 0)
#define __INLINE_SYSCALL1(name, a1) \
INLINE_SYSCALL (name, 1, a1)
#define __INLINE_SYSCALL2(name, a1, a2) \
INLINE_SYSCALL (name, 2, a1, a2)
#define __INLINE_SYSCALL3(name, a1, a2, a3) \
INLINE_SYSCALL (name, 3, a1, a2, a3)
#define __INLINE_SYSCALL4(name, a1, a2, a3, a4) \
INLINE_SYSCALL (name, 4, a1, a2, a3, a4)
#define __INLINE_SYSCALL5(name, a1, a2, a3, a4, a5) \
INLINE_SYSCALL (name, 5, a1, a2, a3, a4, a5)
#define __INLINE_SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
INLINE_SYSCALL (name, 6, a1, a2, a3, a4, a5, a6)
#define __INLINE_SYSCALL7(name, a1, a2, a3, a4, a5, a6, a7) \
INLINE_SYSCALL (name, 7, a1, a2, a3, a4, a5, a6, a7)
- X86_64对
INLINE_SYSCALL
封装,内部调用INTERNAL_SYSCALL
# undef INLINE_SYSCALL
# define INLINE_SYSCALL(name, nr, args...) \
({ \
unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args); \
if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, ))) \
{ \
__set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, )); \
resultvar = (unsigned long int) -1; \
} \
(long int) resultvar; })
sysdeps\unix\sysv\linux\x86_64\sysdep.h
INTERNAL_SYSCALL
转化为调用internal_syscall1-internal_syscall6
internal_syscallx调用syscall实现系统调用
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...) \
internal_syscall##nr (SYS_ify (name), err, args)
...
#undef internal_syscall0
#define internal_syscall0(number, err, dummy...) \
({ \
unsigned long int resultvar; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
SYS_ify
将glibc
提供的API转化为系统调用对应的编号,在系统头文件/usr/include/asm/unistd.h
#undef SYS_ify
#define SYS_ify(syscall_name) __NR_##syscall_name
sysdeps\unix\sysv\linux\x86_64\sysdep.h
内核态部分
系统调用
x86(32位)模式:
int 0x80
软中断方式陷入内核
void __init trap_init(void)
{
...
idt_setup_traps();
...
}
arch/x86/kernel/traps.c
...
void __init idt_setup_traps(void)
{
idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
}
- 当接收到一个
0x80
中断时, 调用entry_INT80_32
static const __initconst struct idt_data def_idts[] = {
#if defined(CONFIG_IA32_EMULATION)
SYSG(IA32_SYSCALL_VECTOR, entry_INT80_compat),
#elif defined(CONFIG_X86_32)
SYSG(IA32_SYSCALL_VECTOR, entry_INT80_32),
#endif
};
/arch/x86/kernel/idt.c
32位
的系统调用支持6个参数,SAVE_ALL
保存所有寄存器,调用do_int80_syscall_32
ENTRY(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */
...
movl %esp, %eax
call do_int80_syscall_32
...
INTERRUPT_RETURN
linux-5.3-rc3\arch\x86\entry\entry_32.S
- 最终会调用到
do_syscall_32_irqs_on
,通过eax
保存的系统调用号,在ia32_sys_call_table中查找到对应的函数进行调用
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
...
do_syscall_32_irqs_on(regs);
}
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
...
struct thread_info *ti = current_thread_info();
unsigned int nr = (unsigned int)regs->orig_ax;
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
syscall_return_slowpath(regs);
...
}
- 系统调用后返回用户态, iret 指令将原来用户态保存的现场恢复回来,包含代码段、指令寄存器等
#define INTERRUPT_RETURN iret
x86_64模式:
syscall
指令实现陷入内核, 使用MSR寄存器, 通过wrmsrl
和rdmsr
指令对其读写,当调用syscall指令时,会从这个寄存器载入系统调用的地址
void __init trap_init(void)
{
cpu_init();
}
void cpu_init(void)
{
...
syscall_init();
..
}
void syscall_init(void)
{
...
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
...
}
ENTRY(entry_SYSCALL_64)
...
call do_syscall_64 /* returns with IRQs disabled */
USERGS_SYSRET64
...
END(entry_SYSCALL_64)
linux-5.3-rc3\arch\x86\entry\entry_64.S
都使用系统调用表
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
...
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
...
}
syscall_return_slowpath(regs);
}
通过sysretq返回用户态
#define USERGS_SYSRET64 \
swapgs; \
sysretq;
进程创建
- clone 系统调用最终也调用
do_fork
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
{
...
return _do_fork(&args);
}
linux-5.3-rc3\kernel\fork.c
- 复制进程信息; 唤醒新进程
long _do_fork(struct kernel_clone_args *args)
{
...
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
...
wake_up_new_task(p);
...
}
- 复制
copy_process()
- flags 参数
Tag | desc |
---|---|
CLONE_CHILD_CLEARTID | 清除子进程TID, 指向用户态变量child_tidptr,同时唤醒一个等待事件 |
CLONE_CHILD_SETTID | 设置子进程TID, 用户态变量child_tidptr |
CLONE_FILES | 进程共享打开文件 |
CLONE_FS | 进程文件系统信息 |
CLONE_IO | 进程共享IO |
CLONE_NEWCGROUP | 在新的cgroup namespace内生成进程 |
CLONE_NEWIPC | 使用新的ipc namespace |
CLONE_NEWNET | 使用新的net namespace |
CLONE_NEWNS | 子进程需要自己的名字空间 |
CLONE_NEWPID | 使用新的pid namespace |
CLONE_NEWUSER | 使用新的user namespace |
CLONE_PARENT | 创建进程与子进程是兄弟进程 |
CLONE_PARENT_SETTID | 父进程设置TID |
CLONE_PID | 子进程ID与调用者一致 |
CLONE_PTRACE | 设置跟踪标识 |
CLONE_SETTLS | 子线程创建TLS |
CLONE_SIGHAND | 进程共享信号处理句柄 |
CLONE_STOPPED | 强制子进程在STOPED状态开始运行 |
CLONE_SYSVSEM | 共享SystemV进程间通讯方式 |
CLONE_THREAD | 父子在同一线程组内 |
CLONE_UNTRACED | 内核线程跟踪禁止 |
CLONE_VFORK | 子进程运行是父进程进入睡眠 |
CLONE_VM | 父子进程共享地址空间 |
- dup_task_struct: 创建申请task_struct, stack, thread_info
static struct task_struct *dup_task_struct(struct task_struct *orig, int node) { struct task_struct *tsk; unsigned long *stack; struct vm_struct *stack_vm_area __maybe_unused; int err; //申请task_struct结构 tsk = alloc_task_struct_node(node); //申请stack stack = alloc_thread_stack_node(tsk, node); stack_vm_area = task_stack_vm_area(tsk); err = arch_dup_task_struct(tsk, orig); tsk->stack = stack; //设置thread_info setup_thread_stack(tsk, orig); clear_user_return_notifier(tsk); clear_tsk_need_resched(tsk); set_task_stack_end_magic(tsk); }
- copy_creds: 分配 cred 结构体并复制, 复制相关权限信息
int copy_creds(struct task_struct *p, unsigned long clone_flags) { ... new = prepare_creds(); ... atomic_inc(&new->user->processes); p->cred = p->real_cred = get_cred(new); alter_cred_subscribers(new, 2); validate_creds(new); ... }
- 判断进程是否达到上限
if (atomic_read(&p->real_cred->user->processes) >= task_rlimit(p, RLIMIT_NPROC)) {
- rcu相关设置,进程列表及抢占式RCU用的锁信息
static inline void rcu_copy_process(struct task_struct *p) { #ifdef CONFIG_PREEMPT_RCU p->rcu_read_lock_nesting = 0; p->rcu_read_unlock_special.s = 0; p->rcu_blocked_node = NULL; INIT_LIST_HEAD(&p->rcu_node_entry); #endif /* #ifdef CONFIG_PREEMPT_RCU */ #ifdef CONFIG_TASKS_RCU p->rcu_tasks_holdout = false; INIT_LIST_HEAD(&p->rcu_tasks_holdout_list); p->rcu_tasks_idle_cpu = -1; #endif /* #ifdef CONFIG_TASKS_RCU */ }
- 初始化运行时统计量
utime
,start_time
,real_start_time
... p->utime = p->stime = p->gtime = 0; ...
- sched_fork 调度相关结构体: 分配并初始化 sched_entity; 设置优先级和调度类; 设置进程状态; 调用调度类函数
int sched_fork(unsigned long clone_flags, struct task_struct *p) { __sched_fork(clone_flags, p); p->state = TASK_NEW; p->prio = current->normal_prio; else if (rt_prio(p->prio)) p->sched_class = &rt_sched_class; else p->sched_class = &fair_sched_class; __set_task_cpu(p, smp_processor_id()); } //初始化task_struct 上下文 perf_event_init_task(p); //分配audit context block audit_alloc(p);
- LSM相关信息, security分配
int security_task_alloc(struct task_struct *task, unsigned long clone_flags) { ... lsm_task_alloc(task); call_int_hook(task_alloc, 0, task, clone_flags); ... }
- 设置了CLONE_SYSVSEM,则共享信号量undo_list
int copy_semundo(unsigned long clone_flags, struct task_struct *tsk) { struct sem_undo_list *undo_list; if (clone_flags & CLONE_SYSVSEM) { error = get_undo_list(&undo_list); if (error) return error; refcount_inc(&undo_list->refcnt); tsk->sysvsem.undo_list = undo_list; } else tsk->sysvsem.undo_list = NULL; return 0; }
- 初始化文件和文件系统变量 判断CLONE_FILES, CLONE_FS
- copy_files: 复制进程打开的文件信息, 用 files_struct 维护;
static int copy_files(unsigned long clone_flags, struct task_struct *tsk) { struct files_struct *oldf, *newf; ... oldf = current->files; ... if (clone_flags & CLONE_FILES) { ... atomic_inc(&oldf->count); ... } ... newf = dup_fd(oldf, &error); tsk->files = newf;
- copy_fs: 复制进程目录信息, 包括根目录/根文件系统; pwd 等, 用 fs_struct 维护
static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) { struct fs_struct *fs = current->fs; ... if (clone_flags & CLONE_FS) { ... spin_lock(&fs->lock); fs->users++; spin_unlock(&fs->lock); } tsk->fs = copy_fs_struct(fs); ... return 0; }
- copy_files: 复制进程打开的文件信息, 用 files_struct 维护;
- 初始化信号相关内容: 复制信号和处理函数, 涉及CLONE_SIGHAND
- copy_sighand 设置CLONE_SIGHAND增加计数否则,新建
static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) { struct sighand_struct *sig; if (clone_flags & CLONE_SIGHAND) { refcount_inc(¤t->sighand->count); return 0; } sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL); rcu_assign_pointer(tsk->sighand, sig); refcount_set(&sig->count, 1); memcpy(sig->action, current->sighand->action, sizeof(sig->action)); }
- copy_signal, 设置CLONE_THREAD则不处理,否则新建
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) { struct signal_struct *sig; if (clone_flags & CLONE_THREAD) return 0; sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL); tsk->signal = sig; sig->nr_threads = 1; atomic_set(&sig->live, 1); refcount_set(&sig->sigcnt, 1 ...
- 复制内存空间: 分配并复制 mm_struct; 复制内存映射信息,设置CLONE_VM则共享否则新申请
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) { struct mm_struct *mm, *oldmm; if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm; goto good_mm; } retval = -ENOMEM; mm = dup_mm(tsk, current->mm); }
- 复制namespace 检查flags: CLONE_NEWNS , CLONE_NEWUTS , CLONE_NEWIPC , CLONE_NEWPID , CLONE_NEWNET , CLONE_NEWCGROUP 不存在则新建
int copy_namespaces(unsigned long flags, struct task_struct *tsk) { struct nsproxy *old_ns = tsk->nsproxy; struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns); struct nsproxy *new_ns; if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWCGROUP)))) { get_nsproxy(old_ns); return 0; } ... new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs); ... }
- 复制IO上下文 判断CLONE_ID没有则新建
static int copy_io(unsigned long clone_flags, struct task_struct *tsk) { if (clone_flags & CLONE_IO) { ioc_task_link(ioc); tsk->io_context = ioc; } else if (ioprio_valid(ioc->ioprio)) { new_ioc = get_task_io_context(tsk, GFP_KERNEL, NUMA_NO_NODE); } }
- 复制线程tls 设置CLONE_SETTLS子线程创建TLS, 设置sp, sp0, io_bit_ptr,gs,TIF_IO_BITMAP
int copy_thread_tls(unsigned long clone_flags, unsigned long sp, unsigned long arg, struct task_struct *p, unsigned long tls) { p->thread.sp = (unsigned long) fork_frame; p->thread.sp0 = (unsigned long) (childregs+1) if (unlikely(p->flags & PF_KTHREAD)) { ... p->thread.io_bitmap_ptr = NULL; ... } if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) { p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr, IO_BITMAP_BYTES, GFP_KERNEL); if (!p->thread.io_bitmap_ptr) { p->thread.io_bitmap_max = 0; } set_tsk_thread_flag(p, TIF_IO_BITMAP); task_user_gs(p) = get_user_gs(current_pt_regs()); if (clone_flags & CLONE_SETTLS) err = do_set_thread_area(p, -1, (struct user_desc __user *)tls, 0); }
- 分配 pid
struct pid *alloc_pid(struct pid_namespace *ns) { pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); tmp = ns; pid->level = ns->level; for (i = ns->level; i >= 0; i--) { int pid_min = 1; if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) pid_min = RESERVED_PIDS; nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, pid_max, GFP_ATOMIC); pid->numbers[i].nr = nr; pid->numbers[i].ns = tmp; tmp = tmp->parent; } ... return pid; }
- 唤醒
wake_up_new_task()
- copy_sighand 设置CLONE_SIGHAND增加计数否则,新建
void wake_up_new_task(struct task_struct *p)
{
...
p->state = TASK_RUNNING;
...
check_preempt_curr(rq, p, WF_FORK);
...
}
- state =
TASK_RUNNING
; activate 用调度类将当前子进程入队列 - 调用调度类函数普通进程就是调用CFS调度函数,更新运行统计量, 加入队列
- 调用
check_preempt_curr
看是否能抢占, 若 task_fork_fair 中已设置 sysctl_sched_child_runs_first, 直接返回, 否则进一步比较并调用 resched_curr 做抢占标记 - 若父进程被标记会被抢占, 则系统调用 fork 返回过程会调度子进程
刚开始看内核代码比较崩溃,之前一直在学《极客时间》刘超老师 的《趣谈Linux操作系统》,这里算做一个总结,光听专栏还是不够的还是要亲自实践,在整理的过程中好多概念还是不清晰,一直再google,如果不是ARTS打卡,真的就放弃了:(
参考
Linux进程管理(二)–fork
glibc system call wrapper
glibc源码分析(一)系统调用
Be First to Comment