Linux进程创建

Linux进程创建

准备

以fork函数为例,看下Linux进程创建具体工作流程:

下面是使用fork函数创建进程的一段C代码:

#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>

int
main (int argc, char *argv[])
{
pid_t pid;

  pid = fork ();
  if (pid < 0){
      printf("fork() error\n");
  }else if (pid == 0){
      printf("child process \n");
  }else {
      printf("parent process \n");
  }
  exit(0);
}

进程创建流程

trace跟踪

  • 用户程序调用glibc中的提供fork函数,fork函数触发系统调用clone, 去创建一个进程, 下面通过strace跟踪下编译链接后的可执行文件:
# strace  ./fork
...
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
....
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
,child_tidptr=0x7fb0be0eca10) = 16273
...
  • 从trace日志可以看到,在fork()对应系统调用clone而非fork

fork man(2)
Since version 2.3.3, rather than invoking the kernel’s fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).

查看 glibc版本
[root@centosgpt first]# /lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.17

  • 由于使用的动态链接方式,所以可以看到glibc的动态链接器,使用静态编译则不会出现

流程整理

  • 应用程序
    • 编译
    • 源码调用glib提供的API
    • 编译源码为目标文件
    • 目标文件重定位 -fPIC
    • 生成符号表
    • 链接
    • 根据重定位类型计算重定位地址
    • 可以设置静态和动态两种方式加载glibc提供的API
  • glibc封装系统调用
    • 封装方式通过
    • 脚本封装
      • syscall.list提供文件
      • make-syscall.sh生成宏定义
      • syscall-template.s 生成系统调用API
    • 嵌入方式C代码嵌入汇编
    • trap in kernel
    • X86-32bit 使用 int 0x80
    • X86-64bit 使用 syscall
  • 内核处理

    • 系统调用
    • 启动时通过trap_init设置系统调用表、指定的函数进行处理
    • 64位模式下新增MSR保存相关系统调用信息
    • 进程创建

    • clone -> sys_call_table 转换为 _do_fork

    • 复制并初始化 task_struct, copy_process()

      • dup_task_struct: 创建申请task_struct, stack, thread_info
      • copy_creds: 分配 cred 结构体并复制, 复制相关权限信息
      • 判断进程是否达到上限
      • rcu相关设置,进程列表及抢占式RCU用的锁信息
      • 初始化运行时统计量 utime, start_time , real_start_time
      • sched_fork 调度相关结构体: 分配并初始化 sched_entity; 设置优先级和调度类; 设置进程状态; 调用调度类函数
      • LSM相关信息, security分配
      • 设置了CLONE_SYSVSEM,则共享信号量undo_list
      • 初始化文件和文件系统变量
      • copy_files: 复制进程打开的文件信息, 用 files_struct 维护;
      • copy_fs: 复制进程目录信息, 包括根目录/根文件系统; pwd 等, 用 fs_struct 维护
      • 初始化信号相关内容: 复制信号和处理函数
      • copy_sighand 设置CLONE_SIGHAND增加计数否则,新建
      • copy_signal, 设置CLONE_THREAD则不处理,否则新建
      • 复制内存空间: 分配并复制 mm_struct; 复制内存映射信息,设置CLONE_VM则共享否则新申请
      • 复制namespace 检查flags: CLONE_NEWNS , CLONE_NEWUTS , CLONE_NEWIPC , CLONE_NEWPID , CLONE_NEWNET , CLONE_NEWCGROUP 不存在则新建
      • 复制IO上下文 判断CLONE_ID没有则新建
      • 复制线程tls 设置CLONE_SETTLS子线程创建TLS, 设置sp, sp0, io_bit_ptr,gs,TIF_IO_BITMAP
      • 分配 pid
    • 唤醒新进程 wake_up_new_task()
      • state = TASK_RUNNING; activate 用调度类将当前子进程入队列
      • 其中 enqueue_entiry 中会调用 update_curr 更新运行统计量, 再加入队列
      • 调用 check_preempt_curr 看是否能抢占, 若 task_fork_fair 中已设置 sysctl_sched_child_runs_first, 直接返回, 否则进一步比较并调用 resched_curr 做抢占标记
      • 若父进程被标记会被抢占, 则系统调用 fork 返回过程会调度子进程

应用程序部分

编译阶段

目标文件对重定向的函数占位, 设置重定位符号表

gcc option:-fPIC ‘gcc -g -c -fPIC $<‘

  • 目标文件对应的函数位置进行占位:
# objdump -S -d ./fork.o
...
int
main (int argc, char *argv[])
{
...
  pid = fork ();
   f:   e8 00 00 00 00          callq  14 <main+0x14>
...
  • 生成对应符号表, UND 为重定位项, 值为 0000000000000000
# readelf -s ./fork
...
Symbol table '.symtab' contains 71 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    70: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fork@@GLIBC_2.2.5
...
  • 查看的重定向信息, 重定位类型 R_X86_64_PLT32
    > /arch/x86/include/asm/elf.h
    > #define R_X86_64_PLT32 4 /* 32 bit PLT address */
# objdump -r fork.o
fork.o:     file format elf64-x86-64

RELOCATION RECORDS FOR [.text]:
...
OFFSET           TYPE              VALUE
fork.o:     file format elf64-x86-64
0000000000000010 R_X86_64_PLT32    fork-0x0000000000000004
...

链接阶段

  • 根据重定位类型计算重定位地址
Name value Field Caclulation
R_X86_64_PLT32 4 word32 L + A – P

A
Represents the addend used to compute the value of the relocatable field.
B
Represents the base address at which a shared object has been loaded into memory during execution. Generally, a shared object is built with a 0 base virtual address, but the execution address will be different.
G
Represents the offset into the global offset table at which the relocation entry’s symbol will reside during execution.
GOT
Represents the address of the global offset table.
L
Represents the place (section offset or address) of the Procedure Linkage Table entry for a symbol.
P
Represents the place (section offset or address) of the storage unit being relocated (computed using r_offset).
S
Represents the value of the symbol whose index resides in the relocation entry.

x86_64_abi X86_64 与AMD ABI 中重定位定义是一致的, 后续待补充

  • 验证下函数重定位地址
    • 获取plt地址 L = 00000000004004a0
# objdump -S -D  ./fork
...
00000000004004a0 <fork@plt>:
  4004a0:       ff 25 8a 0b 20 00       jmpq   *0x200b8a(%rip)        # 601030 <fork@GLIBC_2.2.        5>
  4004a6:       68 03 00 00 00          pushq  $0x3
  4004ab:       e9 b0 ff ff ff          jmpq   400460 <.plt>
...
  • 目标文件对应的函数位置的值A = 0000 0000
# objdump -S -d ./fork.o
...
  pid = fork ();
   f:   e8 00 00 00 00          callq  14 <main+0x14>
...
  • 函数在存储单元中的位置 P = 4005c1
# objdump -S -D  ./fork
339 00000000004005ad <main>:
    353   pid = fork ();
    354   4005bc:       e8 df fe ff ff          callq  4004a0 <fork@plt>
    355   4005c1:       89 45 fc                mov    %eax,-0x4(%rbp)

  • 修正后的地址
    ADDR = L+A-P = 00000000004004a0 + 00000000 - 00000000004005c1 = fffffffffffffedf
    由于VM Linux x86_64 小端法存放数据,对应地址为 dffeffff

callq使用相对寻址,指令调用地址是该条指令下一条指令的地址,加上相对地址,就是函数入库地址 0000004005c1+fffffffffffffedf=4004a0

  • 动态库加载
    • 通过PLT获取函数的地址, PLT通过跳转GOT表来获取实际地址
      >PLT是可执行权限,GOT才有写入权限。
# objdump -S -D  ./fork
245 00000000004004a0 <fork@plt>:
    246   4004a0:       ff 25 8a 0b 20 00       jmpq   *0x200b8a(%rip)        # 601030 <fork@GLIBC_2.2.        5>
    247   4004a6:       68 03 00 00 00          pushq  $0x3
    248   4004ab:       e9 b0 ff ff ff          jmpq   400460 <.plt>
  • GOT表中的实际地址
# objdump -S -D  ./fork
768 Disassembly of section .got:
    769
    770 0000000000600ff8 <.got>:
    771         ...
    772
    773 Disassembly of section .got.plt:
    774
    775 0000000000601000 <_GLOBAL_OFFSET_TABLE_>:
    776   601000:       28 0e                   sub    %cl,(%rsi)
    777   601002:       60                      (bad)
    778         ...
    779   601017:       00 76 04                add    %dh,0x4(%rsi)
    780   60101a:       40 00 00                add    %al,(%rax)
    781   60101d:       00 00                   add    %al,(%rax)
    782   60101f:       00 86 04 40 00 00       add    %al,0x4004(%rsi)
    783   601025:       00 00                   add    %al,(%rax)
    784   601027:       00 96 04 40 00 00       add    %dl,0x4004(%rsi)
    785   60102d:       00 00                   add    %al,(%rax)
    786   60102f:       00 a6 04 40 00 00       add    %ah,0x4004(%rsi)
    787   601035:       00 00                   add    %al,(%rax)
    788         ...

  • 运行时加载
  • 开始调试fork处设置断点
[root@centosgpt first]# gdb fork
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
...
(gdb) b 10
Breakpoint 1 at 0x4005bc: file fork.c, line 10.
(gdb) run
Starting program: /root/src/first/fork

Breakpoint 1, __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:55
55      {
  • 查看PLT表信息对应处理函数, 调用GOT表的函数GOT[6]
(gdb) x/i 0x4004a0
0x4004a0 <fork@plt>: jmpq   *0x200b8a(%rip)        # 0x601030
  • 查看GOT[6]信息, 存放的是PLTfork的下一条指令
(gdb) x/10a 0x601000
0x601000:       0x600e28        0x0
0x601010:       0x0     0x400476 <puts@plt+6>
0x601020:       0x400486 <__libc_start_main@plt+6>      0x400496 <exit@plt+6>
0x601030:       0x4004a6 <fork@plt+6>   0x0
  • fork@plt的下一条指令, 传参数3 调用 PLT[0]函数
(gdb) x/2i x4004a6
0x4004a6 <fork@plt+6>:       pushq  $0x3
0x4004ab <fork@plt+11>:      jmpq   0x400460
  • PLT[0]函数调用, 传入参数 GOT[1] 调用GOT[2]
(gdb) x/2i 0x400460
0x400460:    pushq  0x200ba2(%rip)        # 0x601008
0x400466:    jmpq   *0x200ba4(%rip)        # 0x601010

准备就绪,下一步看下fork地址加载过程

(gdb) set output-radix 16
Output radix now set to decimal 16, hex 10, octal 20.
(gdb) watch *0x601030
Hardware watchpoint 2: *0x601030

(gdb) cont
Continuing.
Hardware watchpoint 2: *0x601030

Old value = 0x4004a6
New value = 0xf7ad2fe0
_dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:149
149     }
  • GOT表更新, 之后通过PLT表跳转后就可以直接找到函数地址并执行
(gdb) x/10a 0x601000
0x601000:       0x600e28        0x7ffff7ffe150
0x601010:       0x7ffff7df1890 <_dl_runtime_resolve_xsave>      0x400476 <puts@plt+6>
0x601020:       0x7ffff7a303a0 <__libc_start_main>      0x400496 <exit@plt+6>
0x601030:       0x7ffff7ad2fe0 <__libc_fork>    0x0
0x601040:       0x0     0x0

gdb 提示
warning: the debug information found in “/usr/lib/debug//lib64/ld-2.17.so.debug” does not match “/lib64/ld-linux-x86-64.so.2” (CRC mismatch).
按照提示安装即可
debuginfo-install glibc-2.17-260.el7_6.6.x86_64
重复调试几遍过后发现, GOT地址调试启动后就已经加载为实际地址
执行了以下操作
rm /etc/ld.so.cache

glibc部分

封装方式

脚本封装

  • syscalls.list文件
# File name Caller  Syscall name    # args  Strong name Weak names
socket      -   socket      i:iii   __socket    socket
  • make-syscall.sh 对以上文件处理生成 宏定义 #define SYSCALL_NAME socket
  • syscall-template.S使用上面的宏定义生成系统调用, DO_CALL实现不同体系间的调用

#define    PSEUDO(name, syscall_name, args)                      \
lose:                                         \
  jmp JUMPTARGET(syscall_error)                           \
  .globl syscall_error;                               \
  ENTRY (name)                                    \
  DO_CALL (syscall_name, args);                           \
  jb lose

sysdeps\x86_64\sysdep.h

trapping to kernel mode

x86(32位)模式:

  • syscall-template.S使用make-syscall.sh生成的宏,定义这个系统调用的实现方式
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
    ret
T_PSEUDO_END (SYSCALL_SYMBOL

sysdeps\unix\syscall-template.S

  • 通过调用DO_CALL实现系统调用
#define    PSEUDO(name, syscall_name, args)                      \
  .text;                                      \
  ENTRY (name)                                    \
    DO_CALL (syscall_name, args);                         \
    cmpl $-4095, %eax;                                \
    jae SYSCALL_ERROR_LABEL
  • 通过SYS_ify获取系统调用编号后,调用ENTER_KERNEL
#define DO_CALL(syscall_name, args)                              \
    PUSHARGS_##args                               \
    DOARGS_##args                                 \
    movl $SYS_ify (syscall_name), %eax;                       \
    ENTER_KERNEL                                  \
    POPARGS_##args
  • 针对x86通过 int $0x80触发
#define ENTER_KERNEL int $0x80

sysdeps\unix\sysv\linux\i386\sysdep.h
int 0x80 软中断方式陷入内核

x86_64模式:

  • x86_64位模式,使用syscall方式陷入内核
# undef DO_CALL
# define DO_CALL(syscall_name, args)        \
    DOARGS_##args               \
    movl $SYS_ify (syscall_name), %eax;     \
    syscall;

sysdeps\unix\sysv\linux\x86_64\sysdep.h

嵌入方式实现系统调用

相关系统调用的汇编代码嵌入到相关代码中, 下面调用过程演示的是这种模式

调用过程

glibc封装了常用的Linux内核系统调用, 为用户屏蔽了根据不同体系复杂性.

  • fork 定义
weak_alias (__libc_fork, fork)
  • __libc_fork 函数调用arch_fork
pid_t
__libc_fork (void)
{
  pid_t pid;
...
  pid = arch_fork (&THREAD_SELF->tid);
...
  return pid;
}

glibc-2.29\sysdeps\nptl\fork.c

  • arch_fork 函数调用
static inline pid_t
arch_fork (void *ctid)
{
  const int flags = CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD;
  long int ret;
#ifdef __ASSUME_CLONE_BACKWARDS
# ifdef INLINE_CLONE_SYSCALL
  ret = INLINE_CLONE_SYSCALL (flags, 0, NULL, 0, ctid);
# else
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, 0, ctid);
# endif
#elif defined(__ASSUME_CLONE_BACKWARDS2)
  ret = INLINE_SYSCALL_CALL (clone, 0, flags, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_BACKWARDS3)
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, 0, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE2)
  ret = INLINE_SYSCALL_CALL (clone2, flags, 0, 0, NULL, ctid, 0);
#elif defined(__ASSUME_CLONE_DEFAULT)
  ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, ctid, 0);
#else
# error "Undefined clone variant"
#endif
  return ret;
}

sysdeps\unix\sysv\linux\kernel-features.h
2.29版本
*__ASSUME_CLONE_BACKWARDS: for variant 1.*
*__ASSUME_CLONE_BACKWARDS2: for variant 2 (s390).*
*__ASSUME_CLONE_BACKWARDS3: for variant 3 (microblaze).*
*__ASSUME_CLONE_DEFAULT: for variant 4.*
*__ASSUME_CLONE2: for clone2 with variant 3 (ia64).*

  • INLINE_SYSCALL_CALL
    #define INLINE_SYSCALL_CALL(...) \
    __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
    

    sysdeps\unix\sysdep.h

  • 通过_INLINE_SYSCALL_NARGS_X 获取参数个数, 通过__SYSCALL_CONCAT根据参数个数组合成特定宏定义
#define __SYSCALL_CONCAT_X(a,b)     a##b
#define __SYSCALL_CONCAT(a,b)       __SYSCALL_CONCAT_X (a, b)
#define __INLINE_SYSCALL_NARGS_X(a,b,c,d,e,f,g,h,n,...) n
#define __INLINE_SYSCALL_NARGS(...) \
  __INLINE_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)
#define __INLINE_SYSCALL_DISP(b,...) \
  __SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
  • 可变参数的系统调用
#define __INLINE_SYSCALL0(name) \
  INLINE_SYSCALL (name, 0)
#define __INLINE_SYSCALL1(name, a1) \
  INLINE_SYSCALL (name, 1, a1)
#define __INLINE_SYSCALL2(name, a1, a2) \
  INLINE_SYSCALL (name, 2, a1, a2)
#define __INLINE_SYSCALL3(name, a1, a2, a3) \
  INLINE_SYSCALL (name, 3, a1, a2, a3)
#define __INLINE_SYSCALL4(name, a1, a2, a3, a4) \
  INLINE_SYSCALL (name, 4, a1, a2, a3, a4)
#define __INLINE_SYSCALL5(name, a1, a2, a3, a4, a5) \
  INLINE_SYSCALL (name, 5, a1, a2, a3, a4, a5)
#define __INLINE_SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
  INLINE_SYSCALL (name, 6, a1, a2, a3, a4, a5, a6)
#define __INLINE_SYSCALL7(name, a1, a2, a3, a4, a5, a6, a7) \
  INLINE_SYSCALL (name, 7, a1, a2, a3, a4, a5, a6, a7)
  • X86_64对 INLINE_SYSCALL封装,内部调用INTERNAL_SYSCALL
# undef INLINE_SYSCALL
# define INLINE_SYSCALL(name, nr, args...) \
  ({                                          \
    unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);        \
    if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))        \
      {                                       \
    __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));           \
    resultvar = (unsigned long int) -1;                   \
      }                                       \
    (long int) resultvar; })

sysdeps\unix\sysv\linux\x86_64\sysdep.h

  • INTERNAL_SYSCALL 转化为调用internal_syscall1-internal_syscall6
    internal_syscallx调用syscall实现系统调用
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...)           \
    internal_syscall##nr (SYS_ify (name), err, args)
...
#undef internal_syscall0
#define internal_syscall0(number, err, dummy...)           \
({                                  \
    unsigned long int resultvar;                    \
    asm volatile (                          \
    "syscall\n\t"                           \
    : "=a" (resultvar)                          \
    : "0" (number)                          \
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);            \
    (long int) resultvar;                       \
})
  • SYS_ifyglibc提供的API转化为系统调用对应的编号,在系统头文件 /usr/include/asm/unistd.h
#undef SYS_ify
#define SYS_ify(syscall_name)  __NR_##syscall_name

sysdeps\unix\sysv\linux\x86_64\sysdep.h

内核态部分

系统调用

x86(32位)模式:

  • int 0x80 软中断方式陷入内核
 void __init trap_init(void)
{
    ...
    idt_setup_traps();
    ...
}

arch/x86/kernel/traps.c

...
void __init idt_setup_traps(void)
{
    idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
}
  • 当接收到一个0x80中断时, 调用 entry_INT80_32
static const __initconst struct idt_data def_idts[] = {
#if defined(CONFIG_IA32_EMULATION)
    SYSG(IA32_SYSCALL_VECTOR,   entry_INT80_compat),
#elif defined(CONFIG_X86_32)
    SYSG(IA32_SYSCALL_VECTOR,   entry_INT80_32),
#endif
};

/arch/x86/kernel/idt.c

  • 32位的系统调用支持6个参数, SAVE_ALL 保存所有寄存器,调用do_int80_syscall_32
ENTRY(entry_INT80_32)
    ASM_CLAC
    pushl   %eax            /* pt_regs->orig_ax */
    SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1    /* save rest */
    ...
    movl    %esp, %eax
    call    do_int80_syscall_32
    ...
    INTERRUPT_RETURN

linux-5.3-rc3\arch\x86\entry\entry_32.S

  • 最终会调用到do_syscall_32_irqs_on,通过 eax保存的系统调用号,在ia32_sys_call_table中查找到对应的函数进行调用
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
    ...
    do_syscall_32_irqs_on(regs);
}

static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
    ...
    struct thread_info *ti = current_thread_info();
    unsigned int nr = (unsigned int)regs->orig_ax;

    regs->ax = ia32_sys_call_table[nr](
        (unsigned int)regs->bx, (unsigned int)regs->cx,
        (unsigned int)regs->dx, (unsigned int)regs->si,
        (unsigned int)regs->di, (unsigned int)regs->bp);

    syscall_return_slowpath(regs);
    ...
}
  • 系统调用后返回用户态, iret 指令将原来用户态保存的现场恢复回来,包含代码段、指令寄存器等
#define INTERRUPT_RETURN       iret

x86_64模式:

syscall 指令实现陷入内核, 使用MSR寄存器, 通过wrmsrlrdmsr指令对其读写,当调用syscall指令时,会从这个寄存器载入系统调用的地址

void __init trap_init(void)
{
    cpu_init();
}
void cpu_init(void)
{
   ...
    syscall_init();
  ..
}
void syscall_init(void)
{
  ...
    wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
  ...
}

ENTRY(entry_SYSCALL_64)
...
  call  do_syscall_64       /* returns with IRQs disabled */
    USERGS_SYSRET64
...
END(entry_SYSCALL_64)

linux-5.3-rc3\arch\x86\entry\entry_64.S

都使用系统调用表

__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
    struct thread_info *ti;
...
        nr = array_index_nospec(nr, NR_syscalls);
        regs->ax = sys_call_table[nr](regs);
...
    }
syscall_return_slowpath(regs);

}

通过sysretq返回用户态

#define USERGS_SYSRET64                \
    swapgs;                 \
    sysretq;

进程创建

  • clone 系统调用最终也调用do_fork

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, int __user *, child_tidptr, unsigned long, tls) { ... return _do_fork(&args); }

linux-5.3-rc3\kernel\fork.c

  • 复制进程信息; 唤醒新进程
long _do_fork(struct kernel_clone_args *args)
{
  ...
    p = copy_process(NULL, trace, NUMA_NO_NODE, args);
  ...
    wake_up_new_task(p);
  ...
}
  • 复制 copy_process()
    • flags 参数
Tag desc
CLONE_CHILD_CLEARTID 清除子进程TID, 指向用户态变量child_tidptr,同时唤醒一个等待事件
CLONE_CHILD_SETTID 设置子进程TID, 用户态变量child_tidptr
CLONE_FILES 进程共享打开文件
CLONE_FS 进程文件系统信息
CLONE_IO 进程共享IO
CLONE_NEWCGROUP 在新的cgroup namespace内生成进程
CLONE_NEWIPC 使用新的ipc namespace
CLONE_NEWNET 使用新的net namespace
CLONE_NEWNS 子进程需要自己的名字空间
CLONE_NEWPID 使用新的pid namespace
CLONE_NEWUSER 使用新的user namespace
CLONE_PARENT 创建进程与子进程是兄弟进程
CLONE_PARENT_SETTID 父进程设置TID
CLONE_PID 子进程ID与调用者一致
CLONE_PTRACE 设置跟踪标识
CLONE_SETTLS 子线程创建TLS
CLONE_SIGHAND 进程共享信号处理句柄
CLONE_STOPPED 强制子进程在STOPED状态开始运行
CLONE_SYSVSEM 共享SystemV进程间通讯方式
CLONE_THREAD 父子在同一线程组内
CLONE_UNTRACED 内核线程跟踪禁止
CLONE_VFORK 子进程运行是父进程进入睡眠
CLONE_VM 父子进程共享地址空间
  • dup_task_struct: 创建申请task_struct, stack, thread_info
    static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
    {
    struct task_struct *tsk;
    unsigned long *stack;
    struct vm_struct *stack_vm_area __maybe_unused;
    int err;
    //申请task_struct结构
    tsk = alloc_task_struct_node(node);
    
    //申请stack
    stack = alloc_thread_stack_node(tsk, node);
    
    stack_vm_area = task_stack_vm_area(tsk);
    err = arch_dup_task_struct(tsk, orig);
    tsk->stack = stack;
    
    //设置thread_info
    setup_thread_stack(tsk, orig);
    
    clear_user_return_notifier(tsk);
    clear_tsk_need_resched(tsk);
    set_task_stack_end_magic(tsk);
    }
    
  • copy_creds: 分配 cred 结构体并复制, 复制相关权限信息
    int copy_creds(struct task_struct *p, unsigned long clone_flags)
    {
     ...
     new = prepare_creds();
     ...
     atomic_inc(&new->user->processes);
     p->cred = p->real_cred = get_cred(new);
     alter_cred_subscribers(new, 2);
     validate_creds(new);
     ...
    }
    
  • 判断进程是否达到上限
    if (atomic_read(&p->real_cred->user->processes) >=
            task_rlimit(p, RLIMIT_NPROC)) {
    
  • rcu相关设置,进程列表及抢占式RCU用的锁信息
    static inline void rcu_copy_process(struct task_struct *p)
    {
    #ifdef CONFIG_PREEMPT_RCU
      p->rcu_read_lock_nesting = 0;
      p->rcu_read_unlock_special.s = 0;
      p->rcu_blocked_node = NULL;
      INIT_LIST_HEAD(&p->rcu_node_entry);
    #endif /* #ifdef CONFIG_PREEMPT_RCU */
    #ifdef CONFIG_TASKS_RCU
      p->rcu_tasks_holdout = false;
      INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
      p->rcu_tasks_idle_cpu = -1;
    #endif /* #ifdef CONFIG_TASKS_RCU */
    }
    
  • 初始化运行时统计量 utime, start_time , real_start_time
    ...
    p->utime = p->stime = p->gtime = 0;
    ...
    
  • sched_fork 调度相关结构体: 分配并初始化 sched_entity; 设置优先级和调度类; 设置进程状态; 调用调度类函数
    int sched_fork(unsigned long clone_flags, struct task_struct *p)
    {
    
    __sched_fork(clone_flags, p);
    p->state = TASK_NEW;
    p->prio = current->normal_prio;
    
    else if (rt_prio(p->prio))
        p->sched_class = &rt_sched_class;
      else
          p->sched_class = &fair_sched_class;
    __set_task_cpu(p, smp_processor_id());
    
    }
    //初始化task_struct 上下文
    perf_event_init_task(p);
    //分配audit context block
    audit_alloc(p);
    
  • LSM相关信息, security分配
    int security_task_alloc(struct task_struct *task, unsigned long clone_flags)
    {
    ...
    lsm_task_alloc(task);
    call_int_hook(task_alloc, 0, task, clone_flags);
    ...
    }
    
  • 设置了CLONE_SYSVSEM,则共享信号量undo_list
    int copy_semundo(unsigned long clone_flags, struct task_struct *tsk)
    {
    struct sem_undo_list *undo_list;
    
    if (clone_flags & CLONE_SYSVSEM) {
        error = get_undo_list(&undo_list);
        if (error)
            return error;
        refcount_inc(&undo_list->refcnt);
        tsk->sysvsem.undo_list = undo_list;
      } else
        tsk->sysvsem.undo_list = NULL;
    
    return 0;
    }
    
  • 初始化文件和文件系统变量 判断CLONE_FILES, CLONE_FS
    • copy_files: 复制进程打开的文件信息, 用 files_struct 维护;
      static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
      {
      struct files_struct *oldf, *newf;
      ...
      oldf = current->files;
      ...
      if (clone_flags & CLONE_FILES) {
      ...
      atomic_inc(&oldf->count);
      ...
      }
      ...
      newf = dup_fd(oldf, &error);
      tsk->files = newf;
      
    • copy_fs: 复制进程目录信息, 包括根目录/根文件系统; pwd 等, 用 fs_struct 维护
      static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
      {
      struct fs_struct *fs = current->fs;
      ...
      if (clone_flags & CLONE_FS) {
      ...
          spin_lock(&fs->lock);
          fs->users++;
        spin_unlock(&fs->lock);
      }
      tsk->fs = copy_fs_struct(fs);
      ...
      return 0;
      }
      
  • 初始化信号相关内容: 复制信号和处理函数, 涉及CLONE_SIGHAND
    • copy_sighand 设置CLONE_SIGHAND增加计数否则,新建
      static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
      {
      struct sighand_struct *sig;
      
      if (clone_flags & CLONE_SIGHAND) {
          refcount_inc(&current->sighand->count);
            return 0;
        }
      sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
      rcu_assign_pointer(tsk->sighand, sig);
      
      refcount_set(&sig->count, 1);
      memcpy(sig->action, current->sighand->action, sizeof(sig->action));
      }
      
      • copy_signal, 设置CLONE_THREAD则不处理,否则新建
      static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
      {
       struct signal_struct *sig;
      
       if (clone_flags & CLONE_THREAD)
         return 0;
      
       sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
       tsk->signal = sig;
       sig->nr_threads = 1;
       atomic_set(&sig->live, 1);
       refcount_set(&sig->sigcnt, 1
       ...
      
      • 复制内存空间: 分配并复制 mm_struct; 复制内存映射信息,设置CLONE_VM则共享否则新申请
    static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
    {
      struct mm_struct *mm, *oldmm;
    
    if (clone_flags & CLONE_VM) {
       mmget(oldmm);
       mm = oldmm;
       goto good_mm;
    } 
    
    retval = -ENOMEM;
    mm = dup_mm(tsk, current->mm);
    }
    
    • 复制namespace 检查flags: CLONE_NEWNS , CLONE_NEWUTS , CLONE_NEWIPC , CLONE_NEWPID , CLONE_NEWNET , CLONE_NEWCGROUP 不存在则新建
    int copy_namespaces(unsigned long flags, struct task_struct *tsk)
    {
      struct nsproxy *old_ns = tsk->nsproxy;
      struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
      struct nsproxy *new_ns;
    
      if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
          CLONE_NEWPID | CLONE_NEWNET |
          CLONE_NEWCGROUP)))) {
      get_nsproxy(old_ns);
      return 0;
     }
     ...
      new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
      ...
    }
    
    • 复制IO上下文 判断CLONE_ID没有则新建
    static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
    {
        if (clone_flags & CLONE_IO) {
              ioc_task_link(ioc);
            tsk->io_context = ioc;
          } else if (ioprio_valid(ioc->ioprio)) {
               new_ioc = get_task_io_context(tsk, GFP_KERNEL, NUMA_NO_NODE);
        }
    
    }
    
    • 复制线程tls 设置CLONE_SETTLS子线程创建TLS, 设置sp, sp0, io_bit_ptr,gs,TIF_IO_BITMAP
    int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
    unsigned long arg, struct task_struct *p, unsigned long tls)
    {
    
       p->thread.sp = (unsigned long) fork_frame;
       p->thread.sp0 = (unsigned long) (childregs+1)
    
       if (unlikely(p->flags & PF_KTHREAD)) {
         ...
          p->thread.io_bitmap_ptr = NULL;
         ...
       }
    
       if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
       p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr,
        IO_BITMAP_BYTES, GFP_KERNEL);
       if (!p->thread.io_bitmap_ptr) {
          p->thread.io_bitmap_max = 0;
       }
       set_tsk_thread_flag(p, TIF_IO_BITMAP);
    
       task_user_gs(p) = get_user_gs(current_pt_regs());
    
       if (clone_flags & CLONE_SETTLS)
         err = do_set_thread_area(p, -1,
             (struct user_desc __user *)tls, 0);
    }
    
    • 分配 pid
    struct pid *alloc_pid(struct pid_namespace *ns)
    {
      pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
      tmp = ns;
      pid->level = ns->level;
    
      for (i = ns->level; i >= 0; i--) {
        int pid_min = 1;
    
        if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
               pid_min = RESERVED_PIDS;
    
        nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
              pid_max, GFP_ATOMIC);
        pid->numbers[i].nr = nr;
        pid->numbers[i].ns = tmp;
        tmp = tmp->parent;
      }
    
      ...
        return pid;
     }
    
    • 唤醒 wake_up_new_task()
void wake_up_new_task(struct task_struct *p)
{
  ...
    p->state = TASK_RUNNING;
  ...

    check_preempt_curr(rq, p, WF_FORK);
  ...
}

  • state = TASK_RUNNING; activate 用调度类将当前子进程入队列
  • 调用调度类函数普通进程就是调用CFS调度函数,更新运行统计量, 加入队列
  • 调用 check_preempt_curr 看是否能抢占, 若 task_fork_fair 中已设置 sysctl_sched_child_runs_first, 直接返回, 否则进一步比较并调用 resched_curr 做抢占标记
  • 若父进程被标记会被抢占, 则系统调用 fork 返回过程会调度子进程

刚开始看内核代码比较崩溃,之前一直在学《极客时间》刘超老师 的《趣谈Linux操作系统》,这里算做一个总结,光听专栏还是不够的还是要亲自实践,在整理的过程中好多概念还是不清晰,一直再google,如果不是ARTS打卡,真的就放弃了:(


参考

Linux进程管理(二)–fork
glibc system call wrapper
glibc源码分析(一)系统调用

Be First to Comment

发表回复