Linux内核参数sysctl_sched_child_runs_first

概述

Linux2.6.23版本引入了CFS调度器,通过sched_child_runs_first设置是否子进程优先运行, 下面是 SUSE Documentation 关于此参数的说明:

A freshly forked child runs before the parent continues execution. Setting this parameter to 1 is beneficial for an application in which the child performs an execution after fork. For example make -j performs better when sched_child_runs_first is turned off. The default value is 0.

make -j<cpu个数>支持并行编译

设置这个参数后子进程会优先在父进程前执行.

验证

...
int
main (int argc, char *argv[])
{
  ...
  for (j = 0; j < numChildren; j++)
    {
      switch (childPid = fork ())
        {
        case -1:
          exit(-1);

        case 0:
          getpid();
          printf ("%d child\n", j);
          _exit (EXIT_SUCCESS);

        default:
          getpid();
          printf ("%d parent\n", j);
          wait (NULL);          /* Wait for child to terminate */
          break;
        }
    }

  exit (EXIT_SUCCESS);
}
 sysctl -w kernel.sched_child_runs_first=1
 ./fork_whos_on_first 10000 > fork.txt 
 awk -f ./fork_whos_on_first.count.awk fork.txt
All done
parent  99983  99.98%
child      17   0.02%

虽然设置了 sched_child_runs_first, 并不是子进程每次都能先于父进程被调度。

由于虚拟机设置了2个CPU,可以通过代码或taskset命令设置一下cpu亲和性,使得任务指定在一个核心上运行

#define _GNU_SOURCE
...
 cpu_set_t mask;
 CPU_ZERO (&mask);
 int i;
 CPU_SET (0, &mask);
 if (sched_setaffinity (0, sizeof (mask), &mask) == -1)
 {
    printf ("sched_setaffinity error\n");
    return;
 }

taskset -c 0 ./fork_whos_on_first 10000

  • 通过perf查看

开始记录2s的事件:

# perf sched record -- sleep 2

同时运行

(fork_whos_on_first 改名为first)
# ./first
0 parent [36785]
0 child [36786]

看下调度延时,可以看到很多任务信息


#perf sched latency ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- kworker/0:0-eve:34530 | 2.544 ms | 13 | avg: 0.164 ms | max: 1.343 ms | max at: 95680.330955 s mysqld:(14) | 1.581 ms | 46 | avg: 0.088 ms | max: 0.195 ms | max at: 95680.811329 s systemd:(2) | 0.498 ms | 2 | avg: 0.086 ms | max: 0.146 ms | max at: 95680.919657 s irq/16-vmwgfx:451 | 0.000 ms | 8 | avg: 0.066 ms | max: 0.076 ms | max at: 95680.858616 s first:(2) | 1.776 ms | 3 | avg: 0.058 ms | max: 0.086 ms | max at: 95680.330472 s perf:36780 | 6.203 ms | 1 | avg: 0.018 ms | max: 0.018 ms | max at: 95681.795302 s sleep:36783 | 1.712 ms | 2 | avg: 0.015 ms | max: 0.022 ms | max at: 95681.794708 s sshd:15689 | 0.836 ms | 6 | avg: 0.007 ms | max: 0.020 ms | max at: 95680.328447 s bash:15704 | 1.280 ms | 2 | avg: 0.004 ms | max: 0.005 ms | max at: 95680.328570 s ... -----------------------------------------------------------------------------------------------------------------

[root@centosgpt ~]# perf sched timehist -n |grep first Samples do not have callchains. 96628.689986 [0000] first[37799] 0.155 0.000 1.763 next: irq/16-vmwgfx[451] 96628.690042 [0000] kworker/0:2-eve[36968] 207.496 1.366 0.036 next: first[37799] 96628.690379 [0000] first[37799] 0.056 0.000 0.336 next: first[37801] 96628.690898 [0000] first[37801] 0.000 0.172 0.519 next: kworker/u256:0[7177] 96628.690961 [0000] kworker/u256:0-[7177] 312.054 0.540 0.062 next: first[37799] 96628.691255 [0000] first[37799] 0.582 0.071 0.294 next: swapper/0[0]

可以看到调用顺序.

源码

涉及到子进程与父进程的关系,容易想到创建新任务,常用的系统调用fork, clone等,这些系统调用,最终调用_do_fork函数, 他的主要流程就是复制,起名(分配pid), 唤醒.

long _do_fork(struct kernel_clone_args *args)
{
    ...
    p = copy_process(NULL, trace, NUMA_NO_NODE, args);
    add_latent_entropy();
    ...
    pid = get_task_pid(p, PIDTYPE_PID);
    nr = pid_vnr(pid);
    ...
    wake_up_new_task(p);
    ...
    put_pid(pid);
    return nr;
}

sysctl_sched_child_runs_first,可以在task_fork_fair看到

copy_process -> sched_fork (p->sched_class->task_fork(p))-> task_fork_fair

static void task_fork_fair(struct task_struct *p)
{
    ...
    curr = cfs_rq->curr;
    if (curr) {
        update_curr(cfs_rq);
        se->vruntime = curr->vruntime;
    }
    place_entity(cfs_rq, se, 1);

    if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
        swap(curr->vruntime, se->vruntime);
        resched_curr(rq);
    }
    se->vruntime -= cfs_rq->min_vruntime;
    ...
}

设置该标识后,如果父进程的vruntime小于子进程vruntime,则交换,保证子进程的vruntime比父进程小,从而能够被优先调度,resched_curr通过TIF_NEED_RESCHED的设置, 标识父进程可被抢占。

关于child run first

child run first,支持者主要的理由是由于COW(copy-on-write)的应用,如果父进程先运行,那么父进程触发的写操作会造成针对子进程一些没有必要的内存拷贝, 而子进程可能本来就不需要哪些。

linux-2.4.4-pre2: fork should run child first 这封邮件里可以看到在CFS引入前,已经出现了相关的讨论.下面对比一下下copy-on-write与立即复制模式。

Immediately && copy-on-write

  • 立即复制
    fork()一个子进程为例,父进程的内存空间将复制给子进程,子进程和父进程拥有不同的物理页面,如下图 :

+-------+     +-------+     +-------+
| page1 |     | page1 | <-- | page1 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page2 | --> | page2 |     | page2 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page3 |  +> | page3 |     | page3 |
+-------+  |  +-------+     +-------+
|       |  |  |       |     |       |
| page4 | -+  | page4 |  +- | page4 |
+-------+     +-------+  |  +-------+
|       |     |       |  |  |       |
| page5 |     | page5 | <+  | page5 |
+-------+     +-------+     +-------+
 Parent     Phyical RAM       Child       
  • 写时复制 copy-on-write

写时复制分为两个阶段:
仅拷贝页表项,不拷贝原始页,所有页表只读,看起来像这个样子, 共享page2,page3:


+-------+     +-------+     +-------+
| page1 |     | page1 |     | page1 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page2 | --> | page2 | <-- | page2 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page3 |  +> | page3 | <+  | page3 |
+-------+  |  +-------+  |  +-------+
|       |  |  |       |  |  |       |
| page4 | -+  | page4 |  +- | page4 |
+-------+     +-------+     +-------+
| page5 |     | page5 |     | page5 |
+-------+     +-------+     +-------+
 Parent     Phyical RAM       Child       

如果其中一个进程对一个页面Page3进行修改:

  • 给子进程拷贝这个页面到新页面Page4
  • 更新子进程页表项
  • 标识两个进程页表项可读可写

+-------+     +-------+     +-------+
| page1 |     | page1 |     | page1 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page2 | --> | page2 | <-- | page2 |
+-------+     +-------+     +-------+
|       |     |       |     |       |
| page3 |  +> | page3 |     | page3 |
+-------+  |  +-------+     +-------+
|       |  |  |       |     |       |
| page4 | -+  | page4 | <-- | page4 |
+-------+     +-------+     +-------+
| page5 |     | page5 |     | page5 |
+-------+     +-------+     +-------+
 Parent     Phyical RAM       Child       

“Child-runs-first is now off”

随后的Linux2.6.32版本sched_child_runs_first其默认设置0,Child-runs-first is now off, 针对一些人的疑问Linus Torvalds做出了回应。主要是出于之前遇到到bash bug以及parent run in first 能更好的利用TLB及cache考虑做出的决定。当然内核也提供kernel.sched_child_runs_first参数,可供需要的用户根据需要设置。

后续关于此相关的讨论我也很少搜到,这期间软硬件环境也发生了很大的变化,站在今天的角度很难说当时提出child run first的那些hacker们说得对或者错,或许不久将来调度器也会变化。

参考

Does /proc/sys/kernel/sched_child_runs_first work?
Kernel Tuning: kernel.sched_child_runs_first
Linus Torvalds 关于 Child-runs-first is now off 回复的邮件
PATCH(?): linux-2.4.4-pre2: fork should run child first
Linux写时拷贝技术(copy-on-write)
COW奶牛!Copy On Write机制了解一下
fork()父子进程运行先后顺序(这篇文章对这个问题讲的比较细致,网上收索了一下资料发现和作者帖子的差不多,而且文章中含有相关代码说明)
Various scheduler-related topics
Virtual Memory, Part III

Be First to Comment

发表回复