Archive for the ‘oom killer’ Category

patch discussion: mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()

December 31, 2015

This post discusses mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process().

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what does oom_kill_process() do
As discussed in kernel: mm: oom_kill_process, it does the following three items.

  1. It may replace victim with one of its feasible child thread group.
  2. It kills those thread groups whose mm are the same as victim’s.
  3. It finally kills victim.

what does the patch change behaviours for the first item
While killing those thread groups whose mm are the same as victim’s, oom_kill_process() finds child thread groups by checking if child’s mm is not the same as victim’s.

Before this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will still kill the child thread group which is already been killed.

After this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will not kill the child thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed child thread group having the same mm are already NULL in do_exit(), then oom-killer will still replace victim with this killed thread group. This patch couldn’t prevent this condition.

what does the patch change behaviours for the second item
While killing those thread groups whose mm are the same as victim’s. oom-killer will also kills those thread groups whose mm are the same as victim’s.

Before this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer skips killing this thread group because the mm of the group leader is NULL now.

After this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer stills kills the killed thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed thread group having the same mm are already NULL in do_exit(), then oom-killer will not kill the thread group. This patch couldn’t prevent this condition.

I wonder if the thread group is already been killed, then it is useless to kill it again. But the commit log of this patch says that such condition is incorrect.

conclusion
This post discusses how mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() change the behaviours of oom_kill_process().

Advertisements

kernel: mm: oom_kill_process

December 29, 2015

The post discusses oom_kill_process().

reference code base
linux 4.3

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> wake_all_kswapds()
   -> get_page_from_freelist()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> __alloc_pages_may_oom()
      -> get_page_from_freelist()
      -> out_of_memory()
         -> select_bad_process()
         -> oom_kill_process()
   -> __alloc_pages_direct_compact()

when is oom_kill_process() called
Page allocation slow path tries getting pages with min watermark, compaction, and reclaim. If it still fails to allocate pages and no progresses are made, then it calls __alloc_pages_may_oom() to allocate pages. If it stills fails, it may repeat above flows or try the final compaction according to gfp_mask and page order of this allocation.

In __alloc_pages_may_oom(), it tries getting pages with high watermark. If it fails, then it will call select_bad_process() to select the feasible thread group and call oom_kill_process() to kill the chosen thread group.

implementation of oom_kill_process()
oom_kill_process() traverses all child thread of the chosen thread group. If a child thread has different mm, is killable, and has the highest badness, then it replaces victim with the child thread. The child thread is the group leader of another thread group. I think it tries to minimise the effects of killing a thread group. The lower a thread group is in the thread group hierarchy, the less effects killing such thread group has.

Then, oom_kill_process() traverses all other thread groups. If the thread group is a user space one, has the same mm as victime thread group, has oom_score_adj > OOM_SCORE_ADJ_MIN, then oom_kill_process() also kill the thread group.

Finally, oom_kill_process() send SIGKILL to kill victim thread group.

479 #define K(x) ((x) << (PAGE_SHIFT-10))
480 /*
481  * Must be called while holding a reference to p, which will be released upon
482  * returning.
483  */
484 void oom_kill_process(struct oom_control *oc, struct task_struct *p,
485                       unsigned int points, unsigned long totalpages,
486                       struct mem_cgroup *memcg, const char *message)
487 {
488         struct task_struct *victim = p;
489         struct task_struct *child;
490         struct task_struct *t;
491         struct mm_struct *mm;
492         unsigned int victim_points = 0;
493         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
494                                               DEFAULT_RATELIMIT_BURST);
495 
496         /*
497          * If the task is already exiting, don't alarm the sysadmin or kill
498          * its children or threads, just set TIF_MEMDIE so it can die quickly
499          */
500         task_lock(p);
501         if (p->mm && task_will_free_mem(p)) {
502                 mark_oom_victim(p);
503                 task_unlock(p);
504                 put_task_struct(p);
505                 return;
506         }
507         task_unlock(p);
508 
509         if (__ratelimit(&oom_rs))
510                 dump_header(oc, p, memcg);
511 
512         task_lock(p);
513         pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
514                 message, task_pid_nr(p), p->comm, points);
515         task_unlock(p);
516 
517         /*
518          * If any of p's children has a different mm and is eligible for kill,
519          * the one with the highest oom_badness() score is sacrificed for its
520          * parent.  This attempts to lose the minimal amount of work done while
521          * still freeing memory.
522          */
523         read_lock(&tasklist_lock);
524         for_each_thread(p, t) {
525                 list_for_each_entry(child, &t->children, sibling) {
526                         unsigned int child_points;
527 
528                         if (child->mm == p->mm)
529                                 continue;
530                         /*
531                          * oom_badness() returns 0 if the thread is unkillable
532                          */
533                         child_points = oom_badness(child, memcg, oc->nodemask,
534                                                                 totalpages);
535                         if (child_points > victim_points) {
536                                 put_task_struct(victim);
537                                 victim = child;
538                                 victim_points = child_points;
539                                 get_task_struct(victim);
540                         }
541                 }
542         }
543         read_unlock(&tasklist_lock);
544 
545         p = find_lock_task_mm(victim);
546         if (!p) {
547                 put_task_struct(victim);
548                 return;
549         } else if (victim != p) {
550                 get_task_struct(p);
551                 put_task_struct(victim);
552                 victim = p;
553         }
554 
555         /* mm cannot safely be dereferenced after task_unlock(victim) */
556         mm = victim->mm;
557         mark_oom_victim(victim);
558         pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
559                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
560                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
561                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
562         task_unlock(victim);
563 
564         /*
565          * Kill all user processes sharing victim->mm in other thread groups, if
566          * any.  They don't get access to memory reserves, though, to avoid
567          * depletion of all memory.  This prevents mm->mmap_sem livelock when an
568          * oom killed thread cannot exit because it requires the semaphore and
569          * its contended by another thread trying to allocate memory itself.
570          * That thread will now get access to memory reserves since it has a
571          * pending fatal signal.
572          */
573         rcu_read_lock();
574         for_each_process(p)
575                 if (p->mm == mm && !same_thread_group(p, victim) &&
576                     !(p->flags & PF_KTHREAD)) {
577                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
578                                 continue;
579 
580                         task_lock(p);   /* Protect ->comm from prctl() */
581                         pr_err("Kill process %d (%s) sharing same memory\n",
582                                 task_pid_nr(p), p->comm);
583                         task_unlock(p);
584                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
585                 }
586         rcu_read_unlock();
587 
588         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
589         put_task_struct(victim);
590 }
591 #undef K

conclusion
This post discusses when oom_kill_process() is called and its implementation.


%d bloggers like this: