Archive for the ‘patch discussion’ Category

patch discussion: mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()

December 31, 2015

This post discusses mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process().

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what does oom_kill_process() do
As discussed in kernel: mm: oom_kill_process, it does the following three items.

  1. It may replace victim with one of its feasible child thread group.
  2. It kills those thread groups whose mm are the same as victim’s.
  3. It finally kills victim.

what does the patch change behaviours for the first item
While killing those thread groups whose mm are the same as victim’s, oom_kill_process() finds child thread groups by checking if child’s mm is not the same as victim’s.

Before this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will still kill the child thread group which is already been killed.

After this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will not kill the child thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed child thread group having the same mm are already NULL in do_exit(), then oom-killer will still replace victim with this killed thread group. This patch couldn’t prevent this condition.

what does the patch change behaviours for the second item
While killing those thread groups whose mm are the same as victim’s. oom-killer will also kills those thread groups whose mm are the same as victim’s.

Before this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer skips killing this thread group because the mm of the group leader is NULL now.

After this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer stills kills the killed thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed thread group having the same mm are already NULL in do_exit(), then oom-killer will not kill the thread group. This patch couldn’t prevent this condition.

I wonder if the thread group is already been killed, then it is useless to kill it again. But the commit log of this patch says that such condition is incorrect.

conclusion
This post discusses how mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() change the behaviours of oom_kill_process().

Advertisements

patch discussion: mm/oom_kill: cleanup the “kill sharing same memory” loop

December 28, 2015

This post discusses mm/oom_kill: cleanup the “kill sharing same memory” loop.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> wake_all_kswapds()
   -> get_page_from_freelist()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> __alloc_pages_may_oom()
      -> get_page_from_freelist()
      -> out_of_memory()
         -> select_bad_process()
         -> oom_kill_process()
   -> __alloc_pages_direct_compact()

how oom-killer kills thread groups other than victim
out_of_memory() calls select_bad_process() to select the victim thread and calls oom_kill_process() to kill it.

oom_kill_process() iterates all thread groups and kill a thread group if it is not satisfying any of the below condition.

  • p->mm != vicitim->mm
  • same_thread_group(p, victim)
  • (p->flags & PF_KTHREAD)
  • p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It implies omm-killer also kills thread groups which share the same mm as victim if its group leader is not a kernel thread and its oom_score_adj is not OOM_SCORE_ADJ_MIN.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c837d06..2b6e880 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -574,14 +574,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * pending fatal signal.
 	 */
 	rcu_read_lock();
-	for_each_process(p)
-		if (p->mm == mm && !same_thread_group(p, victim) &&
-		    !(p->flags & PF_KTHREAD)) {
-			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-				continue;
+	for_each_process(p) {
+		if (p->mm != mm)
+			continue;
+		if (same_thread_group(p, victim))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
 
-			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-		}
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+	}
 	rcu_read_unlock();
 
 	mmdrop(mm);
generated by cgit v0.11.2 at 2015-12-28 12:19:54 (GMT)

conclusion
This post discusses mm/oom_kill: cleanup the “kill sharing same memory” loop and how oom-killer kills thread groups other than victim.

patch discussion: mm: fix the racy mm->locked_vm change in

December 28, 2015

This post discusses mm: fix the racy mm->locked_vm change in.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what locks are holding in acct_stack_growth()
kernel: mm: why expand_stack() expands VMAs but only holds down_read(task->mm->mmap_sem) shows that expand_downwards() holds down_read(&task->mm->mmap_sem) and vma_lock_anon_vma(vma) before expanding the VMA. After the VMA is updated, acct_stack_growth() is called to account statistics in current->mm.

do_page_fault()
down_read(&task->mm->mmap_sem)
-> __do_page_fault()
   -> expand_stack()
    -> expand_downwards()
        -> vma_lock_anon_vma(vma)
        -> acct_stack_growth()
        -> vma_unlock_anon_vma(vma)
   -> handle_mm_fault()
      -> __handle_mm_fault()
up_read(&task->mm->mmap_sem)

what’s the problem
Since VMA expansions are always in the same direction, VMA expansions has no effects on the relative relation between every two VMAs. All VMAs could expand at the same time while only holding down_read(task->mm->mmap_sem).

But it’s incorrect to allow concurrent update to statistics of task->mm, such as task->mm->locked_vm, mm->total_vm, shared_vm, and etc.

how does the patch fix it
It’s safe for a thread to modify task->mm->locked_vm while holding down_write(&task->mm->mmap_sem). But in this place a thread could modify task->mm->locked_vm while only holding down_read(&task->mm->mmap_sem). Many threads could hold down_read(&task->mm->mmap_sem) and modify task->mm->locked_vm concurrently.

This patch moves the code which modifies statistics of task->mm into the section projected by spin_lock(&vma->vm_mm->page_table_lock). Then all threads holding down_read(&task->mm->mmap_sem) needs to contend spin_lock(&vma->vm_mm->page_table_lock) before modifies statistics of task->mm.

diff --git a/mm/mmap.c b/mm/mmap.c
index 3ec19b6..d1ac224 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2138,10 +2138,6 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
 	if (security_vm_enough_memory_mm(mm, grow))
 		return -ENOMEM;
 
-	/* Ok, everything looks good - let it rip */
-	if (vma->vm_flags & VM_LOCKED)
-		mm->locked_vm += grow;
-	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
 	return 0;
 }
 
@@ -2202,6 +2198,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					vma->vm_mm->locked_vm += grow;
+				vm_stat_account(vma->vm_mm, vma->vm_flags,
+						vma->vm_file, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				anon_vma_interval_tree_post_update_vma(vma);
@@ -2273,6 +2273,10 @@ int expand_downwards(struct vm_area_struct *vma,
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					vma->vm_mm->locked_vm += grow;
+				vm_stat_account(vma->vm_mm, vma->vm_flags,
+						vma->vm_file, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;

conclusion
This post discusses mm: fix the racy mm->locked_vm change in. I am not sure if this method is the best one, since it makes a code section which updating statistics of task->mm protected by spin_lock(&vma->vm_mm->page_table_lock). But the result accuracy is acceptable.

patch discussion: mm/vmscan.c: fix types of some locals

December 27, 2015

This post discusses mm/vmscan.c: fix types of some locals.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

zone_page_state(), zone_unmapped_file_pages()
Both function returns page numbers with type unsigned long.

what does the patch do
The patch fixes possible underflow while using a long local variable to accept return value of a function returning unsigned long. The patch fixes this problem by replacing the types of some variable with unsigned long accordingly.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6ceede0..55721b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,7 +194,7 @@ static bool sane_reclaim(struct scan_control *sc)
 
 static unsigned long zone_reclaimable_pages(struct zone *zone)
 {
-	int nr;
+	unsigned long nr;
 
 	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
 	     zone_page_state(zone, NR_INACTIVE_FILE);
@@ -3693,10 +3693,10 @@ static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 }
 
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static long zone_pagecache_reclaimable(struct zone *zone)
+static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 {
-	long nr_pagecache_reclaimable;
-	long delta = 0;
+	unsigned long nr_pagecache_reclaimable;
+	unsigned long delta = 0;
 
 	/*
 	 * If RECLAIM_UNMAP is set, then all file pages are considered

conclusion
This post discusses mm/vmscan.c: fix types of some locals. If the return value is unsigned long, then caller could use unsigned long variable to accept it to avoid underflow.

patch discussion: mm, oom: remove task_lock protecting comm printing

December 27, 2015

This post discusses mm, oom: remove task_lock protecting comm printing.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

task->comm and task_lock()
As discussed in kernel: task_struct: comm and task_lock, comm is protected by task_lock() and task_unlock().

what does this patch do
This patch removes task_lock() and task_unlock() which protect task->comm. The reason is that although the task->comm could be updated while printk, it is more efficient without holding task’s alloc_lock.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c170d9f..58f3d27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -377,13 +377,11 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 static void dump_header(struct oom_control *oc, struct task_struct *p,
 			struct mem_cgroup *memcg)
 {
-	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
 		"oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, oc->order,
 		current->signal->oom_score_adj);
-	cpuset_print_task_mems_allowed(current);
-	task_unlock(current);
+	cpuset_print_current_mems_allowed();
 	dump_stack();
 	if (memcg)
 		mem_cgroup_print_oom_info(memcg, p);
@@ -509,10 +507,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
 
-	task_lock(p);
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
 
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
@@ -586,10 +582,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			if (fatal_signal_pending(p))
 				continue;
 
-			task_lock(p);	/* Protect ->comm from prctl() */
 			pr_info("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(p), p->comm);
-			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
 	rcu_read_unlock();

conclusion
This post discusses mm, oom: remove task_lock protecting comm printing. It permits incorrectly printing task->comm and avoids performance degrading due to get task lock.

patch discussion: mm/oom_kill.c: suppress unnecessary “sharing same memory” message

December 26, 2015

This post discusses mm/oom_kill.c: fix potentially killing unrelated process.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

do_send_sig_info() and fatal_signal_pending()
kernel: signal: how a fatal signal kills a thread group shows that do_send_sig_info() sends signals to the thread group pending of the target thread if group argument is true. Then, each thread in the thread group of target thread could read this signal in do_signal(). If a thread process SIGKILL, it sends SIGKILL to all the other threads in the same thread group.

fatal_signal_pending() return true while the thread’s has pending signal SIGKILL. If a thread already has fatal pending signals, then it is meaningless to send signal to this thread again.

2888 static inline int signal_pending(struct task_struct *p)
2889 {
2890         return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
2891 }
2892 
2893 static inline int __fatal_signal_pending(struct task_struct *p)
2894 {
2895         return unlikely(sigismember(&p->pending.signal, SIGKILL));
2896 }
2897 
2898 static inline int fatal_signal_pending(struct task_struct *p)
2899 {
2900         return signal_pending(p) && __fatal_signal_pending(p);
2901 }

what does the patch do
oom_kill_process sends SIGKILL to a thread group if the thread group’s mm is the same as that of victim’s thread. If the group leader of a thread group already has fatal pending signals, then there is no need to kill the thread group again.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5ba743a..c170d9f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -583,9 +583,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		    !(p->flags & PF_KTHREAD)) {
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
+			if (fatal_signal_pending(p))
+				continue;
 
 			task_lock(p);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
+			pr_info("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);

conclusion
This post discusses mm/oom_kill.c: fix potentially killing unrelated process.

patch discussion: mm/oom_kill.c: fix potentially killing unrelated process

December 22, 2015

This post discusses mm/oom_kill.c: fix potentially killing unrelated process.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what is the problem
The victim thread is found by find_lock_task_mm() which returns a thread having mm in the same thread group. Then victime->mm is assigned to mm and mm points to victim’s mm. oom_kill_process() will also kill other processes which share the same mm as victim.

However, since oom_kill_process() doesn’t increase mm->mm_count, it’s possible that victims’s mm’s mm_count reaches zero and is freed. If the same memory is allocated for another threads’s mm, then mm in oom_kill_process() will points to some thread unrelated to victim. Under this condition, oom_kill_process() will kill the other process which share the same mm as the unrelated thread.

480 /*
481  * Must be called while holding a reference to p, which will be released upon
482  * returning.
483  */
484 void oom_kill_process(struct oom_control *oc, struct task_struct *p,
485                       unsigned int points, unsigned long totalpages,
486                       struct mem_cgroup *memcg, const char *message)
487 {
488         struct task_struct *victim = p;
489         struct task_struct *child;
490         struct task_struct *t;
491         struct mm_struct *mm;
492         unsigned int victim_points = 0;
493         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
494                                               DEFAULT_RATELIMIT_BURST);
495 
496         /*
497          * If the task is already exiting, don't alarm the sysadmin or kill
498          * its children or threads, just set TIF_MEMDIE so it can die quickly
499          */
500         task_lock(p);
501         if (p->mm && task_will_free_mem(p)) {
502                 mark_oom_victim(p);
503                 task_unlock(p);
504                 put_task_struct(p);
505                 return;
506         }
507         task_unlock(p);
508 
509         if (__ratelimit(&oom_rs))
510                 dump_header(oc, p, memcg);
511 
512         task_lock(p);
513         pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
514                 message, task_pid_nr(p), p->comm, points);
515         task_unlock(p);
516 
517         /*
518          * If any of p's children has a different mm and is eligible for kill,
519          * the one with the highest oom_badness() score is sacrificed for its
520          * parent.  This attempts to lose the minimal amount of work done while
521          * still freeing memory.
522          */
523         read_lock(&tasklist_lock);
524         for_each_thread(p, t) {
525                 list_for_each_entry(child, &t->children, sibling) {
526                         unsigned int child_points;
527 
528                         if (child->mm == p->mm)
529                                 continue;
530                         /*
531                          * oom_badness() returns 0 if the thread is unkillable
532                          */
533                         child_points = oom_badness(child, memcg, oc->nodemask,
534                                                                 totalpages);
535                         if (child_points > victim_points) {
536                                 put_task_struct(victim);
537                                 victim = child;
538                                 victim_points = child_points;
539                                 get_task_struct(victim);
540                         }
541                 }
542         }
543         read_unlock(&tasklist_lock);
544 
545         p = find_lock_task_mm(victim);
546         if (!p) {
547                 put_task_struct(victim);
548                 return;
549         } else if (victim != p) {
550                 get_task_struct(p);
551                 put_task_struct(victim);
552                 victim = p;
553         }
554 
555         /* mm cannot safely be dereferenced after task_unlock(victim) */
556         mm = victim->mm;
557         mark_oom_victim(victim);
558         pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
559                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
560                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
561                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
562         task_unlock(victim);
563 
564         /*
565          * Kill all user processes sharing victim->mm in other thread groups, if
566          * any.  They don't get access to memory reserves, though, to avoid
567          * depletion of all memory.  This prevents mm->mmap_sem livelock when an
568          * oom killed thread cannot exit because it requires the semaphore and
569          * its contended by another thread trying to allocate memory itself.
570          * That thread will now get access to memory reserves since it has a
571          * pending fatal signal.
572          */
573         rcu_read_lock();
574         for_each_process(p)
575                 if (p->mm == mm && !same_thread_group(p, victim) &&
576                     !(p->flags & PF_KTHREAD)) {
577                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
578                                 continue;
579 
580                         task_lock(p);   /* Protect ->comm from prctl() */
581                         pr_err("Kill process %d (%s) sharing same memory\n",
582                                 task_pid_nr(p), p->comm);
583                         task_unlock(p);
584                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
585                 }
586         rcu_read_unlock();
587 
588         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
589         put_task_struct(victim);
590 }
591 #undef K

how to fix the problem
oom_kill_process increases mm->mm_count while holds the task’s lock. After using the mm to find which processes also share the same mm, oom_kill_process() calls mmdrop() to decrease mm->mm_count.

mm: mmput and mmdrop shows more information about mmdrop() and mm_count.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ad35aa..5ba743a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -552,8 +552,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		victim = p;
 	}
 
-	/* mm cannot safely be dereferenced after task_unlock(victim) */
+	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
+	atomic_inc(&mm->mm_count);
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
@@ -591,6 +592,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
+	mmdrop(mm);
 	put_task_struct(victim);
 }
 #undef K

conclusion
This post discusses mm/oom_kill.c: fix potentially killing unrelated process which prevents oom-killer kills the wrong processes.

patch discussion: mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL

December 21, 2015

This post discusses mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what is TIF_MEMDIE
mm: TIF_MEMDIE shows that oom-killer sets TIF_MEMDIE flag of the victim thread’s thread_info’s flags and sends signal SIGKILL to kill the victim thread. TIF_MEMDIE flag indicates that the thread is killed by oom-killer.

what is the problem
After oom-killer calls mark_oom_victim(victim) to set TIF_MEMDIE flag of the victim thread’s thread_info flags, it may take several seconds before oom-killer calls do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true) to kill the victim thread.

If there exist threads in other thread group share the same task->mm, then oom-killer also needs to kill them somehow. If there are many such threads, then oom-killer might take seconds from mark_oom_victim(victim) to do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true).

Before the thread is really killed, it could still allocate pages. After the thread has TIF_MEMDIE flag set, it could allocate page without checking watermark. Under this condition, the victim thread could deteriorate the memory level until it is really killed.

479 #define K(x) ((x) << (PAGE_SHIFT-10))
480 /*
481  * Must be called while holding a reference to p, which will be released upon
482  * returning.
483  */
484 void oom_kill_process(struct oom_control *oc, struct task_struct *p,
485                       unsigned int points, unsigned long totalpages,
486                       struct mem_cgroup *memcg, const char *message)
487 {
488         struct task_struct *victim = p;
489         struct task_struct *child;
490         struct task_struct *t;
491         struct mm_struct *mm;
492         unsigned int victim_points = 0;
493         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
494                                               DEFAULT_RATELIMIT_BURST);
495 
496         /*
497          * If the task is already exiting, don't alarm the sysadmin or kill
498          * its children or threads, just set TIF_MEMDIE so it can die quickly
499          */
500         task_lock(p);
501         if (p->mm && task_will_free_mem(p)) {
502                 mark_oom_victim(p);
503                 task_unlock(p);
504                 put_task_struct(p);
505                 return;
506         }
507         task_unlock(p);
508 
509         if (__ratelimit(&oom_rs))
510                 dump_header(oc, p, memcg);
511 
512         task_lock(p);
513         pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
514                 message, task_pid_nr(p), p->comm, points);
515         task_unlock(p);
516 
517         /*
518          * If any of p's children has a different mm and is eligible for kill,
519          * the one with the highest oom_badness() score is sacrificed for its
520          * parent.  This attempts to lose the minimal amount of work done while
521          * still freeing memory.
522          */
523         read_lock(&tasklist_lock);
524         for_each_thread(p, t) {
525                 list_for_each_entry(child, &t->children, sibling) {
526                         unsigned int child_points;
527 
528                         if (child->mm == p->mm)
529                                 continue;
530                         /*
531                          * oom_badness() returns 0 if the thread is unkillable
532                          */
533                         child_points = oom_badness(child, memcg, oc->nodemask,
534                                                                 totalpages);
535                         if (child_points > victim_points) {
536                                 put_task_struct(victim);
537                                 victim = child;
538                                 victim_points = child_points;
539                                 get_task_struct(victim);
540                         }
541                 }
542         }
543         read_unlock(&tasklist_lock);
544 
545         p = find_lock_task_mm(victim);
546         if (!p) {
547                 put_task_struct(victim);
548                 return;
549         } else if (victim != p) {
550                 get_task_struct(p);
551                 put_task_struct(victim);
552                 victim = p;
553         }
554 
555         /* mm cannot safely be dereferenced after task_unlock(victim) */
556         mm = victim->mm;
557         mark_oom_victim(victim);
558         pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
559                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
560                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
561                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
562         task_unlock(victim);
563 
564         /*
565          * Kill all user processes sharing victim->mm in other thread groups, if
566          * any.  They don't get access to memory reserves, though, to avoid
567          * depletion of all memory.  This prevents mm->mmap_sem livelock when an
568          * oom killed thread cannot exit because it requires the semaphore and
569          * its contended by another thread trying to allocate memory itself.
570          * That thread will now get access to memory reserves since it has a
571          * pending fatal signal.
572          */
573         rcu_read_lock();
574         for_each_process(p)
575                 if (p->mm == mm && !same_thread_group(p, victim) &&
576                     !(p->flags & PF_KTHREAD)) {
577                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
578                                 continue;
579 
580                         task_lock(p);   /* Protect ->comm from prctl() */
581                         pr_err("Kill process %d (%s) sharing same memory\n",
582                                 task_pid_nr(p), p->comm);
583                         task_unlock(p);
584                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
585                 }
586         rcu_read_unlock();
587 
588         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
589         put_task_struct(victim);
590 }
591 #undef K
404 /**
405  * mark_oom_victim - mark the given task as OOM victim
406  * @tsk: task to mark
407  *
408  * Has to be called with oom_lock held and never after
409  * oom has been disabled already.
410  */
411 void mark_oom_victim(struct task_struct *tsk)
412 {
413         WARN_ON(oom_killer_disabled);
414         /* OOM killer might race with memcg OOM */
415         if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
416                 return;
417         /*
418          * Make sure that the task is woken up from uninterruptible sleep
419          * if it is frozen because OOM killer wouldn't be able to free
420          * any memory and livelock. freezing_slow_path will tell the freezer
421          * that TIF_MEMDIE tasks should be ignored.
422          */
423         __thaw_task(tsk);
424         atomic_inc(&oom_victims);
425 }

why could a thread with TIF_MEMDIE flag set allocate pages without watermark check
If a thread’s TIF_MEMDIE flag is set, then alloc_flags |= ALLOC_NO_WATERMARKS. The thread could allocate pages in slowpath without checking watermark. If the min_watermark is very low, then the thread could exhaust the whole system memory.

2945 static inline int
2946 gfp_to_alloc_flags(gfp_t gfp_mask)
2947 {
2948         int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
2949         const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
2950 
2951         /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
2952         BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
2953 
2954         /*
2955          * The caller may dip into page reserves a bit more if the caller
2956          * cannot run direct reclaim, or if the caller has realtime scheduling
2957          * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
2958          * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
2959          */
2960         alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
2961 
2962         if (atomic) {
2963                 /*
2964                  * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
2965                  * if it can't schedule.
2966                  */
2967                 if (!(gfp_mask & __GFP_NOMEMALLOC))
2968                         alloc_flags |= ALLOC_HARDER;
2969                 /*
2970                  * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
2971                  * comment for __cpuset_node_allowed().
2972                  */
2973                 alloc_flags &= ~ALLOC_CPUSET;
2974         } else if (unlikely(rt_task(current)) && !in_interrupt())
2975                 alloc_flags |= ALLOC_HARDER;
2976 
2977         if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
2978                 if (gfp_mask & __GFP_MEMALLOC)
2979                         alloc_flags |= ALLOC_NO_WATERMARKS;
2980                 else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
2981                         alloc_flags |= ALLOC_NO_WATERMARKS;
2982                 else if (!in_interrupt() &&
2983                                 ((current->flags & PF_MEMALLOC) ||
2984                                  unlikely(test_thread_flag(TIF_MEMDIE))))
2985                         alloc_flags |= ALLOC_NO_WATERMARKS;
2986         }
2987 #ifdef CONFIG_CMA
2988         if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
2989                 alloc_flags |= ALLOC_CMA;
2990 #endif
2991         return alloc_flags;
2992 }

how the patch fix the problem
It moves do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true) to the above of mark_oom_victim(victim). This could let the victim thread be killed and couldn’t allocate pages. Otherwise, it could allocate pages without watermark check.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..8ad35aa 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -554,6 +554,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
+	/*
+	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
+	 * the OOM victim from depleting the memory reserves from the user
+	 * space under its control.
+	 */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -585,7 +591,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
 }
 #undef K

questions
If the victim thread calls do_exit()->exit_mm() and clears TIF_MEMDIE flag before the oom_kill_process() calls mark_oom_victim(). Then the TIF_MEMDIE flag of the victim will not be cleared forever. But this unexpected behaviour is unlikely to happen and will not be harmful to the system.

conclusion
This post discusses mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL. It explains the symptom, root cause, and how the patch fixes it.

patch discussion: mm/compaction.c: add an is_via_compact_memory() helper

December 21, 2015

This post discusses mm/compaction.c: add an is_via_compact_memory() helper.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

/proc/sys/vm/compact_memory
The core compaction function is compact_zone() which uses the argument compaction control to determine how to compact. There are three ways to call compact_zone(): allocate slow path, kswapd, or writing values to /proc/sys/vm/compact_memory.

If the order of compaction control is -1, then it implies that this compaction is triggered by /proc/sys/vm/compact_memory. Many code flows within compaction use if (cc.order == -1) to know if this compaction is from /proc/sys/vm/compact_memory.

This patch adds a helper function to explicitly check if a compaction is triggered by /proc/sys/vm/compact_memory. The implementation of this function uses (cc.order == -1) to know if the compaction is triggered by /proc/sys/vm/compact_memory.

1714 /* The written value is actually unused, all memory is compacted */
1715 int sysctl_compact_memory;
1716 
1717 /* This is the entry point for compacting all nodes via /proc/sys/vm */
1718 int sysctl_compaction_handler(struct ctl_table *table, int write,
1719                         void __user *buffer, size_t *length, loff_t *ppos)
1720 {
1721         if (write)
1722                 compact_nodes();
1723 
1724         return 0;
1725 }
1702 /* Compact all nodes in the system */
1703 static void compact_nodes(void)
1704 {
1705         int nid;
1706 
1707         /* Flush pending updates to the LRU lists */
1708         lru_add_drain_all();
1709 
1710         for_each_online_node(nid)
1711                 compact_node(nid);
1712 }
1691 static void compact_node(int nid)
1692 {
1693         struct compact_control cc = {
1694                 .order = -1,
1695                 .mode = MIGRATE_SYNC,
1696                 .ignore_skip_hint = true,
1697         };
1698 
1699         __compact_pgdat(NODE_DATA(nid), &cc);
1700 }
1638 /* Compact all zones within a node */
1639 static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
1640 {
1641         int zoneid;
1642         struct zone *zone;
1643 
1644         for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
1645 
1646                 zone = &pgdat->node_zones[zoneid];
1647                 if (!populated_zone(zone))
1648                         continue;
1649 
1650                 cc->nr_freepages = 0;
1651                 cc->nr_migratepages = 0;
1652                 cc->zone = zone;
1653                 INIT_LIST_HEAD(&cc->freepages);
1654                 INIT_LIST_HEAD(&cc->migratepages);
1655 
1656                 /*
1657                  * When called via /proc/sys/vm/compact_memory
1658                  * this makes sure we compact the whole zone regardless of
1659                  * cached scanner positions.
1660                  */
1661                 if (cc->order == -1)
1662                         __reset_isolation_suitable(zone);
1663 
1664                 if (cc->order == -1 || !compaction_deferred(zone, cc->order))
1665                         compact_zone(zone, cc);
1666 
1667                 if (cc->order > 0) {
1668                         if (zone_watermark_ok(zone, cc->order,
1669                                                 low_wmark_pages(zone), 0, 0))
1670                                 compaction_defer_reset(zone, cc->order, false);
1671                 }
1672 
1673                 VM_BUG_ON(!list_empty(&cc->freepages));
1674                 VM_BUG_ON(!list_empty(&cc->migratepages));
1675         }
1676 }

conclusion
This post discusses mm/compaction.c: add an is_via_compact_memory() helper. Before this patch, compaction code uses (cc.order == -1) to know if this compaction is triggered by /proc/sys/vm/compact_memory. After this patch is merged, compaction code could use helper function is_via_compact_memory() directly.

patch discussion: mm, migrate: count pages failing all retries in vmstat and tracepoint

December 8, 2015

This patch discusses mm, migrate: count pages failing all retries in vmstat and tracepoint.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what is the problem of migrate_pages() in v4.3
Let’s consider the following conditions to understand how migrate_pages() works.

Assume the input argument from is a linked list of pages whose size is SWAP_CLUSTER_MAX = 32. Then migrate_pages() will try to migrate these 32 pages into free pages isolated by free scanner.

Also I assume that if it fails to migrate a page due to -EAGAIN, then it will always fail to migrate this page due to -EAGAIN. This could make below examples easier to understand.

1. All 32 pages are migrated successfully.
   1.1 (nr_succeeded, nr_failed, retry, rc) = (32, 0, 0, 0).
   1.2 from list is empty while it returns.
   1.3 Return value is 0.
2. If 4 or 32 pages failed to be migrated, and the other 28 ones are migrated successfully.
   2.1 (nr_succeeded, nr_failed, retry, rc) = (28, 4, 0, 4).
   2.2 from list is empty while it returns.
   2.3 Return value is 4.
3. If the first 5 pages are migrated successfully, but the 6th pages couldn't be migrated successfully due to -EAGAIN.
   3.1 (nr_succeeded, nr_failed, retry, rc) = (5, 0, 1, 1).
   3.2 The size of from list is 27 while it returns.
   3.3 Return value is 1.
4. If the first 5 pages are migrated successfully, the 6th page is not migrated successfully, and the 7th page couldn't be migrated successfully due to -EAGAIN.
   4.1 (nr_succeeded, nr_failed, retry, rc) = (5, 1, 1, 2).
   4.2 The size of from list is 26 while it returns.
   4.3 Return value is 2.

In the 4th case, unmap_and_move() fails to migrate the 6th page. Then it also fails to migrate the 7th page due to -EAGAIN. The return value rc = 2 correctly indicate how many pages are not migrated successfully. rc = 2 is because rc = nr_failed + retry = 1 + 1 = 2. Only nr_failed = 1 page are accounted into /proc/sys/vm/pgmigrate_fail. But it’s correct to increase /proc/sys/vm/pgmigrate_fail by 2 in this case.

1099 /*
1100  * migrate_pages - migrate the pages specified in a list, to the free pages
1101  *                 supplied as the target for the page migration
1102  *
1103  * @from:               The list of pages to be migrated.
1104  * @get_new_page:       The function used to allocate free pages to be used
1105  *                      as the target of the page migration.
1106  * @put_new_page:       The function used to free target pages if migration
1107  *                      fails, or NULL if no special handling is necessary.
1108  * @private:            Private data to be passed on to get_new_page()
1109  * @mode:               The migration mode that specifies the constraints for
1110  *                      page migration, if any.
1111  * @reason:             The reason for page migration.
1112  *
1113  * The function returns after 10 attempts or if no pages are movable any more
1114  * because the list has become empty or no retryable pages exist any more.
1115  * The caller should call putback_lru_pages() to return pages to the LRU
1116  * or free list only if ret != 0.
1117  *
1118  * Returns the number of pages that were not migrated, or an error code.
1119  */
1120 int migrate_pages(struct list_head *from, new_page_t get_new_page,
1121                 free_page_t put_new_page, unsigned long private,
1122                 enum migrate_mode mode, int reason)
1123 {
1124         int retry = 1;
1125         int nr_failed = 0;
1126         int nr_succeeded = 0;
1127         int pass = 0;
1128         struct page *page;
1129         struct page *page2;
1130         int swapwrite = current->flags & PF_SWAPWRITE;
1131         int rc;
1132 
1133         if (!swapwrite)
1134                 current->flags |= PF_SWAPWRITE;
1135 
1136         for(pass = 0; pass < 10 && retry; pass++) {
1137                 retry = 0;
1138 
1139                 list_for_each_entry_safe(page, page2, from, lru) {
1140                         cond_resched();
1141 
1142                         if (PageHuge(page))
1143                                 rc = unmap_and_move_huge_page(get_new_page,
1144                                                 put_new_page, private, page,
1145                                                 pass > 2, mode);
1146                         else
1147                                 rc = unmap_and_move(get_new_page, put_new_page,
1148                                                 private, page, pass > 2, mode,
1149                                                 reason);
1150 
1151                         switch(rc) {
1152                         case -ENOMEM:
1153                                 goto out;
1154                         case -EAGAIN:
1155                                 retry++;
1156                                 break;
1157                         case MIGRATEPAGE_SUCCESS:
1158                                 nr_succeeded++;
1159                                 break;
1160                         default:
1161                                 /*
1162                                  * Permanent failure (-EBUSY, -ENOSYS, etc.):
1163                                  * unlike -EAGAIN case, the failed page is
1164                                  * removed from migration page list and not
1165                                  * retried in the next outer loop.
1166                                  */
1167                                 nr_failed++;
1168                                 break;
1169                         }
1170                 }
1171         }
1172         rc = nr_failed + retry;
1173 out:
1174         if (nr_succeeded)
1175                 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
1176         if (nr_failed)
1177                 count_vm_events(PGMIGRATE_FAIL, nr_failed);
1178         trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
1179 
1180         if (!swapwrite)
1181                 current->flags &= ~PF_SWAPWRITE;
1182 
1183         return rc;
1184 }

how does this patch fix the problem in branch since-4.3
Let nr_failed += retry. This could make the 4th case as below. nr_failed = 2 and /proc/sys/vm/pgmigrate_fail is increased by 2.

 
4. If the first 5 pages are migrated successfully, the 6th page is not migrated successfully, and the 7th page couldn't be migrated successfully due to -EAGAIN.
   4.1 (nr_succeeded, nr_failed, retry, rc) = (5, 2, 1, 2).
   4.2 The size of from list is 26 while returns.
   4.3 Return value is 2.
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7..94961f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1169,7 +1169,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 			}
 		}
 	}
-	rc = nr_failed + retry;
+	nr_failed += retry;
+	rc = nr_failed;
 out:
 	if (nr_succeeded)
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);

conclusion
This patch discusses mm, migrate: count pages failing all retries in vmstat and tracepoint fix incorrect accounting of /proc/sys/vm/pgmigrate_fail while it fails to migrate a page due to -EAGAIN.


%d bloggers like this: