Archive for the ‘memory management’ Category

kernel: mm: debug_pagealloc

January 25, 2016

This post discusses how to enable debug_pagealloc, and how it detects single bit error and memory corruption.

reference code base
linux 4.3

how to enable debug_pagealloc

  • If CONFIG_DEBUG_PAGEALLOC is not set, then debug_pagealloc is always enabled.
  • If CONFIG_DEBUG_PAGEALLOC is set, then debug_pagealloc is disabled by default. Add debug_pagealloc=on in kernel command line could enable this feature. android: add arguments in kernel command line shows how to add arguments into kernel command line.
853         debug_pagealloc=
854                         [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
855                         parameter enables the feature at boot time. In
856                         default, it is disabled. We can avoid allocating huge
857                         chunk of memory for debug pagealloc if we don't enable
858                         it at boot time and the system will work mostly same
859                         with the kernel built without CONFIG_DEBUG_PAGEALLOC.
860                         on: enable the feature

allocate/free pages and debug_pagealloc
Allocating pages will poison pages with 0xaa, and freeing pages will unpoison pages. While unpoisoning pages, it will check if each byte of poisoned pages are 0xaa. If only one bit is incorrect, then kernel log will show “pagealloc: single bit error”. If more than one bit is incorrect, then kernel log will show “pagealloc: memory corruption”.

__alloc_pages_nodemask()
-> get_page_from_freelist()
   -> prep_new_page()
      -> kernel_map_pages()
         -> __kernel_map_pages()
For order 0 pages
__free_pages()
-> free_hot_cold_page()
   -> free_pages_prepare()

For high order pages
__free_pages()
-> __free_pages_ok()
   -> free_pages_prepare()
128 void __kernel_map_pages(struct page *page, int numpages, int enable)
129 {
130         if (!page_poisoning_enabled)
131                 return;
132 
133         if (enable)
134                 unpoison_pages(page, numpages);
135         else
136                 poison_pages(page, numpages);
137 }
138 
 32 /********** mm/debug-pagealloc.c **********/
 33 #define PAGE_POISON 0xaa

conclusion
To enable debug_pagealloc, it needs to compile kernel with CONFIG_DEBUG_PAGEALLOC=y and add debug_pagealloc=on in kernel command line. This feature poisons pages while pages are allocated and unpoison pages while pages are freed. While unpoisoning pages, if pages’ content are incorrect, then kernel log will show “pagealloc: single bit error” or “pagealloc: memory corruption”.

Advertisements

patch discussion: mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process()

December 31, 2015

This post discusses mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process().

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what does oom_kill_process() do
As discussed in kernel: mm: oom_kill_process, it does the following three items.

  1. It may replace victim with one of its feasible child thread group.
  2. It kills those thread groups whose mm are the same as victim’s.
  3. It finally kills victim.

what does the patch change behaviours for the first item
While killing those thread groups whose mm are the same as victim’s, oom_kill_process() finds child thread groups by checking if child’s mm is not the same as victim’s.

Before this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will still kill the child thread group which is already been killed.

After this patch, if a victim’s child thread group having the same mm is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then the oom-killer will not kill the child thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed child thread group having the same mm are already NULL in do_exit(), then oom-killer will still replace victim with this killed thread group. This patch couldn’t prevent this condition.

what does the patch change behaviours for the second item
While killing those thread groups whose mm are the same as victim’s. oom-killer will also kills those thread groups whose mm are the same as victim’s.

Before this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer skips killing this thread group because the mm of the group leader is NULL now.

After this patch, if a thread group is killed recently, then its thread group leader might be in zombie state and waits for its thread group members to do_exit() now. Then, oom-killer stills kills the killed thread group if the mm of any threads in the group is not NULL.

But if the mm of all threads in the killed thread group having the same mm are already NULL in do_exit(), then oom-killer will not kill the thread group. This patch couldn’t prevent this condition.

I wonder if the thread group is already been killed, then it is useless to kill it again. But the commit log of this patch says that such condition is incorrect.

conclusion
This post discusses how mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() change the behaviours of oom_kill_process().

kernel: mm: oom_kill_process

December 29, 2015

The post discusses oom_kill_process().

reference code base
linux 4.3

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> wake_all_kswapds()
   -> get_page_from_freelist()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> __alloc_pages_may_oom()
      -> get_page_from_freelist()
      -> out_of_memory()
         -> select_bad_process()
         -> oom_kill_process()
   -> __alloc_pages_direct_compact()

when is oom_kill_process() called
Page allocation slow path tries getting pages with min watermark, compaction, and reclaim. If it still fails to allocate pages and no progresses are made, then it calls __alloc_pages_may_oom() to allocate pages. If it stills fails, it may repeat above flows or try the final compaction according to gfp_mask and page order of this allocation.

In __alloc_pages_may_oom(), it tries getting pages with high watermark. If it fails, then it will call select_bad_process() to select the feasible thread group and call oom_kill_process() to kill the chosen thread group.

implementation of oom_kill_process()
oom_kill_process() traverses all child thread of the chosen thread group. If a child thread has different mm, is killable, and has the highest badness, then it replaces victim with the child thread. The child thread is the group leader of another thread group. I think it tries to minimise the effects of killing a thread group. The lower a thread group is in the thread group hierarchy, the less effects killing such thread group has.

Then, oom_kill_process() traverses all other thread groups. If the thread group is a user space one, has the same mm as victime thread group, has oom_score_adj > OOM_SCORE_ADJ_MIN, then oom_kill_process() also kill the thread group.

Finally, oom_kill_process() send SIGKILL to kill victim thread group.

479 #define K(x) ((x) << (PAGE_SHIFT-10))
480 /*
481  * Must be called while holding a reference to p, which will be released upon
482  * returning.
483  */
484 void oom_kill_process(struct oom_control *oc, struct task_struct *p,
485                       unsigned int points, unsigned long totalpages,
486                       struct mem_cgroup *memcg, const char *message)
487 {
488         struct task_struct *victim = p;
489         struct task_struct *child;
490         struct task_struct *t;
491         struct mm_struct *mm;
492         unsigned int victim_points = 0;
493         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
494                                               DEFAULT_RATELIMIT_BURST);
495 
496         /*
497          * If the task is already exiting, don't alarm the sysadmin or kill
498          * its children or threads, just set TIF_MEMDIE so it can die quickly
499          */
500         task_lock(p);
501         if (p->mm && task_will_free_mem(p)) {
502                 mark_oom_victim(p);
503                 task_unlock(p);
504                 put_task_struct(p);
505                 return;
506         }
507         task_unlock(p);
508 
509         if (__ratelimit(&oom_rs))
510                 dump_header(oc, p, memcg);
511 
512         task_lock(p);
513         pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
514                 message, task_pid_nr(p), p->comm, points);
515         task_unlock(p);
516 
517         /*
518          * If any of p's children has a different mm and is eligible for kill,
519          * the one with the highest oom_badness() score is sacrificed for its
520          * parent.  This attempts to lose the minimal amount of work done while
521          * still freeing memory.
522          */
523         read_lock(&tasklist_lock);
524         for_each_thread(p, t) {
525                 list_for_each_entry(child, &t->children, sibling) {
526                         unsigned int child_points;
527 
528                         if (child->mm == p->mm)
529                                 continue;
530                         /*
531                          * oom_badness() returns 0 if the thread is unkillable
532                          */
533                         child_points = oom_badness(child, memcg, oc->nodemask,
534                                                                 totalpages);
535                         if (child_points > victim_points) {
536                                 put_task_struct(victim);
537                                 victim = child;
538                                 victim_points = child_points;
539                                 get_task_struct(victim);
540                         }
541                 }
542         }
543         read_unlock(&tasklist_lock);
544 
545         p = find_lock_task_mm(victim);
546         if (!p) {
547                 put_task_struct(victim);
548                 return;
549         } else if (victim != p) {
550                 get_task_struct(p);
551                 put_task_struct(victim);
552                 victim = p;
553         }
554 
555         /* mm cannot safely be dereferenced after task_unlock(victim) */
556         mm = victim->mm;
557         mark_oom_victim(victim);
558         pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
559                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
560                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
561                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
562         task_unlock(victim);
563 
564         /*
565          * Kill all user processes sharing victim->mm in other thread groups, if
566          * any.  They don't get access to memory reserves, though, to avoid
567          * depletion of all memory.  This prevents mm->mmap_sem livelock when an
568          * oom killed thread cannot exit because it requires the semaphore and
569          * its contended by another thread trying to allocate memory itself.
570          * That thread will now get access to memory reserves since it has a
571          * pending fatal signal.
572          */
573         rcu_read_lock();
574         for_each_process(p)
575                 if (p->mm == mm && !same_thread_group(p, victim) &&
576                     !(p->flags & PF_KTHREAD)) {
577                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
578                                 continue;
579 
580                         task_lock(p);   /* Protect ->comm from prctl() */
581                         pr_err("Kill process %d (%s) sharing same memory\n",
582                                 task_pid_nr(p), p->comm);
583                         task_unlock(p);
584                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
585                 }
586         rcu_read_unlock();
587 
588         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
589         put_task_struct(victim);
590 }
591 #undef K

conclusion
This post discusses when oom_kill_process() is called and its implementation.

patch discussion: mm/oom_kill: cleanup the “kill sharing same memory” loop

December 28, 2015

This post discusses mm/oom_kill: cleanup the “kill sharing same memory” loop.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> wake_all_kswapds()
   -> get_page_from_freelist()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> __alloc_pages_may_oom()
      -> get_page_from_freelist()
      -> out_of_memory()
         -> select_bad_process()
         -> oom_kill_process()
   -> __alloc_pages_direct_compact()

how oom-killer kills thread groups other than victim
out_of_memory() calls select_bad_process() to select the victim thread and calls oom_kill_process() to kill it.

oom_kill_process() iterates all thread groups and kill a thread group if it is not satisfying any of the below condition.

  • p->mm != vicitim->mm
  • same_thread_group(p, victim)
  • (p->flags & PF_KTHREAD)
  • p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It implies omm-killer also kills thread groups which share the same mm as victim if its group leader is not a kernel thread and its oom_score_adj is not OOM_SCORE_ADJ_MIN.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c837d06..2b6e880 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -574,14 +574,18 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * pending fatal signal.
 	 */
 	rcu_read_lock();
-	for_each_process(p)
-		if (p->mm == mm && !same_thread_group(p, victim) &&
-		    !(p->flags & PF_KTHREAD)) {
-			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-				continue;
+	for_each_process(p) {
+		if (p->mm != mm)
+			continue;
+		if (same_thread_group(p, victim))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
 
-			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-		}
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+	}
 	rcu_read_unlock();
 
 	mmdrop(mm);
generated by cgit v0.11.2 at 2015-12-28 12:19:54 (GMT)

conclusion
This post discusses mm/oom_kill: cleanup the “kill sharing same memory” loop and how oom-killer kills thread groups other than victim.

patch discussion: mm: fix the racy mm->locked_vm change in

December 28, 2015

This post discusses mm: fix the racy mm->locked_vm change in.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what locks are holding in acct_stack_growth()
kernel: mm: why expand_stack() expands VMAs but only holds down_read(task->mm->mmap_sem) shows that expand_downwards() holds down_read(&task->mm->mmap_sem) and vma_lock_anon_vma(vma) before expanding the VMA. After the VMA is updated, acct_stack_growth() is called to account statistics in current->mm.

do_page_fault()
down_read(&task->mm->mmap_sem)
-> __do_page_fault()
   -> expand_stack()
    -> expand_downwards()
        -> vma_lock_anon_vma(vma)
        -> acct_stack_growth()
        -> vma_unlock_anon_vma(vma)
   -> handle_mm_fault()
      -> __handle_mm_fault()
up_read(&task->mm->mmap_sem)

what’s the problem
Since VMA expansions are always in the same direction, VMA expansions has no effects on the relative relation between every two VMAs. All VMAs could expand at the same time while only holding down_read(task->mm->mmap_sem).

But it’s incorrect to allow concurrent update to statistics of task->mm, such as task->mm->locked_vm, mm->total_vm, shared_vm, and etc.

how does the patch fix it
It’s safe for a thread to modify task->mm->locked_vm while holding down_write(&task->mm->mmap_sem). But in this place a thread could modify task->mm->locked_vm while only holding down_read(&task->mm->mmap_sem). Many threads could hold down_read(&task->mm->mmap_sem) and modify task->mm->locked_vm concurrently.

This patch moves the code which modifies statistics of task->mm into the section projected by spin_lock(&vma->vm_mm->page_table_lock). Then all threads holding down_read(&task->mm->mmap_sem) needs to contend spin_lock(&vma->vm_mm->page_table_lock) before modifies statistics of task->mm.

diff --git a/mm/mmap.c b/mm/mmap.c
index 3ec19b6..d1ac224 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2138,10 +2138,6 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
 	if (security_vm_enough_memory_mm(mm, grow))
 		return -ENOMEM;
 
-	/* Ok, everything looks good - let it rip */
-	if (vma->vm_flags & VM_LOCKED)
-		mm->locked_vm += grow;
-	vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow);
 	return 0;
 }
 
@@ -2202,6 +2198,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					vma->vm_mm->locked_vm += grow;
+				vm_stat_account(vma->vm_mm, vma->vm_flags,
+						vma->vm_file, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
 				anon_vma_interval_tree_post_update_vma(vma);
@@ -2273,6 +2273,10 @@ int expand_downwards(struct vm_area_struct *vma,
 				 * against concurrent vma expansions.
 				 */
 				spin_lock(&vma->vm_mm->page_table_lock);
+				if (vma->vm_flags & VM_LOCKED)
+					vma->vm_mm->locked_vm += grow;
+				vm_stat_account(vma->vm_mm, vma->vm_flags,
+						vma->vm_file, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;

conclusion
This post discusses mm: fix the racy mm->locked_vm change in. I am not sure if this method is the best one, since it makes a code section which updating statistics of task->mm protected by spin_lock(&vma->vm_mm->page_table_lock). But the result accuracy is acceptable.

kernel: mm: why expand_stack() expands VMAs but only holds down_read(task->mm->mmap_sem)

December 28, 2015

This post discusses why expand_stack() expands VMAs but only holds down_read(task->mm->mmap_sem).

reference code base
linux 4.3

kernel config assumption
# CONFIG_STACK_GROWSUP is not set

do_page_fault()
-> __do_page_fault()
   -> expand_stack()
    -> expand_downwards()
   -> handle_mm_fault()
      -> __handle_mm_fault()

expand_stack()
kernel: arm64: mm: how user space stack grows shows that exapnd_stack() is triggered while stack is consumed and needs expansion.

down_read(&task->mm->mmap_sem) must be held before entering exapnd_stack().exapnd_stack() calls expand_downwards() to expand this VMA 1 page downwards.

2335 int expand_stack(struct vm_area_struct *vma, unsigned long address)
2336 {
2337         struct vm_area_struct *prev;
2338 
2339         address &= PAGE_MASK;
2340         prev = vma->vm_prev;
2341         if (prev && prev->vm_end == address) {
2342                 if (!(prev->vm_flags & VM_GROWSDOWN))
2343                         return -ENOMEM;
2344         }
2345         return expand_downwards(vma, address);
2346 }

expand_downwards() could update the VMA by expanding it downwards. It calls vma_lock_anon_vma(vma) before updates and calls vma_unlock_anon_vma(vma) after updates.

2226 /*
2227  * vma is the first one with address < vma->vm_start.  Have to extend vma.
2228  */
2229 int expand_downwards(struct vm_area_struct *vma,
2230                                    unsigned long address)
2231 {
2232         int error;
2233 
2234         /*
2235          * We must make sure the anon_vma is allocated
2236          * so that the anon_vma locking is not a noop.
2237          */
2238         if (unlikely(anon_vma_prepare(vma)))
2239                 return -ENOMEM;
2240 
2241         address &= PAGE_MASK;
2242         error = security_mmap_addr(address);
2243         if (error)
2244                 return error;
2245 
2246         vma_lock_anon_vma(vma);
2247 
2248         /*
2249          * vma->vm_start/vm_end cannot change under us because the caller
2250          * is required to hold the mmap_sem in read mode.  We need the
2251          * anon_vma lock to serialize against concurrent expand_stacks.
2252          */
2253 
2254         /* Somebody else might have raced and expanded it already */
2255         if (address < vma->vm_start) {
2256                 unsigned long size, grow;
2257 
2258                 size = vma->vm_end - address;
2259                 grow = (vma->vm_start - address) >> PAGE_SHIFT;
2260 
2261                 error = -ENOMEM;
2262                 if (grow <= vma->vm_pgoff) {
2263                         error = acct_stack_growth(vma, size, grow);
2264                         if (!error) {
2265                                 /*
2266                                  * vma_gap_update() doesn't support concurrent
2267                                  * updates, but we only hold a shared mmap_sem
2268                                  * lock here, so we need to protect against
2269                                  * concurrent vma expansions.
2270                                  * vma_lock_anon_vma() doesn't help here, as
2271                                  * we don't guarantee that all growable vmas
2272                                  * in a mm share the same root anon vma.
2273                                  * So, we reuse mm->page_table_lock to guard
2274                                  * against concurrent vma expansions.
2275                                  */
2276                                 spin_lock(&vma->vm_mm->page_table_lock);
2277                                 anon_vma_interval_tree_pre_update_vma(vma);
2278                                 vma->vm_start = address;
2279                                 vma->vm_pgoff -= grow;
2280                                 anon_vma_interval_tree_post_update_vma(vma);
2281                                 vma_gap_update(vma);
2282                                 spin_unlock(&vma->vm_mm->page_table_lock);
2283 
2284                                 perf_event_mmap(vma);
2285                         }
2286                 }
2287         }
2288         vma_unlock_anon_vma(vma);
2289         khugepaged_enter_vma_merge(vma, vma->vm_flags);
2290         validate_mm(vma->vm_mm);
2291         return error;
2292 }

why does expand_stack() expand vma but only hold down_read(task->mm->mmap_sem)
expand_downwards() could traverse VMAs while holding down_read(task->mm->mmap_sem). After finding the feasible VMA, expand_downwards() holds vma_lock_anon_vma(vma) to modify the single VMA. Since all VMAs in current->mm->mmaps could only expand downwards, all VMAs could be expanded concurrently. The relative relation of all VMAs are still the same even after all VMAs are expanded concurrently. Each VMA expansion only needs to holds vma_lock_anon_vma(vma) to protect a VMA against modification.

conclusion
This post discusses why expand_stack() expands VMAs but only holds down_read(task->mm->mmap_sem). Since all VMAs expand at the same direction, the relative relation all all VMAs are the same even all VMAs are expanded at the same time. Therefore, exapnd_stack() only needs to hold down_read(task->mm->mmap_sem) to get the information of VMAs and vma_lock_anon_vma(vma) to protect a VMA against modification.

kernel: arm64: mm: how user space stack grows

December 28, 2015

This post discusses how user space stack grows.

reference code base
linux 4.3

kernel config assumption
# CONFIG_STACK_GROWSUP is not set

how does user space stack grow
As discussed in kernel: mm: task->mm->mmap_sem, virtual address space is represented by intervals of VMAs. The stack itself corresponding to a VMA whose vm_flags has flag VM_GROWSDOWN set.

If a stack grows below its corresponding VMA’s scope, then a page fault will be triggered. __do_page_fault() calls expand_stack() to extend the corresponding VMA downward. Then, it calls -> handle_mm_fault() to allocate a page and modify page tables.

do_page_fault()
-> __do_page_fault()
   -> expand_stack()
    -> expand_downwards()
   -> handle_mm_fault()
      -> __handle_mm_fault()

__do_page_fault
If vma->vm_start > addr, then __do_page_fault() checks if VM_GROWSDOWN flag is set in vma->vm_flags. If true, then it calls expand_stack(vma, addr) to expand the VMA by 1 page. If successful, it calls handle_mm_fault() to allocate a page and modify page tables.

157 static int __do_page_fault(struct mm_struct *mm, unsigned long addr,
158                            unsigned int mm_flags, unsigned long vm_flags,
159                            struct task_struct *tsk)
160 {
161         struct vm_area_struct *vma;
162         int fault;
163 
164         vma = find_vma(mm, addr);
165         fault = VM_FAULT_BADMAP;
166         if (unlikely(!vma))
167                 goto out;
168         if (unlikely(vma->vm_start > addr))
169                 goto check_stack;
170 
171         /*
172          * Ok, we have a good vm_area for this memory access, so we can handle
173          * it.
174          */
175 good_area:
176         /*
177          * Check that the permissions on the VMA allow for the fault which
178          * occurred. If we encountered a write or exec fault, we must have
179          * appropriate permissions, otherwise we allow any permission.
180          */
181         if (!(vma->vm_flags & vm_flags)) {
182                 fault = VM_FAULT_BADACCESS;
183                 goto out;
184         }
185 
186         return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
187 
188 check_stack:
189         if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
190                 goto good_area;
191 out:
192         return fault;
193 }

conclusion
This post discusses how user space stack grows. While stack is below its lowest address defined by VMA. A page fault is triggered to extend the corresponding VMA downward by 1 page, allocate a page, and modify page tables.

kernel: mm: task->mm->mmap_sem

December 28, 2015

This post discusses task->mm->mmap_sem.

reference code base
linux 4.3

what is task->mm->mmap_sem
Each thread is represented by a struct task_struct and task->mm manipulates address space of this thread. The virtual address space of a thread is represented by intervals of vma, i.e., virtual memory area.

task->mm->mmap is a list_head of the intervals of VMAs and task->mm->mm_rb is rb tree of VMAs. task->mm->mmap_sem is a semaphore to synchronise access to the VMAs of a thread.

370 struct mm_struct {
371         struct vm_area_struct *mmap;            /* list of VMAs */
372         struct rb_root mm_rb;
373         u32 vmacache_seqnum;                   /* per-thread vmacache */
374 #ifdef CONFIG_MMU
375         unsigned long (*get_unmapped_area) (struct file *filp,
376                                 unsigned long addr, unsigned long len,
377                                 unsigned long pgoff, unsigned long flags);
378 #endif
379         unsigned long mmap_base;                /* base of mmap area */
380         unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
381         unsigned long task_size;                /* size of task vm space */
382         unsigned long highest_vm_end;           /* highest vma end address */
383         pgd_t * pgd;
384         atomic_t mm_users;                      /* How many users with user space? */
385         atomic_t mm_count;                      /* How many references to "struct mm_struct" (users count as 1) */
386         atomic_long_t nr_ptes;                  /* PTE page table pages */
387 #if CONFIG_PGTABLE_LEVELS > 2
388         atomic_long_t nr_pmds;                  /* PMD page table pages */
389 #endif
390         int map_count;                          /* number of VMAs */
391 
392         spinlock_t page_table_lock;             /* Protects page tables and some counters */
393         struct rw_semaphore mmap_sem;
394 
395         struct list_head mmlist;                /* List of maybe swapped mm's.  These are globally strung
396                                                  * together off init_mm.mmlist, and are protected
397                                                  * by mmlist_lock
398                                                  */

how to use task->mm->mmap_sem
The caller should down_read(&task->mm->mmap_sem) before reading VMAs of a thread. The caller should up_read(&task->mm->mmap_sem) after reading read VMAs of a thread.

The caller should down_write(&task->mm->mmap_sem) before updating VMAs of a thread. The caller should up_write(&task->mm->mmap_sem) after updating VMAs of a thread.

system call mmap()
User space calls mmap() to request adding an interval of VMA into current->mm->mmap. The do_mmap() inserts this interval of VMA. down_write(&current->mm->mmap_sem) must be held before calling do_mmap().

sys_mmap()
-> sys_mmap_pgoff()
   -> vm_mmap_pgoff()
      -> down_write(&mm->mmap_sem)
      -> do_mmap_pgoff()
         -> do_mmap()
      -> up_write(&mm->mmap_sem)
1260 /*
1261  * The caller must hold down_write(&current->mm->mmap_sem).
1262  */
1263 unsigned long do_mmap(struct file *file, unsigned long addr,
1264                         unsigned long len, unsigned long prot,
1265                         unsigned long flags, vm_flags_t vm_flags,
1266                         unsigned long pgoff, unsigned long *populate)
1267 {
1268         struct mm_struct *mm = current->mm;

system call unmmap()
User space calls mmap() to request removing an interval of VMA into current->mm->mmap. The do_mmap() removes this interval of VMA. down_write(&current->mm->mmap_sem) must be held before calling do_mmap().

sys_munmap()
-> vm_munmap()
   -> down_write(&mm->mmap_sem)
   -> do_munmap()
   -> up_write(&mm->mmap_sem)
2619 int vm_munmap(unsigned long start, size_t len)
2620 {
2621         int ret;
2622         struct mm_struct *mm = current->mm;
2623 
2624         down_write(&mm->mmap_sem);
2625         ret = do_munmap(mm, start, len);
2626         up_write(&mm->mmap_sem);
2627         return ret;
2628 }
2629 EXPORT_SYMBOL(vm_munmap);

page fault
While page fault happens from user space, the exception handler reads VMAs to determine if this access is valid or not. If the accessed address valid, it will allocate a page for this address. If it’s not valid, it will send a segmentation signal to current thread. down_read(&mm->mmap_sem) must be held before entering __do_page_fault().

195 static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
196                                    struct pt_regs *regs)
197 {
198         struct task_struct *tsk;
199         struct mm_struct *mm;
200         int fault, sig, code;
201         unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
202         unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
203 
204         tsk = current;
205         mm  = tsk->mm;
206 
207         /* Enable interrupts if they were enabled in the parent context. */
208         if (interrupts_enabled(regs))
209                 local_irq_enable();
210 
211         /*
212          * If we're in an interrupt or have no user context, we must not take
213          * the fault.
214          */
215         if (faulthandler_disabled() || !mm)
216                 goto no_context;
217 
218         if (user_mode(regs))
219                 mm_flags |= FAULT_FLAG_USER;
220 
221         if (esr & ESR_LNX_EXEC) {
222                 vm_flags = VM_EXEC;
223         } else if ((esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM)) {
224                 vm_flags = VM_WRITE;
225                 mm_flags |= FAULT_FLAG_WRITE;
226         }
227 
228         /*
229          * PAN bit set implies the fault happened in kernel space, but not
230          * in the arch's user access functions.
231          */
232         if (IS_ENABLED(CONFIG_ARM64_PAN) && (regs->pstate & PSR_PAN_BIT))
233                 goto no_context;
234 
235         /*
236          * As per x86, we may deadlock here. However, since the kernel only
237          * validly references user space from well defined areas of the code,
238          * we can bug out early if this is from code which shouldn't.
239          */
240         if (!down_read_trylock(&mm->mmap_sem)) {
241                 if (!user_mode(regs) && !search_exception_tables(regs->pc))
242                         goto no_context;
243 retry:
244                 down_read(&mm->mmap_sem);
245         } else {
246                 /*
247                  * The above down_read_trylock() might have succeeded in which
248                  * case, we'll have missed the might_sleep() from down_read().
249                  */
250                 might_sleep();
251 #ifdef CONFIG_DEBUG_VM
252                 if (!user_mode(regs) && !search_exception_tables(regs->pc))
253                         goto no_context;
254 #endif
255         }
256 
257         fault = __do_page_fault(mm, addr, mm_flags, vm_flags, tsk);
258 
259         /*
260          * If we need to retry but a fatal signal is pending, handle the
261          * signal first. We do not need to release the mmap_sem because it
262          * would already be released in __lock_page_or_retry in mm/filemap.c.
263          */
264         if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
265                 return 0;
266 
267         /*
268          * Major/minor page fault accounting is only done on the initial
269          * attempt. If we go through a retry, it is extremely likely that the
270          * page will be found in page cache at that point.
271          */
272 
273         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
274         if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
275                 if (fault & VM_FAULT_MAJOR) {
276                         tsk->maj_flt++;
277                         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
278                                       addr);
279                 } else {
280                         tsk->min_flt++;
281                         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
282                                       addr);
283                 }
284                 if (fault & VM_FAULT_RETRY) {
285                         /*
286                          * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
287                          * starvation.
288                          */
289                         mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
290                         mm_flags |= FAULT_FLAG_TRIED;
291                         goto retry;
292                 }
293         }
294 
295         up_read(&mm->mmap_sem);
296 
297         /*
298          * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
299          */
300         if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
301                               VM_FAULT_BADACCESS))))
302                 return 0;
303 
304         /*
305          * If we are in kernel mode at this point, we have no context to
306          * handle this fault with.
307          */
308         if (!user_mode(regs))
309                 goto no_context;
310 
311         if (fault & VM_FAULT_OOM) {
312                 /*
313                  * We ran out of memory, call the OOM killer, and return to
314                  * userspace (which will retry the fault, or kill us if we got
315                  * oom-killed).
316                  */
317                 pagefault_out_of_memory();
318                 return 0;
319         }
320 
321         if (fault & VM_FAULT_SIGBUS) {
322                 /*
323                  * We had some memory, but were unable to successfully fix up
324                  * this page fault.
325                  */
326                 sig = SIGBUS;
327                 code = BUS_ADRERR;
328         } else {
329                 /*
330                  * Something tried to access memory that isn't in our memory
331                  * map.
332                  */
333                 sig = SIGSEGV;
334                 code = fault == VM_FAULT_BADACCESS ?
335                         SEGV_ACCERR : SEGV_MAPERR;
336         }
337 
338         __do_user_fault(tsk, addr, esr, sig, code, regs);
339         return 0;
340 
341 no_context:
342         __do_kernel_fault(mm, addr, esr, regs);
343         return 0;
344 }

conclusion
This post discusses task->mm->mmap_sem which protects current->mm->mmap against concurrent access. down_write(&mm->mmap_sem) must be held before entering do_mmap() and do_munmap() which modifies current->mm->mmap. read(&mm->mmap_sem) must be held before entering __do_page_fault() which reads current->mm->mmap.

patch discussion: mm/vmscan.c: fix types of some locals

December 27, 2015

This post discusses mm/vmscan.c: fix types of some locals.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

zone_page_state(), zone_unmapped_file_pages()
Both function returns page numbers with type unsigned long.

what does the patch do
The patch fixes possible underflow while using a long local variable to accept return value of a function returning unsigned long. The patch fixes this problem by replacing the types of some variable with unsigned long accordingly.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6ceede0..55721b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,7 +194,7 @@ static bool sane_reclaim(struct scan_control *sc)
 
 static unsigned long zone_reclaimable_pages(struct zone *zone)
 {
-	int nr;
+	unsigned long nr;
 
 	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
 	     zone_page_state(zone, NR_INACTIVE_FILE);
@@ -3693,10 +3693,10 @@ static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 }
 
 /* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static long zone_pagecache_reclaimable(struct zone *zone)
+static unsigned long zone_pagecache_reclaimable(struct zone *zone)
 {
-	long nr_pagecache_reclaimable;
-	long delta = 0;
+	unsigned long nr_pagecache_reclaimable;
+	unsigned long delta = 0;
 
 	/*
 	 * If RECLAIM_UNMAP is set, then all file pages are considered

conclusion
This post discusses mm/vmscan.c: fix types of some locals. If the return value is unsigned long, then caller could use unsigned long variable to accept it to avoid underflow.

patch discussion: mm, oom: remove task_lock protecting comm printing

December 27, 2015

This post discusses mm, oom: remove task_lock protecting comm printing.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

task->comm and task_lock()
As discussed in kernel: task_struct: comm and task_lock, comm is protected by task_lock() and task_unlock().

what does this patch do
This patch removes task_lock() and task_unlock() which protect task->comm. The reason is that although the task->comm could be updated while printk, it is more efficient without holding task’s alloc_lock.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c170d9f..58f3d27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -377,13 +377,11 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 static void dump_header(struct oom_control *oc, struct task_struct *p,
 			struct mem_cgroup *memcg)
 {
-	task_lock(current);
 	pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
 		"oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, oc->order,
 		current->signal->oom_score_adj);
-	cpuset_print_task_mems_allowed(current);
-	task_unlock(current);
+	cpuset_print_current_mems_allowed();
 	dump_stack();
 	if (memcg)
 		mem_cgroup_print_oom_info(memcg, p);
@@ -509,10 +507,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p, memcg);
 
-	task_lock(p);
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
 
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
@@ -586,10 +582,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			if (fatal_signal_pending(p))
 				continue;
 
-			task_lock(p);	/* Protect ->comm from prctl() */
 			pr_info("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(p), p->comm);
-			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
 	rcu_read_unlock();

conclusion
This post discusses mm, oom: remove task_lock protecting comm printing. It permits incorrectly printing task->comm and avoids performance degrading due to get task lock.


%d bloggers like this: