Archive for the ‘process management’ Category

kernel: thread and thread group

December 29, 2015

This post discusses thread and thread group.

reference code base
linux 4.3

process, thread V.S. thread group and thread
In user space, a process has many threads. Each process has a id called pid, and each thread has a id called tid. These IDs could be gotten from getpid() and gettid() in POSIX.1 compliant operation system. The counterpart of a process in kernel space is called a thread group.

thread group and group_leader
In kernel space, each thread is represented by a task_struct. A thread group has many threads. Each thread could find its thread group leader by group_leader. All threads having the same group_leader are in the same thread group.

1492         /*
1493          * children/sibling forms the list of my natural children
1494          */
1495         struct list_head children;      /* list of my children */
1496         struct list_head sibling;       /* linkage in my parent's children list */
1497         struct task_struct *group_leader;       /* threadgroup leader */

task fields: signal, thread_node, and thread_group
All threads in the same thread group share the same signal. In a signal, nr_threads indicates the number of threads in the thread group. thread_head is a list_head which is composed of all threads in the thread group. leader_pid is the pid of the group leader in the thread group.

A thread joins a thread group by inserting task->thread_node into task->signal->thread_head.

For a group leader, task->thread_group is a list_head which is composed of all the threads in the thread group. For a thread which is not a group leader, it joins a thread group by inserting its task->thread_group into the group leader’s task->thread_group.

1507         /* PID/PID hash table linkage. */
1508         struct pid_link pids[PIDTYPE_MAX];
1509         struct list_head thread_group;
1510         struct list_head thread_node;
......
1563 /* signal handlers */
1564         struct signal_struct *signal;
1565         struct sighand_struct *sighand;
641 struct signal_struct {
642         atomic_t                sigcnt;
643         atomic_t                live;
644         int                     nr_threads;
645         struct list_head        thread_head;
......
687         struct pid *leader_pid;

traverse thread groups and threads
for_each_process(p) traverses all thread group leaders. Callers reference p for each thread group leader. init_task, i.e., idle process, is the first thread group leader and its pid is 0.

for_each_thread(p, t) traverses all threads in the same thread group as p. Callers reference t for each thread. The implementation traverses p->signal->thread_head into which each thread in the same thread inserts its thread_node.

for_each_process_thread(p, t) traverses all threads by leveraging for_each_process(p) and for_each_thread(p, t). Callers reference t for each thread and p for thread group leader of t.

2650 #define for_each_process(p) \
2651         for (p = &init_task ; (p = next_task(p)) != &init_task ; )
2652 
2653 extern bool current_is_single_threaded(void);
2654 
2655 /*
2656  * Careful: do_each_thread/while_each_thread is a double loop so
2657  *          'break' will not work as expected - use goto instead.
2658  */
2659 #define do_each_thread(g, t) \
2660         for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
2661 
2662 #define while_each_thread(g, t) \
2663         while ((t = next_thread(t)) != g)
2664 
2665 #define __for_each_thread(signal, t)    \
2666         list_for_each_entry_rcu(t, &(signal)->thread_head, thread_node)
2667 
2668 #define for_each_thread(p, t)           \
2669         __for_each_thread((p)->signal, t)
2670 
2671 /* Careful: this is a double loop, 'break' won't work as expected. */
2672 #define for_each_process_thread(p, t)   \
2673         for_each_process(p) for_each_thread(p, t)

conclusion
This post discusses thread and thread group. It shows the related data structures and macros.

Advertisements

kernel: task_struct: comm and task_lock

December 27, 2015

This post discusses comm and task_lock() of task_struct.

reference code base
linux 4.3

struct task_struct
Each thread is represented by a struct task_struct. In this post, we care about two fields of task_struct, comm and alloc_lock.

comm is NULL terminated char string. The size of comm is at most 16, including NULL termination. comm is protected by task_lock().

314 /* Task command name length */
315 #define TASK_COMM_LEN 16
1378 struct task_struct {
1379         volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
......
1542         char comm[TASK_COMM_LEN]; /* executable name excluding path
1543                                      - access with [gs]et_task_comm (which lock
1544                                        it with task_lock())
1545                                      - initialized normally by setup_new_exec */
......
1588 /* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
1589  * mempolicy */
1590         spinlock_t alloc_lock;

task_lock() and task_unlock()
task_lock() and task_unlock() get and release ->alloc_lock. It protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring. What we care here is that it could protect ->comm.

2716 /*
2717  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
2718  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
2719  * pins the final release of task.io_context.  Also protects ->cpuset and
2720  * ->cgroup.subsys[]. And ->vfork_done.
2721  *
2722  * Nests both inside and outside of read_lock(&tasklist_lock).
2723  * It must not be nested with write_lock_irq(&tasklist_lock),
2724  * neither inside nor outside.
2725  */
2726 static inline void task_lock(struct task_struct *p)
2727 {
2728         spin_lock(&p->alloc_lock);
2729 }
2730 
2731 static inline void task_unlock(struct task_struct *p)
2732 {
2733         spin_unlock(&p->alloc_lock);
2734 }

conclusion
This post discusses comm and task_lock() of task_struct. comm represents the name of a thread. task_lock() and task_unlock() protects task->comm.

mm: TIF_MEMDIE

December 21, 2015

This post discusses TIF_MEMDIE.

reference code base
linux v4.3

what is thread_info
thread_info is a per-thread data structure. In arm64, thread_info of each thread lies at the lowest address of kernel stack. Each flag of of thread_info->flags indicates different runtime property of the thread.

 42  * low level task data that entry.S needs immediate access to.
 43  * __switch_to() assumes cpu_context follows immediately after cpu_domain.
 44  */
 45 struct thread_info {
 46         unsigned long           flags;          /* low level flags */
 47         mm_segment_t            addr_limit;     /* address limit */
 48         struct task_struct      *task;          /* main task structure */
 49         int                     preempt_count;  /* 0 => preemptable, <0 => bug */
 50         int                     cpu;            /* cpu */
 51 };

what is TIF_MEMDIE
TIF_MEMDIE is a flag in thread_info flags. If TIF_MEMDIE is set, it implies that oom-killer is killing the thread.

 89 /*
 90  * thread information flags:
 91  *  TIF_SYSCALL_TRACE   - syscall trace active
 92  *  TIF_SYSCALL_TRACEPOINT - syscall tracepoint for ftrace
 93  *  TIF_SYSCALL_AUDIT   - syscall auditing
 94  *  TIF_SECOMP          - syscall secure computing
 95  *  TIF_SIGPENDING      - signal pending
 96  *  TIF_NEED_RESCHED    - rescheduling necessary
 97  *  TIF_NOTIFY_RESUME   - callback before returning to user
 98  *  TIF_USEDFPU         - FPU was used by this task this quantum (SMP)
 99  */
100 #define TIF_SIGPENDING          0
101 #define TIF_NEED_RESCHED        1
102 #define TIF_NOTIFY_RESUME       2       /* callback before returning to user */
103 #define TIF_FOREIGN_FPSTATE     3       /* CPU's FP state is not current's */
104 #define TIF_NOHZ                7
105 #define TIF_SYSCALL_TRACE       8
106 #define TIF_SYSCALL_AUDIT       9
107 #define TIF_SYSCALL_TRACEPOINT  10
108 #define TIF_SECCOMP             11
109 #define TIF_MEMDIE              18      /* is terminating due to OOM killer */
110 #define TIF_FREEZE              19
111 #define TIF_RESTORE_SIGMASK     20
112 #define TIF_SINGLESTEP          21
113 #define TIF_32BIT               22      /* 32bit process */
114 #define TIF_SWITCH_MM           23      /* deferred switch_mm */

when is TIF_MEMDIE of a thread set
oom_kill_process() kills a thread due to out of memory. If oom_killer_process() determines the victim to kill, it will call mark_oom_victim(victim) to indicate that the thread is being killed by oom-killer and call do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true) to kill the victim thread.

mark_oom_victim() sets TIF_MEMDIE flag. kernel could test TIF_MEMDIE flag of a thread to know if the thread is being killed by oom-killer.

479 #define K(x) ((x) << (PAGE_SHIFT-10))
480 /*
481  * Must be called while holding a reference to p, which will be released upon
482  * returning.
483  */
484 void oom_kill_process(struct oom_control *oc, struct task_struct *p,
485                       unsigned int points, unsigned long totalpages,
486                       struct mem_cgroup *memcg, const char *message)
487 {
488         struct task_struct *victim = p;
489         struct task_struct *child;
490         struct task_struct *t;
491         struct mm_struct *mm;
492         unsigned int victim_points = 0;
493         static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
494                                               DEFAULT_RATELIMIT_BURST);
495 
496         /*
497          * If the task is already exiting, don't alarm the sysadmin or kill
498          * its children or threads, just set TIF_MEMDIE so it can die quickly
499          */
500         task_lock(p);
501         if (p->mm && task_will_free_mem(p)) {
502                 mark_oom_victim(p);
503                 task_unlock(p);
504                 put_task_struct(p);
505                 return;
506         }
507         task_unlock(p);
508 
509         if (__ratelimit(&oom_rs))
510                 dump_header(oc, p, memcg);
511 
512         task_lock(p);
513         pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
514                 message, task_pid_nr(p), p->comm, points);
515         task_unlock(p);
516 
517         /*
518          * If any of p's children has a different mm and is eligible for kill,
519          * the one with the highest oom_badness() score is sacrificed for its
520          * parent.  This attempts to lose the minimal amount of work done while
521          * still freeing memory.
522          */
523         read_lock(&tasklist_lock);
524         for_each_thread(p, t) {
525                 list_for_each_entry(child, &t->children, sibling) {
526                         unsigned int child_points;
527 
528                         if (child->mm == p->mm)
529                                 continue;
530                         /*
531                          * oom_badness() returns 0 if the thread is unkillable
532                          */
533                         child_points = oom_badness(child, memcg, oc->nodemask,
534                                                                 totalpages);
535                         if (child_points > victim_points) {
536                                 put_task_struct(victim);
537                                 victim = child;
538                                 victim_points = child_points;
539                                 get_task_struct(victim);
540                         }
541                 }
542         }
543         read_unlock(&tasklist_lock);
544 
545         p = find_lock_task_mm(victim);
546         if (!p) {
547                 put_task_struct(victim);
548                 return;
549         } else if (victim != p) {
550                 get_task_struct(p);
551                 put_task_struct(victim);
552                 victim = p;
553         }
554 
555         /* mm cannot safely be dereferenced after task_unlock(victim) */
556         mm = victim->mm;
557         mark_oom_victim(victim);
558         pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
559                 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
560                 K(get_mm_counter(victim->mm, MM_ANONPAGES)),
561                 K(get_mm_counter(victim->mm, MM_FILEPAGES)));
562         task_unlock(victim);
563 
564         /*
565          * Kill all user processes sharing victim->mm in other thread groups, if
566          * any.  They don't get access to memory reserves, though, to avoid
567          * depletion of all memory.  This prevents mm->mmap_sem livelock when an
568          * oom killed thread cannot exit because it requires the semaphore and
569          * its contended by another thread trying to allocate memory itself.
570          * That thread will now get access to memory reserves since it has a
571          * pending fatal signal.
572          */
573         rcu_read_lock();
574         for_each_process(p)
575                 if (p->mm == mm && !same_thread_group(p, victim) &&
576                     !(p->flags & PF_KTHREAD)) {
577                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
578                                 continue;
579 
580                         task_lock(p);   /* Protect ->comm from prctl() */
581                         pr_err("Kill process %d (%s) sharing same memory\n",
582                                 task_pid_nr(p), p->comm);
583                         task_unlock(p);
584                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
585                 }
586         rcu_read_unlock();
587 
588         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
589         put_task_struct(victim);
590 }
591 #undef K
404 /**
405  * mark_oom_victim - mark the given task as OOM victim
406  * @tsk: task to mark
407  *
408  * Has to be called with oom_lock held and never after
409  * oom has been disabled already.
410  */
411 void mark_oom_victim(struct task_struct *tsk)
412 {
413         WARN_ON(oom_killer_disabled);
414         /* OOM killer might race with memcg OOM */
415         if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
416                 return;
417         /*
418          * Make sure that the task is woken up from uninterruptible sleep
419          * if it is frozen because OOM killer wouldn't be able to free
420          * any memory and livelock. freezing_slow_path will tell the freezer
421          * that TIF_MEMDIE tasks should be ignored.
422          */
423         __thaw_task(tsk);
424         atomic_inc(&oom_victims);
425 }

when is TIF_MEMDIE of a thread unset
While a thread is do_exit() it calls exit_mm() to release memory resources related to it. I will call mmput() to decrease the reference count of task->mm. task->mm will be released by mmdrop() if it doesn’t have reference.

At the end, exit_mm() unsets TIF_MEMDIE flag of current thread if the flag is set.

383 /*
384  * Turn us into a lazy TLB process if we
385  * aren't already..
386  */
387 static void exit_mm(struct task_struct *tsk)
388 {
389         struct mm_struct *mm = tsk->mm;
390         struct core_state *core_state;
391 
392         mm_release(tsk, mm);
393         if (!mm)
394                 return;
395         sync_mm_rss(mm);
396         /*
397          * Serialize with any possible pending coredump.
398          * We must hold mmap_sem around checking core_state
399          * and clearing tsk->mm.  The core-inducing thread
400          * will increment ->nr_threads for each thread in the
401          * group with ->mm != NULL.
402          */
403         down_read(&mm->mmap_sem);
404         core_state = mm->core_state;
405         if (core_state) {
406                 struct core_thread self;
407 
408                 up_read(&mm->mmap_sem);
409 
410                 self.task = tsk;
411                 self.next = xchg(&core_state->dumper.next, &self);
412                 /*
413                  * Implies mb(), the result of xchg() must be visible
414                  * to core_state->dumper.
415                  */
416                 if (atomic_dec_and_test(&core_state->nr_threads))
417                         complete(&core_state->startup);
418 
419                 for (;;) {
420                         set_task_state(tsk, TASK_UNINTERRUPTIBLE);
421                         if (!self.task) /* see coredump_finish() */
422                                 break;
423                         freezable_schedule();
424                 }
425                 __set_task_state(tsk, TASK_RUNNING);
426                 down_read(&mm->mmap_sem);
427         }
428         atomic_inc(&mm->mm_count);
429         BUG_ON(mm != tsk->active_mm);
430         /* more a memory barrier than a real lock */
431         task_lock(tsk);
432         tsk->mm = NULL;
433         up_read(&mm->mmap_sem);
434         enter_lazy_tlb(mm, current);
435         task_unlock(tsk);
436         mm_update_next_owner(mm);
437         mmput(mm);
438         if (test_thread_flag(TIF_MEMDIE))
439                 exit_oom_victim();
440 }
427 /**
428  * exit_oom_victim - note the exit of an OOM victim
429  */
430 void exit_oom_victim(void)
431 {
432         clear_thread_flag(TIF_MEMDIE);
433 
434         if (!atomic_dec_return(&oom_victims))
435                 wake_up_all(&oom_victims_wait);
436 }

conclusion
The post discusses TIF_MEMDIE flag thread_info flags. This flag implies that the thread is being killed by oom-killer and ready to die. This flag is unset while a thread is finishing exit_mm().

kernel: freezer

December 6, 2015

This post discusses freezer in v3.10.49.

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_FREEZER=y
CONFIG_SUSPEND_FREEZER=y

how to know if a thread is frozen
A task is frozen if PF_FROZEN in task->flags is set.

/*
 * Check if a process has been frozen
 */
static inline bool frozen(struct task_struct *p)
{
	return p->flags & PF_FROZEN;
}

suspend and freeze_process()
While system suspend to facilitate power efficiency, try_to_free_tasks() help freeze all user space tasks.

pm_suspend()
   -> enter_state()
      -> suspend_prepare()
         -> suspend_freeze_processes()
            -> freeze_processes()
               -> try_to_freeze_tasks()
                  -> freeze_task()
      -> suspend_devices_and_enter()
      -> suspend_finish()

kernel thread couldn’t be frozen by default
kthread itself couldn’t be frozen PF_NOFREEZE is set in task->flags. All kernel threads are forked by kthread and share the same property.

int kthreadd(void *unused)
{
	struct task_struct *tsk = current;

	/* Setup a clean context for our children to inherit. */
	set_task_comm(tsk, "kthreadd");
	ignore_signals(tsk);
	set_cpus_allowed_ptr(tsk, cpu_all_mask);
	set_mems_allowed(node_states[N_MEMORY]);

	current->flags |= PF_NOFREEZE;

	for (;;) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (list_empty(&kthread_create_list))
			schedule();
		__set_current_state(TASK_RUNNING);

		spin_lock(&kthread_create_lock);
		while (!list_empty(&kthread_create_list)) {
			struct kthread_create_info *create;

			create = list_entry(kthread_create_list.next,
					    struct kthread_create_info, list);
			list_del_init(&create->list);
			spin_unlock(&kthread_create_lock);

			create_kthread(create);

			spin_lock(&kthread_create_lock);
		}
		spin_unlock(&kthread_create_lock);
	}

	return 0;
}
/**
 * freezing_slow_path - slow path for testing whether a task needs to be frozen
 * @p: task to be tested
 *
 * This function is called by freezing() if system_freezing_cnt isn't zero
 * and tests whether @p needs to enter and stay in frozen state.  Can be
 * called under any context.  The freezers are responsible for ensuring the
 * target tasks see the updated state.
 */
bool freezing_slow_path(struct task_struct *p)
{
	if (p->flags & PF_NOFREEZE)
		return false;

	if (pm_nosig_freezing || cgroup_freezing(p))
		return true;

	if (pm_freezing && !(p->flags & PF_KTHREAD))
		return true;

	return false;
}
EXPORT_SYMBOL(freezing_slow_path);

kernel thread and set_freezable()
A kernel thread could call set_freezable() to make itself freezable. set_freezable() unset unset PF_NOFREEZE of caller’s task->flags and then try_to_freeze(). If the system has already enter suspend, then the kernel thread will be frozen. If not, the kernel thread still could be frozen while system enters suspend later.

set_freezable()
   -> try_to_freeze()
      -> try_to_freeze_unsafe()
         -> freezing()
         -> __refrigerator()
/**
 * set_freezable - make %current freezable
 *
 * Mark %current freezable and enter refrigerator if necessary.
 */
bool set_freezable(void)
{
	might_sleep();

	/*
	 * Modify flags while holding freezer_lock.  This ensures the
	 * freezer notices that we aren't frozen yet or the freezing
	 * condition is visible to try_to_freeze() below.
	 */
	spin_lock_irq(&freezer_lock);
	current->flags &= ~PF_NOFREEZE;
	spin_unlock_irq(&freezer_lock);

	return try_to_freeze();
}
EXPORT_SYMBOL(set_freezable);
/*
 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
 */
static inline bool try_to_freeze_unsafe(void)
{
/* This causes problems for ARM targets and is a known
 * problem upstream.
 *	might_sleep();
 */
	if (likely(!freezing(current)))
		return false;
	return __refrigerator(false);
}

static inline bool try_to_freeze(void)
{
	if (!(current->flags & PF_NOFREEZE))
		debug_check_no_locks_held();
	return try_to_freeze_unsafe();
}

/*
 * Check if there is a request to freeze a process
 */
static inline bool freezing(struct task_struct *p)
{
	if (likely(!atomic_read(&system_freezing_cnt)))
		return false;
	return freezing_slow_path(p);
}
/* Refrigerator is place where frozen processes are stored :-). */
bool __refrigerator(bool check_kthr_stop)
{
	/* Hmm, should we be allowed to suspend when there are realtime
	   processes around? */
	bool was_frozen = false;
	long save = current->state;

	pr_debug("%s entered refrigerator\n", current->comm);

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		spin_lock_irq(&freezer_lock);
		current->flags |= PF_FROZEN;
		if (!freezing(current) ||
		    (check_kthr_stop && kthread_should_stop()))
			current->flags &= ~PF_FROZEN;
		spin_unlock_irq(&freezer_lock);

		if (!(current->flags & PF_FROZEN))
			break;
		was_frozen = true;
		schedule();
	}

	pr_debug("%s left refrigerator\n", current->comm);

	/*
	 * Restore saved task state before returning.  The mb'd version
	 * needs to be used; otherwise, it might silently break
	 * synchronization which depends on ordered task state change.
	 */
	set_current_state(save);

	return was_frozen;
}
EXPORT_SYMBOL(__refrigerator);

conclusion
This post discusses that how suspend/resume freeze all user space process. By default, a kernel thread couldn’t be frozen because task->flags & PF_NOFREEZE is set.

A kernel thread could call set_freezable() to make itself freezable. set_freezable() unset unset PF_NOFREEZE of caller’s task->flags and then try_to_freeze(). If the system has already enter suspend, then the kernel thread will be frozen. If not, the kernel thread still could be frozen while system enters suspend later.

kernel: arm64: mm: allocate kernel stack

November 21, 2015

This post is to discuss allocation of kernel stack in arm64.

reference code base
LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y

when is kernel stack created
A kernel stack is created for each thread at fork.

do_fork()
-> copy_process()
-> dup_task_struct()
-> alloc_thread_info_node()
-> alloc_pages_node()

arm64 kernel stack size and thread info
In arm64, each thread needs an order-2 page as kernel stack. The thread_info of each thread is at the lowest address of this page. The SP of each thread initially points to the highest address of this page minus 16.

/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
*
* It copies the registers, and all the appropriate
* parts of the process environment (as per the clone
* flags). The actual kick-off is left to the caller.
*/
static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace)
{
......
p = dup_task_struct(current);
......
}
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
struct task_struct *tsk;
struct thread_info *ti;
unsigned long *stackend;
int node = tsk_fork_get_node(orig);
int err;

tsk = alloc_task_struct_node(node);
if (!tsk)
return NULL;

ti = alloc_thread_info_node(tsk, node);
if (!ti)
goto free_tsk;

static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
int node)
{
struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
THREAD_SIZE_ORDER);

return page ? page_address(page) : NULL;
}
#ifndef CONFIG_ARM64_64K_PAGES
#define THREAD_SIZE_ORDER 2
#endif

#define THREAD_SIZE 16384
#define THREAD_START_SP (THREAD_SIZE - 16)
#ifdef __KERNEL__

#ifdef CONFIG_DEBUG_STACK_USAGE
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
#else
# define THREADINFO_GFP (GFP_KERNEL | __GFP_NOTRACK)
#endif

#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)

gfp_mask of this allocation

  • By default: gfp_mask of this allocation is (__GFP_NOTRACK | ___GFP_KMEMCG | ___GFP_FS | ___GFP_IO | ___GFP_WAIT) = 0x3000d0
  • According to kernel: alloc_page: how suspend resume controls gfp_mask , gfp_mask of this allocation will become (__GFP_NOTRACK | ___GFP_KMEMCG | ___GFP_WAIT) = 0x300010 in while system is suspended. Since all user space process are freeze before entering suspend, this condition only happens to kthreadd.

  • conclusion
    This post is to discuss allocation of kernel stack in arm64. Also discuss order and gfp_mask parameters while allocating kernel stack in arm64.

    kernel: mm: page_alloc: behaviors of page allocation while a thread forks

    November 18, 2015

    The post is to discuss behaviors of page allocation while a thread forks. In this case, the process enters page allocation slowpath while allocating pages as kernel stack with gfp_mask=0x3000d0.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack
    The process enters do_fork(), allocate order-2 pages and enter page allocation slowpath.

    <4>[122596.622892] c2  15688 gle.android.gms(15688:15688): alloc order:2 mode:0x3000d0, reclaim 60 in 0.030s pri 10, scan 60, lru 80228, trigger lmk 1 times
    <4>[122596.622921] c2  15688 CPU: 2 PID: 15688 Comm: gle.android.gms Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122596.622931] c2  15688 Call trace:
    <4>[122596.622954] c2  15688 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122596.622965] c2  15688 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122596.622981] c2  15688 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122596.622995] c2  15688 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122596.623009] c2  15688 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122596.623022] c2  15688 [<ffffffc00021a1e4>] copy_process.part.58+0xf4/0xdfc
    <4>[122596.623031] c2  15688 [<ffffffc00021b000>] do_fork+0xe0/0x358
    <4>[122596.623041] c2  15688 [<ffffffc00021b310>] SyS_clone+0x10/0x1c
    <4>[122596.685079] c1  15688 gle.android.gms(15688:15688): alloc order:2 mode:0x3000d0, reclaim 54 in 0.030s pri 10, scan 97, lru 79879, trigger lmk 1 times
    <4>[122596.685114] c1  15688 CPU: 1 PID: 15688 Comm: gle.android.gms Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122596.685127] c1  15688 Call trace:
    <4>[122596.685152] c1  15688 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122596.685163] c1  15688 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122596.685179] c1  15688 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122596.685193] c1  15688 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122596.685207] c1  15688 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122596.685220] c1  15688 [<ffffffc00021a1e4>] copy_process.part.58+0xf4/0xdfc
    <4>[122596.685230] c1  15688 [<ffffffc00021b000>] do_fork+0xe0/0x358
    <4>[122596.685241] c1  15688 [<ffffffc00021b310>] SyS_clone+0x10/0x1c
    

    why does fork allocate an order-2 page in arm64
    kernel: arm64: mm: allocate kernel stack

    behaviors of page allocation

  • GFP_KERNEL means this allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x3000d0 = (__GFP_NOTRACK | __GFP_KMEMCG | GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0 
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct have no progresses, trigger OOM. It then returns pages if available after OOM.
  • should_alloc_retry() always returns true and it goes back to rebalance again.
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is false, and page order is 2. So should_alloc_retry() always returns true .

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while a thread forks. In arm64, each thread needs an order-2 page as kernel stack. In this case, a thread allocates an order-2 page with gfp_mask=0x3000d0. the process enters page allocation slowpath and does direct reclaim twice. These reclaims take 0.06 seconds within fork.

    android: process main thread blocked at Zombie state because another thread blocked at D state

    November 15, 2015

    This post is to demonstrate a condition in which the thread group leader of a process is blocked at zombie state becase another thread of the same process is blocked at uninterrupted sleep (D) state.

    example program
    android: example: process main thread blocked at Z state because another thread blocked at D state

    prerequisite to run this example
    kernel-module-hung

    software of testing device
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    hardware of testing device
    Architecture arm-v8 cortex-53.

    log of running this module
    4149:4149, process-zombie-thread-blocked starts
    4149:4151, void* thread_routine(void*):27
    4149:4149, process-zombie-thread-blocked ends

    code flow

  • main thread of process 4149 is starting
  • main thread 4149:4149 pthread_create 4149:4151
  • thread 4149:4151 writes data to /sys/kernel/debug/hung/mutex
  • thread 4149:4151 blocked at uninterruptible sleep (D) state due to mutex deadlock in kernel space
  • main thread 4149:4149 returns from main() and executes do_exit() in kernel space.
  • main thread 4149:4149 is blocked at zombie (Z) state because another threads of 4149 is blocked at uninterruptible sleep (D) state and doesn’t call do_exit().
  • analysis: why the main process stuck in zombie state
    The main thread of this process is ready to reaping all it’s child threads. But one of its child thread is blocked at uninterruptible sleep (D) state and couldn’t return from kernel space into user space to handle the fatal signal and do_exit(). In this example, the child thread is blocked at uninterruptible sleep (D) state forever due to mutex deadlock. So the main thread is blocked at Zombie (Z) state forever until kernel restarts.

    how to fix the problem
    If this condition already happens, the only way to recover is to restart the kernel.

    conclusion
    This post is to demonstrate a condition in which the thread group leader of a process is blocked at zombie state becase another thread of the same process is blcoked at uninterrupted sleep (D) state. Since the root cause is mutex deadlock in this case, the only way to recover is to restart the kernel.

    android: kernel: thread blocked at uninterruptible sleep state while msleep

    November 15, 2015

    This post is to discuss why a thread is blocked at uninterruptible sleep state while msleep.

    reference code base
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    why is the thread blocked at uninterruptible sleep state while msleep
    msleep() leverages schedule_timeout_uninterruptible() which set current state as TASK_UNINTERRUPTIBLE and enters schedule_timeout().

    /**
     * msleep - sleep safely even with waitqueue interruptions
     * @msecs: Time in milliseconds to sleep for
     */
    void msleep(unsigned int msecs)
    {
    	unsigned long timeout = msecs_to_jiffies(msecs) + 1;
    
    	while (timeout)
    		timeout = schedule_timeout_uninterruptible(timeout);
    }
    
    signed long __sched schedule_timeout_uninterruptible(signed long timeout)
    {
    	__set_current_state(TASK_UNINTERRUPTIBLE);
    	return schedule_timeout(timeout);
    }
    

    how is the thread waken up from uninterruptible sleep state
    In schedule_timeout(), the threads setup a timer whose timeout is equal to now plus timeout, and then call schedule(). Since the thread’s state is TASK_UNINTERRUPTIBLE, it could only be waken up explicitly. Sending signal to this thread couldn’t wake it up since do_signal calls wake_up_interruptible() which skipping wake up uninterruptible sleep (D) state threads.
    After timeout, the timer is triggered executes process_timeout() which calls wake_up_process() to wake up the thread. After the thread is waken up, it executes the code after schedule() and finally returns from schedule_timeout().

    static void process_timeout(unsigned long __data)
    {
    	wake_up_process((struct task_struct *)__data);
    }
    
    /**
     * schedule_timeout - sleep until timeout
     * @timeout: timeout value in jiffies
     *
     * Make the current task sleep until @timeout jiffies have
     * elapsed. The routine will return immediately unless
     * the current task state has been set (see set_current_state()).
     *
     * You can set the task state as follows -
     *
     * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
     * pass before the routine returns. The routine will return 0
     *
     * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
     * delivered to the current task. In this case the remaining time
     * in jiffies will be returned, or 0 if the timer expired in time
     *
     * The current task state is guaranteed to be TASK_RUNNING when this
     * routine returns.
     *
     * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
     * the CPU away without a bound on the timeout. In this case the return
     * value will be %MAX_SCHEDULE_TIMEOUT.
     *
     * In all cases the return value is guaranteed to be non-negative.
     */
    signed long __sched schedule_timeout(signed long timeout)
    {
    	struct timer_list timer;
    	unsigned long expire;
    
    	switch (timeout)
    	{
    	case MAX_SCHEDULE_TIMEOUT:
    		/*
    		 * These two special cases are useful to be comfortable
    		 * in the caller. Nothing more. We could take
    		 * MAX_SCHEDULE_TIMEOUT from one of the negative value
    		 * but I' d like to return a valid offset (>=0) to allow
    		 * the caller to do everything it want with the retval.
    		 */
    		schedule();
    		goto out;
    	default:
    		/*
    		 * Another bit of PARANOID. Note that the retval will be
    		 * 0 since no piece of kernel is supposed to do a check
    		 * for a negative retval of schedule_timeout() (since it
    		 * should never happens anyway). You just have the printk()
    		 * that will tell you if something is gone wrong and where.
    		 */
    		if (timeout < 0) {
    			printk(KERN_ERR "schedule_timeout: wrong timeout "
    				"value %lx\n", timeout);
    			dump_stack();
    			current->state = TASK_RUNNING;
    			goto out;
    		}
    	}
    
    	expire = timeout + jiffies;
    
    	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
    	__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
    	schedule();
    	del_singleshot_timer_sync(&timer);
    
    	/* Remove the timer from the object tracker */
    	destroy_timer_on_stack(&timer);
    
    	timeout = expire - jiffies;
    
     out:
    	return timeout < 0 ? 0 : timeout;
    }
    

    example of a thread blocked while msleep
    android: kernel: example: hung is an example in which a thread is blocked in kernel space while msleep for 1000 seconds. ps command shows that the thread is blocked at uninterruptible sleep (D) state.

    static void do_msleep_hung(void)
    {
    	msleep(1 * 1000 * 1000);
    }
    
    $ adb shell "echo 1 > /sys/kernel/debug/hung/msleep"
    $ adb shell ps
    root      604   1     5848   376   ffffffff 0042c77c S /sbin/adbd
    root      6891  604   2868   1000  0022bf84 b5d5b8fc D /system/bin/sh
    $ adb shell cat /proc/6891/stack
    [<0000000000000000>] __switch_to+0x90/0x9c
    [<0000000000000000>] msleep+0x14/0x24
    [<0000000000000000>] msleep_store+0x38/0x50 [hung]
    [<0000000000000000>] vfs_write+0xcc/0x178
    [<0000000000000000>] SyS_write+0x44/0x74
    [<0000000000000000>] cpu_switch_to+0x48/0x4c
    [<0000000000000000>] 0xffffffffffffffff
    

    conclusion
    This post is to discuss why a thread is blocked at uninterruptible sleep (D) state while msleep. At the end shows an example of a thread blocked at uninterruptible sleep (D) state while msleep for 1000 seconds.

    android: kernel: thread blocked at uninterruptible sleep state while waiting for mutex

    November 15, 2015

    This post is to discuss why a thread is blocked at uninterruptible sleep state while waiting for mutex.

    reference code base
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    mutex data structure
    In mutex structure, the field count realizes atomic operation of mutex. The filed wait_list is the list of all threads waiting for this mutex.

    struct mutex {
    	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
    	atomic_t		count;
    	spinlock_t		wait_lock;
    	struct list_head	wait_list;
    #if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
    	struct task_struct	*owner;
    #endif
    #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
    	void			*spin_mlock;	/* Spinner MCS lock */
    #endif
    #ifdef CONFIG_DEBUG_MUTEXES
    	const char 		*name;
    	void			*magic;
    #endif
    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    	struct lockdep_map	dep_map;
    #endif
    };
    

    how to determine the mutex is locked or not
    mutex_is_locked() returns true if a mutex is locked. mutex->count is the atomic variable to indicate if the mutex is locked or not. If mutex->count is equal to 1, then no thread is holding the mutex lock. If mutex->count is less than 1, then some threads is holding the mutex.

    /**
     * mutex_is_locked - is the mutex locked
     * @lock: the mutex to be queried
     *
     * Returns 1 if the mutex is locked, 0 if unlocked.
     */
    static inline int mutex_is_locked(struct mutex *lock)
    {
    	return atomic_read(&lock->count) != 1;
    }
    

    how to get mutex lock
    mutex_lock() is the API to get mutex lock. If the mutex lock is not available, it will fallback to enter __mutex_lock_slowpath() in which the calling thread stays at uninterruptible sleep state until the mutex is available.

    void __sched mutex_lock(struct mutex *lock)
    {
    	might_sleep();
    	/*
    	 * The locking fastpath is the 1->0 transition from
    	 * 'unlocked' into 'locked' state.
    	 */
    	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
    	mutex_set_owner(lock);
    }
    
    __mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
    {
    	if (unlikely(atomic_dec_return(count) < 0))
    		fail_fn(count);
    }
    

    why is the thread blocked at uninterruptible sleep state while waiting for mutex
    __mutex_lock_slowpath() leverages __mutex_lock_common() by passing TASK_UNINTERRUPTIBLE as state argument.

    In __mutex_lock_common(), the thread stays in a for loop until the mutex is available, i.e., mutex->count is equal to 1. If the mutex is not available, the thread adds a mutex_waiter into the mutex’s wait_list, set its state as ASK_UNINTERRUPTIBLE, and calls schedule_preempt_disabled() to sleep. After getting the mutex, the thread removes the mutex_waiter from the mutex’s wait_list and returns from __mutex_lock_common().

    static __used noinline void __sched
    __mutex_lock_slowpath(atomic_t *lock_count)
    {
    	struct mutex *lock = container_of(lock_count, struct mutex, count);
    
    	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
    }
    
    static inline int __sched
    __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
    		    struct lockdep_map *nest_lock, unsigned long ip)
    {
    	struct task_struct *task = current;
    	struct mutex_waiter waiter;
    	unsigned long flags;
    ......
    	spin_lock_mutex(&lock->wait_lock, flags);
    
    	debug_mutex_lock_common(lock, &waiter);
    	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
    
    	/* add waiting tasks to the end of the waitqueue (FIFO): */
    	list_add_tail(&waiter.list, &lock->wait_list);
    	waiter.task = task;
    
    	if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, -1) == 1))
    		goto done;
    ......
    for (;;) {
    		/*
    		 * Lets try to take the lock again - this is needed even if
    		 * we get here for the first time (shortly after failing to
    		 * acquire the lock), to make sure that we get a wakeup once
    		 * it's unlocked. Later on, if we sleep, this is the
    		 * operation that gives us the lock. We xchg it to -1, so
    		 * that when we release the lock, we properly wake up the
    		 * other waiters:
    		 */
    		if (MUTEX_SHOW_NO_WAITER(lock) &&
    		   (atomic_xchg(&lock->count, -1) == 1))
    			break;
    
    		/*
    		 * got a signal? (This code gets eliminated in the
    		 * TASK_UNINTERRUPTIBLE case.)
    		 */
    		if (unlikely(signal_pending_state(state, task))) {
    			mutex_remove_waiter(lock, &waiter,
    					    task_thread_info(task));
    			mutex_release(&lock->dep_map, 1, ip);
    			spin_unlock_mutex(&lock->wait_lock, flags);
    
    			debug_mutex_free_waiter(&waiter);
    			preempt_enable();
    			return -EINTR;
    		}
    		__set_task_state(task, state);
    
    		/* didn't get the lock, go to sleep: */
    		spin_unlock_mutex(&lock->wait_lock, flags);
    		schedule_preempt_disabled();
    		spin_lock_mutex(&lock->wait_lock, flags);
    	}
    
    done:
    	lock_acquired(&lock->dep_map, ip);
    	/* got the lock - rejoice! */
    	mutex_remove_waiter(lock, &waiter, current_thread_info());
    	mutex_set_owner(lock);
    
    	/* set it to 0 if there are no waiters left: */
    	if (likely(list_empty(&lock->wait_list)))
    		atomic_set(&lock->count, 0);
    
    	spin_unlock_mutex(&lock->wait_lock, flags);
    
    	debug_mutex_free_waiter(&waiter);
    	preempt_enable();
    
    	return 0;
    }
    

    how is the thread waken up from uninterruptible sleep state
    All threads calling mutex_lock() and waiting for the mutex are sleeping until the mutex becomes available. While the thread holding the mutex is releasing the mutex, if someone is waiting for the mutex, the thread enters __mutex_unlock_slowpath() to wake up all threads waiting in the mutex’s wait_list.

    /**
     * mutex_unlock - release the mutex
     * @lock: the mutex to be released
     *
     * Unlock a mutex that has been locked by this task previously.
     *
     * This function must not be used in interrupt context. Unlocking
     * of a not locked mutex is not allowed.
     *
     * This function is similar to (but not equivalent to) up().
     */
    void __sched mutex_unlock(struct mutex *lock)
    {
    	/*
    	 * The unlocking fastpath is the 0->1 transition from 'locked'
    	 * into 'unlocked' state:
    	 */
    #ifndef CONFIG_DEBUG_MUTEXES
    	/*
    	 * When debugging is enabled we must not clear the owner before time,
    	 * the slow path will always be taken, and that clears the owner field
    	 * after verifying that it was indeed current.
    	 */
    	mutex_clear_owner(lock);
    #endif
    	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
    }
    
    /**
     *  __mutex_fastpath_unlock - try to promote the count from 0 to 1
     *  @count: pointer of type atomic_t
     *  @fail_fn: function to call if the original value was not 0
     *
     * Try to promote the count from 0 to 1. If it wasn't 0, call <fail_fn>.
     * In the failure case, this function is allowed to either set the value to
     * 1, or to set it to a value lower than 1.
     *
     * If the implementation sets it to a value of lower than 1, then the
     * __mutex_slowpath_needs_to_unlock() macro needs to return 1, it needs
     * to return 0 otherwise.
     */
    static inline void
    __mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
    {
    	if (unlikely(atomic_inc_return(count) <= 0))
    		fail_fn(count);
    }
    

    how is the thread waken by mutex unlock slowpath
    In __mutex_unlock_common_slowpath(), The thread releases the mutex by setting mutex->count as 1. If the mutex’s wait_list is not empty, the thread wakes up all threads waiting the mutex’s wait_list.

    /*
     * Release the lock, slowpath:
     */
    static inline void
    __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
    {
    	struct mutex *lock = container_of(lock_count, struct mutex, count);
    	unsigned long flags;
    
    	spin_lock_mutex(&lock->wait_lock, flags);
    	mutex_release(&lock->dep_map, nested, _RET_IP_);
    	debug_mutex_unlock(lock);
    
    	/*
    	 * some architectures leave the lock unlocked in the fastpath failure
    	 * case, others need to leave it locked. In the later case we have to
    	 * unlock it here
    	 */
    	if (__mutex_slowpath_needs_to_unlock())
    		atomic_set(&lock->count, 1);
    
    	if (!list_empty(&lock->wait_list)) {
    		/* get the first entry from the wait-list: */
    		struct mutex_waiter *waiter =
    				list_entry(lock->wait_list.next,
    					   struct mutex_waiter, list);
    
    		debug_mutex_wake_waiter(lock, waiter);
    
    		wake_up_process(waiter->task);
    	}
    
    	spin_unlock_mutex(&lock->wait_lock, flags);
    }
    
    /*
     * Release the lock, slowpath:
     */
    static __used noinline void
    __mutex_unlock_slowpath(atomic_t *lock_count)
    {
    	__mutex_unlock_common_slowpath(lock_count, 1);
    }
    

    example of a thread blocked while waiting for mutex
    android: kernel: example: hung is an example in which a thread is blocked in kernel space due to mutex deadlock. ps command shows that the thread is blocked at uninterruptible sleep (D) state.

    static void do_mutex_twice(void)
    {
    	mutex_lock(&mutex);
    	mutex_lock(&mutex);
    }
    
    $ adb shell "echo 1 > /sys/kernel/debug/hung/mutex"
    $ adb shell ps
    root      604   1     5848   376   ffffffff 0042c77c S /sbin/adbd
    root      6118  604   2868   1004  fc0080f8 9c4ef8fc D /system/bin/sh
    $ adb shell cat /proc/6118/stack
    [<0000000000000000>] __switch_to+0x90/0x9c
    [<0000000000000000>] mutex_store+0x44/0x5c [hung]
    [<0000000000000000>] vfs_write+0xcc/0x178
    [<0000000000000000>] SyS_write+0x44/0x74
    [<0000000000000000>] cpu_switch_to+0x48/0x4c
    [<0000000000000000>] 0xffffffffffffffff
    

    conclusion
    This post is to discuss why a thread is blocked at uninterruptible sleep (D) state while waiting for mutex and when the blocked thread is waken up. At the end shows an example of a thread hitting mutex deadlock and blocked at uninterruptible sleep (D) state forever.

    android: kernel: example: hung

    November 8, 2015

    This post demonstrate a kernel module which demonstrates some conditions in which a user space process is hung at kernel space.

    example program
    kernel-module-hung

    software of testing device
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    hardware of testing device
    Architecture arm-v8 cortex-53.

    build this module

    $ export ARCH=arm64
    $ export CROSS_COMPILE=aarch64-linux-android-
    $ make KERNEL_PATH=${android_source}/kernel O=${android_source}/out/target/product/${project}/obj/KERNEL_OBJ
    

    install and uninstall this module

    $ adb push hung.ko /data
    $ adb shell insmod /data/hung.ko
    $ adb shell rmmod hung
    

    how this module works

  • hung at mutex deadlock
  • $ adb shell "echo 1 > /sys/kernel/debug/hung/mutex"
    $ adb shell ps
    root      604   1     5848   376   ffffffff 0042c77c S /sbin/adbd
    root      6118  604   2868   1004  fc0080f8 9c4ef8fc D /system/bin/sh
    $ adb shell cat /proc/6118/stack
    [<0000000000000000>] __switch_to+0x90/0x9c
    [<0000000000000000>] mutex_store+0x44/0x5c [hung]
    [<0000000000000000>] vfs_write+0xcc/0x178
    [<0000000000000000>] SyS_write+0x44/0x74
    [<0000000000000000>] cpu_switch_to+0x48/0x4c
    [<0000000000000000>] 0xffffffffffffffff
    
  • hung at msleep for 1000 seconds
  • $ adb shell "echo 1 > /sys/kernel/debug/hung/msleep"
    $ adb shell ps
    root      604   1     5848   376   ffffffff 0042c77c S /sbin/adbd
    root      6891  604   2868   1000  0022bf84 b5d5b8fc D /system/bin/sh
    $ adb shell cat /proc/6891/stack
    [<0000000000000000>] __switch_to+0x90/0x9c
    [<0000000000000000>] msleep+0x14/0x24
    [<0000000000000000>] msleep_store+0x38/0x50 [hung]
    [<0000000000000000>] vfs_write+0xcc/0x178
    [<0000000000000000>] SyS_write+0x44/0x74
    [<0000000000000000>] cpu_switch_to+0x48/0x4c
    [<0000000000000000>] 0xffffffffffffffff
    
  • hung at mdelay for 1000 seconds
  • $ adb shell "echo 1 > /sys/kernel/debug/hung/mdelay"
    $ adb shell ps
    root      604   1     5848   376   ffffffff 0042c77c S /sbin/adbd
    root      6595  604   2868   1000  00000000 b24778fc R /system/bin/sh
    $ adb shell cat /proc/6595/stack
    [<0000000000000000>] __switch_to+0x90/0x9c
    [<0000000000000000>] sched_account_irqtime+0xe0/0xfc
    [<0000000000000000>] irqtime_account_irq+0xec/0x108
    [<0000000000000000>] __do_softirq+0x244/0x284
    [<0000000000000000>] do_softirq+0x40/0x54
    [<0000000000000000>] uncached_logk_pc+0xdc/0xf8
    [<0000000000000000>] gic_handle_irq+0xb0/0xcc
    [<0000000000000000>] el1_irq+0x64/0xd4
    [<0000000000000000>] __delay+0x18/0x38
    [<0000000000000000>] __const_udelay+0x20/0x2c
    [<0000000000000000>] $x+0x48/0x64 [hung]
    [<0000000000000000>] vfs_write+0xcc/0x178
    [<0000000000000000>] SyS_write+0x44/0x74
    [<0000000000000000>] cpu_switch_to+0x48/0x4c
    [<0000000000000000>] 0xffffffffffffffff
    

    conclusion
    This post demonstrate a kernel module which demonstrates some conditions in which a user space process is hung at kernel space. Under these condition, these process are hung until kernel restarts.


    %d bloggers like this: