Archive for the ‘alloc_pages’ Category

kernel: mm: behaviors of page allocation while ipa_get_skb_ipa_rx allocates sock buffer

November 23, 2015

The post is to discuss behaviors of page allocation while ipa_get_skb_ipa_rx calls alloc_skb. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask = 0x2142d0 = (__GFP_NOTRACK | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_NOWARN | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS).

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y

environment setup
The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
Node 0, zone      DMA          143            8          468            2          106            0

call stack

<4>[164683.570864] c0  29125 kworker/u16:5(29125:29125): alloc order:2 mode:0x2142d0, reclaim 60 in 0.030s pri 10, scan 60, lru 80242, trigger lmk 1 times 
<4>[164683.570885] c0  29125 CPU: 0 PID: 29125 Comm: kworker/u16:5 Tainted: G        W    3.10.49-g4c6439a #12
<4>[164683.570903] c0  29125 Workqueue: iparepwq41 ipa_wq_repl_rx
<4>[164683.570911] c0  29125 Call trace:
<4>[164683.570924] c0  29125 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
<4>[164683.570933] c0  29125 [<ffffffc000207920>] show_stack+0x10/0x1c
<4>[164683.570945] c0  29125 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
<4>[164683.570956] c0  29125 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
<4>[164683.570967] c0  29125 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
<4>[164683.570977] c0  29125 [<ffffffc0002f0e1c>] new_slab+0x218/0x23c
<4>[164683.570987] c0  29125 [<ffffffc0002f279c>] __slab_alloc.isra.50.constprop.55+0x26c/0x300
<4>[164683.570997] c0  29125 [<ffffffc0002f4778>] __kmalloc_track_caller+0xd8/0x214
<4>[164683.571007] c0  29125 [<ffffffc000b3b148>] __kmalloc_reserve.isra.31+0x38/0x8c
<4>[164683.571016] c0  29125 [<ffffffc000b3b20c>] __alloc_skb+0x70/0x164
<4>[164683.571024] c0  29125 [<ffffffc000b3b3c4>] __netdev_alloc_skb+0xc4/0x100
<4>[164683.571034] c0  29125 [<ffffffc0009ebadc>] ipa_get_skb_ipa_rx+0x18/0x24
<4>[164683.571043] c0  29125 [<ffffffc0009ebe00>] ipa_wq_repl_rx+0xf0/0x2a8
<4>[164683.571053] c0  29125 [<ffffffc000237b88>] process_one_work+0x264/0x3dc
<4>[164683.571061] c0  29125 [<ffffffc000238ee0>] worker_thread+0x1f0/0x310
<4>[164683.571070] c0  29125 [<ffffffc00023e6d0>] kthread+0xac/0xb8

order and gfp_mask of this allocation

  • page order is 2 = 16 KB
  • According to my experience, kmem_cache kmalloc-xxxx set gfp_mask with 0x204200 = (__GFP_NO_TRACKING | __GFP_COMPUND | __GFP_NOWARN)
  • According to kernel: mm: gfp_mask and kmalloc_reserve, kmalloc_reserve() sets gfp_mask with 0x010200 = (__GFP_NOMEMALLOC | __GFP_NOWARN)
  • ultimately, gfp_mask = 0x2142d0 = (__GFP_NO_TRACKING | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_NOWARN | GFP_KERNEL).
  • behaviors of this page allocation

  • GFP_KERNEL is set. This allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x2142d0 = (__GFP_NO_TRACKING | __GFP_NOMEMALLOC  | __GFP_COMP | __GFP_NOWARN | GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of this page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct reclaim have no progresses, trigger OOM. It then returns pages if available after OOM.
  • should_alloc_retry() always returns true and it goes back to rebalance again.
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is false, and page order is 2. So should_alloc_retry() always returns true .

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while ipa_get_skb_ipa_rx calls alloc_skb. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask = 0x2142d0 = (__GFP_NOTRACK | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_NOWARN | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS). The allocation guarantee success since should_alloc_retry() always returns true.

    Advertisements

    kernel: mm: page_alloc: behaviors of page allocation while sock_alloc_send_pskb allocate sock buffer

    November 23, 2015

    The post is to discuss behaviors of page allocation while sock_alloc_send_pskb calls alloc_skb. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask = 0x1146d0 = (__GFP_KMEMCG | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_REPEAT | __GFP_NOWARN | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS).

    reference code base
    software of testing device
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack

    <4>[164616.981519] c0  26339 Chrome_ChildIOT(16427:26339): alloc order:2 mode:0x1146d0, reclaim 66 in 0.030s pri 10, scan 71, lru 80385, trigger lmk 1 times 
    <d4>[164616.981546] c0  26339 CPU: 0 PID: 26339 Comm: Chrome_ChildIOT Tainted: G        W    3.10.49-g4c6439a #12
    <4>[164616.981581] c0  26339 Call trace:
    <4>[164616.981614] c0  26339 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[164616.981628] c0  26339 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[164616.981648] c0  26339 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[164616.981665] c0  26339 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[164616.981680] c0  26339 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[164616.981693] c0  26339 [<ffffffc0002c24a0>] __get_free_pages+0x14/0x48
    <4>[164616.981705] c0  26339 [<ffffffc0002f01c8>] kmalloc_order_trace+0x3c/0xec
    <4>[164616.981716] c0  26339 [<ffffffc0002f46e8>] __kmalloc_track_caller+0x48/0x214
    <4>[164616.981735] c0  26339 [<ffffffc000b3b148>] __kmalloc_reserve.isra.31+0x38/0x8c
    <4>[164616.981745] c0  26339 [<ffffffc000b3b20c>] __alloc_skb+0x70/0x164
    <4>[164616.981757] c0  26339 [<ffffffc000b36064>] sock_alloc_send_pskb+0xd8/0x324
    <4>[164616.981768] c0  26339 [<ffffffc000b362c8>] sock_alloc_send_skb+0x18/0x24
    <4>[164616.981781] c0  26339 [<ffffffc000c0bac8>] unix_stream_sendmsg+0x158/0x2e8
    <4>[164616.981804] c0  26339 [<ffffffc000b327e0>] sock_sendmsg+0x8c/0xb0
    <4>[164616.981816] c0  26339 [<ffffffc000b346f8>] SyS_sendto+0x130/0x164
    

    order and gfp_mask of this allocation

  • page order is 2 = 16 KB
  • According to kernel: mm: gfp_mask of kmalloc, kmalloc() sets gfp_mask with 0x1040d0 = (__GFP_KMEMCG | __GFP_COMP | GFP_KERNEL).
  • According to kernel: mm: gfp_mask and kmalloc_reserve, kmalloc_reserve() sets gfp_mask with 0x010200 = (__GFP_NOMEMALLOC | __GFP_NOWARN)
  • sock_alloc_send_pskb() sets gfp_mask with 0x00400 = (__GFP_REPEAT).
  • ultimately, gfp_mask = 0x1146d0 = (__GFP_KMEMCG | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_REPEAT | __GFP_NOWARN | GFP_KERNEL).
  • /*
     *	Generic send/receive buffer handlers
     */
    
    struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
    				     unsigned long data_len, int noblock,
    				     int *errcode)
    {
    	struct sk_buff *skb;
    	gfp_t gfp_mask;
    	long timeo;
    	int err;
    	int npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
    
    	err = -EMSGSIZE;
    	if (npages > MAX_SKB_FRAGS)
    		goto failure;
    
    	gfp_mask = sk->sk_allocation;
    	if (gfp_mask & __GFP_WAIT)
    		gfp_mask |= __GFP_REPEAT;
    
    	timeo = sock_sndtimeo(sk, noblock);
    	while (1) {
    		err = sock_error(sk);
    		if (err != 0)
    			goto failure;
    
    		err = -EPIPE;
    		if (sk->sk_shutdown & SEND_SHUTDOWN)
    			goto failure;
    
    		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) {
    			skb = alloc_skb(header_len, gfp_mask);
    

    behaviors of this page allocation

  • GFP_KERNEL is set. This allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x1146d0 = (__GFP_KMEMCG | __GFP_NOMEMALLOC  | __GFP_COMP | __GFP_REPEAT | __GFP_NOWARN | GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of this page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct reclaim have no progresses, trigger OOM. It then returns pages if available after OOM.
  • should_alloc_retry() always returns true and it goes back to rebalance again.
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is false, and page order is 2. So should_alloc_retry() always returns true .

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while sock_alloc_send_pskb calls alloc_skb. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask = 0x1146d0 = (__GFP_KMEMCG | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_REPEAT | __GFP_NOWARN | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS). The allocation guarantee success since should_alloc_retry() always returns true.

    kernel: mm: gfp_mask and kmalloc_reserve

    November 23, 2015

    This post is to discuss gfp_mask and kmalloc_reserve.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=yk
    CONFIG_CMA=y
    CONFIG_TRACING=y
    

    kmalloc_reserve implementation
    kmalloc_reserve is declared in net subsystem to allocate skb buffer. It calls kmalloc at most twice. The first time is with __GFP_NOMEMALLOC which avoids the allocation without checking watermark. The second time is tried if the first time fails and gfp_pfmemalloc_allowed(flags) returns true.

    The rationale is that it tries to allocate pages normally. If it fails, it then fallback to tries to allocate this page which could avoid checking watermark. The caller then sets skb_buf->pfmemalloc = pfmemalloc set by kmalloc_reserve(). If skb_buf->pfmemalloc is true, then network drivers knows that this page is allocated in high memory pressure and drops it soon.

  • According to kernel: mm: gfp_mask of kmalloc, kmalloc() sets gfp_mask with 0x104000 = (__GFP_KMEMCG | __GFP_COMP)
  • kmalloc_reserve() set gfp_mask with 0x010200 = (__GFP_NOMEMALLOC | __GFP_NOWARN)
  • kmalloc_reserve(……, GFP_KERNEL, …….) allocates pages with gfp_mask = 0x1142d0 = (__GFP_KMEMCG | __GFP_NOMEMALLOC | __GFP_COMP | __GFP_NOWARN | GFP_KERNEL)
  • /*
     * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
     * the caller if emergency pfmemalloc reserves are being used. If it is and
     * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
     * may be used. Otherwise, the packet data may be discarded until enough
     * memory is free
     */
    #define kmalloc_reserve(size, gfp, node, pfmemalloc) \
    	 __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
    
    static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
    			       unsigned long ip, bool *pfmemalloc)
    {
    	void *obj;
    	bool ret_pfmemalloc = false;
    
    	/*
    	 * Try a regular allocation, when that fails and we're not entitled
    	 * to the reserves, fail.
    	 */
    	obj = kmalloc_node_track_caller(size,
    					flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
    					node);
    	if (obj || !(gfp_pfmemalloc_allowed(flags)))
    		goto out;
    
    	/* Try again but now we are using pfmemalloc reserves */
    	ret_pfmemalloc = true;
    	obj = kmalloc_node_track_caller(size, flags, node);
    
    out:
    	if (pfmemalloc)
    		*pfmemalloc = ret_pfmemalloc;
    
    	return obj;
    }
    

    conclusion
    This post is to discuss gfp_mask and kmalloc_reserve. It shows how kmalloc_reserve use __GFP_NOMEMALLOC to controls kmalloc behaviors.

    kernel: mm: gfp_to_alloc_flags

    November 23, 2015

    This post is to discuss transformation from gfp_mask to alloc_flags.

    strong> reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=yk
    CONFIG_CMA=y
    CONFIG_TRACING=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    code flow and examples

  • If gfp_mask = 0x0000d0 = (GFP_KERNEL), then alloc_flags = 0x40 = (ALLOC_CPUSET | ALLOC_WMARK_MIN).
  • If gfp_mask = 0x000020 = (GFP_ATOMIC), then alloc_flags = 0x30 = (ALLOC_HIGH | ALLOC_HARDER | ALLOC_WMARK_MIN)
  • If gfp_mask = 0x300010 = (__GFP_NOTRACK | __GFP_KMEMCG | __GFP_WAIT), then alloc_flags = 0x40 = (ALLOC_CPUSET | ALLOC_WMARK_MIN).
  • If gfp_mask = 0x0008d0 = (PF_MEMALLOC | GFP_KERNEL), then alloc_flags = 0x44 = (ALLOC_CPUSET | ALLOC_NO_WATERMARKS).
  • If gfp_maks = 0x0000d0 = (GFP_KERNEL) and in_interrupt() is false and (current->flags & PF_MEMALLOC) is true, then alloc_flags = 0x44 = (ALLOC_CPUSET | ALLOC_NO_WATERMARKS).
  • if gfp_mask = 0x0108d0 = (__GFP_NOMEMALLOC | PF_MEMALLOC | GFP_KERNEL), then alloc_flags = 0x40 = (ALLOC_CPUSET | ALLOC_WMARK_MIN).
  • If gfp_maks = 0x0100d0 = (__GFP_NOMEMALLOC | GFP_KERNEL) and in_interrupt() is false and (current->flags & PF_MEMALLOC) is true, then alloc_flags = 0x40 = (ALLOC_CPUSET | ALLOC_WMARK_MIN).
  • static inline int
    gfp_to_alloc_flags(gfp_t gfp_mask)
    {
    	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
    	const gfp_t wait = gfp_mask & __GFP_WAIT;
    
    	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
    	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
    
    	/*
    	 * The caller may dip into page reserves a bit more if the caller
    	 * cannot run direct reclaim, or if the caller has realtime scheduling
    	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
    	 * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
    	 */
    	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
    
    	if (!wait) {
    		/*
    		 * Not worth trying to allocate harder for
    		 * __GFP_NOMEMALLOC even if it can't schedule.
    		 */
    		if  (!(gfp_mask & __GFP_NOMEMALLOC))
    			alloc_flags |= ALLOC_HARDER;
    		/*
    		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
    		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
    		 */
    		alloc_flags &= ~ALLOC_CPUSET;
    	} else if (unlikely(rt_task(current)) && !in_interrupt())
    		alloc_flags |= ALLOC_HARDER;
    
    	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
    		if (gfp_mask & __GFP_MEMALLOC)
    			alloc_flags |= ALLOC_NO_WATERMARKS;
    		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
    			alloc_flags |= ALLOC_NO_WATERMARKS;
    		else if (!in_interrupt() &&
    				((current->flags & PF_MEMALLOC) ||
    				 unlikely(test_thread_flag(TIF_MEMDIE))))
    			alloc_flags |= ALLOC_NO_WATERMARKS;
    	}
    #ifdef CONFIG_CMA
    	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
    		alloc_flags |= ALLOC_CMA;
    #endif
    	return alloc_flags;
    }
    

    conclusion
    This post is to discuss transformation from gfp_mask to alloc_flags. It shows code flow and demonstrates some examples.

    kernel: mm: page_alloc: behaviors of page allocation while drivers call kmalloc

    November 22, 2015

    The post is to discuss behaviors of page allocation while drivers call kmalloc. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask=0x18c0d0 = (__GFP_KMEMCG| __GFP_RECLAIMABLE | __GFP_ZERO | __GFP_COMP | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS).

    reference code base
    software of testing device
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack

    <4>[178437.220958] c1   3673 kworker/u16:2(3673:3673): alloc order:2 mode:0x18c0d0, reclaim 60 in 0.030s pri 10, scan 60, lru 80219, trigger lmk 1 times
    <4>[178437.220981] c1   3673 CPU: 1 PID: 3673 Comm: kworker/u16:2 Tainted: G        W    3.10.49-g4c6439a #12
    <4>[178437.221004] c1   3673 Workqueue: msm_vidc_workerq_venus venus_hfi_core_work_handler
    <4>[178437.221013] c1   3673 Call trace:
    <4>[178437.221027] c1   3673 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[178437.221036] c1   3673 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[178437.221048] c1   3673 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[178437.221060] c1   3673 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[178437.221071] c1   3673 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[178437.221080] c1   3673 [<ffffffc0002c24a0>] __get_free_pages+0x14/0x48
    <4>[178437.221090] c1   3673 [<ffffffc0002f01c8>] kmalloc_order_trace+0x3c/0xec
    <4>[178437.221102] c1   3673 [<ffffffc00051fdcc>] venus_hfi_core_work_handler+0x430/0xb84
    <4>[178437.221113] c1   3673 [<ffffffc000237b88>] process_one_work+0x264/0x3dc
    <4>[178437.221122] c1   3673 [<ffffffc000238ee0>] worker_thread+0x1f0/0x310
    <4>[178437.221131] c1   3673 [<ffffffc00023e6d0>] kthread+0xac/0xb8
    

    order and gfp_mask of this allocation
    kernel: mm: code flow and gfp_mask of kmalloc shows kzalloc will finally use gfp_mask 0x10c0d0 = (__GFP_KMEMCG | __GFP_ZERO | __GFP_COMP | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS) to allocate pages if size > 8KB.

    In this case, the driver calls kzalloc(12KB, GFP_TEMPORARY) which finally requests an order-2 page(16KB) with gfp_mask 0x18c0d0 (__GFP_KMEMCG | __GFP_RECLAIMABLE | __GFP_ZERO | __GFP_COMP | GFP_KERNEL).

    static void venus_hfi_response_handler(struct venus_hfi_device *device)
    {
    	u8 *packet = NULL;
    	u32 rc = 0;
    	struct hfi_sfr_struct *vsfr = NULL;
    
    	packet = kzalloc(VIDC_IFACEQ_VAR_HUGE_PKT_SIZE, GFP_TEMPORARY);
    
    #define VIDC_IFACEQ_VAR_HUGE_PKT_SIZE          (1024*12)
    

    behaviors of this page allocation

  • GFP_KERNEL is set. This allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_RECLAIMABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x10c0d0 = (__GFP_KMEMCG | __GFP_RECLAIMABLE | __GFP_ZERO | __GFP_COMP | GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_RECLAIMABLE = 1
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of this page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct reclaim have no progresses, trigger OOM. It then returns pages if available after OOM.
  • If both compaction and direct reclaim have no progresses, since pm_suspended_storage() is always true, should_alloc_retry() return false to leave rebalance loop.
  • Try compaction which returns page if success.
  • should_alloc_retry() always returns true and it goes back to rebalance again.
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is false, and page order is 2. So should_alloc_retry() always returns true .

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while drivers call kmalloc. In this case, the process enters page allocation slowpath while allocating an order-2 page with gfp_mask=0x18c0d0 = (__GFP_KMEMCG| __GFP_RECLAIMABLE | __GFP_ZERO | __GFP_COMP | GFP_KERNEL) in which GFP_KERNEL =(__GFP_WAIT | __GFP_IO | __GFP_FS). The allocation guarantee success since should_alloc_retry() always returns true.

    kernel: mm: gfp_mask of kmalloc

    November 22, 2015

    This post is to discuss gfp_mask of kmalloc.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    CONFIG_TRACING=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    code flow

  • If allocation size is greater than 8KB(KMALLOC_MAX_CACHE_SIZE), then the code flow of kmalloc is
  • kmalloc()
    -> kmalloc_large()
    -> kmalloc_order_trace()
    -> kmalloc_order()
    -> __get_free_pages()
    -> __alloc_pages_nodemask()
    
  • if allocate size is equal to or less than 8KB(KMALLOC_MAX_CACHE_SIZE), then the code flow is
  • kmalloc()
    -> kmem_cache_alloc_trace()
    
  • kzalloc calls kmalloc directly and sets __GFP_ZERO flag in gfp_mask.

    gfp masks

  • kzalloc() sets gfp_mask with 0x008000 = (__GFP_ZERO)
  • kmalloc_order() sets gfp_mask with 0x104000 = (__GFP_KMEMCG | __GFP_COMP)
  • kmalloc(size, GFP_KERNEL) allocate pages with gfp_mask = 0x1040d0 = (__GFP_KMEMCG | __GFP_COMP | GFP_KERNEL).
  • kmalloc(size, GFP_KERNEL) allocate pages with gfp_mask = 0x10c0d0 = (__GFP_KMEMCG | __GFP_ZERO | __GFP_COMP | GFP_KERNEL).
  • #ifdef CONFIG_SLAB
    /*
     * The largest kmalloc size supported by the SLAB allocators is
     * 32 megabyte (2^25) or the maximum allocatable page order if that is
     * less than 32 MB.
     *
     * WARNING: Its not easy to increase this value since the allocators have
     * to do various tricks to work around compiler limitations in order to
     * ensure proper constant folding.
     */
    #define KMALLOC_SHIFT_HIGH	((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
    				(MAX_ORDER + PAGE_SHIFT - 1) : 25)
    #define KMALLOC_SHIFT_MAX	KMALLOC_SHIFT_HIGH
    #ifndef KMALLOC_SHIFT_LOW
    #define KMALLOC_SHIFT_LOW	5
    #endif
    #else
    /*
     * SLUB allocates up to order 2 pages directly and otherwise
     * passes the request to the page allocator.
     */
    #define KMALLOC_SHIFT_HIGH	(PAGE_SHIFT + 1)
    #define KMALLOC_SHIFT_MAX	(MAX_ORDER + PAGE_SHIFT)
    #ifndef KMALLOC_SHIFT_LOW
    #define KMALLOC_SHIFT_LOW	3
    #endif
    #endif
    
    /* Maximum allocatable size */
    #define KMALLOC_MAX_SIZE	(1UL << KMALLOC_SHIFT_MAX)
    /* Maximum size for which we actually use a slab cache */
    #define KMALLOC_MAX_CACHE_SIZE	(1UL << KMALLOC_SHIFT_HIGH)
    /* Maximum order allocatable via the slab allocagtor */
    #define KMALLOC_MAX_ORDER	(KMALLOC_SHIFT_MAX - PAGE_SHIFT)
    
    
    static __always_inline void *
    kmalloc_order(size_t size, gfp_t flags, unsigned int order)
    {
    	void *ret;
    
    	flags |= (__GFP_COMP | __GFP_KMEMCG);
    	ret = (void *) __get_free_pages(flags, order);
    	kmemleak_alloc(ret, size, 1, flags);
    	return ret;
    }
    
    #ifdef CONFIG_TRACING
    void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
    {
    	void *ret = slab_alloc(s, gfpflags, _RET_IP_);
    	trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
    	return ret;
    }
    EXPORT_SYMBOL(kmem_cache_alloc_trace);
    
    void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
    {
    	void *ret = kmalloc_order(size, flags, order);
    	trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
    	return ret;
    }
    EXPORT_SYMBOL(kmalloc_order_trace);
    
    #ifdef CONFIG_TRACING
    extern void *
    kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size);
    extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order);
    #else
    static __always_inline void *
    kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
    {
    	return kmem_cache_alloc(s, gfpflags);
    }
    
    static __always_inline void *
    kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
    {
    	return kmalloc_order(size, flags, order);
    }
    #endif
    
    static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
    {
    	unsigned int order = get_order(size);
    	return kmalloc_order_trace(size, flags, order);
    }
    
    static __always_inline void *kmalloc(size_t size, gfp_t flags)
    {
    	if (__builtin_constant_p(size)) {
    		if (size > KMALLOC_MAX_CACHE_SIZE)
    			return kmalloc_large(size, flags);
    
    		if (!(flags & GFP_DMA)) {
    			int index = kmalloc_index(size);
    
    			if (!index)
    				return ZERO_SIZE_PTR;
    
    			return kmem_cache_alloc_trace(kmalloc_caches[index],
    					flags, size);
    		}
    	}
    	return __kmalloc(size, flags);
    }
    
    /**
     * kzalloc - allocate memory. The memory is set to zero.
     * @size: how many bytes of memory are required.
     * @flags: the type of memory to allocate (see kmalloc).
     */
    static inline void *kzalloc(size_t size, gfp_t flags)
    {
    	return kmalloc(size, flags | __GFP_ZERO);
    }
    

    conclusion
    This post is to discuss gfp_mask of kmalloc and kzalloc when allocation size is greater than 8KB.

    kernel: mm: page_alloc: behaviors of page allocation while kthread forks after system suspends

    November 21, 2015

    The post is to discuss behaviors of page allocation while kthread forks after system suspends. In this case, the process enters page allocation slowpath while allocating pages as kernel stack with gfp_mask=0x300010 = (__GFP_NOTRACK | __GFP_KMEMCG | __GFP_WAIT).

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack
    kthreadd, whose pid is 2, stay at TASK_INTERRUPTIBLE unless kthread_create_list not empty and waken up. For each list_head in kthread_create_list, it will create a kernel thread for it by calling do_fork().

    <7>[122467.918049] c1   1135 PM: Entering mem sleep
    .....
    <4>[122468.083046] c0      2 kthreadd(2:2): alloc order:2 mode:0x300010, reclaim 45 in 0.030s pri 10, scan 89, lru 80323, trigger lmk 1 times
    <4>[122468.083068] c0      2 CPU: 0 PID: 2 Comm: kthreadd Tainted: G        W    3.10.49-g4c6439a #12
    <4>[122468.083076] c0      2 Call trace:
    <4>[122468.083093] c0      2 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122468.083103] c0      2 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122468.083115] c0      2 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122468.083127] c0      2 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122468.083138] c0      2 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122468.083148] c0      2 [<ffffffc00021a1e4>] copy_process.part.58+0xf4/0xdfc
    <4>[122468.083157] c0      2 [<ffffffc00021b000>] do_fork+0xe0/0x358
    <4>[122468.083166] c0      2 [<ffffffc00021b2a4>] kernel_thread+0x2c/0x38
    <4>[122468.083176] c0      2 [<ffffffc00023ef0c>] kthreadd+0xd8/0x108
    ......
    <7>[122468.154504] c0   1135 PM: Finishing wakeup.
    

    why does fork allocate an order-2 page in arm64
    kernel: arm64: mm: allocate kernel stack

    behaviors of this page allocation

  • Only __GFP_WAIT is set. This allocation could sleep, but couldn’t do IO/FS operations.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x300010 = (__GFP_NOTRACK | __GFP_KMEMCG | __GFP_WAIT)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0 
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of this page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct reclaim have no progresses, trigger OOM. It then returns pages if available after OOM.
  • If both compaction and direct reclaim have no progresses, since pm_suspended_storage() is always true, should_alloc_retry() return false to leave rebalance loop.
  • Try compaction which returns page if success.
  • Return NULL page
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is always true, and page order is 2. If both compaction and direct reclaim have no progresses, it returns false. Otherwise, it return true.

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while kthread forks after system suspends. In this case, the process enters page allocation slowpath while allocating pages as kernel stack with gfp_mask=0x300010 = (__GFP_NOTRACK | __GFP_KMEMCG | __GFP_WAIT). This allocation doesn’t guarantee success since pm_suspended_storage() is true. This allocation will give up when compaction and reclaim and oom-killer doesn’t help release a feasible page immediately.

    kernel: alloc_page: how suspend resume controls gfp_mask

    November 21, 2015

    This post discusses how suspend/resume controls gfp_mask to affect page allocation behaviors.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    CONFIG_PM_SLEEP=y
    

    how does suspend resume control gfp_allowed_mask
    suspend/resume affects allocation gfp_mask by controlling gfp_allowed_mask.

    suspend_devices_and_enter() is the main function of suspend/resume. Devices suspend after entering this function, and resume from this, too. Before suspend, pm_restrict_gfp_mask() is called to suppress some flags in gfp_allowed_mask. After resume, pm_restore_gfp_mask() is called to restore these suppressed flags in gfp_allowed_mask.

    static int enter_state(suspend_state_t state)
    {
    	int error;
    
    	if (!valid_state(state))
    		return -ENODEV;
    
    	if (!mutex_trylock(&pm_mutex))
    		return -EBUSY;
    
    	if (state == PM_SUSPEND_FREEZE)
    		freeze_begin();
    
    	printk(KERN_INFO "PM: Syncing filesystems ... ");
    	sys_sync();
    	printk("done.\n");
    
    	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
    	error = suspend_prepare(state);
    	if (error)
    		goto Unlock;
    
    	if (suspend_test(TEST_FREEZER))
    		goto Finish;
    
    	pr_debug("PM: Entering %s sleep\n", pm_states[state]);
    	pm_restrict_gfp_mask();
    	error = suspend_devices_and_enter(state);
    	pm_restore_gfp_mask();
    
     Finish:
    	pr_debug("PM: Finishing wakeup.\n");
    	suspend_finish();
     Unlock:
    	mutex_unlock(&pm_mutex);
    	return error;
    }
    

    which flags are suppressed while suspend

  • pm_restrict_gfp_mask() helps suppress GFP_IOFS = (__GFP_IO | __GFP_FS).
  • pm_restore_gfp_mask() helps restore these flags to original state.
  • pm_suspended_storage() helps return if GFP_IOFS flags in gfp_allowed_mask are disable.
  • gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
    
    #ifdef CONFIG_PM_SLEEP
    /*
     * The following functions are used by the suspend/hibernate code to temporarily
     * change gfp_allowed_mask in order to avoid using I/O during memory allocations
     * while devices are suspended.  To avoid races with the suspend/hibernate code,
     * they should always be called with pm_mutex held (gfp_allowed_mask also should
     * only be modified with pm_mutex held, unless the suspend/hibernate code is
     * guaranteed not to run in parallel with that modification).
     */
    
    static gfp_t saved_gfp_mask;
    
    void pm_restore_gfp_mask(void)
    {
    	WARN_ON(!mutex_is_locked(&pm_mutex));
    	if (saved_gfp_mask) {
    		gfp_allowed_mask = saved_gfp_mask;
    		saved_gfp_mask = 0;
    	}
    }
    
    void pm_restrict_gfp_mask(void)
    {
    	WARN_ON(!mutex_is_locked(&pm_mutex));
    	WARN_ON(saved_gfp_mask);
    	saved_gfp_mask = gfp_allowed_mask;
    	gfp_allowed_mask &= ~GFP_IOFS;
    }
    
    bool pm_suspended_storage(void)
    {
    	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
    		return false;
    	return true;
    }
    #endif /* CONFIG_PM_SLEEP */
    

    how does gfp_allowed_mask control allocation behaviors
    __alloc_pages_nodemask is the heart of buddy allocation. User could request pages by passing gfp_mask and order. __alloc_pages_nodemask filter gfp_mask with gfp_allowed_mask. Only flags set in gfp_allowed_mask are allowed to continue.

    /*
     * This is the 'heart' of the zoned buddy allocator.
     */
    struct page *
    __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
                struct zonelist *zonelist, nodemask_t *nodemask)
    {
        enum zone_type high_zoneidx = gfp_zone(gfp_mask);
        struct zone *preferred_zone;
        struct page *page = NULL;
        int migratetype = allocflags_to_migratetype(gfp_mask);
        unsigned int cpuset_mems_cookie;
        int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
        struct mem_cgroup *memcg = NULL;
    
        gfp_mask &= gfp_allowed_mask;
    
    

    should_alloc_retry() control the loop behaviors in allocation slowpath. If it return true, memory subsystem will continue working hard to release memory to make allocation success by compaction and reclaiming. If both mechanism release more memory and storage is suspend, then it’s probably useless to continue releasing memory. So should_alloc_retry() return false to give up allocation under this condition if __GFP_NORETRY not set.

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
                    unsigned long did_some_progress,
                    unsigned long pages_reclaimed)
    {
        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
            return 0;
    
        /* Always retry if specifically requested */
        if (gfp_mask & __GFP_NOFAIL)
            return 1;
    
        /*
         * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
         * making forward progress without invoking OOM. Suspend also disables
         * storage devices so kswapd will not help. Bail if we are suspending.
         */
        if (!did_some_progress && pm_suspended_storage())
            return 0;
    
        /*
         * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
         * means __GFP_NOFAIL, but that may not be true in other
         * implementations.
         */
        if (order <= PAGE_ALLOC_COSTLY_ORDER)
            return 1;
    
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
            return 1;
    
        return 0;
    }
    

    conclusion
    This post discusses how suspend/resume controls gfp_allowed_mask which controls the behaviors of page allocation. While device is suspended, all allocations’ GFP_IOFS = (__GFP_IO | __GFP_FS) in gfp_mask are disabled.

    kernel: mm: page_alloc: behaviors of page allocation while allocating unreclaimable slab pages

    November 21, 2015

    The post is to discuss behaviors of page allocation while allocating reclaimable slab pages with gfp_mask=0x2052d0. In this case, the process enters page allocation slowpath while allocating an inode object from fuse_inode kmem_cache.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack
    The process couldn’t find the inode of a path in a fuse filesystem. So it calls kmem_cache_alloc() to request an inode object from fuse_inode kmem_cache. Since no slabs are available, fuse_inode kmem_cache calls new_slab which allocates an order-3 page from buddy system. Call stack shows the process enters direct reclaim within page allocation slowpath.

    <4>[122586.862473] c2  15062 MediaScannerSer(14889:15062): alloc order:3 mode:0x2052d0, reclaim 66 in 0.030s pri 10, scan 68, lru 80329, trigger lmk 1 times
    <4>[122586.862494] c2  15062 CPU: 2 PID: 15062 Comm: MediaScannerSer Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122586.862503] c2  15062 Call trace:
    <4>[122586.862519] c2  15062 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122586.862528] c2  15062 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122586.862540] c2  15062 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122586.862552] c2  15062 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122586.862563] c2  15062 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122586.862572] c2  15062 [<ffffffc0002f0c84>] new_slab+0x80/0x23c
    <4>[122586.862582] c2  15062 [<ffffffc0002f279c>] __slab_alloc.isra.50.constprop.55+0x26c/0x300
    <4>[122586.862591] c2  15062 [<ffffffc0002f2ad8>] kmem_cache_alloc+0x94/0x1d4
    <4>[122586.862602] c2  15062 [<ffffffc0003e3a38>] fuse_alloc_inode+0x20/0xb8
    <4>[122586.862612] c2  15062 [<ffffffc00030f484>] alloc_inode+0x1c/0x90
    <4>[122586.862621] c2  15062 [<ffffffc000310150>] iget5_locked+0xa0/0x1c0
    <4>[122586.862629] c2  15062 [<ffffffc0003e3d1c>] fuse_iget+0x60/0x1bc
    <4>[122586.862639] c2  15062 [<ffffffc0003de158>] fuse_lookup_name+0x140/0x194
    <4>[122586.862648] c2  15062 [<ffffffc0003de1e0>] fuse_lookup+0x34/0x110
    <4>[122586.862656] c2  15062 [<ffffffc000302388>] lookup_real+0x30/0x54
    <4>[122586.862664] c2  15062 [<ffffffc000302f18>] __lookup_hash+0x30/0x48
    <4>[122586.862673] c2  15062 [<ffffffc000303798>] lookup_slow+0x44/0xbc
    <4>[122586.862682] c2  15062 [<ffffffc000304df4>] path_lookupat+0x104/0x710
    <4>[122586.862690] c2  15062 [<ffffffc000305428>] filename_lookup.isra.32+0x28/0x74
    <4>[122586.862699] c2  15062 [<ffffffc0003072e8>] user_path_at_empty+0x58/0x88
    <4>[122586.862708] c2  15062 [<ffffffc000307324>] user_path_at+0xc/0x18
    <4>[122586.862719] c2  15062 [<ffffffc0002f8780>] SyS_faccessat+0xc0/0x1bc
    
    <4>[122588.502620] c2  15062 MediaScannerSer(14889:15062): alloc order:3 mode:0x2052d0, reclaim 63 in 0.030s pri 10, scan 71, lru 80678, trigger lmk 1 times
    <4>[122588.502643] c2  15062 CPU: 2 PID: 15062 Comm: MediaScannerSer Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122588.502652] c2  15062 Call trace:
    <4>[122588.502670] c2  15062 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122588.502680] c2  15062 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122588.502693] c2  15062 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122588.502705] c2  15062 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122588.502716] c2  15062 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122588.502726] c2  15062 [<ffffffc0002f0c84>] new_slab+0x80/0x23c
    <4>[122588.502736] c2  15062 [<ffffffc0002f279c>] __slab_alloc.isra.50.constprop.55+0x26c/0x300
    <4>[122588.502746] c2  15062 [<ffffffc0002f2ad8>] kmem_cache_alloc+0x94/0x1d4
    <4>[122588.502757] c2  15062 [<ffffffc0003e3a38>] fuse_alloc_inode+0x20/0xb8
    <4>[122588.502768] c2  15062 [<ffffffc00030f484>] alloc_inode+0x1c/0x90
    <4>[122588.502777] c2  15062 [<ffffffc000310150>] iget5_locked+0xa0/0x1c0
    <4>[122588.502786] c2  15062 [<ffffffc0003e3d1c>] fuse_iget+0x60/0x1bc
    <4>[122588.502796] c2  15062 [<ffffffc0003de158>] fuse_lookup_name+0x140/0x194
    <4>[122588.502805] c2  15062 [<ffffffc0003de1e0>] fuse_lookup+0x34/0x110
    <4>[122588.502814] c2  15062 [<ffffffc000302388>] lookup_real+0x30/0x54
    <4>[122588.502822] c2  15062 [<ffffffc000302f18>] __lookup_hash+0x30/0x48
    <4>[122588.502830] c2  15062 [<ffffffc000303798>] lookup_slow+0x44/0xbc
    <4>[122588.502839] c2  15062 [<ffffffc000304df4>] path_lookupat+0x104/0x710
    <4>[122588.502848] c2  15062 [<ffffffc000305428>] filename_lookup.isra.32+0x28/0x74
    <4>[122588.502857] c2  15062 [<ffffffc0003072e8>] user_path_at_empty+0x58/0x88
    <4>[122588.502865] c2  15062 [<ffffffc000307324>] user_path_at+0xc/0x18
    <4>[122588.502878] c2  15062 [<ffffffc0002f8780>] SyS_faccessat+0xc0/0x1bc
    

    behaviors of page allocation

  • GFP_KERNEL means this allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 3 
    gfp_mask = 0x2052d0 = (__GFP_NOTRACK | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN | __GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0 
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct have no progresses, trigger OOM. It then returns pages if available after OOM.
  • __GFP_NORETRY is set. should_alloc_retry() returns false.
  • Try compaction which returns page if success.
  • __GFP_NOWARN is not set: avoid printing page allocation failure log
  • Return NULL page
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    why does should_alloc_retry() return false
    __GFP_NOWARN is not set.

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while allocating reclaimable slab pages with gfp_mask=0x2052d0. Page allocation behaviors are whimsical and needs case by case study. In this case, the process enters page allocation slowpath while allocating an inode object from fuse_inode kmem_cache.

    kernel: mm: page_alloc: behaviors of page allocation while a thread forks

    November 18, 2015

    The post is to discuss behaviors of page allocation while a thread forks. In this case, the process enters page allocation slowpath while allocating pages as kernel stack with gfp_mask=0x3000d0.

    reference code base
    LA.BF64.1.2.1-02220-8×94.0 with Android 5.1.0_r3(LMY47I) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    

    environment setup
    The memory has only one node which has one DMA zone. The zone has 727 pagebloacks among which 106 are CMA ones.

    Number of blocks type     Unmovable  Reclaimable      Movable      Reserve          CMA      Isolate 
    Node 0, zone      DMA          143            8          468            2          106            0
    

    call stack
    The process enters do_fork(), allocate order-2 pages and enter page allocation slowpath.

    <4>[122596.622892] c2  15688 gle.android.gms(15688:15688): alloc order:2 mode:0x3000d0, reclaim 60 in 0.030s pri 10, scan 60, lru 80228, trigger lmk 1 times
    <4>[122596.622921] c2  15688 CPU: 2 PID: 15688 Comm: gle.android.gms Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122596.622931] c2  15688 Call trace:
    <4>[122596.622954] c2  15688 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122596.622965] c2  15688 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122596.622981] c2  15688 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122596.622995] c2  15688 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122596.623009] c2  15688 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122596.623022] c2  15688 [<ffffffc00021a1e4>] copy_process.part.58+0xf4/0xdfc
    <4>[122596.623031] c2  15688 [<ffffffc00021b000>] do_fork+0xe0/0x358
    <4>[122596.623041] c2  15688 [<ffffffc00021b310>] SyS_clone+0x10/0x1c
    <4>[122596.685079] c1  15688 gle.android.gms(15688:15688): alloc order:2 mode:0x3000d0, reclaim 54 in 0.030s pri 10, scan 97, lru 79879, trigger lmk 1 times
    <4>[122596.685114] c1  15688 CPU: 1 PID: 15688 Comm: gle.android.gms Tainted: G        W    3.10.49-g4c6439a #12 
    <4>[122596.685127] c1  15688 Call trace:
    <4>[122596.685152] c1  15688 [<ffffffc0002077dc>] dump_backtrace+0x0/0x134
    <4>[122596.685163] c1  15688 [<ffffffc000207920>] show_stack+0x10/0x1c
    <4>[122596.685179] c1  15688 [<ffffffc000cedc64>] dump_stack+0x1c/0x28
    <4>[122596.685193] c1  15688 [<ffffffc0002cb6d8>] try_to_free_pages+0x5f4/0x720
    <4>[122596.685207] c1  15688 [<ffffffc0002c219c>] __alloc_pages_nodemask+0x544/0x834
    <4>[122596.685220] c1  15688 [<ffffffc00021a1e4>] copy_process.part.58+0xf4/0xdfc
    <4>[122596.685230] c1  15688 [<ffffffc00021b000>] do_fork+0xe0/0x358
    <4>[122596.685241] c1  15688 [<ffffffc00021b310>] SyS_clone+0x10/0x1c
    

    why does fork allocate an order-2 page in arm64
    kernel: arm64: mm: allocate kernel stack

    behaviors of page allocation

  • GFP_KERNEL means this allocation could do IO/FS operations and sleep.
  • gfp_mask suggests allocation from ZONE_NORMAL, and the first feasible zone is zonelist is ZONE_DMA.
  • gfp_mask suggests allocation from MIGRATE_UNMOVABLE freelist.
  • low watermark check is required.
  • page order = 2 
    gfp_mask = 0x3000d0 = (__GFP_NOTRACK | __GFP_KMEMCG | GFP_KERNEL)
    high_zoneidx = gfp_zone(gfp_mask) = ZONE_NORMAL = 1
    migratetype = allocflags_to_migratetype(gfp_mask) = MIGRATE_UNMOVABLE = 0 
    prefered_zone = ZONE_DMA
    alloc_flags = ALLOC_WMARK_LOW | ALLOC_CPUSET
    

    behaviors of page allocation slowpath

  • __GFP_NO_KSWAPD is not set: wakeup kswapd
  • try get_page_from_freelist before entering rebalance
  • ALLOC_NO_WATERMARKS is not set: skip trying __alloc_pages_high_priority which returns page if success
  • wait is true: enter rebalance which includes compaction and direct reclaim
  • Try compaction which returns page if success.
  • Try direct reclaim which returns page if success.
  • If both compaction and direct have no progresses, trigger OOM. It then returns pages if available after OOM.
  • should_alloc_retry() always returns true and it goes back to rebalance again.
  • wait = gfp_mask & __GFP_WAIT = __GFP_WAIT
    alloc_flags = gfp_to_alloc_flags(gfp_mask) = 0x00000040 = (ALLOC_WMARK_MIN | ALLOC_CPUSET)
    

    behaviors of should_alloc_retry()
    __GFP_NORETRY is not set, __GFP_NOFAIL is not set, pm_suspended_storage() is false, and page order is 2. So should_alloc_retry() always returns true .

    static inline int
    should_alloc_retry(gfp_t gfp_mask, unsigned int order,
    				unsigned long did_some_progress,
    				unsigned long pages_reclaimed)
    {
    	/* Do not loop if specifically requested */
    	if (gfp_mask & __GFP_NORETRY)
    		return 0;
    
    	/* Always retry if specifically requested */
    	if (gfp_mask & __GFP_NOFAIL)
    		return 1;
    
    	/*
    	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
    	 * making forward progress without invoking OOM. Suspend also disables
    	 * storage devices so kswapd will not help. Bail if we are suspending.
    	 */
    	if (!did_some_progress && pm_suspended_storage())
    		return 0;
    
    	/*
    	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
    	 * means __GFP_NOFAIL, but that may not be true in other
    	 * implementations.
    	 */
    	if (order <= PAGE_ALLOC_COSTLY_ORDER)
    		return 1;
    
    	/*
    	 * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
    	 * specified, then we retry until we no longer reclaim any pages
    	 * (above), or we've reclaimed an order of pages at least as
    	 * large as the allocation's order. In both cases, if the
    	 * allocation still fails, we stop retrying.
    	 */
    	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
    		return 1;
    
    	return 0;
    }
    

    conclusion
    The post is to discuss behaviors of page allocation while a thread forks. In arm64, each thread needs an order-2 page as kernel stack. In this case, a thread allocates an order-2 page with gfp_mask=0x3000d0. the process enters page allocation slowpath and does direct reclaim twice. These reclaims take 0.06 seconds within fork.


    %d bloggers like this: