Archive for the ‘reclaim’ Category

kernel: mm: balance_pgdat

December 6, 2015

This post discusses balance_pgdat().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call flow of background reclaim

kswapd()
   -> kswapd_try_to_sleep()
      -> prepare_to_wait()
      -> prepare_kswapd_sleep()
      -> prepare_kswapd_sleep()
      -> finish_wait()
   -> try_to_freeze()
   -> balance_pgdat()
      -> shrink_zone()
         -> shrink_lruvec()
         -> vmpressure()
         -> should_continue_reclaim()
      -> shrink_slab()

balance_pgdat()
balance_pgdat() returns (order, classzone_idx).

If input order is 3, and balance_pgdat() returns (3, 2), then it implies that balance_pgdat() successfully rebalance all zones from dma to normal and highmem with order-3 high watermark checking.

If input order is 3, and balance_pgdat() returns (0, 2) then it implies that balance_pgdat() fails to rebalance all zones from dma to normal and highmem with order-3 high watermark checking. But it successfully rebalance all zones from dma to normal and highmem with order-0 high watermark checking.

kswapd() checks return value of balance_pgdat(). If the return order is less than input order, then kswapd() knows that the rebalance fails, so it avoids updating (new_order, new_classzone_idx) to make kswapd possible to sleep after rebalance failure.

Simple code flow of balance_pgdat:


set up scan_control sc for shrink_zone().

do {
    for (int i = pgdat->nr_zones - 1; i >= 0; i--) {
        if (!zone_balanced(zone, order, 0, 0)) {
           end_zone = i; 
           break;
        }
    }  

    if (i < 0) {
        pgdat_is_balanced = true;
        goto out;
    }

    for (int i = 0; i <= end_zone; i++) {
        shrink_zone();
        shrink_slab();
    }

    if (pgdat_balanced()) {
        pgdat_is_balanced = true;
        break;
    }

    if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
        break;
} while (--sc.priority >= 0);

if (!pgdat_is_balanced) {
    if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
       order = sc.order = 0;
    loop_again;
}

if (order)
   try compact_pgdat() if needed.

*classzone_idx = end_zone;
return order;

/*
 * For kswapd, balance_pgdat() will work across all this node's zones until
 * they are all at high_wmark_pages(zone).
 *
 * Returns the final order kswapd was reclaiming at
 *
 * There is special handling here for zones which are full of pinned pages.
 * This can happen if the pages are all mlocked, or if they are all used by
 * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
 * What we do is to detect the case where all pages in the zone have been
 * scanned twice and there has been zero successful reclaim.  Mark the zone as
 * dead and from now on, only perform a short scan.  Basically we're polling
 * the zone for when the problem goes away.
 *
 * kswapd scans the zones in the highmem->normal->dma direction.  It skips
 * zones which have free_pages > high_wmark_pages(zone), but once a zone is
 * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
 * lower zones regardless of the number of free pages in the lower zones. This
 * interoperates with the page allocator fallback scheme to ensure that aging
 * of pages is balanced across the zones.
 */
static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
							int *classzone_idx)
{
	bool pgdat_is_balanced = false;
	int i;
	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
	struct reclaim_state *reclaim_state = current->reclaim_state;
	unsigned long nr_soft_reclaimed;
	unsigned long nr_soft_scanned;
	struct scan_control sc = {
		.gfp_mask = GFP_KERNEL,
		.may_unmap = 1,
		.may_swap = 1,
		/*
		 * kswapd doesn't want to be bailed out while reclaim. because
		 * we want to put equal scanning pressure on each zone.
		 */
		.nr_to_reclaim = ULONG_MAX,
		.order = order,
		.target_mem_cgroup = NULL,
	};
	struct shrink_control shrink = {
		.gfp_mask = sc.gfp_mask,
	};
loop_again:
	sc.priority = DEF_PRIORITY;
	sc.nr_reclaimed = 0;
	sc.may_writepage = !laptop_mode;
	count_vm_event(PAGEOUTRUN);

	do {
		unsigned long lru_pages = 0;

		/*
		 * Scan in the highmem->dma direction for the highest
		 * zone which needs scanning
		 */
		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
			struct zone *zone = pgdat->node_zones + i;

			if (!populated_zone(zone))
				continue;

			if (sc.priority != DEF_PRIORITY &&
			    !zone_reclaimable(zone))
				continue;

			/*
			 * Do some background aging of the anon list, to give
			 * pages a chance to be referenced before reclaiming.
			 */
			age_active_anon(zone, &sc);

			/*
			 * If the number of buffer_heads in the machine
			 * exceeds the maximum allowed level and this node
			 * has a highmem zone, force kswapd to reclaim from
			 * it to relieve lowmem pressure.
			 */
			if (buffer_heads_over_limit && is_highmem_idx(i)) {
				end_zone = i;
				break;
			}

			if (!zone_balanced(zone, order, 0, 0)) {
				end_zone = i;
				break;
			} else {
				/* If balanced, clear the congested flag */
				zone_clear_flag(zone, ZONE_CONGESTED);
			}
		}

		if (i < 0) {
			pgdat_is_balanced = true;
			goto out;
		}

		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;

			lru_pages += zone_reclaimable_pages(zone);
		}

		/*
		 * Now scan the zone in the dma->highmem direction, stopping
		 * at the last zone which needs scanning.
		 *
		 * We do this because the page allocator works in the opposite
		 * direction.  This prevents the page allocator from allocating
		 * pages behind kswapd's direction of progress, which would
		 * cause too much scanning of the lower zones.
		 */
		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;
			int testorder;
			unsigned long balance_gap;

			if (!populated_zone(zone))
				continue;

			if (sc.priority != DEF_PRIORITY &&
			    !zone_reclaimable(zone))
				continue;

			sc.nr_scanned = 0;

			nr_soft_scanned = 0;
			/*
			 * Call soft limit reclaim before calling shrink_zone.
			 */
			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
							order, sc.gfp_mask,
							&nr_soft_scanned);
			sc.nr_reclaimed += nr_soft_reclaimed;

			/*
			 * We put equal pressure on every zone, unless
			 * one zone has way too many pages free
			 * already. The "too many pages" is defined
			 * as the high wmark plus a "gap" where the
			 * gap is either the low watermark or 1%
			 * of the zone, whichever is smaller.
			 */
			balance_gap = min(low_wmark_pages(zone),
				(zone->managed_pages +
					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
				KSWAPD_ZONE_BALANCE_GAP_RATIO);
			/*
			 * Kswapd reclaims only single pages with compaction
			 * enabled. Trying too hard to reclaim until contiguous
			 * free pages have become available can hurt performance
			 * by evicting too much useful data from memory.
			 * Do not reclaim more than needed for compaction.
			 */
			testorder = order;
			if (IS_ENABLED(CONFIG_COMPACTION) && order &&
					compaction_suitable(zone, order) !=
						COMPACT_SKIPPED)
				testorder = 0;

			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
			    !zone_balanced(zone, testorder,
					   balance_gap, end_zone)) {
				shrink_zone(zone, &sc);

				reclaim_state->reclaimed_slab = 0;
				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
			}

			/*
			 * If we're getting trouble reclaiming, start doing
			 * writepage even in laptop mode.
			 */
			if (sc.priority < DEF_PRIORITY - 2)
				sc.may_writepage = 1;

			if (!zone_reclaimable(zone)) {
				if (end_zone && end_zone == i)
					end_zone--;
				continue;
			}

			if (zone_balanced(zone, testorder, 0, end_zone))
				/*
				 * If a zone reaches its high watermark,
				 * consider it to be no longer congested. It's
				 * possible there are dirty pages backed by
				 * congested BDIs but as pressure is relieved,
				 * speculatively avoid congestion waits
				 */
				zone_clear_flag(zone, ZONE_CONGESTED);
		}

		/*
		 * If the low watermark is met there is no need for processes
		 * to be throttled on pfmemalloc_wait as they should not be
		 * able to safely make forward progress. Wake them
		 */
		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
				pfmemalloc_watermark_ok(pgdat))
			wake_up(&pgdat->pfmemalloc_wait);

		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
			pgdat_is_balanced = true;
			break;		/* kswapd: all done */
		}

		/*
		 * We do this so kswapd doesn't build up large priorities for
		 * example when it is freeing in parallel with allocators. It
		 * matches the direct reclaim path behaviour in terms of impact
		 * on zone->*_priority.
		 */
		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
			break;
	} while (--sc.priority >= 0);

out:
	if (!pgdat_is_balanced) {
		cond_resched();

		try_to_freeze();

		/*
		 * Fragmentation may mean that the system cannot be
		 * rebalanced for high-order allocations in all zones.
		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
		 * it means the zones have been fully scanned and are still
		 * not balanced. For high-order allocations, there is
		 * little point trying all over again as kswapd may
		 * infinite loop.
		 *
		 * Instead, recheck all watermarks at order-0 as they
		 * are the most important. If watermarks are ok, kswapd will go
		 * back to sleep. High-order users can still perform direct
		 * reclaim if they wish.
		 */
		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
			order = sc.order = 0;

		goto loop_again;
	}

	/*
	 * If kswapd was reclaiming at a higher order, it has the option of
	 * sleeping without all zones being balanced. Before it does, it must
	 * ensure that the watermarks for order-0 on *all* zones are met and
	 * that the congestion flags are cleared. The congestion flag must
	 * be cleared as kswapd is the only mechanism that clears the flag
	 * and it is potentially going to sleep here.
	 */
	if (order) {
		int zones_need_compaction = 1;

		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;

			if (!populated_zone(zone))
				continue;

			/* Check if the memory needs to be defragmented. */
			if (zone_watermark_ok(zone, order,
				    low_wmark_pages(zone), *classzone_idx, 0))
				zones_need_compaction = 0;
		}

		if (zones_need_compaction)
			compact_pgdat(pgdat, order);
	}

	/*
	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
	 * makes a decision on the order we were last reclaiming at. However,
	 * if another caller entered the allocator slow path while kswapd
	 * was awake, order will remain at the higher level
	 */
	*classzone_idx = end_zone;
	return order;
}

conclusion
This post discusses balance_pgdat(). It gives a simplified code flow of balance_pgdat(). The return order of balance_pgdat() indicates if rebalance succeeds or not. balance_pgdat() also returns classzone_idx. If classzone_idx is 2, then it balance_pgdat() shrinks dma, normal, and highmem zones. The loop in balance_pgdat() repeats until pgdat_balanced() return trues. But the pgdat_balance() might be due to updating order as 0.

kernel: mm: kswapd

December 6, 2015

This post discusses kswapd().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set
kswapd()
   -> kswapd_try_to_sleep()
      -> prepare_to_wait()
      -> prepare_kswapd_sleep()
      -> prepare_kswapd_sleep()
      -> finish_wait()
   -> try_to_freeze()
   -> balance_pgdat()
      -> kswapd_shrink_zone()
         -> shrink_zone()
         -> shrink_slab()

kswapd()
kswapd thread enters kswapd() after forked. kswapd thread could allocate page without checking watermark since PF_MEMALLOC is set. Also it could be frozen while system suspends since it calls set_freezable().

If rebalance fails, then (new_order, new_classzone_idx) will not be updated from (pgdat->kswapd_max_order, pgdat->classzone_idx). Thus, it will always call kswapd_try_to_sleep to try to sleep.

If rebalance succeed, then (new_order, new_classzone_idx) will be updated. If (new_order, new_classzone_idx) has higher order or lower preferred zone, then it will not sleep and try balance_pgdat() directly.

Simplified kswapd code flow:

    for (;;) {
        if (rebalance success) {
            (new_order, new_classzone_idx) = (pgdat->kswapd_max_order, pgdat->classzone_idx);
            (pgdat->kswapd_max_order, pgdat->classzone_idx) = (0, pgdat->nr_zones - 1);
        }

        if ((new_order, new_classzone_idx) is harder than (order, classzone_idx)) {
           (order, classzone_idx) = (new_order, new_classzone_idx);
        } else {
           kswapd_try_to_sleep();
           (new_order, new_classzone_idx) = (order, classzone_idx) = (pgdat->kswapd_max_order, pgdat->classzone_idx);
           (pgdat->kswapd_max_order, pgdat->classzone_idx) = (0, pgdat->nr_zones - 1);
        }

        if (system enters suspends)
           freeze();

        if (return from frozen state)
           continue(); // skip balance pgdat to speed up resume time while thawing user space process
        
        (balanced_order, balanced_classzone_idx) = balance_pgdat();
    }

/*
 * The background pageout daemon, started as a kernel thread
 * from the init process.
 *
 * This basically trickles out pages so that we have _some_
 * free memory available even if there is no other activity
 * that frees anything up. This is needed for things like routing
 * etc, where we otherwise might have all activity going on in
 * asynchronous contexts that cannot page things out.
 *
 * If there are applications that are active memory-allocators
 * (most normal use), this basically shouldn't matter.
 */
static int kswapd(void *p)
{
	unsigned long order, new_order;
	unsigned balanced_order;
	int classzone_idx, new_classzone_idx;
	int balanced_classzone_idx;
	pg_data_t *pgdat = (pg_data_t*)p;
	struct task_struct *tsk = current;

	struct reclaim_state reclaim_state = {
		.reclaimed_slab = 0,
	};
	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

	lockdep_set_current_reclaim_state(GFP_KERNEL);

	if (kswapd_cpu_mask == NULL && !cpumask_empty(cpumask))
		set_cpus_allowed_ptr(tsk, cpumask);
	current->reclaim_state = &reclaim_state;

	/*
	 * Tell the memory management that we're a "memory allocator",
	 * and that if we need more memory we should get access to it
	 * regardless (see "__alloc_pages()"). "kswapd" should
	 * never get caught in the normal page freeing logic.
	 *
	 * (Kswapd normally doesn't need memory anyway, but sometimes
	 * you need a small amount of memory in order to be able to
	 * page out something else, and this flag essentially protects
	 * us from recursively trying to free more memory as we're
	 * trying to free the first piece of memory in the first place).
	 */
	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
	set_freezable();

	order = new_order = 0;
	balanced_order = 0;
	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
	balanced_classzone_idx = classzone_idx;
	for ( ; ; ) {
		bool ret;

		/*
		 * If the last balance_pgdat was unsuccessful it's unlikely a
		 * new request of a similar or harder type will succeed soon
		 * so consider going to sleep on the basis we reclaimed at
		 */
		if (balanced_classzone_idx >= new_classzone_idx &&
					balanced_order == new_order) {
			new_order = pgdat->kswapd_max_order;
			new_classzone_idx = pgdat->classzone_idx;
			pgdat->kswapd_max_order =  0;
			pgdat->classzone_idx = pgdat->nr_zones - 1;
		}

		if (order < new_order || classzone_idx > new_classzone_idx) {
			/*
			 * Don't sleep if someone wants a larger 'order'
			 * allocation or has tigher zone constraints
			 */
			order = new_order;
			classzone_idx = new_classzone_idx;
		} else {
			kswapd_try_to_sleep(pgdat, balanced_order,
						balanced_classzone_idx);
			order = pgdat->kswapd_max_order;
			classzone_idx = pgdat->classzone_idx;
			new_order = order;
			new_classzone_idx = classzone_idx;
			pgdat->kswapd_max_order = 0;
			pgdat->classzone_idx = pgdat->nr_zones - 1;
		}

		ret = try_to_freeze();
		if (kthread_should_stop())
			break;

		/*
		 * We can speed up thawing tasks if we don't call balance_pgdat
		 * after returning from the refrigerator
		 */
		if (!ret) {
			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
			balanced_classzone_idx = classzone_idx;
			balanced_order = balance_pgdat(pgdat, order,
						&balanced_classzone_idx);
		}
	}

	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
	current->reclaim_state = NULL;
	lockdep_clear_current_reclaim_state();

	return 0;
}

conclusion
This post discusses kswapd() and explains its simple code flow.

kernel: mm: wakeup_kswapd

December 6, 2015

This post discusses wakeup_kswapd()

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack of wakeup_kswapd

__alloc_pages_nodemask()
-> get_page_from_free_list()
-> __alloc_pages_slowpath()
   -> wakeup_all_kswapd()
      -> wakeup_kswapd()
   -> get_page_from_free_list()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> should_alloc_retry()

what is kswapd
kswapd is a kernel thread which could call do_try_to_free_pages() to reclaim pages. kswapd reclaimed is called background reclaim. It’s preferred than direct reclaim which makes a process allocate pages with long latency.

how many kswapd are there in a system
At init stage, each node whose state is N_MEMORY has a kswapd daemon. The kswapd of node 0 is kswapd0, the kswapd of node 1 is kswapd1, and so on. Each kswapd calls kthread_run() after created.

/*
 * This kswapd start function will be called by init and node-hot-add.
 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
 */
int kswapd_run(int nid)
{
	pg_data_t *pgdat = NODE_DATA(nid);
	int ret = 0;

	if (pgdat->kswapd)
		return 0;

	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
	if (IS_ERR(pgdat->kswapd)) {
		/* failure at boot is fatal */
		BUG_ON(system_state == SYSTEM_BOOTING);
		pr_err("Failed to start kswapd on node %d\n", nid);
		ret = PTR_ERR(pgdat->kswapd);
		pgdat->kswapd = NULL;
	} else if (kswapd_cpu_mask) {
		if (set_kswapd_cpu_mask(pgdat))
			pr_warn("error setting kswapd cpu affinity mask\n");
	}
	return ret;
}

/*
 * Called by memory hotplug when all memory in a node is offlined.  Caller must
 * hold lock_memory_hotplug().
 */
void kswapd_stop(int nid)
{
	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;

	if (kswapd) {
		kthread_stop(kswapd);
		NODE_DATA(nid)->kswapd = NULL;
	}
}

static int __init kswapd_init(void)
{
	int nid;

	swap_setup();
	for_each_node_state(nid, N_MEMORY)
 		kswapd_run(nid);
	if (kswapd_cpu_mask == NULL)
		hotcpu_notifier(cpu_callback, 0);
	return 0;
}

module_init(kswapd_init)

when is kswapd waken up
If a thread fails to allocate pages with low watermark check, then it will enter allocation slowpath. In slowpath, the thread wakes up kswapd at first. Then it tries to allocate pages with min watermark check. If it still fails to allocate a page from freelist, then it will enter direct reclaim to reclaim pages itself.

It’s better to let kswapd reclaim in the background rather than direct reclaim. A thread direct reclaim in allocation slow path might be harmful to system response if the thread is the main thread of an application, or kthread, or it holds some resources such as mutex which other threads are waiting for.

To make kswapd background reclaim more active than direct reclaim, we could make the gap between low watermark and min watermark bigger. A thread could still allocate pages while min watermark is satisfied, but it will wake up kswapd while low watermark is not satisfied. If the gap is 100 MB, while kswapd is wake up, then there are still 100 MB usable free pages before direct reclaim could happen.

wake_all_kswapd() and wakeup_kswapd()
wake_all_kswapd() traverses all zones download the zonelist from high_zoneidx which is gfp_zone(gfp_mask) and indicates the highest zone id satisfying caller’s requests.

static inline
void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
						enum zone_type high_zoneidx,
						enum zone_type classzone_idx)
{
	struct zoneref *z;
	struct zone *zone;

	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
		wakeup_kswapd(zone, order, classzone_idx);
}

Along the zonelist from high_zoneidx, if the zone is populated and low watermark check of this zone is not satisfied, then wake up kswapd of the zone’s node. The low watermark check doesn’t take into lowmem_reserve ratio into account.

No matter the low watermark check is satisfied or not, wakeup_kswapd() will always try to update kswapd_max_order and classzone_idx of the zone. If kswapd_max_order is 5 and callzone_idx = 2, then it means that some thread wants to allocate an order-5 page from highmem zone but it fails to allocate a requested page downside along the zonelist.

/*
 * A zone is low on free memory, so wake its kswapd task to service it.
 */
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
	pg_data_t *pgdat;

	if (!populated_zone(zone))
		return;

	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
		return;
	pgdat = zone->zone_pgdat;
	if (pgdat->kswapd_max_order < order) {
		pgdat->kswapd_max_order = order;
		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
	}
	if (!waitqueue_active(&pgdat->kswapd_wait))
		return;
	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
		return;

	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
	wake_up_interruptible(&pgdat->kswapd_wait);
}

conclusion
This post discusses wakeup_kswapd(). It shows when kswapd is waken up, and under what condition does it reclaim pages in the background.

patch discussion: mm: vmscan: rework compaction-ready signaling in direct reclaim

December 5, 2015

This post discusses mm: vmscan: rework compaction-ready signaling in direct reclaim.

merge time
v3.17

call flow of direct reclaim
Actually, kswapd also calls do_try_to_free_pages().

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                           -> shrink_active_list()
                           -> shrink_inactive_list()
                              -> shrink_page_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

do_try_to_free_pages() and shrink_zones() in v3.16
do_try_to_free_pages() repeats calling shrink_zones() until sc->nr_reclaimed >= sc->nr_to_reclaim, sc->priority < 0, or shrink_zones() returns true. The value returned by shrink_zones() indicates aborted_reclaim which means shrink_zones() skips at least one zone along the zonelist.

If do_try_to_free_pages() couldn't reclaim any pages after repeating calling shrink_zones(). It wouldn't return 0 if aborted_reclaim is true. This could help avoid triggering oom-killer before the next compaction.

effects of this patch in 3.17
This patch encode abort_reclaim data into scan_control. Therefore, the shrink_zones() doesn’t need to return abort_reclaim.

Another patch mm: vmscan: remove all_unreclaimable() also merged in v3.17 change the return type of shrink_zones() back to bool. But it means reclaimable rather than abort_reclaim.

conclusion
This post discusses mm: vmscan: rework compaction-ready signaling in direct reclaim which makes do_try_to_free_pages() more readable although it doesn’t change any logic.

kernel: mm: shrink_inactive_list

December 5, 2015

This post discusses shrink_inactive_list().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                           -> shrink_active_list()
                           -> shrink_inactive_list()
                              -> shrink_page_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

shrink_lruvec and shrink_inactive_list()
shurink_lruvec() calls get_scan_count() to evaluate how many pages to scan for each lru list. nr[0] is for inactive_anon, nr[1] is for active_anon, nr[2] is for inactive_file, and nr[3] is for active_file. If any of nr[0], nr[2], nr[3] is not empty, then it will call shrink_list() for each unevictable lru list. If the lru list is inactive, then shrink_list() calls shrink_inactive_list() to shrink this lru. If the lru list is active, then shrink_list() calls shrink_active_list() if the inactive list ratio is low. The inactive_anon ratio is low if zone->nr_inactive_anon * zone->inactive_ratio nr_active_anon. The inactive_file ratio is low if zone->nr_inactive_file nr_active_file.

what does shrink_inactive() do
shrink_page_list() calls isolate_lru_pages() to isolate some pages from inactive_anon or inactive_file list into a local list, i.e., page_list. Then it calls shrink_page_list() to shrink page_list.

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
		     struct scan_control *sc, enum lru_list lru)
{
	LIST_HEAD(page_list);
	unsigned long nr_scanned;
	unsigned long nr_reclaimed = 0;
	unsigned long nr_taken;
	unsigned long nr_dirty = 0;
	unsigned long nr_writeback = 0;
	isolate_mode_t isolate_mode = 0;
	int file = is_file_lru(lru);
	struct zone *zone = lruvec_zone(lruvec);
	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

	while (unlikely(too_many_isolated(zone, file, sc))) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);

		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

	lru_add_drain();

	if (!sc->may_unmap)
		isolate_mode |= ISOLATE_UNMAPPED;
	if (!sc->may_writepage)
		isolate_mode |= ISOLATE_CLEAN;

	spin_lock_irq(&zone->lru_lock);

	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
				     &nr_scanned, sc, isolate_mode, lru);

	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);

	if (global_reclaim(sc)) {
		zone->pages_scanned += nr_scanned;
		if (current_is_kswapd())
			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
		else
			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
	}
	spin_unlock_irq(&zone->lru_lock);

	if (nr_taken == 0)
		return 0;

	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
					&nr_dirty, &nr_writeback, false);

	spin_lock_irq(&zone->lru_lock);

	reclaim_stat->recent_scanned[file] += nr_taken;

	if (global_reclaim(sc)) {
		if (current_is_kswapd())
			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
					       nr_reclaimed);
		else
			__count_zone_vm_events(PGSTEAL_DIRECT, zone,
					       nr_reclaimed);
	}

	putback_inactive_pages(lruvec, &page_list);

	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);

	spin_unlock_irq(&zone->lru_lock);

	free_hot_cold_page_list(&page_list, 1);

	/*
	 * If reclaim is isolating dirty pages under writeback, it implies
	 * that the long-lived page allocation rate is exceeding the page
	 * laundering rate. Either the global limits are not being effective
	 * at throttling processes due to the page distribution throughout
	 * zones or there is heavy usage of a slow backing device. The
	 * only option is to throttle from reclaim context which is not ideal
	 * as there is no guarantee the dirtying process is throttled in the
	 * same way balance_dirty_pages() manages.
	 *
	 * This scales the number of dirty pages that must be under writeback
	 * before throttling depending on priority. It is a simple backoff
	 * function that has the most effect in the range DEF_PRIORITY to
	 * DEF_PRIORITY-2 which is the priority reclaim is considered to be
	 * in trouble and reclaim is considered to be in trouble.
	 *
	 * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
	 * DEF_PRIORITY-1  50% must be PageWriteback
	 * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
	 * ...
	 * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
	 *                     isolated page is PageWriteback
	 */
	if (nr_writeback && nr_writeback >=
			(nr_taken >> (DEF_PRIORITY - sc->priority)))
		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
		zone_idx(zone),
		nr_scanned, nr_reclaimed,
		sc->priority,
		trace_shrink_flags(file));
	return nr_reclaimed;
}

/proc/vmstat and shrink_inactive_list()
While shrink_inactive_list() isolates lru pages from inactive_anon or inactive_file into a local list called page_list, the number of isolated lru page is accounted in /proc/vmstat.

If the caller is kswapd in zone dma, then /proc/vmstat/pgscan_kswapd_dma is increased.
If the caller is kswapd in zone normal, then /proc/vmstat/pgscan_kswapd_normal is increased.
If the caller is kswapd in zone movable, then proc/vmstat/pgscan_kswapd_movable is increased.
If the caller is direct reclaimed in zone dma, then /proc/vmstat/pgscan_direct_dma is increased.
If the caller is direct reclaim in zone normal, then /proc/vmstat/pgscan_direct_normal is increased.
If the caller is direct reclaime in zone movable, then /proc/vmstat/pgscan_direct_movable is increased.

While shrink_inactive_list() calls shrink_page_list() to reclaim the isolated pages in page_list, the number of reclaimed page is accounted in /proc/vmstat.

If the caller is kswapd in zone dma, then /proc/vmstat/pgsteal_kswapd_dma is increased.
If the caller is kswapd in zone normal, then /proc/vmstat/pgsteal_kswapd_normal is increased.
If the caller is kswapd in zone movable, then /proc/vmstat/pgsteal_kswapd_movable is increased.
If the caller is direct reclaimed in zone dma, then /proc/vmstat/pgsteal_direct_dma is increased.
If the caller is direct reclaim in zone normal, then /proc/vmstat/pgsteal_direct_normal is increased.
If the caller is direct reclaime in zone movable, then /proc/vmstat/pgsteal_direct_movable is increased.
------ VIRTUAL MEMORY STATS (/proc/vmstat) ------
nr_free_pages 30067
nr_inactive_anon 5424
nr_active_anon 338576
nr_inactive_file 59481
nr_active_file 58591
nr_unevictable 18981
nr_mlock 18017
nr_anon_pages 337893
nr_mapped 116348
nr_file_pages 143209
nr_dirty 26
nr_writeback 72
nr_slab_reclaimable 19690
nr_slab_unreclaimable 25018
nr_page_table_pages 11961
nr_kernel_stack 3018
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 6168
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 6159
nr_dirtied 11523918
nr_written 11270195
nr_anon_transparent_hugepages 0
nr_free_cma 4287
nr_dirty_threshold 6442
nr_dirty_background_threshold 1288
pgpgin 790917801
pgpgout 75442400
pswpin 0
pswpout 0
pgalloc_dma 1373663299
pgalloc_normal 0
pgalloc_movable 0
pgfree 1375463846
pgactivate 117565768
pgdeactivate 52661597
pgfault 2722674372
pgmajfault 7010162
pgrefill_dma 89437883
pgrefill_normal 0
pgrefill_movable 0
pgsteal_kswapd_dma 186996602
pgsteal_kswapd_normal 0
pgsteal_kswapd_movable 0
pgsteal_direct_dma 7400060
pgsteal_direct_normal 0
pgsteal_direct_movable 0
pgscan_kswapd_dma 228700634
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 9064264
pgscan_direct_normal 0
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 568
slabs_scanned 197934252
kswapd_inodesteal 7122334
kswapd_low_wmark_hit_quickly 147149
kswapd_high_wmark_hit_quickly 77589
pageoutrun 323211
allocstall 156820
pgrotated 8807
pgmigrate_success 1698508
pgmigrate_fail 253
compact_migrate_scanned 24318527
compact_free_scanned 988378858
compact_isolated 3727646
compact_stall 15834
compact_fail 10133
compact_success 4497
unevictable_pgs_culled 42315
unevictable_pgs_scanned 0
unevictable_pgs_rescued 23334
unevictable_pgs_mlocked 46701
unevictable_pgs_munlocked 28684
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0

conclusion
This post discusses when shrink_inactive_list() is called by shrink_lruvec, how it does, and how /proc/vmstat is accounted while isolating iru pages and shrink_page_list().

kernel: mm: shrink_active_list

December 5, 2015

This post discusses shrink_active_list().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                           -> shrink_active_list()
                           -> shrink_inactive_list()
                              -> shrink_page_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

shrink_lruvec and shrink_active_list()
shurink_lruvec() calls get_scan_count() to evaluate how many pages to scan for each lru list. nr[0] is for inactive_anon, nr[1] is for active_anon, nr[2] is for inactive_file, and nr[3] is for active_file. If any of nr[0], nr[2], nr[3] is not empty, then it will call shrink_list() for each unevictable lru list. If the lru list is inactive, then shrink_list() calls shrink_inactive_list() to shrink this lru. If the lru list is active, then shrink_list() calls shrink_active_list() if the inactive list ratio is low. The inactive_anon ratio is low if zone->nr_inactive_anon * zone->inactive_ratio nr_active_anon. The inactive_file ratio is low if zone->nr_inactive_file nr_active_file.

what does shrink_active_list do
shrink_active_list() isolates pages from some active lru list, i.e., active_anon or active_file. The isolated pages are put into list l_hold. For each page in l_hold, if it is unevictable, then this page will be put back to appropriate lru by putback_lru_page(). Then, if it is referenced and a file cache of some executive, it will be put back to the head of original active lru. Otherwise, it will be put into the corresponding inactive list.

static void shrink_active_list(unsigned long nr_to_scan,
			       struct lruvec *lruvec,
			       struct scan_control *sc,
			       enum lru_list lru)
{
	unsigned long nr_taken;
	unsigned long nr_scanned;
	unsigned long vm_flags;
	LIST_HEAD(l_hold);	/* The pages which were snipped off */
	LIST_HEAD(l_active);
	LIST_HEAD(l_inactive);
	struct page *page;
	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
	unsigned long nr_rotated = 0;
	isolate_mode_t isolate_mode = 0;
	int file = is_file_lru(lru);
	struct zone *zone = lruvec_zone(lruvec);

	lru_add_drain();

	if (!sc->may_unmap)
		isolate_mode |= ISOLATE_UNMAPPED;
	if (!sc->may_writepage)
		isolate_mode |= ISOLATE_CLEAN;

	spin_lock_irq(&zone->lru_lock);

	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
				     &nr_scanned, sc, isolate_mode, lru);
	if (global_reclaim(sc))
		zone->pages_scanned += nr_scanned;

	reclaim_stat->recent_scanned[file] += nr_taken;

	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
	spin_unlock_irq(&zone->lru_lock);

	while (!list_empty(&l_hold)) {
		cond_resched();
		page = lru_to_page(&l_hold);
		list_del(&page->lru);

		if (unlikely(!page_evictable(page))) {
			putback_lru_page(page);
			continue;
		}

		if (unlikely(buffer_heads_over_limit)) {
			if (page_has_private(page) && trylock_page(page)) {
				if (page_has_private(page))
					try_to_release_page(page, 0);
				unlock_page(page);
			}
		}

		if (page_referenced(page, 0, sc->target_mem_cgroup,
				    &vm_flags)) {
			nr_rotated += hpage_nr_pages(page);
			/*
			 * Identify referenced, file-backed active pages and
			 * give them one more trip around the active list. So
			 * that executable code get better chances to stay in
			 * memory under moderate memory pressure.  Anon pages
			 * are not likely to be evicted by use-once streaming
			 * IO, plus JVM can create lots of anon VM_EXEC pages,
			 * so we ignore them here.
			 */
			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
				list_add(&page->lru, &l_active);
				continue;
			}
		}

		ClearPageActive(page);	/* we are de-activating */
		list_add(&page->lru, &l_inactive);
	}

	/*
	 * Move pages back to the lru list.
	 */
	spin_lock_irq(&zone->lru_lock);
	/*
	 * Count referenced pages from currently used mappings as rotated,
	 * even though only some of them are actually re-activated.  This
	 * helps balance scan pressure between file and anonymous pages in
	 * get_scan_ratio.
	 */
	reclaim_stat->recent_rotated[file] += nr_rotated;

	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
	spin_unlock_irq(&zone->lru_lock);

	free_hot_cold_page_list(&l_hold, 1);
}

conclusion
This post discusses shrink_active_list().

kernel: mm: shrink_page_list

December 5, 2015

This post discusses shrink_page_list().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                           -> shrink_active_list()
                           -> shrink_inactive_list()
                              -> shrink_page_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

what dose shrink_page_list() do
It’s the ultimate function in direct/background reclaim to reclaim page.

Input and output of this function.

input:
struct list_head *page_list: a list of isolated pages ready for reclaimed
struct zone *zone: the zone in which these pages live
struct scan_control *sc: it's initialised by try_to_free_pages() and passed down the reclaim flow.
output:
unsigned long *ret_nr_dirty: the number of dirty pages.
unsigned long *ret_nr_writeback: the number of writeback pages.
return unsigned int nr_reclaimed: the number of reclaimed pages.

Simple code flow


unsigned long shrink_page_list()
{
    while (!list_empty(page_list)) {
        page = lru_to_page(page_list);
        list_del(&page->lru);
        sc->nr_scanned++;

        if (page is still referenced)
           continue(); 
        
        if (page is mapped)
           Unmap this page. continue() if failing to unmap.    

        if (page is dirty)
           Call pageout() to writeback() this page. continue() on failure of page is still writeback(). 
        
        if (page has buffer)
           Remove it. continue() on failure.

        // Finally, we could reclaim the page
        __clear_page_locked(page);
        nr_reclaimed++;
        list_add(&page->lru, &free_pages);
    }
}
/*
 * shrink_page_list() returns the number of reclaimed pages
 */
static unsigned long shrink_page_list(struct list_head *page_list,
				      struct zone *zone,
				      struct scan_control *sc,
				      enum ttu_flags ttu_flags,
				      unsigned long *ret_nr_dirty,
				      unsigned long *ret_nr_writeback,
				      bool force_reclaim)
{
	LIST_HEAD(ret_pages);
	LIST_HEAD(free_pages);
	int pgactivate = 0;
	unsigned long nr_dirty = 0;
	unsigned long nr_congested = 0;
	unsigned long nr_reclaimed = 0;
	unsigned long nr_writeback = 0;

	cond_resched();

	mem_cgroup_uncharge_start();
	while (!list_empty(page_list)) {
		struct address_space *mapping;
		struct page *page;
		int may_enter_fs;
		enum page_references references = PAGEREF_RECLAIM_CLEAN;

		cond_resched();

		page = lru_to_page(page_list);
		list_del(&page->lru);

		if (!trylock_page(page))
			goto keep;

		VM_BUG_ON(PageActive(page));
		VM_BUG_ON(page_zone(page) != zone);

		sc->nr_scanned++;

		if (unlikely(!page_evictable(page)))
			goto cull_mlocked;

		if (!sc->may_unmap && page_mapped(page))
			goto keep_locked;

		/* Double the slab pressure for mapped and swapcache pages */
		if (page_mapped(page) || PageSwapCache(page))
			sc->nr_scanned++;

		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

		if (PageWriteback(page)) {
			/*
			 * memcg doesn't have any dirty pages throttling so we
			 * could easily OOM just because too many pages are in
			 * writeback and there is nothing else to reclaim.
			 *
			 * Check __GFP_IO, certainly because a loop driver
			 * thread might enter reclaim, and deadlock if it waits
			 * on a page for which it is needed to do the write
			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
			 * but more thought would probably show more reasons.
			 *
			 * Don't require __GFP_FS, since we're not going into
			 * the FS, just waiting on its writeback completion.
			 * Worryingly, ext4 gfs2 and xfs allocate pages with
			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
			 * testing may_enter_fs here is liable to OOM on them.
			 */
			if (global_reclaim(sc) ||
			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
				/*
				 * This is slightly racy - end_page_writeback()
				 * might have just cleared PageReclaim, then
				 * setting PageReclaim here end up interpreted
				 * as PageReadahead - but that does not matter
				 * enough to care.  What we do want is for this
				 * page to have PageReclaim set next time memcg
				 * reclaim reaches the tests above, so it will
				 * then wait_on_page_writeback() to avoid OOM;
				 * and it's also appropriate in global reclaim.
				 */
				SetPageReclaim(page);
				nr_writeback++;
				goto keep_locked;
			}
			wait_on_page_writeback(page);
		}

		if (!force_reclaim)
			references = page_check_references(page, sc);

		switch (references) {
		case PAGEREF_ACTIVATE:
			goto activate_locked;
		case PAGEREF_KEEP:
			goto keep_locked;
		case PAGEREF_RECLAIM:
		case PAGEREF_RECLAIM_CLEAN:
			; /* try to reclaim the page below */
		}

		/*
		 * Anonymous process memory has backing store?
		 * Try to allocate it some swap space here.
		 */
		if (PageAnon(page) && !PageSwapCache(page)) {
			if (!(sc->gfp_mask & __GFP_IO))
				goto keep_locked;
			if (!add_to_swap(page, page_list))
				goto activate_locked;
			may_enter_fs = 1;
		}

		mapping = page_mapping(page);

		/*
		 * The page is mapped into the page tables of one or more
		 * processes. Try to unmap it here.
		 */
		if (page_mapped(page) && mapping) {
			switch (try_to_unmap(page, ttu_flags)) {
			case SWAP_FAIL:
				goto activate_locked;
			case SWAP_AGAIN:
				goto keep_locked;
			case SWAP_MLOCK:
				goto cull_mlocked;
			case SWAP_SUCCESS:
				; /* try to free the page below */
			}
		}

		if (PageDirty(page)) {
			nr_dirty++;

			/*
			 * Only kswapd can writeback filesystem pages to
			 * avoid risk of stack overflow but do not writeback
			 * unless under significant pressure.
			 */
			if (page_is_file_cache(page) &&
					(!current_is_kswapd() ||
					 sc->priority >= DEF_PRIORITY - 2)) {
				/*
				 * Immediately reclaim when written back.
				 * Similar in principal to deactivate_page()
				 * except we already have the page isolated
				 * and know it's dirty
				 */
				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
				SetPageReclaim(page);

				goto keep_locked;
			}

			if (references == PAGEREF_RECLAIM_CLEAN)
				goto keep_locked;
			if (!may_enter_fs)
				goto keep_locked;
			if (!sc->may_writepage)
				goto keep_locked;

			/* Page is dirty, try to write it out here */
			switch (pageout(page, mapping, sc)) {
			case PAGE_KEEP:
				nr_congested++;
				goto keep_locked;
			case PAGE_ACTIVATE:
				goto activate_locked;
			case PAGE_SUCCESS:
				if (PageWriteback(page))
					goto keep;
				if (PageDirty(page))
					goto keep;

				/*
				 * A synchronous write - probably a ramdisk.  Go
				 * ahead and try to reclaim the page.
				 */
				if (!trylock_page(page))
					goto keep;
				if (PageDirty(page) || PageWriteback(page))
					goto keep_locked;
				mapping = page_mapping(page);
			case PAGE_CLEAN:
				; /* try to free the page below */
			}
		}

		/*
		 * If the page has buffers, try to free the buffer mappings
		 * associated with this page. If we succeed we try to free
		 * the page as well.
		 *
		 * We do this even if the page is PageDirty().
		 * try_to_release_page() does not perform I/O, but it is
		 * possible for a page to have PageDirty set, but it is actually
		 * clean (all its buffers are clean).  This happens if the
		 * buffers were written out directly, with submit_bh(). ext3
		 * will do this, as well as the blockdev mapping.
		 * try_to_release_page() will discover that cleanness and will
		 * drop the buffers and mark the page clean - it can be freed.
		 *
		 * Rarely, pages can have buffers and no ->mapping.  These are
		 * the pages which were not successfully invalidated in
		 * truncate_complete_page().  We try to drop those buffers here
		 * and if that worked, and the page is no longer mapped into
		 * process address space (page_count == 1) it can be freed.
		 * Otherwise, leave the page on the LRU so it is swappable.
		 */
		if (page_has_private(page)) {
			if (!try_to_release_page(page, sc->gfp_mask))
				goto activate_locked;
			if (!mapping && page_count(page) == 1) {
				unlock_page(page);
				if (put_page_testzero(page))
					goto free_it;
				else {
					/*
					 * rare race with speculative reference.
					 * the speculative reference will free
					 * this page shortly, so we may
					 * increment nr_reclaimed here (and
					 * leave it off the LRU).
					 */
					nr_reclaimed++;
					continue;
				}
			}
		}

		if (!mapping || !__remove_mapping(mapping, page))
			goto keep_locked;

		/*
		 * At this point, we have no other references and there is
		 * no way to pick any more up (removed from LRU, removed
		 * from pagecache). Can use non-atomic bitops now (and
		 * we obviously don't have to worry about waking up a process
		 * waiting on the page lock, because there are no references.
		 */
		__clear_page_locked(page);
free_it:
		nr_reclaimed++;

		/*
		 * Is there need to periodically free_page_list? It would
		 * appear not as the counts should be low
		 */
		list_add(&page->lru, &free_pages);
		continue;

cull_mlocked:
		if (PageSwapCache(page))
			try_to_free_swap(page);
		unlock_page(page);
		putback_lru_page(page);
		continue;

activate_locked:
		/* Not a candidate for swapping, so reclaim swap space. */
		if (PageSwapCache(page) && vm_swap_full())
			try_to_free_swap(page);
		VM_BUG_ON(PageActive(page));
		SetPageActive(page);
		pgactivate++;
keep_locked:
		unlock_page(page);
keep:
		list_add(&page->lru, &ret_pages);
		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
	}

	/*
	 * Tag a zone as congested if all the dirty pages encountered were
	 * backed by a congested BDI. In this case, reclaimers should just
	 * back off and wait for congestion to clear because further reclaim
	 * will encounter the same problem
	 */
	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
		zone_set_flag(zone, ZONE_CONGESTED);

	free_hot_cold_page_list(&free_pages, 1);

	list_splice(&ret_pages, page_list);
	count_vm_events(PGACTIVATE, pgactivate);
	mem_cgroup_uncharge_end();
	*ret_nr_dirty += nr_dirty;
	*ret_nr_writeback += nr_writeback;
	return nr_reclaimed;
}

conclusion
This post discusses shrink_page_list() which is the ultimate function to reclaim page. An isolated pages couldn’t be reclaimed if it’s still referenced by others. Before it could be reclaimed successfully, it might need to be unmap and writeback.

kernel: mm: shrink_list

December 4, 2015

This post discusses shrink_list().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

shrnk_zone(), shrink_lruvec(), and shrink_list()
shrink_zone() calls shrink_lruvec() at least once. It repeats calling shrink_lruvec() if should_continue_reclaim() is true. srink_lruvec() calls shrink_list() for each evictable lru.

Assume settings are as below.

zone->inactive_ratio = 4.
# CONFIG_MEMCG  is not set
CONFIG_SWAP is set

Based on above settings, the workflow of how shrink_lruvec() calls shrink_list() is:

shrink_list() for zone->lruvec[LRU_INACTIVE_ANON].
It calls shrink_inactive_list() for inactive_anon.

shrink_list() for zone->lruvec[LRU_ACTIVE_ANON].
If zone->nr_inactive_anon * 4 < zone->nr_active_anon, then it calls shrink_inactive_list() for inactive_anon.

shrink_list() for zone->lruvec[LRU_INACTIVE_FILE].
It calls shrink_inactive_list() for inactive_file.

shrink_list() for zone->lruvec[LRU_ACTIVE_FILE].
If zone->nr_inactive_file < zone->nr_active_file, then it calls shrink_active_list() for active_file.
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
				 struct lruvec *lruvec, struct scan_control *sc)
{
	if (is_active_lru(lru)) {
		if (inactive_list_is_low(lruvec, lru))
			shrink_active_list(nr_to_scan, lruvec, sc, lru);
		return 0;
	}

	return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}

conclusion
This post discusses how shrink_list() works.

kernel: mm: shrink_lruvec

December 4, 2015

This post discusses shrink_lruvec().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                        -> get_scan_count()
                        -> shrink_list()
                        -> shrink_active_list()
                        -> throttle_vm_writeout()
                     -> vmpressure()
                     -> should_continue_reclaim()

shrink_zone() and shrink_lruvec()
The major loop in direct reclaim is in do_try_to_free_pages() which iterates priority from 12 to 0. shrink_zone() doesn’t retry until the caller is in reclaim/compaction state. shrink_zone() calls shrink_lruvec(). If should_continue_reclaim() returns true, it will repeat shrink_lruvec(). One necessary condition of should_continue_reclaim() returning true is the caller is in reclaim/compaction state.

  • If page order > 3, then the caller will always be in reclaim/compaction state.
  • If 1 <= page order <= 3, then the caller will always not be in reclaim/compaction state.
  • If page order == 0, then the caller will always not be in reclaim/compaction state.
  • code flow of shrink_lruvec()
    The simplified code flow is as below. The get_scan_count() determines nr_to_scan of each lru list. shrink_lruvec() calls shrink_list() to shrink inactive_anon, active_anon, inactive_file, and active_file.

    get_scan_count()
    shrink_list()
    throttle_vm_writeout()
    
    /*
     * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
     */
    static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
    {
    	unsigned long nr[NR_LRU_LISTS];
    	unsigned long nr_to_scan;
    	enum lru_list lru;
    	unsigned long nr_reclaimed = 0;
    	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
    	struct blk_plug plug;
    
    	get_scan_count(lruvec, sc, nr);
    
    	blk_start_plug(&plug);
    	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
    					nr[LRU_INACTIVE_FILE]) {
    		for_each_evictable_lru(lru) {
    			if (nr[lru]) {
    				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
    				nr[lru] -= nr_to_scan;
    
    				nr_reclaimed += shrink_list(lru, nr_to_scan,
    							    lruvec, sc);
    			}
    		}
    		/*
    		 * On large memory systems, scan >> priority can become
    		 * really large. This is fine for the starting priority;
    		 * we want to put equal scanning pressure on each zone.
    		 * However, if the VM has a harder time of freeing pages,
    		 * with multiple processes reclaiming pages, the total
    		 * freeing target can get unreasonably large.
    		 */
    		if (nr_reclaimed >= nr_to_reclaim &&
    		    sc->priority < DEF_PRIORITY)
    			break;
    	}
    	blk_finish_plug(&plug);
    	sc->nr_reclaimed += nr_reclaimed;
    
    	/*
    	 * Even if we did not try to evict anon pages at all, we want to
    	 * rebalance the anon lru active/inactive ratio.
    	 */
    	if (inactive_anon_is_low(lruvec))
    		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
    				   sc, LRU_ACTIVE_ANON);
    
    	throttle_vm_writeout(sc->gfp_mask);
    }
    

    shrink_lruvec() and get_scan_count()
    get_scan_count() set up nr_to_scan of each lru.

    If CONFIG_MEMCG is disabled,
    nr[0] = nr[LRU_INACTIVE_ANON] = /proc/vmstat/nr_inactive_anon
    nr[1] = nr[LRU_ACTIVE_ANON] = /proc/vmstat/nr_active_anon
    nr[2] = nr[LRU_INACTIVE_FILE] = /proc/vmstat/nr_inactive_file
    nr[3] = nr[LRU_ACTIVE_FILE] = /proc/vmstat/nr_active_file
    
    /*
     * We do arithmetic on the LRU lists in various places in the code,
     * so it is important to keep the active lists LRU_ACTIVE higher in
     * the array than the corresponding inactive lists, and to keep
     * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
     *
     * This has to be kept in sync with the statistics in zone_stat_item
     * above and the descriptions in vmstat_text in mm/vmstat.c
     */
    #define LRU_BASE 0
    #define LRU_ACTIVE 1
    #define LRU_FILE 2
    
    enum lru_list {
    	LRU_INACTIVE_ANON = LRU_BASE,
    	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
    	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    	LRU_UNEVICTABLE,
    	NR_LRU_LISTS
    };
    
    /*
     * Determine how aggressively the anon and file LRU lists should be
     * scanned.  The relative value of each set of LRU lists is determined
     * by looking at the fraction of the pages scanned we did rotate back
     * onto the active list instead of evict.
     *
     * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
     * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
     */
    static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
    			   unsigned long *nr)
    {
    	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
    	u64 fraction[2];
    	u64 denominator = 0;	/* gcc */
    	struct zone *zone = lruvec_zone(lruvec);
    	unsigned long anon_prio, file_prio;
    	enum scan_balance scan_balance;
    	unsigned long anon, file, free;
    	bool force_scan = false;
    	unsigned long ap, fp;
    	enum lru_list lru;
    
    	/*
    	 * If the zone or memcg is small, nr[l] can be 0.  This
    	 * results in no scanning on this priority and a potential
    	 * priority drop.  Global direct reclaim can go to the next
    	 * zone and tends to have no problems. Global kswapd is for
    	 * zone balancing and it needs to scan a minimum amount. When
    	 * reclaiming for a memcg, a priority drop can cause high
    	 * latencies, so it's better to scan a minimum amount there as
    	 * well.
    	 */
    	if (current_is_kswapd() && !zone_reclaimable(zone))
    		force_scan = true;
    	if (!global_reclaim(sc))
    		force_scan = true;
    
    	/* If we have no swap space, do not bother scanning anon pages. */
    	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
    		scan_balance = SCAN_FILE;
    		goto out;
    	}
    
    	/*
    	 * Global reclaim will swap to prevent OOM even with no
    	 * swappiness, but memcg users want to use this knob to
    	 * disable swapping for individual groups completely when
    	 * using the memory controller's swap limit feature would be
    	 * too expensive.
    	 */
    	if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
    		scan_balance = SCAN_FILE;
    		goto out;
    	}
    
    	/*
    	 * Do not apply any pressure balancing cleverness when the
    	 * system is close to OOM, scan both anon and file equally
    	 * (unless the swappiness setting disagrees with swapping).
    	 */
    	if (!sc->priority && vmscan_swappiness(sc)) {
    		scan_balance = SCAN_EQUAL;
    		goto out;
    	}
    
    	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
    		get_lru_size(lruvec, LRU_INACTIVE_ANON);
    	file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
    		get_lru_size(lruvec, LRU_INACTIVE_FILE);
    
    	/*
    	 * If it's foreseeable that reclaiming the file cache won't be
    	 * enough to get the zone back into a desirable shape, we have
    	 * to swap.  Better start now and leave the - probably heavily
    	 * thrashing - remaining file pages alone.
    	 */
    	if (global_reclaim(sc)) {
    		free = zone_page_state(zone, NR_FREE_PAGES);
    		if (unlikely(file + free <= high_wmark_pages(zone))) {
    			scan_balance = SCAN_ANON;
    			goto out;
    		}
    	}
    
    	/*
    	 * There is enough inactive page cache, do not reclaim
    	 * anything from the anonymous working set right now.
    	 */
    	if (!IS_ENABLED(CONFIG_BALANCE_ANON_FILE_RECLAIM) &&
    			!inactive_file_is_low(lruvec)) {
    		scan_balance = SCAN_FILE;
    		goto out;
    	}
    
    	scan_balance = SCAN_FRACT;
    
    	/*
    	 * With swappiness at 100, anonymous and file have the same priority.
    	 * This scanning priority is essentially the inverse of IO cost.
    	 */
    	anon_prio = vmscan_swappiness(sc);
    	file_prio = 200 - anon_prio;
    
    	/*
    	 * OK, so we have swap space and a fair amount of page cache
    	 * pages.  We use the recently rotated / recently scanned
    	 * ratios to determine how valuable each cache is.
    	 *
    	 * Because workloads change over time (and to avoid overflow)
    	 * we keep these statistics as a floating average, which ends
    	 * up weighing recent references more than old ones.
    	 *
    	 * anon in [0], file in [1]
    	 */
    	spin_lock_irq(&zone->lru_lock);
    	if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
    		reclaim_stat->recent_scanned[0] /= 2;
    		reclaim_stat->recent_rotated[0] /= 2;
    	}
    
    	if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
    		reclaim_stat->recent_scanned[1] /= 2;
    		reclaim_stat->recent_rotated[1] /= 2;
    	}
    
    	/*
    	 * The amount of pressure on anon vs file pages is inversely
    	 * proportional to the fraction of recently scanned pages on
    	 * each list that were recently referenced and in active use.
    	 */
    	ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
    	ap /= reclaim_stat->recent_rotated[0] + 1;
    
    	fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
    	fp /= reclaim_stat->recent_rotated[1] + 1;
    	spin_unlock_irq(&zone->lru_lock);
    
    	fraction[0] = ap;
    	fraction[1] = fp;
    	denominator = ap + fp + 1;
    out:
    	for_each_evictable_lru(lru) {
    		int file = is_file_lru(lru);
    		unsigned long size;
    		unsigned long scan;
    
    		size = get_lru_size(lruvec, lru);
    		scan = size >> sc->priority;
    
    		if (!scan && force_scan)
    			scan = min(size, SWAP_CLUSTER_MAX);
    
    		switch (scan_balance) {
    		case SCAN_EQUAL:
    			/* Scan lists relative to size */
    			break;
    		case SCAN_FRACT:
    			/*
    			 * Scan types proportional to swappiness and
    			 * their relative recent reclaim efficiency.
    			 */
    			scan = div64_u64(scan * fraction[file], denominator);
    			break;
    		case SCAN_FILE:
    		case SCAN_ANON:
    			/* Scan one type exclusively */
    			if ((scan_balance == SCAN_FILE) != file)
    				scan = 0;
    			break;
    		default:
    			/* Look ma, no brain */
    			BUG();
    		}
    		nr[lru] = scan;
    	}
    }
    

    conclusion
    This post discusses shrink_lruvec(). It also discusses its relations with shrink_zone(), get_scan_count() and shrink_list().

kernel: mm: shrink_zone

December 3, 2015

This post discusses shrink_zone().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack

__alloc_pages_nodemask()
-> __alloc_pages_slowpath()
   -> __alloc_pages_direct_reclaim()
      -> __perform_reclaim()
         -> try_to_free_pages()
            -> throttle_direct_reclaim()  
            -> do_try_to_free_pages()
               -> shrink_zones()
                  -> shrink_zone()
                     -> shrink_lruvec()
                     -> vmpressure()
                     -> should_continue_reclaim()

shrink_zones() and shrink_zone()
shrink_zones() is in reclaim path. It is called by do_try_to_free_pages() with different priority. It checks all zones down the zonelist from gfp_zone(gfp_mask). If calls shrink_zone() if compaction_ready() is false. shrink_zones() returns 1 if at least one zone’s compaction_ready() is true.

shrink_zone() and shrink_lruvec()
shrink_zone() iterates all mem_cgroup and calls shrink_lruvec() of that. In my case, CONFIG_MEMCG is not set, so it will only shrink the zone’s lruvec. The code flow could be simplified as below.

do {
    shrink_lruvec() to shrink zone->lruvec
    vmpressure() to update system vm pressure 
} while (should_continue_reclaim());
static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
	unsigned long nr_reclaimed, nr_scanned;

	do {
		struct mem_cgroup *root = sc->target_mem_cgroup;
		struct mem_cgroup_reclaim_cookie reclaim = {
			.zone = zone,
			.priority = sc->priority,
		};
		struct mem_cgroup *memcg;

		nr_reclaimed = sc->nr_reclaimed;
		nr_scanned = sc->nr_scanned;

		memcg = mem_cgroup_iter(root, NULL, &reclaim);
		do {
			struct lruvec *lruvec;

			lruvec = mem_cgroup_zone_lruvec(zone, memcg);

			shrink_lruvec(lruvec, sc);

			/*
			 * Direct reclaim and kswapd have to scan all memory
			 * cgroups to fulfill the overall scan target for the
			 * zone.
			 *
			 * Limit reclaim, on the other hand, only cares about
			 * nr_to_reclaim pages to be reclaimed and it will
			 * retry with decreasing priority if one round over the
			 * whole hierarchy is not sufficient.
			 */
			if (!global_reclaim(sc) &&
					sc->nr_reclaimed >= sc->nr_to_reclaim) {
				mem_cgroup_iter_break(root, memcg);
				break;
			}
			memcg = mem_cgroup_iter(root, memcg, &reclaim);
		} while (memcg);

		vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
			   sc->nr_scanned - nr_scanned,
			   sc->nr_reclaimed - nr_reclaimed);

	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
					 sc->nr_scanned - nr_scanned, sc));
}

should_continue_reclaim() and shrink_zone()
shrink_zone() calls shrink_lruvec() in a loop which is controlled by should_continue_reclaim().

The code flow of should_continue_reclaim() could be simplified as below.

If not in reclaim/compaction mode, then it will return false.
If sc->nr_reclaimed = 0, then it will return false.
If __GFP_RETRY is set, then if no progress in reclaim and scan, it will return false.
If not reclaimed enough for compaction, then it will return true.
If compaction_suitable() is negative, then it will return true.
/*
 * Reclaim/compaction is used for high-order allocation requests. It reclaims
 * order-0 pages before compacting the zone. should_continue_reclaim() returns
 * true if more pages should be reclaimed such that when the page allocator
 * calls try_to_compact_zone() that it will have enough free pages to succeed.
 * It will give up earlier than that if there is difficulty reclaiming pages.
 */
static inline bool should_continue_reclaim(struct zone *zone,
					unsigned long nr_reclaimed,
					unsigned long nr_scanned,
					struct scan_control *sc)
{
	unsigned long pages_for_compaction;
	unsigned long inactive_lru_pages;

	/* If not in reclaim/compaction mode, stop */
	if (!in_reclaim_compaction(sc))
		return false;

	/* Consider stopping depending on scan and reclaim activity */
	if (sc->gfp_mask & __GFP_REPEAT) {
		/*
		 * For __GFP_REPEAT allocations, stop reclaiming if the
		 * full LRU list has been scanned and we are still failing
		 * to reclaim pages. This full LRU scan is potentially
		 * expensive but a __GFP_REPEAT caller really wants to succeed
		 */
		if (!nr_reclaimed && !nr_scanned)
			return false;
	} else {
		/*
		 * For non-__GFP_REPEAT allocations which can presumably
		 * fail without consequence, stop if we failed to reclaim
		 * any pages from the last SWAP_CLUSTER_MAX number of
		 * pages that were scanned. This will return to the
		 * caller faster at the risk reclaim/compaction and
		 * the resulting allocation attempt fails
		 */
		if (!nr_reclaimed)
			return false;
	}

	/*
	 * If we have not reclaimed enough pages for compaction and the
	 * inactive lists are large enough, continue reclaiming
	 */
	pages_for_compaction = (2UL << sc->order);
	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
	if (get_nr_swap_pages() > 0)
		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
	if (sc->nr_reclaimed < pages_for_compaction &&
			inactive_lru_pages > pages_for_compaction)
		return true;

	/* If compaction would go ahead or the allocation would succeed, stop */
	switch (compaction_suitable(zone, sc->order)) {
	case COMPACT_PARTIAL:
	case COMPACT_CONTINUE:
		return false;
	default:
		return true;
	}
}

in_reclaim_compaction() and reclaim/compaction state
in_reclaim_compaction() returns if the caller is in reclaim/compaction state. The criteria is that sc->order > 3 and sc->priority < 10.

If a thread enters direct reclaim due to order-2 page allocation, then it will not be in reclaim/compaction state until priority = 9.

If a thread enters direct reclaim due ot order-4 page allocation, then it will always be in reclaim/compaction state.

/* Use reclaim/compaction for costly allocs or under memory pressure */
static bool in_reclaim_compaction(struct scan_control *sc)
{
	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
			(sc->order > PAGE_ALLOC_COSTLY_ORDER ||
			 sc->priority < DEF_PRIORITY - 2))
		return true;

	return false;
}

conclusion
This post discusses shrink_zone(). It also discusses its relations with other functions, such as shrink_zones(), should_continue_reclaim(), and in_reclaim_compaction().


%d bloggers like this: