Archive for the ‘buddy system’ Category

patch discussion: mm/compaction.c: add an is_via_compact_memory() helper

December 21, 2015

This post discusses mm/compaction.c: add an is_via_compact_memory() helper.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

/proc/sys/vm/compact_memory
The core compaction function is compact_zone() which uses the argument compaction control to determine how to compact. There are three ways to call compact_zone(): allocate slow path, kswapd, or writing values to /proc/sys/vm/compact_memory.

If the order of compaction control is -1, then it implies that this compaction is triggered by /proc/sys/vm/compact_memory. Many code flows within compaction use if (cc.order == -1) to know if this compaction is from /proc/sys/vm/compact_memory.

This patch adds a helper function to explicitly check if a compaction is triggered by /proc/sys/vm/compact_memory. The implementation of this function uses (cc.order == -1) to know if the compaction is triggered by /proc/sys/vm/compact_memory.

1714 /* The written value is actually unused, all memory is compacted */
1715 int sysctl_compact_memory;
1716 
1717 /* This is the entry point for compacting all nodes via /proc/sys/vm */
1718 int sysctl_compaction_handler(struct ctl_table *table, int write,
1719                         void __user *buffer, size_t *length, loff_t *ppos)
1720 {
1721         if (write)
1722                 compact_nodes();
1723 
1724         return 0;
1725 }
1702 /* Compact all nodes in the system */
1703 static void compact_nodes(void)
1704 {
1705         int nid;
1706 
1707         /* Flush pending updates to the LRU lists */
1708         lru_add_drain_all();
1709 
1710         for_each_online_node(nid)
1711                 compact_node(nid);
1712 }
1691 static void compact_node(int nid)
1692 {
1693         struct compact_control cc = {
1694                 .order = -1,
1695                 .mode = MIGRATE_SYNC,
1696                 .ignore_skip_hint = true,
1697         };
1698 
1699         __compact_pgdat(NODE_DATA(nid), &cc);
1700 }
1638 /* Compact all zones within a node */
1639 static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
1640 {
1641         int zoneid;
1642         struct zone *zone;
1643 
1644         for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
1645 
1646                 zone = &pgdat->node_zones[zoneid];
1647                 if (!populated_zone(zone))
1648                         continue;
1649 
1650                 cc->nr_freepages = 0;
1651                 cc->nr_migratepages = 0;
1652                 cc->zone = zone;
1653                 INIT_LIST_HEAD(&cc->freepages);
1654                 INIT_LIST_HEAD(&cc->migratepages);
1655 
1656                 /*
1657                  * When called via /proc/sys/vm/compact_memory
1658                  * this makes sure we compact the whole zone regardless of
1659                  * cached scanner positions.
1660                  */
1661                 if (cc->order == -1)
1662                         __reset_isolation_suitable(zone);
1663 
1664                 if (cc->order == -1 || !compaction_deferred(zone, cc->order))
1665                         compact_zone(zone, cc);
1666 
1667                 if (cc->order > 0) {
1668                         if (zone_watermark_ok(zone, cc->order,
1669                                                 low_wmark_pages(zone), 0, 0))
1670                                 compaction_defer_reset(zone, cc->order, false);
1671                 }
1672 
1673                 VM_BUG_ON(!list_empty(&cc->freepages));
1674                 VM_BUG_ON(!list_empty(&cc->migratepages));
1675         }
1676 }

conclusion
This post discusses mm/compaction.c: add an is_via_compact_memory() helper. Before this patch, compaction code uses (cc.order == -1) to know if this compaction is triggered by /proc/sys/vm/compact_memory. After this patch is merged, compaction code could use helper function is_via_compact_memory() directly.

Advertisements

patch discussion: mm, migrate: count pages failing all retries in vmstat and tracepoint

December 8, 2015

This patch discusses mm, migrate: count pages failing all retries in vmstat and tracepoint.

merge at
git: kernel/git/mhocko/mm.git
branch: since-4.3

what is the problem of migrate_pages() in v4.3
Let’s consider the following conditions to understand how migrate_pages() works.

Assume the input argument from is a linked list of pages whose size is SWAP_CLUSTER_MAX = 32. Then migrate_pages() will try to migrate these 32 pages into free pages isolated by free scanner.

Also I assume that if it fails to migrate a page due to -EAGAIN, then it will always fail to migrate this page due to -EAGAIN. This could make below examples easier to understand.

1. All 32 pages are migrated successfully.
   1.1 (nr_succeeded, nr_failed, retry, rc) = (32, 0, 0, 0).
   1.2 from list is empty while it returns.
   1.3 Return value is 0.
2. If 4 or 32 pages failed to be migrated, and the other 28 ones are migrated successfully.
   2.1 (nr_succeeded, nr_failed, retry, rc) = (28, 4, 0, 4).
   2.2 from list is empty while it returns.
   2.3 Return value is 4.
3. If the first 5 pages are migrated successfully, but the 6th pages couldn't be migrated successfully due to -EAGAIN.
   3.1 (nr_succeeded, nr_failed, retry, rc) = (5, 0, 1, 1).
   3.2 The size of from list is 27 while it returns.
   3.3 Return value is 1.
4. If the first 5 pages are migrated successfully, the 6th page is not migrated successfully, and the 7th page couldn't be migrated successfully due to -EAGAIN.
   4.1 (nr_succeeded, nr_failed, retry, rc) = (5, 1, 1, 2).
   4.2 The size of from list is 26 while it returns.
   4.3 Return value is 2.

In the 4th case, unmap_and_move() fails to migrate the 6th page. Then it also fails to migrate the 7th page due to -EAGAIN. The return value rc = 2 correctly indicate how many pages are not migrated successfully. rc = 2 is because rc = nr_failed + retry = 1 + 1 = 2. Only nr_failed = 1 page are accounted into /proc/sys/vm/pgmigrate_fail. But it’s correct to increase /proc/sys/vm/pgmigrate_fail by 2 in this case.

1099 /*
1100  * migrate_pages - migrate the pages specified in a list, to the free pages
1101  *                 supplied as the target for the page migration
1102  *
1103  * @from:               The list of pages to be migrated.
1104  * @get_new_page:       The function used to allocate free pages to be used
1105  *                      as the target of the page migration.
1106  * @put_new_page:       The function used to free target pages if migration
1107  *                      fails, or NULL if no special handling is necessary.
1108  * @private:            Private data to be passed on to get_new_page()
1109  * @mode:               The migration mode that specifies the constraints for
1110  *                      page migration, if any.
1111  * @reason:             The reason for page migration.
1112  *
1113  * The function returns after 10 attempts or if no pages are movable any more
1114  * because the list has become empty or no retryable pages exist any more.
1115  * The caller should call putback_lru_pages() to return pages to the LRU
1116  * or free list only if ret != 0.
1117  *
1118  * Returns the number of pages that were not migrated, or an error code.
1119  */
1120 int migrate_pages(struct list_head *from, new_page_t get_new_page,
1121                 free_page_t put_new_page, unsigned long private,
1122                 enum migrate_mode mode, int reason)
1123 {
1124         int retry = 1;
1125         int nr_failed = 0;
1126         int nr_succeeded = 0;
1127         int pass = 0;
1128         struct page *page;
1129         struct page *page2;
1130         int swapwrite = current->flags & PF_SWAPWRITE;
1131         int rc;
1132 
1133         if (!swapwrite)
1134                 current->flags |= PF_SWAPWRITE;
1135 
1136         for(pass = 0; pass < 10 && retry; pass++) {
1137                 retry = 0;
1138 
1139                 list_for_each_entry_safe(page, page2, from, lru) {
1140                         cond_resched();
1141 
1142                         if (PageHuge(page))
1143                                 rc = unmap_and_move_huge_page(get_new_page,
1144                                                 put_new_page, private, page,
1145                                                 pass > 2, mode);
1146                         else
1147                                 rc = unmap_and_move(get_new_page, put_new_page,
1148                                                 private, page, pass > 2, mode,
1149                                                 reason);
1150 
1151                         switch(rc) {
1152                         case -ENOMEM:
1153                                 goto out;
1154                         case -EAGAIN:
1155                                 retry++;
1156                                 break;
1157                         case MIGRATEPAGE_SUCCESS:
1158                                 nr_succeeded++;
1159                                 break;
1160                         default:
1161                                 /*
1162                                  * Permanent failure (-EBUSY, -ENOSYS, etc.):
1163                                  * unlike -EAGAIN case, the failed page is
1164                                  * removed from migration page list and not
1165                                  * retried in the next outer loop.
1166                                  */
1167                                 nr_failed++;
1168                                 break;
1169                         }
1170                 }
1171         }
1172         rc = nr_failed + retry;
1173 out:
1174         if (nr_succeeded)
1175                 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
1176         if (nr_failed)
1177                 count_vm_events(PGMIGRATE_FAIL, nr_failed);
1178         trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
1179 
1180         if (!swapwrite)
1181                 current->flags &= ~PF_SWAPWRITE;
1182 
1183         return rc;
1184 }

how does this patch fix the problem in branch since-4.3
Let nr_failed += retry. This could make the 4th case as below. nr_failed = 2 and /proc/sys/vm/pgmigrate_fail is increased by 2.

 
4. If the first 5 pages are migrated successfully, the 6th page is not migrated successfully, and the 7th page couldn't be migrated successfully due to -EAGAIN.
   4.1 (nr_succeeded, nr_failed, retry, rc) = (5, 2, 1, 2).
   4.2 The size of from list is 26 while returns.
   4.3 Return value is 2.
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7..94961f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1169,7 +1169,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 			}
 		}
 	}
-	rc = nr_failed + retry;
+	nr_failed += retry;
+	rc = nr_failed;
 out:
 	if (nr_succeeded)
 		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);

conclusion
This patch discusses mm, migrate: count pages failing all retries in vmstat and tracepoint fix incorrect accounting of /proc/sys/vm/pgmigrate_fail while it fails to migrate a page due to -EAGAIN.

kernel: mm: balance_pgdat

December 6, 2015

This post discusses balance_pgdat().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call flow of background reclaim

kswapd()
   -> kswapd_try_to_sleep()
      -> prepare_to_wait()
      -> prepare_kswapd_sleep()
      -> prepare_kswapd_sleep()
      -> finish_wait()
   -> try_to_freeze()
   -> balance_pgdat()
      -> shrink_zone()
         -> shrink_lruvec()
         -> vmpressure()
         -> should_continue_reclaim()
      -> shrink_slab()

balance_pgdat()
balance_pgdat() returns (order, classzone_idx).

If input order is 3, and balance_pgdat() returns (3, 2), then it implies that balance_pgdat() successfully rebalance all zones from dma to normal and highmem with order-3 high watermark checking.

If input order is 3, and balance_pgdat() returns (0, 2) then it implies that balance_pgdat() fails to rebalance all zones from dma to normal and highmem with order-3 high watermark checking. But it successfully rebalance all zones from dma to normal and highmem with order-0 high watermark checking.

kswapd() checks return value of balance_pgdat(). If the return order is less than input order, then kswapd() knows that the rebalance fails, so it avoids updating (new_order, new_classzone_idx) to make kswapd possible to sleep after rebalance failure.

Simple code flow of balance_pgdat:


set up scan_control sc for shrink_zone().

do {
    for (int i = pgdat->nr_zones - 1; i >= 0; i--) {
        if (!zone_balanced(zone, order, 0, 0)) {
           end_zone = i; 
           break;
        }
    }  

    if (i < 0) {
        pgdat_is_balanced = true;
        goto out;
    }

    for (int i = 0; i <= end_zone; i++) {
        shrink_zone();
        shrink_slab();
    }

    if (pgdat_balanced()) {
        pgdat_is_balanced = true;
        break;
    }

    if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
        break;
} while (--sc.priority >= 0);

if (!pgdat_is_balanced) {
    if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
       order = sc.order = 0;
    loop_again;
}

if (order)
   try compact_pgdat() if needed.

*classzone_idx = end_zone;
return order;

/*
 * For kswapd, balance_pgdat() will work across all this node's zones until
 * they are all at high_wmark_pages(zone).
 *
 * Returns the final order kswapd was reclaiming at
 *
 * There is special handling here for zones which are full of pinned pages.
 * This can happen if the pages are all mlocked, or if they are all used by
 * device drivers (say, ZONE_DMA).  Or if they are all in use by hugetlb.
 * What we do is to detect the case where all pages in the zone have been
 * scanned twice and there has been zero successful reclaim.  Mark the zone as
 * dead and from now on, only perform a short scan.  Basically we're polling
 * the zone for when the problem goes away.
 *
 * kswapd scans the zones in the highmem->normal->dma direction.  It skips
 * zones which have free_pages > high_wmark_pages(zone), but once a zone is
 * found to have free_pages <= high_wmark_pages(zone), we scan that zone and the
 * lower zones regardless of the number of free pages in the lower zones. This
 * interoperates with the page allocator fallback scheme to ensure that aging
 * of pages is balanced across the zones.
 */
static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
							int *classzone_idx)
{
	bool pgdat_is_balanced = false;
	int i;
	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
	struct reclaim_state *reclaim_state = current->reclaim_state;
	unsigned long nr_soft_reclaimed;
	unsigned long nr_soft_scanned;
	struct scan_control sc = {
		.gfp_mask = GFP_KERNEL,
		.may_unmap = 1,
		.may_swap = 1,
		/*
		 * kswapd doesn't want to be bailed out while reclaim. because
		 * we want to put equal scanning pressure on each zone.
		 */
		.nr_to_reclaim = ULONG_MAX,
		.order = order,
		.target_mem_cgroup = NULL,
	};
	struct shrink_control shrink = {
		.gfp_mask = sc.gfp_mask,
	};
loop_again:
	sc.priority = DEF_PRIORITY;
	sc.nr_reclaimed = 0;
	sc.may_writepage = !laptop_mode;
	count_vm_event(PAGEOUTRUN);

	do {
		unsigned long lru_pages = 0;

		/*
		 * Scan in the highmem->dma direction for the highest
		 * zone which needs scanning
		 */
		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
			struct zone *zone = pgdat->node_zones + i;

			if (!populated_zone(zone))
				continue;

			if (sc.priority != DEF_PRIORITY &&
			    !zone_reclaimable(zone))
				continue;

			/*
			 * Do some background aging of the anon list, to give
			 * pages a chance to be referenced before reclaiming.
			 */
			age_active_anon(zone, &sc);

			/*
			 * If the number of buffer_heads in the machine
			 * exceeds the maximum allowed level and this node
			 * has a highmem zone, force kswapd to reclaim from
			 * it to relieve lowmem pressure.
			 */
			if (buffer_heads_over_limit && is_highmem_idx(i)) {
				end_zone = i;
				break;
			}

			if (!zone_balanced(zone, order, 0, 0)) {
				end_zone = i;
				break;
			} else {
				/* If balanced, clear the congested flag */
				zone_clear_flag(zone, ZONE_CONGESTED);
			}
		}

		if (i < 0) {
			pgdat_is_balanced = true;
			goto out;
		}

		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;

			lru_pages += zone_reclaimable_pages(zone);
		}

		/*
		 * Now scan the zone in the dma->highmem direction, stopping
		 * at the last zone which needs scanning.
		 *
		 * We do this because the page allocator works in the opposite
		 * direction.  This prevents the page allocator from allocating
		 * pages behind kswapd's direction of progress, which would
		 * cause too much scanning of the lower zones.
		 */
		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;
			int testorder;
			unsigned long balance_gap;

			if (!populated_zone(zone))
				continue;

			if (sc.priority != DEF_PRIORITY &&
			    !zone_reclaimable(zone))
				continue;

			sc.nr_scanned = 0;

			nr_soft_scanned = 0;
			/*
			 * Call soft limit reclaim before calling shrink_zone.
			 */
			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
							order, sc.gfp_mask,
							&nr_soft_scanned);
			sc.nr_reclaimed += nr_soft_reclaimed;

			/*
			 * We put equal pressure on every zone, unless
			 * one zone has way too many pages free
			 * already. The "too many pages" is defined
			 * as the high wmark plus a "gap" where the
			 * gap is either the low watermark or 1%
			 * of the zone, whichever is smaller.
			 */
			balance_gap = min(low_wmark_pages(zone),
				(zone->managed_pages +
					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
				KSWAPD_ZONE_BALANCE_GAP_RATIO);
			/*
			 * Kswapd reclaims only single pages with compaction
			 * enabled. Trying too hard to reclaim until contiguous
			 * free pages have become available can hurt performance
			 * by evicting too much useful data from memory.
			 * Do not reclaim more than needed for compaction.
			 */
			testorder = order;
			if (IS_ENABLED(CONFIG_COMPACTION) && order &&
					compaction_suitable(zone, order) !=
						COMPACT_SKIPPED)
				testorder = 0;

			if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
			    !zone_balanced(zone, testorder,
					   balance_gap, end_zone)) {
				shrink_zone(zone, &sc);

				reclaim_state->reclaimed_slab = 0;
				shrink_slab(&shrink, sc.nr_scanned, lru_pages);
				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
			}

			/*
			 * If we're getting trouble reclaiming, start doing
			 * writepage even in laptop mode.
			 */
			if (sc.priority < DEF_PRIORITY - 2)
				sc.may_writepage = 1;

			if (!zone_reclaimable(zone)) {
				if (end_zone && end_zone == i)
					end_zone--;
				continue;
			}

			if (zone_balanced(zone, testorder, 0, end_zone))
				/*
				 * If a zone reaches its high watermark,
				 * consider it to be no longer congested. It's
				 * possible there are dirty pages backed by
				 * congested BDIs but as pressure is relieved,
				 * speculatively avoid congestion waits
				 */
				zone_clear_flag(zone, ZONE_CONGESTED);
		}

		/*
		 * If the low watermark is met there is no need for processes
		 * to be throttled on pfmemalloc_wait as they should not be
		 * able to safely make forward progress. Wake them
		 */
		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
				pfmemalloc_watermark_ok(pgdat))
			wake_up(&pgdat->pfmemalloc_wait);

		if (pgdat_balanced(pgdat, order, *classzone_idx)) {
			pgdat_is_balanced = true;
			break;		/* kswapd: all done */
		}

		/*
		 * We do this so kswapd doesn't build up large priorities for
		 * example when it is freeing in parallel with allocators. It
		 * matches the direct reclaim path behaviour in terms of impact
		 * on zone->*_priority.
		 */
		if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
			break;
	} while (--sc.priority >= 0);

out:
	if (!pgdat_is_balanced) {
		cond_resched();

		try_to_freeze();

		/*
		 * Fragmentation may mean that the system cannot be
		 * rebalanced for high-order allocations in all zones.
		 * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,
		 * it means the zones have been fully scanned and are still
		 * not balanced. For high-order allocations, there is
		 * little point trying all over again as kswapd may
		 * infinite loop.
		 *
		 * Instead, recheck all watermarks at order-0 as they
		 * are the most important. If watermarks are ok, kswapd will go
		 * back to sleep. High-order users can still perform direct
		 * reclaim if they wish.
		 */
		if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)
			order = sc.order = 0;

		goto loop_again;
	}

	/*
	 * If kswapd was reclaiming at a higher order, it has the option of
	 * sleeping without all zones being balanced. Before it does, it must
	 * ensure that the watermarks for order-0 on *all* zones are met and
	 * that the congestion flags are cleared. The congestion flag must
	 * be cleared as kswapd is the only mechanism that clears the flag
	 * and it is potentially going to sleep here.
	 */
	if (order) {
		int zones_need_compaction = 1;

		for (i = 0; i <= end_zone; i++) {
			struct zone *zone = pgdat->node_zones + i;

			if (!populated_zone(zone))
				continue;

			/* Check if the memory needs to be defragmented. */
			if (zone_watermark_ok(zone, order,
				    low_wmark_pages(zone), *classzone_idx, 0))
				zones_need_compaction = 0;
		}

		if (zones_need_compaction)
			compact_pgdat(pgdat, order);
	}

	/*
	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
	 * makes a decision on the order we were last reclaiming at. However,
	 * if another caller entered the allocator slow path while kswapd
	 * was awake, order will remain at the higher level
	 */
	*classzone_idx = end_zone;
	return order;
}

conclusion
This post discusses balance_pgdat(). It gives a simplified code flow of balance_pgdat(). The return order of balance_pgdat() indicates if rebalance succeeds or not. balance_pgdat() also returns classzone_idx. If classzone_idx is 2, then it balance_pgdat() shrinks dma, normal, and highmem zones. The loop in balance_pgdat() repeats until pgdat_balanced() return trues. But the pgdat_balance() might be due to updating order as 0.

kernel: mm: kswapd

December 6, 2015

This post discusses kswapd().

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set
kswapd()
   -> kswapd_try_to_sleep()
      -> prepare_to_wait()
      -> prepare_kswapd_sleep()
      -> prepare_kswapd_sleep()
      -> finish_wait()
   -> try_to_freeze()
   -> balance_pgdat()
      -> kswapd_shrink_zone()
         -> shrink_zone()
         -> shrink_slab()

kswapd()
kswapd thread enters kswapd() after forked. kswapd thread could allocate page without checking watermark since PF_MEMALLOC is set. Also it could be frozen while system suspends since it calls set_freezable().

If rebalance fails, then (new_order, new_classzone_idx) will not be updated from (pgdat->kswapd_max_order, pgdat->classzone_idx). Thus, it will always call kswapd_try_to_sleep to try to sleep.

If rebalance succeed, then (new_order, new_classzone_idx) will be updated. If (new_order, new_classzone_idx) has higher order or lower preferred zone, then it will not sleep and try balance_pgdat() directly.

Simplified kswapd code flow:

    for (;;) {
        if (rebalance success) {
            (new_order, new_classzone_idx) = (pgdat->kswapd_max_order, pgdat->classzone_idx);
            (pgdat->kswapd_max_order, pgdat->classzone_idx) = (0, pgdat->nr_zones - 1);
        }

        if ((new_order, new_classzone_idx) is harder than (order, classzone_idx)) {
           (order, classzone_idx) = (new_order, new_classzone_idx);
        } else {
           kswapd_try_to_sleep();
           (new_order, new_classzone_idx) = (order, classzone_idx) = (pgdat->kswapd_max_order, pgdat->classzone_idx);
           (pgdat->kswapd_max_order, pgdat->classzone_idx) = (0, pgdat->nr_zones - 1);
        }

        if (system enters suspends)
           freeze();

        if (return from frozen state)
           continue(); // skip balance pgdat to speed up resume time while thawing user space process
        
        (balanced_order, balanced_classzone_idx) = balance_pgdat();
    }

/*
 * The background pageout daemon, started as a kernel thread
 * from the init process.
 *
 * This basically trickles out pages so that we have _some_
 * free memory available even if there is no other activity
 * that frees anything up. This is needed for things like routing
 * etc, where we otherwise might have all activity going on in
 * asynchronous contexts that cannot page things out.
 *
 * If there are applications that are active memory-allocators
 * (most normal use), this basically shouldn't matter.
 */
static int kswapd(void *p)
{
	unsigned long order, new_order;
	unsigned balanced_order;
	int classzone_idx, new_classzone_idx;
	int balanced_classzone_idx;
	pg_data_t *pgdat = (pg_data_t*)p;
	struct task_struct *tsk = current;

	struct reclaim_state reclaim_state = {
		.reclaimed_slab = 0,
	};
	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

	lockdep_set_current_reclaim_state(GFP_KERNEL);

	if (kswapd_cpu_mask == NULL && !cpumask_empty(cpumask))
		set_cpus_allowed_ptr(tsk, cpumask);
	current->reclaim_state = &reclaim_state;

	/*
	 * Tell the memory management that we're a "memory allocator",
	 * and that if we need more memory we should get access to it
	 * regardless (see "__alloc_pages()"). "kswapd" should
	 * never get caught in the normal page freeing logic.
	 *
	 * (Kswapd normally doesn't need memory anyway, but sometimes
	 * you need a small amount of memory in order to be able to
	 * page out something else, and this flag essentially protects
	 * us from recursively trying to free more memory as we're
	 * trying to free the first piece of memory in the first place).
	 */
	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
	set_freezable();

	order = new_order = 0;
	balanced_order = 0;
	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
	balanced_classzone_idx = classzone_idx;
	for ( ; ; ) {
		bool ret;

		/*
		 * If the last balance_pgdat was unsuccessful it's unlikely a
		 * new request of a similar or harder type will succeed soon
		 * so consider going to sleep on the basis we reclaimed at
		 */
		if (balanced_classzone_idx >= new_classzone_idx &&
					balanced_order == new_order) {
			new_order = pgdat->kswapd_max_order;
			new_classzone_idx = pgdat->classzone_idx;
			pgdat->kswapd_max_order =  0;
			pgdat->classzone_idx = pgdat->nr_zones - 1;
		}

		if (order < new_order || classzone_idx > new_classzone_idx) {
			/*
			 * Don't sleep if someone wants a larger 'order'
			 * allocation or has tigher zone constraints
			 */
			order = new_order;
			classzone_idx = new_classzone_idx;
		} else {
			kswapd_try_to_sleep(pgdat, balanced_order,
						balanced_classzone_idx);
			order = pgdat->kswapd_max_order;
			classzone_idx = pgdat->classzone_idx;
			new_order = order;
			new_classzone_idx = classzone_idx;
			pgdat->kswapd_max_order = 0;
			pgdat->classzone_idx = pgdat->nr_zones - 1;
		}

		ret = try_to_freeze();
		if (kthread_should_stop())
			break;

		/*
		 * We can speed up thawing tasks if we don't call balance_pgdat
		 * after returning from the refrigerator
		 */
		if (!ret) {
			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
			balanced_classzone_idx = classzone_idx;
			balanced_order = balance_pgdat(pgdat, order,
						&balanced_classzone_idx);
		}
	}

	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
	current->reclaim_state = NULL;
	lockdep_clear_current_reclaim_state();

	return 0;
}

conclusion
This post discusses kswapd() and explains its simple code flow.

kernel: mm: wakeup_kswapd

December 6, 2015

This post discusses wakeup_kswapd()

reference code base
LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

reference kernel config

# CONFIG_NUMA is not set
CONFIG_ZONE_DMA=y
# CONFIG_MEMCG is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_MEMORY_ISOLATION=y
CONFIG_CMA=y
# CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
# CONFIG_CPUSETS is not set

call stack of wakeup_kswapd

__alloc_pages_nodemask()
-> get_page_from_free_list()
-> __alloc_pages_slowpath()
   -> wakeup_all_kswapd()
      -> wakeup_kswapd()
   -> get_page_from_free_list()
   -> __alloc_pages_direct_compact()
   -> __alloc_pages_direct_reclaim()
   -> should_alloc_retry()

what is kswapd
kswapd is a kernel thread which could call do_try_to_free_pages() to reclaim pages. kswapd reclaimed is called background reclaim. It’s preferred than direct reclaim which makes a process allocate pages with long latency.

how many kswapd are there in a system
At init stage, each node whose state is N_MEMORY has a kswapd daemon. The kswapd of node 0 is kswapd0, the kswapd of node 1 is kswapd1, and so on. Each kswapd calls kthread_run() after created.

/*
 * This kswapd start function will be called by init and node-hot-add.
 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
 */
int kswapd_run(int nid)
{
	pg_data_t *pgdat = NODE_DATA(nid);
	int ret = 0;

	if (pgdat->kswapd)
		return 0;

	pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
	if (IS_ERR(pgdat->kswapd)) {
		/* failure at boot is fatal */
		BUG_ON(system_state == SYSTEM_BOOTING);
		pr_err("Failed to start kswapd on node %d\n", nid);
		ret = PTR_ERR(pgdat->kswapd);
		pgdat->kswapd = NULL;
	} else if (kswapd_cpu_mask) {
		if (set_kswapd_cpu_mask(pgdat))
			pr_warn("error setting kswapd cpu affinity mask\n");
	}
	return ret;
}

/*
 * Called by memory hotplug when all memory in a node is offlined.  Caller must
 * hold lock_memory_hotplug().
 */
void kswapd_stop(int nid)
{
	struct task_struct *kswapd = NODE_DATA(nid)->kswapd;

	if (kswapd) {
		kthread_stop(kswapd);
		NODE_DATA(nid)->kswapd = NULL;
	}
}

static int __init kswapd_init(void)
{
	int nid;

	swap_setup();
	for_each_node_state(nid, N_MEMORY)
 		kswapd_run(nid);
	if (kswapd_cpu_mask == NULL)
		hotcpu_notifier(cpu_callback, 0);
	return 0;
}

module_init(kswapd_init)

when is kswapd waken up
If a thread fails to allocate pages with low watermark check, then it will enter allocation slowpath. In slowpath, the thread wakes up kswapd at first. Then it tries to allocate pages with min watermark check. If it still fails to allocate a page from freelist, then it will enter direct reclaim to reclaim pages itself.

It’s better to let kswapd reclaim in the background rather than direct reclaim. A thread direct reclaim in allocation slow path might be harmful to system response if the thread is the main thread of an application, or kthread, or it holds some resources such as mutex which other threads are waiting for.

To make kswapd background reclaim more active than direct reclaim, we could make the gap between low watermark and min watermark bigger. A thread could still allocate pages while min watermark is satisfied, but it will wake up kswapd while low watermark is not satisfied. If the gap is 100 MB, while kswapd is wake up, then there are still 100 MB usable free pages before direct reclaim could happen.

wake_all_kswapd() and wakeup_kswapd()
wake_all_kswapd() traverses all zones download the zonelist from high_zoneidx which is gfp_zone(gfp_mask) and indicates the highest zone id satisfying caller’s requests.

static inline
void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
						enum zone_type high_zoneidx,
						enum zone_type classzone_idx)
{
	struct zoneref *z;
	struct zone *zone;

	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
		wakeup_kswapd(zone, order, classzone_idx);
}

Along the zonelist from high_zoneidx, if the zone is populated and low watermark check of this zone is not satisfied, then wake up kswapd of the zone’s node. The low watermark check doesn’t take into lowmem_reserve ratio into account.

No matter the low watermark check is satisfied or not, wakeup_kswapd() will always try to update kswapd_max_order and classzone_idx of the zone. If kswapd_max_order is 5 and callzone_idx = 2, then it means that some thread wants to allocate an order-5 page from highmem zone but it fails to allocate a requested page downside along the zonelist.

/*
 * A zone is low on free memory, so wake its kswapd task to service it.
 */
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
	pg_data_t *pgdat;

	if (!populated_zone(zone))
		return;

	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
		return;
	pgdat = zone->zone_pgdat;
	if (pgdat->kswapd_max_order < order) {
		pgdat->kswapd_max_order = order;
		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
	}
	if (!waitqueue_active(&pgdat->kswapd_wait))
		return;
	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
		return;

	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
	wake_up_interruptible(&pgdat->kswapd_wait);
}

conclusion
This post discusses wakeup_kswapd(). It shows when kswapd is waken up, and under what condition does it reclaim pages in the background.

patch discussion: mm, thp: restructure thp avoidance of light synchronous migration

December 5, 2015

This post discusses mm, thp: restructure thp avoidance of light synchronous migration.

call flow of compaction

__alloc_pages_nodemask()
   -> __alloc_pages_slowpath()
      -> wake_all_kswapds()
      -> get_page_from_freelist()
      -> __alloc_pages_direct_compact()
         -> try_to_compact_pages()
            -> compact_zone_order()
               -> compact_zone()
                  -> compact_finished()
                  -> isolate_migratepages()
                  -> migrate_pages()
      -> __alloc_pages_direct_reclaim()
      -> should_alloc_retry()
      -> warn_alloc_failed()
      -> return page

compaction in v3.15
The first time of __alloc_pages_direct_compact() is async_migration. The other times of alloc_pages_direct_compact() is sync_migration.

2576         /*
2577          * Try direct compaction. The first pass is asynchronous. Subsequent
2578          * attempts after direct reclaim are synchronous
2579          */
2580         page = __alloc_pages_direct_compact(gfp_mask, order,
2581                                         zonelist, high_zoneidx,
2582                                         nodemask,
2583                                         alloc_flags, preferred_zone,
2584                                         migratetype, sync_migration,
2585                                         &contended_compaction,
2586                                         &deferred_compaction,
2587                                         &did_some_progress);
2588         if (page)
2589                 goto got_pg;
2590         sync_migration = true;

compaction in v3.16
The first time of __alloc_pages_direct_compact() is MIGRATE_ASYNC. The other times of alloc_pages_direct_compact() is MIGRATE_SYNC_LIGHT if __GFP_NO_KSWAPD is not set or the thread is not a kernel thread.

The __GFP_NO_KSWAPD here is to indicates that the caller is allocating transparent hugepages. It will affect some allocator who tries to allocate high order pages at first and doesn’t want to disturb the system such as ion allocator or kgsl allocator.

  • kernel: mm: gfp_mask and ion system heap allocation
  • kernel: mm: gfp_mask and kgsl allocator
  • The change here is due to mm, compaction: embed migration mode in compact_control and mm, thp: avoid excessive compaction latency during fault.

    2604         /*
    2605          * Try direct compaction. The first pass is asynchronous. Subsequent
    2606          * attempts after direct reclaim are synchronous
    2607          */
    2608         page = __alloc_pages_direct_compact(gfp_mask, order, zonelist,
    2609                                         high_zoneidx, nodemask, alloc_flags,
    2610                                         preferred_zone,
    2611                                         classzone_idx, migratetype,
    2612                                         migration_mode, &contended_compaction,
    2613                                         &deferred_compaction,
    2614                                         &did_some_progress);
    2615         if (page)
    2616                 goto got_pg;
    2617 
    2618         /*
    2619          * It can become very expensive to allocate transparent hugepages at
    2620          * fault, so use asynchronous memory compaction for THP unless it is
    2621          * khugepaged trying to collapse.
    2622          */
    2623         if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
    2624                 migration_mode = MIGRATE_SYNC_LIGHT;
    2625 
    

    compaction in v3.17
    The first time of __alloc_pages_direct_compact() is MIGRATE_ASYNC. The other times of alloc_pages_direct_compact() is MIGRATE_SYNC_LIGHT if (gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE or the thread is not a kernel thread.

    GFP_TRANSHUGE here is to indicates that the caller is allocating transparent hugepages. This change could avoid affecting users who sets __GFP_NO_KSWAPD to avoid waking up kswapd to disturb the system.

    The change here is due to mm, thp: restructure thp avoidance of light synchronous migration.

    2627         /*
    2628          * Try direct compaction. The first pass is asynchronous. Subsequent
    2629          * attempts after direct reclaim are synchronous
    2630          */
    2631         page = __alloc_pages_direct_compact(gfp_mask, order, zonelist,
    2632                                         high_zoneidx, nodemask, alloc_flags,
    2633                                         preferred_zone,
    2634                                         classzone_idx, migratetype,
    2635                                         migration_mode, &contended_compaction,
    2636                                         &deferred_compaction,
    2637                                         &did_some_progress);
    2638         if (page)
    2639                 goto got_pg;
    2640 
    2641         /*
    2642          * If compaction is deferred for high-order allocations, it is because
    2643          * sync compaction recently failed. In this is the case and the caller
    2644          * requested a movable allocation that does not heavily disrupt the
    2645          * system then fail the allocation instead of entering direct reclaim.
    2646          */
    2647         if ((deferred_compaction || contended_compaction) &&
    2648                                                 (gfp_mask & __GFP_NO_KSWAPD))
    2649                 goto nopage;
    2650 
    2651         /*
    2652          * It can become very expensive to allocate transparent hugepages at
    2653          * fault, so use asynchronous memory compaction for THP unless it is
    2654          * khugepaged trying to collapse.
    2655          */
    2656         if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
    2657                                                 (current->flags & PF_KTHREAD))
    2658                 migration_mode = MIGRATE_SYNC_LIGHT;
    

    conclusion
    This post discusses how compaction changes synchronous migration conditions in v3.15, v3.16, and v3.17. These changes are due to two patches.

  • mm, thp: avoid excessive compaction latency during fault
  • mm, thp: restructure thp avoidance of light synchronous migration
  • patch discussion: mm: vmscan: rework compaction-ready signaling in direct reclaim

    December 5, 2015

    This post discusses mm: vmscan: rework compaction-ready signaling in direct reclaim.

    merge time
    v3.17

    call flow of direct reclaim
    Actually, kswapd also calls do_try_to_free_pages().

    __alloc_pages_nodemask()
    -> __alloc_pages_slowpath()
       -> __alloc_pages_direct_reclaim()
          -> __perform_reclaim()
             -> try_to_free_pages()
                -> throttle_direct_reclaim()  
                -> do_try_to_free_pages()
                   -> shrink_zones()
                      -> shrink_zone()
                         -> shrink_lruvec()
                            -> get_scan_count()
                            -> shrink_list()
                               -> shrink_active_list()
                               -> shrink_inactive_list()
                                  -> shrink_page_list()
                            -> shrink_active_list()
                            -> throttle_vm_writeout()
                         -> vmpressure()
                         -> should_continue_reclaim()
    

    do_try_to_free_pages() and shrink_zones() in v3.16
    do_try_to_free_pages() repeats calling shrink_zones() until sc->nr_reclaimed >= sc->nr_to_reclaim, sc->priority < 0, or shrink_zones() returns true. The value returned by shrink_zones() indicates aborted_reclaim which means shrink_zones() skips at least one zone along the zonelist.

    If do_try_to_free_pages() couldn't reclaim any pages after repeating calling shrink_zones(). It wouldn't return 0 if aborted_reclaim is true. This could help avoid triggering oom-killer before the next compaction.

    effects of this patch in 3.17
    This patch encode abort_reclaim data into scan_control. Therefore, the shrink_zones() doesn’t need to return abort_reclaim.

    Another patch mm: vmscan: remove all_unreclaimable() also merged in v3.17 change the return type of shrink_zones() back to bool. But it means reclaimable rather than abort_reclaim.

    conclusion
    This post discusses mm: vmscan: rework compaction-ready signaling in direct reclaim which makes do_try_to_free_pages() more readable although it doesn’t change any logic.

    kernel: mm: shrink_inactive_list

    December 5, 2015

    This post discusses shrink_inactive_list().

    reference code base
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    # CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
    CONFIG_COMPACTION=y
    CONFIG_MIGRATION=y
    # CONFIG_CPUSETS is not set
    

    call stack

    __alloc_pages_nodemask()
    -> __alloc_pages_slowpath()
       -> __alloc_pages_direct_reclaim()
          -> __perform_reclaim()
             -> try_to_free_pages()
                -> throttle_direct_reclaim()  
                -> do_try_to_free_pages()
                   -> shrink_zones()
                      -> shrink_zone()
                         -> shrink_lruvec()
                            -> get_scan_count()
                            -> shrink_list()
                               -> shrink_active_list()
                               -> shrink_inactive_list()
                                  -> shrink_page_list()
                            -> shrink_active_list()
                            -> throttle_vm_writeout()
                         -> vmpressure()
                         -> should_continue_reclaim()
    

    shrink_lruvec and shrink_inactive_list()
    shurink_lruvec() calls get_scan_count() to evaluate how many pages to scan for each lru list. nr[0] is for inactive_anon, nr[1] is for active_anon, nr[2] is for inactive_file, and nr[3] is for active_file. If any of nr[0], nr[2], nr[3] is not empty, then it will call shrink_list() for each unevictable lru list. If the lru list is inactive, then shrink_list() calls shrink_inactive_list() to shrink this lru. If the lru list is active, then shrink_list() calls shrink_active_list() if the inactive list ratio is low. The inactive_anon ratio is low if zone->nr_inactive_anon * zone->inactive_ratio nr_active_anon. The inactive_file ratio is low if zone->nr_inactive_file nr_active_file.

    what does shrink_inactive() do
    shrink_page_list() calls isolate_lru_pages() to isolate some pages from inactive_anon or inactive_file list into a local list, i.e., page_list. Then it calls shrink_page_list() to shrink page_list.

    static noinline_for_stack unsigned long
    shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
    		     struct scan_control *sc, enum lru_list lru)
    {
    	LIST_HEAD(page_list);
    	unsigned long nr_scanned;
    	unsigned long nr_reclaimed = 0;
    	unsigned long nr_taken;
    	unsigned long nr_dirty = 0;
    	unsigned long nr_writeback = 0;
    	isolate_mode_t isolate_mode = 0;
    	int file = is_file_lru(lru);
    	struct zone *zone = lruvec_zone(lruvec);
    	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
    
    	while (unlikely(too_many_isolated(zone, file, sc))) {
    		congestion_wait(BLK_RW_ASYNC, HZ/10);
    
    		/* We are about to die and free our memory. Return now. */
    		if (fatal_signal_pending(current))
    			return SWAP_CLUSTER_MAX;
    	}
    
    	lru_add_drain();
    
    	if (!sc->may_unmap)
    		isolate_mode |= ISOLATE_UNMAPPED;
    	if (!sc->may_writepage)
    		isolate_mode |= ISOLATE_CLEAN;
    
    	spin_lock_irq(&zone->lru_lock);
    
    	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
    				     &nr_scanned, sc, isolate_mode, lru);
    
    	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
    
    	if (global_reclaim(sc)) {
    		zone->pages_scanned += nr_scanned;
    		if (current_is_kswapd())
    			__count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
    		else
    			__count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
    	}
    	spin_unlock_irq(&zone->lru_lock);
    
    	if (nr_taken == 0)
    		return 0;
    
    	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
    					&nr_dirty, &nr_writeback, false);
    
    	spin_lock_irq(&zone->lru_lock);
    
    	reclaim_stat->recent_scanned[file] += nr_taken;
    
    	if (global_reclaim(sc)) {
    		if (current_is_kswapd())
    			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
    					       nr_reclaimed);
    		else
    			__count_zone_vm_events(PGSTEAL_DIRECT, zone,
    					       nr_reclaimed);
    	}
    
    	putback_inactive_pages(lruvec, &page_list);
    
    	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
    
    	spin_unlock_irq(&zone->lru_lock);
    
    	free_hot_cold_page_list(&page_list, 1);
    
    	/*
    	 * If reclaim is isolating dirty pages under writeback, it implies
    	 * that the long-lived page allocation rate is exceeding the page
    	 * laundering rate. Either the global limits are not being effective
    	 * at throttling processes due to the page distribution throughout
    	 * zones or there is heavy usage of a slow backing device. The
    	 * only option is to throttle from reclaim context which is not ideal
    	 * as there is no guarantee the dirtying process is throttled in the
    	 * same way balance_dirty_pages() manages.
    	 *
    	 * This scales the number of dirty pages that must be under writeback
    	 * before throttling depending on priority. It is a simple backoff
    	 * function that has the most effect in the range DEF_PRIORITY to
    	 * DEF_PRIORITY-2 which is the priority reclaim is considered to be
    	 * in trouble and reclaim is considered to be in trouble.
    	 *
    	 * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
    	 * DEF_PRIORITY-1  50% must be PageWriteback
    	 * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
    	 * ...
    	 * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
    	 *                     isolated page is PageWriteback
    	 */
    	if (nr_writeback && nr_writeback >=
    			(nr_taken >> (DEF_PRIORITY - sc->priority)))
    		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
    
    	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
    		zone_idx(zone),
    		nr_scanned, nr_reclaimed,
    		sc->priority,
    		trace_shrink_flags(file));
    	return nr_reclaimed;
    }
    

    /proc/vmstat and shrink_inactive_list()
    While shrink_inactive_list() isolates lru pages from inactive_anon or inactive_file into a local list called page_list, the number of isolated lru page is accounted in /proc/vmstat.

    If the caller is kswapd in zone dma, then /proc/vmstat/pgscan_kswapd_dma is increased.
    If the caller is kswapd in zone normal, then /proc/vmstat/pgscan_kswapd_normal is increased.
    If the caller is kswapd in zone movable, then proc/vmstat/pgscan_kswapd_movable is increased.
    If the caller is direct reclaimed in zone dma, then /proc/vmstat/pgscan_direct_dma is increased.
    If the caller is direct reclaim in zone normal, then /proc/vmstat/pgscan_direct_normal is increased.
    If the caller is direct reclaime in zone movable, then /proc/vmstat/pgscan_direct_movable is increased.
    

    While shrink_inactive_list() calls shrink_page_list() to reclaim the isolated pages in page_list, the number of reclaimed page is accounted in /proc/vmstat.

    If the caller is kswapd in zone dma, then /proc/vmstat/pgsteal_kswapd_dma is increased.
    If the caller is kswapd in zone normal, then /proc/vmstat/pgsteal_kswapd_normal is increased.
    If the caller is kswapd in zone movable, then /proc/vmstat/pgsteal_kswapd_movable is increased.
    If the caller is direct reclaimed in zone dma, then /proc/vmstat/pgsteal_direct_dma is increased.
    If the caller is direct reclaim in zone normal, then /proc/vmstat/pgsteal_direct_normal is increased.
    If the caller is direct reclaime in zone movable, then /proc/vmstat/pgsteal_direct_movable is increased.
    
    ------ VIRTUAL MEMORY STATS (/proc/vmstat) ------
    nr_free_pages 30067
    nr_inactive_anon 5424
    nr_active_anon 338576
    nr_inactive_file 59481
    nr_active_file 58591
    nr_unevictable 18981
    nr_mlock 18017
    nr_anon_pages 337893
    nr_mapped 116348
    nr_file_pages 143209
    nr_dirty 26
    nr_writeback 72
    nr_slab_reclaimable 19690
    nr_slab_unreclaimable 25018
    nr_page_table_pages 11961
    nr_kernel_stack 3018
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 6168
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 6159
    nr_dirtied 11523918
    nr_written 11270195
    nr_anon_transparent_hugepages 0
    nr_free_cma 4287
    nr_dirty_threshold 6442
    nr_dirty_background_threshold 1288
    pgpgin 790917801
    pgpgout 75442400
    pswpin 0
    pswpout 0
    pgalloc_dma 1373663299
    pgalloc_normal 0
    pgalloc_movable 0
    pgfree 1375463846
    pgactivate 117565768
    pgdeactivate 52661597
    pgfault 2722674372
    pgmajfault 7010162
    pgrefill_dma 89437883
    pgrefill_normal 0
    pgrefill_movable 0
    pgsteal_kswapd_dma 186996602
    pgsteal_kswapd_normal 0
    pgsteal_kswapd_movable 0
    pgsteal_direct_dma 7400060
    pgsteal_direct_normal 0
    pgsteal_direct_movable 0
    pgscan_kswapd_dma 228700634
    pgscan_kswapd_normal 0
    pgscan_kswapd_movable 0
    pgscan_direct_dma 9064264
    pgscan_direct_normal 0
    pgscan_direct_movable 0
    pgscan_direct_throttle 0
    pginodesteal 568
    slabs_scanned 197934252
    kswapd_inodesteal 7122334
    kswapd_low_wmark_hit_quickly 147149
    kswapd_high_wmark_hit_quickly 77589
    pageoutrun 323211
    allocstall 156820
    pgrotated 8807
    pgmigrate_success 1698508
    pgmigrate_fail 253
    compact_migrate_scanned 24318527
    compact_free_scanned 988378858
    compact_isolated 3727646
    compact_stall 15834
    compact_fail 10133
    compact_success 4497
    unevictable_pgs_culled 42315
    unevictable_pgs_scanned 0
    unevictable_pgs_rescued 23334
    unevictable_pgs_mlocked 46701
    unevictable_pgs_munlocked 28684
    unevictable_pgs_cleared 0
    unevictable_pgs_stranded 0
    

    conclusion
    This post discusses when shrink_inactive_list() is called by shrink_lruvec, how it does, and how /proc/vmstat is accounted while isolating iru pages and shrink_page_list().

    kernel: mm: shrink_active_list

    December 5, 2015

    This post discusses shrink_active_list().

    reference code base
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    # CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
    CONFIG_COMPACTION=y
    CONFIG_MIGRATION=y
    # CONFIG_CPUSETS is not set
    

    call stack

    __alloc_pages_nodemask()
    -> __alloc_pages_slowpath()
       -> __alloc_pages_direct_reclaim()
          -> __perform_reclaim()
             -> try_to_free_pages()
                -> throttle_direct_reclaim()  
                -> do_try_to_free_pages()
                   -> shrink_zones()
                      -> shrink_zone()
                         -> shrink_lruvec()
                            -> get_scan_count()
                            -> shrink_list()
                               -> shrink_active_list()
                               -> shrink_inactive_list()
                                  -> shrink_page_list()
                            -> shrink_active_list()
                            -> throttle_vm_writeout()
                         -> vmpressure()
                         -> should_continue_reclaim()
    

    shrink_lruvec and shrink_active_list()
    shurink_lruvec() calls get_scan_count() to evaluate how many pages to scan for each lru list. nr[0] is for inactive_anon, nr[1] is for active_anon, nr[2] is for inactive_file, and nr[3] is for active_file. If any of nr[0], nr[2], nr[3] is not empty, then it will call shrink_list() for each unevictable lru list. If the lru list is inactive, then shrink_list() calls shrink_inactive_list() to shrink this lru. If the lru list is active, then shrink_list() calls shrink_active_list() if the inactive list ratio is low. The inactive_anon ratio is low if zone->nr_inactive_anon * zone->inactive_ratio nr_active_anon. The inactive_file ratio is low if zone->nr_inactive_file nr_active_file.

    what does shrink_active_list do
    shrink_active_list() isolates pages from some active lru list, i.e., active_anon or active_file. The isolated pages are put into list l_hold. For each page in l_hold, if it is unevictable, then this page will be put back to appropriate lru by putback_lru_page(). Then, if it is referenced and a file cache of some executive, it will be put back to the head of original active lru. Otherwise, it will be put into the corresponding inactive list.

    static void shrink_active_list(unsigned long nr_to_scan,
    			       struct lruvec *lruvec,
    			       struct scan_control *sc,
    			       enum lru_list lru)
    {
    	unsigned long nr_taken;
    	unsigned long nr_scanned;
    	unsigned long vm_flags;
    	LIST_HEAD(l_hold);	/* The pages which were snipped off */
    	LIST_HEAD(l_active);
    	LIST_HEAD(l_inactive);
    	struct page *page;
    	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
    	unsigned long nr_rotated = 0;
    	isolate_mode_t isolate_mode = 0;
    	int file = is_file_lru(lru);
    	struct zone *zone = lruvec_zone(lruvec);
    
    	lru_add_drain();
    
    	if (!sc->may_unmap)
    		isolate_mode |= ISOLATE_UNMAPPED;
    	if (!sc->may_writepage)
    		isolate_mode |= ISOLATE_CLEAN;
    
    	spin_lock_irq(&zone->lru_lock);
    
    	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
    				     &nr_scanned, sc, isolate_mode, lru);
    	if (global_reclaim(sc))
    		zone->pages_scanned += nr_scanned;
    
    	reclaim_stat->recent_scanned[file] += nr_taken;
    
    	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
    	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
    	spin_unlock_irq(&zone->lru_lock);
    
    	while (!list_empty(&l_hold)) {
    		cond_resched();
    		page = lru_to_page(&l_hold);
    		list_del(&page->lru);
    
    		if (unlikely(!page_evictable(page))) {
    			putback_lru_page(page);
    			continue;
    		}
    
    		if (unlikely(buffer_heads_over_limit)) {
    			if (page_has_private(page) && trylock_page(page)) {
    				if (page_has_private(page))
    					try_to_release_page(page, 0);
    				unlock_page(page);
    			}
    		}
    
    		if (page_referenced(page, 0, sc->target_mem_cgroup,
    				    &vm_flags)) {
    			nr_rotated += hpage_nr_pages(page);
    			/*
    			 * Identify referenced, file-backed active pages and
    			 * give them one more trip around the active list. So
    			 * that executable code get better chances to stay in
    			 * memory under moderate memory pressure.  Anon pages
    			 * are not likely to be evicted by use-once streaming
    			 * IO, plus JVM can create lots of anon VM_EXEC pages,
    			 * so we ignore them here.
    			 */
    			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
    				list_add(&page->lru, &l_active);
    				continue;
    			}
    		}
    
    		ClearPageActive(page);	/* we are de-activating */
    		list_add(&page->lru, &l_inactive);
    	}
    
    	/*
    	 * Move pages back to the lru list.
    	 */
    	spin_lock_irq(&zone->lru_lock);
    	/*
    	 * Count referenced pages from currently used mappings as rotated,
    	 * even though only some of them are actually re-activated.  This
    	 * helps balance scan pressure between file and anonymous pages in
    	 * get_scan_ratio.
    	 */
    	reclaim_stat->recent_rotated[file] += nr_rotated;
    
    	move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
    	move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
    	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
    	spin_unlock_irq(&zone->lru_lock);
    
    	free_hot_cold_page_list(&l_hold, 1);
    }
    

    conclusion
    This post discusses shrink_active_list().

    kernel: mm: shrink_page_list

    December 5, 2015

    This post discusses shrink_page_list().

    reference code base
    LA.BF64.1.1-06510-8×94.0 with Android 5.0.0_r2(LRX21M) and Linux kernel 3.10.49.

    reference kernel config

    # CONFIG_NUMA is not set
    CONFIG_ZONE_DMA=y
    # CONFIG_MEMCG is not set
    # CONFIG_TRANSPARENT_HUGEPAGE is not set
    CONFIG_MEMORY_ISOLATION=y
    CONFIG_CMA=y
    # CONFIG_ALLOC_BUFFERS_IN_4K_CHUNKS is not set
    CONFIG_COMPACTION=y
    CONFIG_MIGRATION=y
    # CONFIG_CPUSETS is not set
    

    call stack

    __alloc_pages_nodemask()
    -> __alloc_pages_slowpath()
       -> __alloc_pages_direct_reclaim()
          -> __perform_reclaim()
             -> try_to_free_pages()
                -> throttle_direct_reclaim()  
                -> do_try_to_free_pages()
                   -> shrink_zones()
                      -> shrink_zone()
                         -> shrink_lruvec()
                            -> get_scan_count()
                            -> shrink_list()
                               -> shrink_active_list()
                               -> shrink_inactive_list()
                                  -> shrink_page_list()
                            -> shrink_active_list()
                            -> throttle_vm_writeout()
                         -> vmpressure()
                         -> should_continue_reclaim()
    

    what dose shrink_page_list() do
    It’s the ultimate function in direct/background reclaim to reclaim page.

    Input and output of this function.

    input:
    struct list_head *page_list: a list of isolated pages ready for reclaimed
    struct zone *zone: the zone in which these pages live
    struct scan_control *sc: it's initialised by try_to_free_pages() and passed down the reclaim flow.
    output:
    unsigned long *ret_nr_dirty: the number of dirty pages.
    unsigned long *ret_nr_writeback: the number of writeback pages.
    return unsigned int nr_reclaimed: the number of reclaimed pages.
    

    Simple code flow

    
    unsigned long shrink_page_list()
    {
        while (!list_empty(page_list)) {
            page = lru_to_page(page_list);
            list_del(&page->lru);
            sc->nr_scanned++;
    
            if (page is still referenced)
               continue(); 
            
            if (page is mapped)
               Unmap this page. continue() if failing to unmap.    
    
            if (page is dirty)
               Call pageout() to writeback() this page. continue() on failure of page is still writeback(). 
            
            if (page has buffer)
               Remove it. continue() on failure.
    
            // Finally, we could reclaim the page
            __clear_page_locked(page);
            nr_reclaimed++;
            list_add(&page->lru, &free_pages);
        }
    }
    
    /*
     * shrink_page_list() returns the number of reclaimed pages
     */
    static unsigned long shrink_page_list(struct list_head *page_list,
    				      struct zone *zone,
    				      struct scan_control *sc,
    				      enum ttu_flags ttu_flags,
    				      unsigned long *ret_nr_dirty,
    				      unsigned long *ret_nr_writeback,
    				      bool force_reclaim)
    {
    	LIST_HEAD(ret_pages);
    	LIST_HEAD(free_pages);
    	int pgactivate = 0;
    	unsigned long nr_dirty = 0;
    	unsigned long nr_congested = 0;
    	unsigned long nr_reclaimed = 0;
    	unsigned long nr_writeback = 0;
    
    	cond_resched();
    
    	mem_cgroup_uncharge_start();
    	while (!list_empty(page_list)) {
    		struct address_space *mapping;
    		struct page *page;
    		int may_enter_fs;
    		enum page_references references = PAGEREF_RECLAIM_CLEAN;
    
    		cond_resched();
    
    		page = lru_to_page(page_list);
    		list_del(&page->lru);
    
    		if (!trylock_page(page))
    			goto keep;
    
    		VM_BUG_ON(PageActive(page));
    		VM_BUG_ON(page_zone(page) != zone);
    
    		sc->nr_scanned++;
    
    		if (unlikely(!page_evictable(page)))
    			goto cull_mlocked;
    
    		if (!sc->may_unmap && page_mapped(page))
    			goto keep_locked;
    
    		/* Double the slab pressure for mapped and swapcache pages */
    		if (page_mapped(page) || PageSwapCache(page))
    			sc->nr_scanned++;
    
    		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
    			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
    
    		if (PageWriteback(page)) {
    			/*
    			 * memcg doesn't have any dirty pages throttling so we
    			 * could easily OOM just because too many pages are in
    			 * writeback and there is nothing else to reclaim.
    			 *
    			 * Check __GFP_IO, certainly because a loop driver
    			 * thread might enter reclaim, and deadlock if it waits
    			 * on a page for which it is needed to do the write
    			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
    			 * but more thought would probably show more reasons.
    			 *
    			 * Don't require __GFP_FS, since we're not going into
    			 * the FS, just waiting on its writeback completion.
    			 * Worryingly, ext4 gfs2 and xfs allocate pages with
    			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
    			 * testing may_enter_fs here is liable to OOM on them.
    			 */
    			if (global_reclaim(sc) ||
    			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
    				/*
    				 * This is slightly racy - end_page_writeback()
    				 * might have just cleared PageReclaim, then
    				 * setting PageReclaim here end up interpreted
    				 * as PageReadahead - but that does not matter
    				 * enough to care.  What we do want is for this
    				 * page to have PageReclaim set next time memcg
    				 * reclaim reaches the tests above, so it will
    				 * then wait_on_page_writeback() to avoid OOM;
    				 * and it's also appropriate in global reclaim.
    				 */
    				SetPageReclaim(page);
    				nr_writeback++;
    				goto keep_locked;
    			}
    			wait_on_page_writeback(page);
    		}
    
    		if (!force_reclaim)
    			references = page_check_references(page, sc);
    
    		switch (references) {
    		case PAGEREF_ACTIVATE:
    			goto activate_locked;
    		case PAGEREF_KEEP:
    			goto keep_locked;
    		case PAGEREF_RECLAIM:
    		case PAGEREF_RECLAIM_CLEAN:
    			; /* try to reclaim the page below */
    		}
    
    		/*
    		 * Anonymous process memory has backing store?
    		 * Try to allocate it some swap space here.
    		 */
    		if (PageAnon(page) && !PageSwapCache(page)) {
    			if (!(sc->gfp_mask & __GFP_IO))
    				goto keep_locked;
    			if (!add_to_swap(page, page_list))
    				goto activate_locked;
    			may_enter_fs = 1;
    		}
    
    		mapping = page_mapping(page);
    
    		/*
    		 * The page is mapped into the page tables of one or more
    		 * processes. Try to unmap it here.
    		 */
    		if (page_mapped(page) && mapping) {
    			switch (try_to_unmap(page, ttu_flags)) {
    			case SWAP_FAIL:
    				goto activate_locked;
    			case SWAP_AGAIN:
    				goto keep_locked;
    			case SWAP_MLOCK:
    				goto cull_mlocked;
    			case SWAP_SUCCESS:
    				; /* try to free the page below */
    			}
    		}
    
    		if (PageDirty(page)) {
    			nr_dirty++;
    
    			/*
    			 * Only kswapd can writeback filesystem pages to
    			 * avoid risk of stack overflow but do not writeback
    			 * unless under significant pressure.
    			 */
    			if (page_is_file_cache(page) &&
    					(!current_is_kswapd() ||
    					 sc->priority >= DEF_PRIORITY - 2)) {
    				/*
    				 * Immediately reclaim when written back.
    				 * Similar in principal to deactivate_page()
    				 * except we already have the page isolated
    				 * and know it's dirty
    				 */
    				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
    				SetPageReclaim(page);
    
    				goto keep_locked;
    			}
    
    			if (references == PAGEREF_RECLAIM_CLEAN)
    				goto keep_locked;
    			if (!may_enter_fs)
    				goto keep_locked;
    			if (!sc->may_writepage)
    				goto keep_locked;
    
    			/* Page is dirty, try to write it out here */
    			switch (pageout(page, mapping, sc)) {
    			case PAGE_KEEP:
    				nr_congested++;
    				goto keep_locked;
    			case PAGE_ACTIVATE:
    				goto activate_locked;
    			case PAGE_SUCCESS:
    				if (PageWriteback(page))
    					goto keep;
    				if (PageDirty(page))
    					goto keep;
    
    				/*
    				 * A synchronous write - probably a ramdisk.  Go
    				 * ahead and try to reclaim the page.
    				 */
    				if (!trylock_page(page))
    					goto keep;
    				if (PageDirty(page) || PageWriteback(page))
    					goto keep_locked;
    				mapping = page_mapping(page);
    			case PAGE_CLEAN:
    				; /* try to free the page below */
    			}
    		}
    
    		/*
    		 * If the page has buffers, try to free the buffer mappings
    		 * associated with this page. If we succeed we try to free
    		 * the page as well.
    		 *
    		 * We do this even if the page is PageDirty().
    		 * try_to_release_page() does not perform I/O, but it is
    		 * possible for a page to have PageDirty set, but it is actually
    		 * clean (all its buffers are clean).  This happens if the
    		 * buffers were written out directly, with submit_bh(). ext3
    		 * will do this, as well as the blockdev mapping.
    		 * try_to_release_page() will discover that cleanness and will
    		 * drop the buffers and mark the page clean - it can be freed.
    		 *
    		 * Rarely, pages can have buffers and no ->mapping.  These are
    		 * the pages which were not successfully invalidated in
    		 * truncate_complete_page().  We try to drop those buffers here
    		 * and if that worked, and the page is no longer mapped into
    		 * process address space (page_count == 1) it can be freed.
    		 * Otherwise, leave the page on the LRU so it is swappable.
    		 */
    		if (page_has_private(page)) {
    			if (!try_to_release_page(page, sc->gfp_mask))
    				goto activate_locked;
    			if (!mapping && page_count(page) == 1) {
    				unlock_page(page);
    				if (put_page_testzero(page))
    					goto free_it;
    				else {
    					/*
    					 * rare race with speculative reference.
    					 * the speculative reference will free
    					 * this page shortly, so we may
    					 * increment nr_reclaimed here (and
    					 * leave it off the LRU).
    					 */
    					nr_reclaimed++;
    					continue;
    				}
    			}
    		}
    
    		if (!mapping || !__remove_mapping(mapping, page))
    			goto keep_locked;
    
    		/*
    		 * At this point, we have no other references and there is
    		 * no way to pick any more up (removed from LRU, removed
    		 * from pagecache). Can use non-atomic bitops now (and
    		 * we obviously don't have to worry about waking up a process
    		 * waiting on the page lock, because there are no references.
    		 */
    		__clear_page_locked(page);
    free_it:
    		nr_reclaimed++;
    
    		/*
    		 * Is there need to periodically free_page_list? It would
    		 * appear not as the counts should be low
    		 */
    		list_add(&page->lru, &free_pages);
    		continue;
    
    cull_mlocked:
    		if (PageSwapCache(page))
    			try_to_free_swap(page);
    		unlock_page(page);
    		putback_lru_page(page);
    		continue;
    
    activate_locked:
    		/* Not a candidate for swapping, so reclaim swap space. */
    		if (PageSwapCache(page) && vm_swap_full())
    			try_to_free_swap(page);
    		VM_BUG_ON(PageActive(page));
    		SetPageActive(page);
    		pgactivate++;
    keep_locked:
    		unlock_page(page);
    keep:
    		list_add(&page->lru, &ret_pages);
    		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
    	}
    
    	/*
    	 * Tag a zone as congested if all the dirty pages encountered were
    	 * backed by a congested BDI. In this case, reclaimers should just
    	 * back off and wait for congestion to clear because further reclaim
    	 * will encounter the same problem
    	 */
    	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
    		zone_set_flag(zone, ZONE_CONGESTED);
    
    	free_hot_cold_page_list(&free_pages, 1);
    
    	list_splice(&ret_pages, page_list);
    	count_vm_events(PGACTIVATE, pgactivate);
    	mem_cgroup_uncharge_end();
    	*ret_nr_dirty += nr_dirty;
    	*ret_nr_writeback += nr_writeback;
    	return nr_reclaimed;
    }
    

    conclusion
    This post discusses shrink_page_list() which is the ultimate function to reclaim page. An isolated pages couldn’t be reclaimed if it’s still referenced by others. Before it could be reclaimed successfully, it might need to be unmap and writeback.


    %d bloggers like this: