linux/mm/swap_slots.c
Huang Ying 38d8b4e6bd mm, THP, swap: delay splitting THP during swap out
Patch series "THP swap: Delay splitting THP during swapping out", v11.

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

 - Batch the swap operations for the THP to reduce lock
   acquiring/releasing, including allocating/freeing the swap space,
   adding/deleting to/from the swap cache, and writing/reading the swap
   space, etc. This will help improve the performance of the THP swap.

 - The THP swap space read/write will be 2M sequential IO. It is
   particularly helpful for the swap read, which are usually 4k random
   IO. This will improve the performance of the THP swap too.

 - It will help the memory fragmentation, especially when the THP is
   heavily used by the applications. The 2M continuous pages will be
   free up after THP swapping out.

 - It will improve the THP utilization on the system with the swap
   turned on. Because the speed for khugepaged to collapse the normal
   pages into the THP is quite slow. After the THP is split during the
   swapping out, it will take quite long time for the normal pages to
   collapse back into the THP after being swapped in. The high THP
   utilization helps the efficiency of the page based memory management
   too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be turned
on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP during
the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the swap
space for the THP and adding the THP into the swap cache.  This will
reduce lock acquiring/releasing for the locks used for the swap cache
management.

With the patchset, the swap out throughput improves 15.5% (from about
3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

This patch (of 5):

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the THP
(Transparent Huge Page) and adding the THP into the swap cache.  This
will batch the corresponding operation, thus improve THP swap out
throughput.

This is the first step for the THP swap optimization.  The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

In this patch, one swap cluster is used to hold the contents of each THP
swapped out.  So, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  For other
architectures which want such THP swap optimization,
ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
the architecture.  In effect, this will enlarge swap cluster size by 2
times on x86_64.  Which may make it harder to find a free cluster when
the swap space becomes fragmented.  So that, this may reduce the
continuous swap space allocation and sequential write in theory.  The
performance test in 0day shows no regressions caused by this.

In the future of THP swap optimization, some information of the swapped
out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

The mem cgroup swap accounting functions are enhanced to support charge
or uncharge a swap cluster backing a THP as a whole.

The swap cluster allocate/free functions are added to allocate/free a
swap cluster for a THP.  A fair simple algorithm is used for swap
cluster allocation, that is, only the first swap device in priority list
will be tried to allocate the swap cluster.  The function will fail if
the trying is not successful, and the caller will fallback to allocate a
single swap slot instead.  This works good enough for normal cases.  If
the difference of the number of the free swap clusters among multiple
swap devices is significant, it is possible that some THPs are split
earlier than necessary.  For example, this could be caused by big size
difference among multiple swap devices.

The swap cache functions is enhanced to support add/delete THP to/from
the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
enhanced in the future with multi-order radix tree.  But because we will
split the THP soon during swapping out, that optimization doesn't make
much sense for this first step.

The THP splitting functions are enhanced to support to split THP in swap
cache during swapping out.  The page lock will be held during allocating
the swap cluster, adding the THP into the swap cache and splitting the
THP.  So in the code path other than swapping out, if the THP need to be
split, the PageSwapCache(THP) will be always false.

The swap cluster is only available for SSD, so the THP swap optimization
in this patchset has no effect for HDD.

[ying.huang@intel.com: fix two issues in THP optimize patch]
  Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
[hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00

354 lines
9.1 KiB
C

/*
* Manage cache of swap slots to be used for and returned from
* swap.
*
* Copyright(c) 2016 Intel Corporation.
*
* Author: Tim Chen <tim.c.chen@linux.intel.com>
*
* We allocate the swap slots from the global pool and put
* it into local per cpu caches. This has the advantage
* of no needing to acquire the swap_info lock every time
* we need a new slot.
*
* There is also opportunity to simply return the slot
* to local caches without needing to acquire swap_info
* lock. We do not reuse the returned slots directly but
* move them back to the global pool in a batch. This
* allows the slots to coaellesce and reduce fragmentation.
*
* The swap entry allocated is marked with SWAP_HAS_CACHE
* flag in map_count that prevents it from being allocated
* again from the global pool.
*
* The swap slots cache is protected by a mutex instead of
* a spin lock as when we search for slots with scan_swap_map,
* we can possibly sleep.
*/
#include <linux/swap_slots.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/vmalloc.h>
#include <linux/mutex.h>
#include <linux/mm.h>
#ifdef CONFIG_SWAP
static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
static bool swap_slot_cache_active;
bool swap_slot_cache_enabled;
static bool swap_slot_cache_initialized;
DEFINE_MUTEX(swap_slots_cache_mutex);
/* Serialize swap slots cache enable/disable operations */
DEFINE_MUTEX(swap_slots_cache_enable_mutex);
static void __drain_swap_slots_cache(unsigned int type);
static void deactivate_swap_slots_cache(void);
static void reactivate_swap_slots_cache(void);
#define use_swap_slot_cache (swap_slot_cache_active && \
swap_slot_cache_enabled && swap_slot_cache_initialized)
#define SLOTS_CACHE 0x1
#define SLOTS_CACHE_RET 0x2
static void deactivate_swap_slots_cache(void)
{
mutex_lock(&swap_slots_cache_mutex);
swap_slot_cache_active = false;
__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
mutex_unlock(&swap_slots_cache_mutex);
}
static void reactivate_swap_slots_cache(void)
{
mutex_lock(&swap_slots_cache_mutex);
swap_slot_cache_active = true;
mutex_unlock(&swap_slots_cache_mutex);
}
/* Must not be called with cpu hot plug lock */
void disable_swap_slots_cache_lock(void)
{
mutex_lock(&swap_slots_cache_enable_mutex);
swap_slot_cache_enabled = false;
if (swap_slot_cache_initialized) {
/* serialize with cpu hotplug operations */
get_online_cpus();
__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
put_online_cpus();
}
}
static void __reenable_swap_slots_cache(void)
{
swap_slot_cache_enabled = has_usable_swap();
}
void reenable_swap_slots_cache_unlock(void)
{
__reenable_swap_slots_cache();
mutex_unlock(&swap_slots_cache_enable_mutex);
}
static bool check_cache_active(void)
{
long pages;
if (!swap_slot_cache_enabled || !swap_slot_cache_initialized)
return false;
pages = get_nr_swap_pages();
if (!swap_slot_cache_active) {
if (pages > num_online_cpus() *
THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE)
reactivate_swap_slots_cache();
goto out;
}
/* if global pool of slot caches too low, deactivate cache */
if (pages < num_online_cpus() * THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE)
deactivate_swap_slots_cache();
out:
return swap_slot_cache_active;
}
static int alloc_swap_slot_cache(unsigned int cpu)
{
struct swap_slots_cache *cache;
swp_entry_t *slots, *slots_ret;
/*
* Do allocation outside swap_slots_cache_mutex
* as kvzalloc could trigger reclaim and get_swap_page,
* which can lock swap_slots_cache_mutex.
*/
slots = kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE,
GFP_KERNEL);
if (!slots)
return -ENOMEM;
slots_ret = kvzalloc(sizeof(swp_entry_t) * SWAP_SLOTS_CACHE_SIZE,
GFP_KERNEL);
if (!slots_ret) {
kvfree(slots);
return -ENOMEM;
}
mutex_lock(&swap_slots_cache_mutex);
cache = &per_cpu(swp_slots, cpu);
if (cache->slots || cache->slots_ret)
/* cache already allocated */
goto out;
if (!cache->lock_initialized) {
mutex_init(&cache->alloc_lock);
spin_lock_init(&cache->free_lock);
cache->lock_initialized = true;
}
cache->nr = 0;
cache->cur = 0;
cache->n_ret = 0;
cache->slots = slots;
slots = NULL;
cache->slots_ret = slots_ret;
slots_ret = NULL;
out:
mutex_unlock(&swap_slots_cache_mutex);
if (slots)
kvfree(slots);
if (slots_ret)
kvfree(slots_ret);
return 0;
}
static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
bool free_slots)
{
struct swap_slots_cache *cache;
swp_entry_t *slots = NULL;
cache = &per_cpu(swp_slots, cpu);
if ((type & SLOTS_CACHE) && cache->slots) {
mutex_lock(&cache->alloc_lock);
swapcache_free_entries(cache->slots + cache->cur, cache->nr);
cache->cur = 0;
cache->nr = 0;
if (free_slots && cache->slots) {
kvfree(cache->slots);
cache->slots = NULL;
}
mutex_unlock(&cache->alloc_lock);
}
if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
spin_lock_irq(&cache->free_lock);
swapcache_free_entries(cache->slots_ret, cache->n_ret);
cache->n_ret = 0;
if (free_slots && cache->slots_ret) {
slots = cache->slots_ret;
cache->slots_ret = NULL;
}
spin_unlock_irq(&cache->free_lock);
if (slots)
kvfree(slots);
}
}
static void __drain_swap_slots_cache(unsigned int type)
{
unsigned int cpu;
/*
* This function is called during
* 1) swapoff, when we have to make sure no
* left over slots are in cache when we remove
* a swap device;
* 2) disabling of swap slot cache, when we run low
* on swap slots when allocating memory and need
* to return swap slots to global pool.
*
* We cannot acquire cpu hot plug lock here as
* this function can be invoked in the cpu
* hot plug path:
* cpu_up -> lock cpu_hotplug -> cpu hotplug state callback
* -> memory allocation -> direct reclaim -> get_swap_page
* -> drain_swap_slots_cache
*
* Hence the loop over current online cpu below could miss cpu that
* is being brought online but not yet marked as online.
* That is okay as we do not schedule and run anything on a
* cpu before it has been marked online. Hence, we will not
* fill any swap slots in slots cache of such cpu.
* There are no slots on such cpu that need to be drained.
*/
for_each_online_cpu(cpu)
drain_slots_cache_cpu(cpu, type, false);
}
static int free_slot_cache(unsigned int cpu)
{
mutex_lock(&swap_slots_cache_mutex);
drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
mutex_unlock(&swap_slots_cache_mutex);
return 0;
}
int enable_swap_slots_cache(void)
{
int ret = 0;
mutex_lock(&swap_slots_cache_enable_mutex);
if (swap_slot_cache_initialized) {
__reenable_swap_slots_cache();
goto out_unlock;
}
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache",
alloc_swap_slot_cache, free_slot_cache);
if (WARN_ONCE(ret < 0, "Cache allocation failed (%s), operating "
"without swap slots cache.\n", __func__))
goto out_unlock;
swap_slot_cache_initialized = true;
__reenable_swap_slots_cache();
out_unlock:
mutex_unlock(&swap_slots_cache_enable_mutex);
return 0;
}
/* called with swap slot cache's alloc lock held */
static int refill_swap_slots_cache(struct swap_slots_cache *cache)
{
if (!use_swap_slot_cache || cache->nr)
return 0;
cache->cur = 0;
if (swap_slot_cache_active)
cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, false,
cache->slots);
return cache->nr;
}
int free_swap_slot(swp_entry_t entry)
{
struct swap_slots_cache *cache;
cache = &get_cpu_var(swp_slots);
if (use_swap_slot_cache && cache->slots_ret) {
spin_lock_irq(&cache->free_lock);
/* Swap slots cache may be deactivated before acquiring lock */
if (!use_swap_slot_cache) {
spin_unlock_irq(&cache->free_lock);
goto direct_free;
}
if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
/*
* Return slots to global pool.
* The current swap_map value is SWAP_HAS_CACHE.
* Set it to 0 to indicate it is available for
* allocation in global pool
*/
swapcache_free_entries(cache->slots_ret, cache->n_ret);
cache->n_ret = 0;
}
cache->slots_ret[cache->n_ret++] = entry;
spin_unlock_irq(&cache->free_lock);
} else {
direct_free:
swapcache_free_entries(&entry, 1);
}
put_cpu_var(swp_slots);
return 0;
}
swp_entry_t get_swap_page(struct page *page)
{
swp_entry_t entry, *pentry;
struct swap_slots_cache *cache;
entry.val = 0;
if (PageTransHuge(page)) {
if (IS_ENABLED(CONFIG_THP_SWAP))
get_swap_pages(1, true, &entry);
return entry;
}
/*
* Preemption is allowed here, because we may sleep
* in refill_swap_slots_cache(). But it is safe, because
* accesses to the per-CPU data structure are protected by the
* mutex cache->alloc_lock.
*
* The alloc path here does not touch cache->slots_ret
* so cache->free_lock is not taken.
*/
cache = raw_cpu_ptr(&swp_slots);
if (check_cache_active()) {
mutex_lock(&cache->alloc_lock);
if (cache->slots) {
repeat:
if (cache->nr) {
pentry = &cache->slots[cache->cur++];
entry = *pentry;
pentry->val = 0;
cache->nr--;
} else {
if (refill_swap_slots_cache(cache))
goto repeat;
}
}
mutex_unlock(&cache->alloc_lock);
if (entry.val)
return entry;
}
get_swap_pages(1, false, &entry);
return entry;
}
#endif /* CONFIG_SWAP */