Commit graph

21710 commits

Author SHA1 Message Date
Marc Zyngier c466595c41 irqdomain: Make irq_domain_alloc_irqs_recursive available
We are soon going to need the MSI layer to call into the domain
allocators. Instead of open coding this, make the standard
irq_domain_alloc_irqs_recursive function available to the MSI
layer.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-12-16 15:29:44 +00:00
Rami Rosen fccd3af571 cgroup_pids: fix a typo.
This patch fixes a typo in pids_charge() method.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-14 14:54:37 -05:00
Tejun Heo 3fa4cc9c2d net, cgroup: cgroup_sk_updat_lock was missing initializer
bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup") added global
spinlock cgroup_sk_update_lock but erroneously skipped initializer
leading to uninitialized spinlock warning.  Fix it by using
DEFINE_SPINLOCK().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14 14:20:33 -05:00
Thomas Gleixner 425a5072dc genirq: Free irq_desc with rcu
The new VMD device driver needs to iterate over a list of
"demultiplexing" interrupts. Protecting that list with a lock is not
possible because the list is also required in code pathes which hold
irq descriptor lock. Therefor the demultiplexing interrupt handler
would create a lock inversion scenario if it calls a demux handler
with the list protection lock held.

A solution for this is to free the irq descriptor via RCU, so the
list can be walked with rcu read lock held.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
2015-12-14 10:03:46 +01:00
Thomas Gleixner abc7e40c81 genirq: Prevent chip buslock deadlock
If a interrupt chip utilizes chip->buslock then free_irq() can
deadlock in the following way:

CPU0				CPU1
				interrupt(X) (Shared or spurious)
free_irq(X)			interrupt_thread(X)
chip_bus_lock(X)
				   irq_finalize_oneshot(X)
				     chip_bus_lock(X)
synchronize_irq(X)
	
synchronize_irq() waits for the interrupt thread to complete,
i.e. forever.

Solution is simple: Drop chip_bus_lock() before calling
synchronize_irq() as we do with the irq_desc lock. There is nothing to
be protected after the point where irq_desc lock has been released.

This adds chip_bus_lock/unlock() to the remove_irq() code path, but
that's actually correct in the case where remove_irq() is called on
such an interrupt. The current users of remove_irq() are not affected
as none of those interrupts is on a chip which requires buslock.

Reported-by: Fredrik Markström <fredrik.markstrom@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2015-12-14 09:45:06 +01:00
Ingo Molnar 057032e457 Linux 4.4-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWbh6rAAoJEHm+PkMAQRiG2L4H/AmgfAPX8vYQTKXAnpxt7cLx
 2W9C+fyKRcHzqBCsUB4wxPwYwDGPmEZHo4Y/evaSdUbPhngRRRfEVZpzbTUqheEm
 A7h/1mCl5gcaGRzDNTAST3vfmNmwOHAWC+Ch3ZuuzxH+brtY0Ynb32CNa1XnmW9K
 7qknzpgyE3ZNQgwKzZ7F/+TscGcslalKRoAxPa7fumb1srW/Z04aGXYZdEQxOhow
 6Oc2op0IijTse5TdfW/MsbpvbH2uBLnQcYHvKXJ0wRmnQGeowguLSgyW356EAOQ9
 7L7xCvXyX5dFakZ9EApT3wsSP+cS7jUInDLDAxf14gyODrnl65KO3LU8vmZbJmI=
 =l5Rq
 -----END PGP SIGNATURE-----

Merge tag 'v4.4-rc5' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-14 09:31:23 +01:00
Peter Zijlstra dfd01f0260 sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-13 14:30:59 -08:00
Daniel Borkmann bb35a6ef7d bpf, inode: allow for rename and link ops
Add support for renaming and hard links to the fs. Most of this can be
implemented by using simple library operations under the same constraints
that we don't use a reserved name like elsewhere. Linking can be useful
to share/manage things like maps across subsystem users. It works within
the file system boundary, but is not allowed for directories.

Symbolic links are explicitly not implemented here, as it can be better
done already by doing bind mounts inside bpf fs to set up shared directories
f.e. useful when using volumes in docker containers that map a private
working directory into /sys/fs/bpf/ which contains itself a bind mounted
path from the host's /sys/fs/bpf/ mount that is shared among multiple
containers. For single maps instead of whole directory, hard links can
be easily used to do the same.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12 18:44:23 -05:00
Chris Wilson 86fffe4a61 kernel: remove stop_machine() Kconfig dependency
Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable.  This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.

For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.

This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.

Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
John Stultz 37cf4dc337 time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow
For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
sane. We might multiply them later which can cause an overflow
and undefined behavior.

This patch introduces new helper functions to simplify the
checking code and adds comments to clarify

Orginally this patch was by Sasha Levin, but I've basically
rewritten it, so he should get credit for finding the issue
and I should get the blame for any mistakes made since.

Also, credit to Richard Cochran for the phrasing used in the
comment for what is considered valid here.

Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-10 22:41:06 -08:00
Sasha Levin 52d189f1b3 ntp: Verify offset doesn't overflow in ntp_update_offset
We need to make sure that the offset is valid before manipulating it,
otherwise it might overflow on the multiplication.

Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Reworked one of the checks so it makes more sense]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-10 22:41:05 -08:00
Tejun Heo bd1060a1d6 sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound.  As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter.  They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical.  Both inherit
configuration from the parent on creation but there's no interaction
afterwards.  An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level.  net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used.  Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid.  This is to avoid adding yet another cgroup related field
to struct sock.

As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead.  It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs.  Non-critical inaccuracies from small race windows won't
make any noticeable difference.

This patch doesn't make use of the pointer yet.  The following patch
will implement netfilter match for cgroup2 membership.

v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
    cgroup specific field.

v3: Add comments explaining why sock_data_prioidx() and
    sock_data_classid() use different fallback values.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08 22:02:33 -05:00
David S. Miller bc9b145a09 Merge branch 'for-4.5-ancestor-test' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Preparatory changes for some new socket cgroup infrastructure
and netfilter targets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08 22:01:38 -05:00
Linus Torvalds 5406812e59 Merge branch 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "More change than I'd have liked at this stage.  The pids controller
  and the changes made to cgroup core to support it introduced and
  revealed several important issues.

   - Assigning membership to a newly created task and migrating it can
     race leading to incorrect accounting.  Oleg fixed it by widening
     threadgroup synchronization.  It looks like we'll be able to merge
     it with a different percpu rwsem which is used in fork path making
     things simpler and cheaper.

   - The recent change to extend cgroup membership to zombies (so that
     pid accounting can extend till the pid is actually released) missed
     pinning the underlying data structures leading to use-after-free.
     Fixed.

   - v2 hierarchy was calling subsystem callbacks with the wrong target
     cgroup_subsys_state based on the incorrect assumption that they
     share the same target.  pids is the first controller affected by
     this.  Subsys callbacks updated so that they can deal with
     multi-target migrations"

* 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup_pids: don't account for the root cgroup
  cgroup: fix handling of multi-destination migration from subtree_control enabling
  cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
  cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
  cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
  cgroup: make css_set pin its css's to avoid use-afer-free
  cgroup: fix cftype->file_offset handling
2015-12-08 13:35:52 -08:00
Linus Torvalds 51825c8a86 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "This tree includes four core perf fixes for misc bugs, three fixes to
  x86 PMU drivers, and two updates to old email addresses"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Do not send exit event twice
  perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
  perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
  perf: Fix PERF_EVENT_IOC_PERIOD deadlock
  treewide: Remove old email address
  perf/x86: Fix LBR call stack save/restore
  perf: Update email address in MAINTAINERS
  perf/core: Robustify the perf_cgroup_from_task() RCU checks
  perf/core: Fix RCU problem with cgroup context switching code
2015-12-08 13:01:23 -08:00
Tejun Heo 82607adcf9 workqueue: implement lockup detector
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING.  These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.

To alleviate the situation, this patch implements workqueue lockup
detector.  It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.

 BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
 Showing busy workqueues and worker pools:
 workqueue events: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
     pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
 workqueue events_power_efficient: flags=0x80
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
     pending: check_lifetime, neigh_periodic_work
 workqueue cgroup_pidlist_destroy: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     pending: cgroup_pidlist_destroy_work_fn
 ...

The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.

v2: Decoupled from softlockup control knobs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:29:47 -05:00
Tejun Heo 03e0d4610b watchdog: introduce touch_softlockup_watchdog_sched()
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected.  One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend.  The other is to account
for scheduler and timer going idle.

For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection.  This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead.  For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 11:29:42 -05:00
Tejun Heo fca839c00a workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue
Task or work item involved in memory reclaim trying to flush a
non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
deadlock.  Trigger WARN_ONCE() if such conditions are detected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2015-12-08 11:29:36 -05:00
Thomas Petazzoni f0cb322073 genirq: Implement irq_percpu_is_enabled()
Certain interrupt controller drivers have a register set that does not
make it easy to save/restore the mask of enabled/disabled interrupts
at suspend/resume time. At resume time, such drivers rely on the core
kernel irq subsystem to tell whether such or such interrupt is enabled
or not, in order to restore the proper state in the interrupt
controller register.

While the irqd_irq_disabled() provides the relevant information for
global interrupts, there is no similar function to query the
enabled/disabled state of a per-CPU interrupt.

Therefore, this commit complements the percpu_irq API with an
irq_percpu_is_enabled() function.

[ tglx: Simplified the implementation and added kerneldoc ]

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Tawfik Bayouk <tawfik@marvell.com>
Cc: Nadav Haklai <nadavh@marvell.com>
Cc: Lior Amsalem <alior@marvell.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1445347435-2333-2-git-send-email-thomas.petazzoni@free-electrons.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-08 12:53:29 +01:00
Paul E. McKenney 648c630c64 Merge branches 'doc.2015.12.05a', 'exp.2015.12.07a', 'fixes.2015.12.07a', 'list.2015.12.04b' and 'torture.2015.12.05a' into HEAD
doc.2015.12.05a:  Documentation updates
exp.2015.12.07a:  Expedited grace-period updates
fixes.2015.12.07a:  Miscellaneous fixes
list.2015.12.04b:  Linked-list updates
torture.2015.12.05a:  Torture-test updates
2015-12-07 17:02:54 -08:00
Paul E. McKenney 45fed3e7cf rcu: Make rcu_gp_init() be bool rather than int
The return value from rcu_gp_init() is always used as a bool, so
this commit makes it be a bool.

Reported-by: Iftekhar Ahmed <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:33 -08:00
Peter Zijlstra e11f13355b rcu: Move wakeup out from under rnp->lock
This patch removes a potential deadlock hazard by moving the
wake_up_process() in rcu_spawn_gp_kthread() out from under rnp->lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:32 -08:00
Paul E. McKenney 7c9906ca5e rcu: Don't redundantly disable irqs in rcu_irq_{enter,exit}()
This commit replaces a local_irq_save()/local_irq_restore() pair with
a lockdep assertion that interrupts are already disabled.  This should
remove the corresponding overhead from the interrupt entry/exit fastpaths.

This change was inspired by the fact that Iftekhar Ahmed's mutation
testing showed that removing rcu_irq_enter()'s call to local_ird_restore()
had no effect, which might indicate that interrupts were always enabled
anyway.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:31 -08:00
Paul E. McKenney d117c8aa1d rcu: Make cpu_needs_another_gp() be bool
The cpu_needs_another_gp() function is currently of type int, but only
returns zero or one.  Bow to reality and make it be of type bool.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:31 -08:00
Paul E. McKenney a87f203e27 rcu: Eliminate unused rcu_init_one() argument
Now that the rcu_state structure's ->rda field is compile-time initialized,
there is no need to pass the per-CPU rcu_data structure into rcu_init_one().
This commit therefore eliminates this now-unused parameter.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-07 17:01:19 -08:00
Paul E. McKenney 79cfea0273 rcu: Remove TINY_RCU bloat from pointless boot parameters
The rcu_expedited, rcu_normal, and rcu_normal_after_boot kernel boot
parameters are pointless in the case of TINY_RCU because in that case
synchronous grace periods, both expedited and normal, are no-ops.
However, these three symbols contribute several hundred bytes of bloat.
This commit therefore uses CPP directives to avoid compiling this code
in TINY_RCU kernels.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-12-07 16:59:37 -08:00
Tejun Heo 177493987c Merge branch 'for-4.5-ancestor-test' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.5
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-07 17:24:10 -05:00
Seiichi Ikarashi 390dd67c47 clocksource: Add CPU info to clocksource watchdog reporting
The clocksource watchdog reporting was improved by 0b046b217a.
I want to add the info of CPU where the watchdog detects a
deviation because it is necessary to identify the trouble spot
if the clocksource is TSC.

Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-07 12:15:28 -08:00
David Gibson 35a4933a89 time: Avoid signed overflow in timekeeping_get_ns()
1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
of the same logic to avoid keeping a semi-redundant struct timespec
in struct timekeeper.

However, the commit also introduced a subtle semantic change - where
clocksource_cyc2ns() uses purely unsigned math, the new version introduces
a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
overflow the following shift will still give a negative result.  The
choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
generally happen if there's a ~10 minute pause in examining the
clocksource.

This can be triggered on a powerpc KVM guest by stopping it from qemu for
a bit over 10 minutes.  After resuming time has jumped backwards several
minutes causing numerous problems (jiffies does not advance, msleep()s can
be extended by minutes..).  It doesn't happen on x86 KVM guests, because
the guest TSC is effectively frozen while the guest is stopped, which is
not the case for the powerpc timebase.

Obviously an unsigned (64 bit) overflow will only take twice as long as a
signed, 63-bit overflow.  I don't know the time code well enough to know
if that will still cause incorrect calculations, or if a 64-bit overflow
is avoided elsewhere.

Still, an incorrect forwards clock adjustment will cause less trouble than
time going backwards.  So, this patch removes the potential for
intermediate signed overflow.

Cc: stable@vger.kernel.org  (3.7+)
Suggested-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-07 11:43:22 -08:00
Tejun Heo 0b98f0c042 Merge branch 'master' into for-4.4-fixes
The following commit which went into mainline through networking tree

  3b13758f51 ("cgroups: Allow dynamically changing net_classid")

conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.

  1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths.  The latter
drops @css from cgrp_attach().

Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task.  We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Nina Schiff <ninasc@fb.com>
2015-12-07 10:09:03 -05:00
Linus Torvalds fb7b26e47e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "This updates contains the following changes:

   - Fix a signal handling regression in the bit wait functions.

   - Avoid false positive warnings in the wakeup path.

   - Initialize the scheduler root domain properly.

   - Handle gtime calculations in proc/$PID/stat proper.

   - Add more documentation for the barriers in try_to_wake_up().

   - Fix a subtle race in try_to_wake_up() which might cause a task to
     be scheduled on two cpus

   - Compile static helper function only when it is used"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
  sched/core: Better document the try_to_wake_up() barriers
  sched/cputime: Fix invalid gtime in proc
  sched/core: Clear the root_domain cpumasks in init_rootdomain()
  sched/core: Remove false-positive warning from wake_up_process()
  sched/wait: Fix signal handling in bit wait helpers
  sched/rt: Hide the push_irq_work_func() declaration
2015-12-06 08:35:05 -08:00
Peter Zijlstra 0017960f38 perf/core: Collapse common IPI pattern
Various functions implement the same pattern to send IPIs to an
event's CPU. Collapse the easy ones in a common helper function to
reduce duplication.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06 12:55:48 +01:00
Jiri Olsa 4e93ad601a perf: Do not send exit event twice
In case we monitor events system wide, we get EXIT event
(when configured) twice for each task that exited.

Note doubled lines with same pid/tid in following example:

  $ sudo ./perf record -a
  ^C[ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
  $ sudo ./perf report -D | grep EXIT

  0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
  0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
  0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
  0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
  1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
  1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
  2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
  2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)

The reason is that the cpu contexts are processes each time
we call perf_event_task. I'm changing the perf_event_aux logic
to serve task_ctx and cpu contexts separately, which ensure we
don't get EXIT event generated twice on same cpu context.

This does not affect other auxiliary events, as they don't
use task_ctx at all.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06 12:54:49 +01:00
Paul E. McKenney 6b50e119c4 rcutorture: Print symbolic name for ->gp_state
Currently, ->gp_state is printed as an integer, which slows debugging.
This commit therefore prints a symbolic name in addition to the integer.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to fix relational operator called out by Dan Carpenter. ]
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-12-05 17:58:26 -08:00
Paul E. McKenney 18aff33e73 rcutorture: Print symbolic name for rcu_torture_writer_state
Currently, rcu_torture_writer_state is printed as an integer, which slows
debugging.  This commit therefore prints a symbolic name in addition to
the integer.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-12-05 17:58:22 -08:00
Paul E. McKenney b1adb3e273 rcutorture: Dump stack when GP kthread stalls
This commit increases debug information in the case where the grace-period
kthread is being prevented from running by dumping that kthread's stack.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Split into prior commit and this commit, as suggested by
  Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-12-05 17:58:05 -08:00
Paul E. McKenney a0e3a3aa28 rcutorture: Flag nonexistent RCU GP kthread
Currently, if the RCU grace-period kthread has not yet been created,
in which case the starvation-check code will print zero for the state,
which maps to TASK_RUNNING.  This could clearly be quite confusing, so
this commit prints ~0, which does not map to any legal ->state value.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-12-05 17:58:00 -08:00
Paul E. McKenney 46a5d164db rcu: Stop disabling interrupts in scheduler fastpaths
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal.  Especially given that there are regions of the scheduler that
already have interrupts disabled.

This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
  by Peter Zijlstra. ]
2015-12-04 12:27:31 -08:00
Paul E. McKenney f0f2e7d307 rcu: Avoid tick_nohz_active checks on NOCBs CPUs
Currently, rcu_prepare_for_idle() checks for tick_nohz_active, even on
individual NOCBs CPUs, unless all CPUs are marked as NOCBs CPUs at build
time.  This check is pointless on NOCBs CPUs because they never have any
callbacks posted, given that all of their callbacks are handed off to the
corresponding rcuo kthread.  There is a check for individually designated
NOCBs CPUs, but it pointelessly follows the check for tick_nohz_active.

This commit therefore moves the check for individually designated NOCBs
CPUs up with the check for CONFIG_RCU_NOCB_CPU_ALL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:31 -08:00
Paul E. McKenney 699d403520 rcu: Fix obsolete rcu_bootup_announce_oddness() comment
This function no longer has #ifdefs, so this commit removes the
header comment calling them out.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:30 -08:00
Paul E. McKenney 8ba9153b2c rcu: Remove lock-acquisition loop from rcu_read_unlock_special()
Several releases have come and gone without the warning triggering,
so remove the lock-acquisition loop.  Retain the WARN_ON_ONCE()
out of sheer paranoia.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:30 -08:00
Paul E. McKenney fecbf6f01f rcu: Simplify rcu_sched_qs() control flow
This commit applies an early-exit approach to rcu_sched_qs(), reducing
the nesting level and saving a line of code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:29 -08:00
Paul Gortmaker 47dbc90663 kernel: Make rcu/tree_trace.c explicitly non-modular
The Kconfig currently controlling compilation of this code is:

init/Kconfig:config TREE_RCU_TRACE
init/Kconfig:   def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU )

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the file there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  We could
consider moving this to an earlier initcall if desired.

We don't replace module.h with init.h since the file already has that.
We also delete the moduleparam.h include that is left over from
commit 64db4cfff9 (""Tree RCU": scalable
classic RCU implementation") since it is not needed here either.

We morph some tags like MODULE_AUTHOR into the comments at the top of
the file for documentation purposes.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:29 -08:00
Paul E. McKenney 3dc5dbe9a1 rcu: Move lock_class_key to local scope
Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[]
arrays needlessly pollute the global namespace within tree.c.  This
commit therefore converts them to static local variables within
rcu_init_one().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:27:29 -08:00
Paul E. McKenney 3e42ec1aa7 rcu: Allow expedited grace periods to be disabled at init
Expedited grace periods can speed up boot, but are undesirable in
aggressive real-time systems.  This commit therefore introduces a
kernel parameter rcupdate.rcu_normal_after_boot that disables
expedited grace periods just before init is spawned.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:54 -08:00
Paul E. McKenney 5a9be7c628 rcu: Add rcu_normal kernel parameter to suppress expediting
Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads.  This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:53 -08:00
Paul E. McKenney 72611ab9f5 rcu: Add more diagnostics to expedited stall warning messages.
This commit adds print statements that check the rcu_node structure to
find which ->expmask bits and which ->exp_tasks structures are blocking
the current expedited grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:53 -08:00
Paul E. McKenney 73f36f9de8 rcu: Make expedited grace periods resolve stall-warning ties
Currently, if a grace period ends just as the stall-warning timeout
fires, an empty stall warning will be printed.  This is not helpful,
so this commit avoids these useless warnings by rechecking completion
after awakening in synchronize_sched_expedited_wait().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:52 -08:00
Paul E. McKenney df5bd5144a rcu: Reduce expedited GP memory contention via per-CPU variables
Currently, the piggybacked-work checks carried out by sync_exp_work_done()
atomically increment a small set of variables (the ->expedited_workdone0,
->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
fields in the rcu_state structure), which will form a memory-contention
bottleneck given a sufficiently large number of CPUs concurrently invoking
either synchronize_rcu_expedited() or synchronize_sched_expedited().

This commit therefore moves these for fields to the per-CPU rcu_data
structure, eliminating the memory contention.  The show_rcuexp() function
also changes to sum up each field in the rcu_data structures.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:52 -08:00
Paul E. McKenney 1307f21487 rcu: Invert sync_rcu_exp_select_cpus() "if" statement
This commit saves a couple lines of code and reduces indentation
by inverting the sense of an "if" statement in the function
sync_rcu_exp_select_cpus().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:51 -08:00
Paul E. McKenney 886ef5a18a rcu: Move smp_mb() from rcu_seq_snap() to rcu_exp_gp_seq_snap()
The memory barrier in rcu_seq_snap() is needed only for grace periods,
so this commit moves it to the grace-period-oriented wrapper
rcu_exp_gp_seq_snap().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:51 -08:00
Paul E. McKenney 1de6e56ddc rcu: Clarify role of ->expmaskinitnext
Analogy with the ->qsmaskinitnext field might lead one to believe that
->expmaskinitnext tracks online CPUs.  This belief is incorrect: Any CPU
that has ever been online will have its bit set in the ->expmaskinitnext
field.  This commit therefore adds a comment to make this clear, at
least to people who read comments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:50 -08:00
Paul E. McKenney 06f60de19d rcu: Short-circuit synchronize_sched_expedited() if only one CPU
If there is only one CPU, then invoking synchronize_sched_expedited()
is by definition a grace period.  This commit checks for this condition
and does a short-circuit return in that case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-12-04 12:26:50 -08:00
Waiman Long cd0272fab7 locking/pvqspinlock: Queue node adaptive spinning
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.

This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

		    Westmere			Haswell
  Patch		32 vCPUs    48 vCPUs	32 vCPUs    48 vCPUs
  -----		--------    --------    --------    --------
  Before patch   3m02.3s     5m00.2s     1m43.7s     3m03.5s
  After patch    3m03.0s     4m37.5s	 1m43.0s     2m47.2s

For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 11:39:51 +01:00
Waiman Long 1c4941fd53 locking/pvqspinlock: Allow limited lock stealing
This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt.  This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.

The pv_wait_head() function was renamed as pv_wait_head_or_lock()
as it was modified to acquire the lock before returning. This is
necessary because of possible lock stealing attempts from other tasks.

Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:

                    Westmere                    Haswell
  Patch         32 vCPUs    48 vCPUs    32 vCPUs    48 vCPUs
  -----         --------    --------    --------    --------
  Before patch   3m15.6s    10m56.1s     1m44.1s     5m29.1s
  After patch    3m02.3s     5m00.2s     1m43.7s     3m03.5s

For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 11:39:51 +01:00
Waiman Long 45e898b735 locking/pvqspinlock: Collect slowpath lock statistics
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.

The statistical counters are per-cpu variables to minimize the
performance overhead in their updates. These counters are exported
via the debugfs filesystem under the qlockstat directory.  When the
corresponding debugfs files are read, summation and computing of the
required data are then performed.

The measured latencies for different CPUs are:

	CPU		Wakeup		Kicking
	---		------		-------
	Haswell-EX	63.6us		 7.4us
	Westmere-EX	67.6us		 9.3us

The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.

A sample of statistical counters after system bootup (with vCPU
overcommit) was:

	pv_hash_hops=1.00
	pv_kick_unlock=1148
	pv_kick_wake=1146
	pv_latency_kick=11040
	pv_latency_wake=194840
	pv_spurious_wakeup=7
	pv_wait_again=4
	pv_wait_head=23
	pv_wait_node=1129

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 11:39:50 +01:00
Waiman Long aa0b7ae063 sched/fair: Disable the task group load_avg update for the root_task_group
Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:49 +01:00
Waiman Long b0367629ac sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline
If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:48 +01:00
Waiman Long a426f99c91 sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:47 +01:00
Andi Kleen ed82b8a1ff sched/core: Move the sched_to_prio[] arrays out of line
When building a kernel with a gcc 6 snapshot the compiler complains
about unused const static variables for prio_to_weight and prio_to_mult
for multiple scheduler files (all but core.c and autogroup.c)

The way the array is currently declared it will be duplicated in
every scheduler file that includes sched.h, which seems rather wasteful.

Move the array out of line into core.c. I also added a sched_ prefix
to avoid any potential name space collisions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:46 +01:00
Frederic Weisbecker b7ce2277f0 sched/cputime: Convert vtime_seqlock to seqcount
The cputime can only be updated by the current task itself, even in
vtime case. So we can safely use seqcount instead of seqlock as there
is no writer concurrency involved.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:46 +01:00
Frederic Weisbecker e592539466 sched/cputime: Introduce vtime accounting check for readers
Readers need to know if vtime runs at all on some CPU somewhere, this
is a fast-path check to determine if we need to check further the need
to add up any tickless cputime delta.

This fast path check uses context tracking state because vtime is tied
to context tracking as of now. This check appears to be confusing though
so lets use a vtime function that deals with context tracking details
in vtime implementation instead.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:45 +01:00
Frederic Weisbecker 55dbdcfa05 sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled()
vtime_accounting_enabled() checks if vtime is running on the current CPU
and is as such a misnomer. Lets rename it to a function that reflect its
locality. We are going to need the current name for a function that tells
if vtime runs at all on some CPU.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:45 +01:00
Frederic Weisbecker cab245d68c sched/cputime: Correctly handle task guest time on housekeepers
When a task runs on a housekeeper (a CPU running with the periodic tick
with neighbours running tickless), it doesn't account cputime using vtime
but relies on the tick. Such a task has its vtime_snap_whence value set
to VTIME_INACTIVE.

Readers won't handle that correctly though. As long as vtime is running
on some CPU, readers incorretly assume that vtime runs on all CPUs and
always compute the tickless cputime delta, which is only junk on
housekeepers.

So lets fix this with checking that the target runs on a vtime CPU through
the appropriate state check before computing the tickless delta.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:44 +01:00
Frederic Weisbecker 7098c1eac7 sched/cputime: Clarify vtime symbols and document them
VTIME_SLEEPING state happens either when:

1) The task is sleeping and no tickless delta is to be added on the task
   cputime stats.
2) The CPU isn't running vtime at all, so the same properties of 1) applies.

Lets rename the vtime symbol to reflect both states.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:44 +01:00
Hiroshi Shimamoto 7877a0ba5e sched/cputime: Remove extra cost in task_cputime()
There is an extra cost in task_cputime() and task_cputime_scaled() when
nohz_full is not activated. When vtime accounting is not enabled, we
don't need to get deltas of utime and stime under vtime seqlock.

This patch removes that cost with adding a shortcut route if vtime
accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless cputime delta.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:43 +01:00
Byungchul Park ad936d8658 sched/fair: Make it possible to account fair load avg consistently
The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.

This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.

These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:34:42 +01:00
Peter Zijlstra 8643cda549 sched/core, locking: Document Program-Order guarantees
These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.

( This commit is in the locking tree, because the new documentation
  refers to a newly introduced locking primitive. )

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:33:41 +01:00
Peter Zijlstra b3e0b1b6d8 locking, sched: Introduce smp_cond_acquire() and use it
Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:

 - it documents control dependencies,
 - its typically cheaper than using smp_load_acquire() in a loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:33:41 +01:00
Ingo Molnar 829cf31751 Merge branch 'sched/urgent' into locking/core, to pick up scheduler fix we rely on
So we want to change a locking API, but the scheduler uses it, and a conflict
is generated by a recent scheduler fix.

Pick up the pending scheduler fixes to make life easier.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:30:35 +01:00
Ingo Molnar 467386fbbf Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:27:36 +01:00
Peter Zijlstra ecf7d01c22 sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:26:43 +01:00
Peter Zijlstra b75a225315 sched/core: Better document the try_to_wake_up() barriers
Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:26:42 +01:00
Hiroshi Shimamoto 2541117b0c sched/cputime: Fix invalid gtime in proc
/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.

This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.

Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.

The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.

I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:

	 # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
	S 4348
	R 7064566
	R 7064766
	R 7064967
	R 7065168
	S 4759
	S 4759

During running busy loop, it returns large value.

After applying this patch, we can see right gtime.

	 # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
	S 5338
	R 5365
	R 5465
	R 5566
	R 5666
	S 5726
	S 5726

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:18:49 +01:00
Xunlei Pang 8295c69925 sched/core: Clear the root_domain cpumasks in init_rootdomain()
root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.

The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.

Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:16:21 +01:00
Sasha Levin 119d6f6a3b sched/core: Remove false-positive warning from wake_up_process()
Because wakeups can (fundamentally) be late, a task might not be in
the expected state. Therefore testing against a task's state is racy,
and can yield false positives.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: oleg@redhat.com
Fixes: 9067ac85d5 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:10:16 +01:00
Peter Zijlstra 68985633bc sched/wait: Fix signal handling in bit wait helpers
Vladimir reported getting RCU stall warnings and bisected it back to
commit:

  743162013d ("sched: Remove proliferation of wait_on_bit() action functions")

That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.

Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce82209 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:10:15 +01:00
Peter Zijlstra 642c2d671c perf: Fix PERF_EVENT_IOC_PERIOD deadlock
Dmitry reported a fairly silly recursive lock deadlock for
PERF_EVENT_IOC_PERIOD, fix this by explicitly doing the inactive part of
__perf_event_period() instead of calling that function.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: c7999c6f3f ("perf: Fix PERF_EVENT_IOC_PERIOD migration race")
Link: http://lkml.kernel.org/r/20151130115615.GJ17308@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-04 10:08:03 +01:00
zhuo-hao a0e3213f83 alarmtimer: Avoid unexpected rtc interrupt when system resume from S3
Before the system go to suspend (S3), if user create a timer
with clockid CLOCK_REALTIME_ALARM/CLOCK_BOOTTIME_ALARM and set a
"large" timeout value to this timer. The function
alarmtimer_suspend will be called to setup a timeout value to
RTC timer to avoid the system sleep over time. However, if the
system wakeup early than RTC timeout, the RTC timer will not be
cleared. And this will cause the hpet_rtc_interrupt come
unexpectedly until the RTC timeout. To fix this problem, just
adding alarmtimer_resume to cancel the RTC timer.

This was noticed because the HPET RTC emulation fires an
interrupt every 16ms(=1/2^DEFAULT_RTC_SHIFT) up to the point
where the alarm time is reached.

This program always hits this situation
(https://lkml.org/lkml/2015/11/8/326), if system wake up earlier
than alarm time.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Zhuo-hao Lee <zhuo-hao.lee@intel.com>
[jstultz: Tweak commit subject & formatting slightly]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-12-03 22:31:42 -08:00
David S. Miller f188b951f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/renesas/ravb_main.c
	kernel/bpf/syscall.c
	net/ipv4/ipmr.c

All three conflicts were cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 21:09:12 -05:00
Linus Torvalds 071f5d105a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A lot of Thanksgiving turkey leftovers accumulated, here goes:

   1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.

   2) IDs for some new iwlwifi chips, from Oren Givon.

   3) Fix rtlwifi lockups on boot, from Larry Finger.

   4) Fix memory leak in fm10k, from Stephen Hemminger.

   5) We have a route leak in the ipv6 tunnel infrastructure, fix from
      Paolo Abeni.

   6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.

   7) Wrong lockdep annotations in tcp md5 support, fix from Eric
      Dumazet.

   8) Work around some middle boxes which prevent proper handling of TCP
      Fast Open, from Yuchung Cheng.

   9) TCP repair can do huge kmalloc() requests, build paged SKBs
      instead.  From Eric Dumazet.

  10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
      Borkmann.

  11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
      Nikolay Aleksandrov.

  12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
      Weikusat.

  13) Fix double free in VRF code, from Nikolay Aleksandrov.

  14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.

  15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.

  16) Fix clearing of persistent array maps in bpf, from Daniel
      Borkmann.

  17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
      early enough.  From Eric Dumazet.

  18) Fix out of bounds accesses in bpf array implementation when
      updating elements, from Daniel Borkmann.

  19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
      Dumazet.

  20) When dumping proxy neigh entries, we have to accomodate NULL
      device pointers properly, from Konstantin Khlebnikov.

  21) SCTP doesn't release all ipv6 socket resources properly, fix from
      Eric Dumazet.

  22) Prevent underflows of sch->q.qlen for multiqueue packet
      schedulers, also from Eric Dumazet.

  23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
      Huang and Michael Chan.

  24) Don't actively scan radar channels, from Antonio Quartulli"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
  net: phy: reset only targeted phy
  bnxt_en: Setup uc_list mac filters after resetting the chip.
  bnxt_en: enforce proper storing of MAC address
  bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
  net: lpc_eth: remove irq > NR_IRQS check from probe()
  net_sched: fix qdisc_tree_decrease_qlen() races
  openvswitch: fix hangup on vxlan/gre/geneve device deletion
  ipv4: igmp: Allow removing groups from a removed interface
  ipv6: sctp: implement sctp_v6_destroy_sock()
  arm64: bpf: add 'store immediate' instruction
  ipv6: kill sk_dst_lock
  ipv6: sctp: add rcu protection around np->opt
  net/neighbour: fix crash at dumping device-agnostic proxy entries
  sctp: use GFP_USER for user-controlled kmalloc
  sctp: convert sack_needed and sack_generation to bits
  ipv6: add complete rcu protection around np->opt
  bpf: fix allocation warnings in bpf maps and integer overflow
  mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
  net: mvneta: enable setting custom TX IP checksum limit
  net: mvneta: fix error path for building skb
  ...
2015-12-03 16:02:46 -08:00
Linus Torvalds c041f08738 During the merge window I added a new file that is used to filter trace
events on pids. It filters all events where only tasks with their pid in that
 file exists. It also handles the sched_switch and sched_wakeup trace events
 where the current task does not have its pid in the file, but the task
 either being switched to or awaken does.
 
 Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both of
 these tracepoints use the same class as the sched_wakeup tracepoint, and
 they too should be included in what gets filtered by the set_event_pid file.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWX7XpAAoJEKKk/i67LK/8G6gH/0W9QxCt/iS+J+x3gAPPs+/9
 jtwZgAOWjq+118ZpWORtgRcoH2r/sJUNwlXqhMojAHsPlwZsr6TXkWJkgyNdZZ7B
 QdUtZrr+egGYvd7TE0ONi/XrLTe9VLtBQsh5pN7l9fF9TjxYUmu5V9LplH9z0RxW
 Hw8EzqGzG2iZnXYCnErtu5jRLmr18f2u9aUptPAc4bYPLVUUw9M9MqRV/ZwQxsaX
 1mfIoR5SVC5IWW/R07qjULlbFpvNXkVJ56HwXMVBN44mYz3eUGYBKzjyAJ0Ugymf
 CNDPzh4HgVFsEqDedr0D8T5WZNJSUErdbHVSWze+CCUNfYikvU7gNmxNJ89Q3P4=
 =LsnU
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "During the merge window I added a new file that is used to filter
  trace events on pids.  It filters all events where only tasks with
  their pid in that file exists.  It also handles the sched_switch and
  sched_wakeup trace events where the current task does not have its pid
  in the file, but the task either being switched to or awaken does.

  Unfortunately, I forgot about sched_wakeup_new and sched_waking.  Both
  of these tracepoints use the same class as the sched_wakeup
  tracepoint, and they too should be included in what gets filtered by
  the set_event_pid file"

* tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter
2015-12-03 15:23:17 -08:00
Oleg Nesterov b53202e630 cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-03 10:24:08 -05:00
Tejun Heo 8075b542cf Merge branch 'for-4.4-fixes' into for-4.5 2015-12-03 10:22:52 -05:00
Tejun Heo 67cde9c493 cgroup_pids: don't account for the root cgroup
Because accounting resources for the root cgroup sometimes incurs
measureable overhead for workloads which don't care about cgroup and
often ends up calculating a number which is available elsewhere in a
slightly different form, cgroup is not in the business of providing
system-wide statistics.  The pids controller which was introduced
recently was exposing "pids.current" at the root.  This patch disable
accounting for root cgroup and removes the file from the root
directory.

While this is a userland visible behavior change, pids has been
available only in one version and was badly broken there, so I don't
think this will be noticeable.  If it turns out to be a problem, we
can reinstate it for v1 hierarchies.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
Tejun Heo 1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
Tejun Heo 599c963a0f cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen.  freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.

This patch merges the two steps.  Propagation now takes place inside
the task loop.  This simplifies the code and prepares it for the fix
of multi-destination migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2015-12-03 10:18:21 -05:00
Alexei Starovoitov 01b3f52157 bpf: fix allocation warnings in bpf maps and integer overflow
For large map->value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
 [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
 [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
 [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
 [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
 [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
 [<     inline     >] alloc_pages include/linux/gfp.h:451
 [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
 [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
 [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
 [<     inline     >] kmalloc_large include/linux/slab.h:390
 [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
 [<     inline     >] kmalloc include/linux/slab.h:463
 [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
 [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744

To avoid never succeeding kmalloc with order >= MAX_ORDER check that
elem->value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
so keep those kmalloc-s as-is.

Large value_size can cause integer overflows in elem_size and map.pages
formulas, so check for that as well.

Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02 23:36:00 -05:00
Daniel Borkmann fbca9d2d35 bpf, array: fix heap out-of-bounds access when updating elements
During own review but also reported by Dmitry's syzkaller [1] it has been
noticed that we trigger a heap out-of-bounds access on eBPF array maps
when updating elements. This happens with each map whose map->value_size
(specified during map creation time) is not multiple of 8 bytes.

In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
used to align array map slots for faster access. However, in function
array_map_update_elem(), we update the element as ...

memcpy(array->value + array->elem_size * index, value, array->elem_size);

... where we access 'value' out-of-bounds, since it was allocated from
map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
and later on copied through copy_from_user(value, uvalue, map->value_size).
Thus, up to 7 bytes, we can access out-of-bounds.

Same could happen from within an eBPF program, where in worst case we
access beyond an eBPF program's designated stack.

Since 1be7f75d16 ("bpf: enable non-root eBPF programs") didn't hit an
official release yet, it only affects priviledged users.

In case of array_map_lookup_elem(), the verifier prevents eBPF programs
from accessing beyond map->value_size through check_map_access(). Also
from syscall side map_lookup_elem() only copies map->value_size back to
user, so nothing could leak.

  [1] http://github.com/google/syzkaller

Fixes: 28fbcfa08d ("bpf: add array type of eBPF maps")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 21:56:17 -05:00
Steven Rostedt (Red Hat) 0f72e37e42 tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter
The set_event_pid filter relies on attaching to the sched_switch and
sched_wakeup tracepoints to see if it should filter the tracing on schedule
tracepoints. By adding the callbacks to sched_wakeup, pids in the
set_event_pid file will trace the wakeups of those tasks with those pids.

But sched_wakeup_new and sched_waking were missed. These two should also be
traced. Luckily, these tracepoints share the same class as sched_wakeup
which means they can use the same pre and post callbacks as sched_wakeup
does.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-12-01 16:08:05 -05:00
Zach Brown 29732938a6 vfs: add copy_file_range syscall and vfs helper
Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <zab@redhat.com>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                 Change flags parameter from int to unsigned int,
                 Add function to include/linux/syscalls.h,
                 Check copy len after file open mode,
                 Don't forbid ranges inside the same file,
                 Use rw_verify_area() to veriy ranges,
                 Use file_out rather than file_in,
                 Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-01 14:00:53 -05:00
Linus Torvalds 9e5d25e829 Found two minor bugs while doing development on the ring buffer code.
The first is something that's been there since its creation. If a reader
 reads a page out of the ring buffer before there's any events on it, it
 can get an out of date timestamp for that event. It may be off by a few
 microseconds, more if the first event gets discarded. The fix was to
 only update the reader time stamp when it actually sees an event on
 the page, instead of just reading the timestamp from the page even if
 it has no events on it. That timestamp is still volatile until an event
 is present.
 
 The second bug is more recent. Instead of passing around parameters
 a descriptor was made and the parameters are passed via a single
 descriptor. This simplified the code a bit. But there was one place that
 expected the parameter to be passed by value not reference (which a
 descriptor now does). And it added to the length of the event, which
 may be ignored later, but the length should not have been increased.
 The only real problem with this bug is that it may allocate more than
 was needed for the event.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWVdKDAAoJEKKk/i67LK/8dYwH/15HYLvVMbljmXyatXOp+zO9
 VTJAfXuLowMNLWMEg57NjDkcmqKodQfVWEY1gW8dgY3VnUa6KbbWgcX++3ncnXtH
 RwRQ8YhgTCpz9rFmIKesSBuLpu0uE1naqGn9QSF0AaU48ljykDStXBJJs7QGBCB8
 ZDFru17sFT5/BpeE/zcTrDKmUH79YSWbmnQvcp4UxdP3Eq7THvd41bOEoUrvm/zU
 pA2I/+Yz8wUJMA9I4UHEkltoMSV46/QldXbSh+zzWoTdZp92ZAiyuNNO/EtamGa+
 QVS6cU3b9N9C6EGom21PaSheNUpZDrZf5Tj0S0wrg2f/+lWublplRJNzwhSu85E=
 =K6x5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "I found two minor bugs while doing development on the ring buffer
  code.

  The first is something that's been there since its creation.  If a
  reader reads a page out of the ring buffer before there's any events
  on it, it can get an out of date timestamp for that event.  It may be
  off by a few microseconds, more if the first event gets discarded.
  The fix was to only update the reader time stamp when it actually sees
  an event on the page, instead of just reading the timestamp from the
  page even if it has no events on it.  That timestamp is still volatile
  until an event is present.

  The second bug is more recent.  Instead of passing around parameters a
  descriptor was made and the parameters are passed via a single
  descriptor.  This simplified the code a bit.  But there was one place
  that expected the parameter to be passed by value not reference (which
  a descriptor now does).  And it added to the length of the event,
  which may be ignored later, but the length should not have been
  increased.  The only real problem with this bug is that it may
  allocate more than was needed for the event"

* tag 'trace-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Put back the length if crossed page with add_timestamp
  ring-buffer: Update read stamp with first real commit on page
2015-11-30 15:38:23 -08:00
Oleg Nesterov afbcb364be cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
Now that we know that the forking task can't migrate amd the child is always
moved to the same cgroup by cgroup_post_fork()->css_set_move_task() we can
change pids_can_fork() and pids_cancel_fork() to just use task_css(current).
And since we no longer need to pin this css, we can remove pid_fork().

Note: the patch uses task_css_check(true), perhaps it makes sense to add a
helper or change task_css_set_check() to take cgroup_threadgroup_rwsem into
account.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30 09:48:18 -05:00
Oleg Nesterov c9e75f0492 cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
If the new child migrates to another cgroup before cgroup_post_fork() calls
subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
pids_uncharge(old_pids) + pids_charge(pids) sequence twice.

Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
unconditionally. percpu_down_read() is cheap and this allows other cleanups,
see the next changes.

Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-30 09:48:18 -05:00
Tejun Heo 53254f900b cgroup: make css_set pin its css's to avoid use-afer-free
A css_set represents the relationship between a set of tasks and
css's.  css_set never pinned the associated css's.  This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.

Unfortunately, afcf6c8b75 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release.  Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.

 general protection fault: 0000 [#1] PREEMPT SMP
 task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
 RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
 ...
 Call Trace:
  <IRQ>
  [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
  [<ffffffff810f8893>] cgroup_free+0x53/0xe0
  [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
  [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
  [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
  [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
  [<ffffffff81056e54>] __do_softirq+0xd4/0x460
  [<ffffffff81057369>] irq_exit+0x89/0xa0
  [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
  [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
  <EOI>
 ...
 Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
 RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
  RSP <ffff88001fc03e20>
 ---[ end trace 89a4a4b916b90c49 ]---
 Kernel panic - not syncing: Fatal exception in interrupt
 Kernel Offset: disabled
 ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fix it by making css_set pin the associate css's until its release.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b75 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
2015-11-30 09:46:21 -05:00
Peter Zijlstra 82bbe34b3d nohz: Clarify magic in tick_nohz_stop_sched_tick()
While going through the nohz code I got stumped by some of it.

This patch adds a few comments clarifying the code; based on discussion
with Thomas.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20151119162106.GO3816@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-25 22:37:27 +01:00
Steven Rostedt (Red Hat) 39daa7b9e8 ftrace: Show all tramps registered to a record on ftrace_bug()
When an anomaly is detected in the function call modification code,
ftrace_bug() is called to disable function tracing as well as give any
information that may help debug the problem. Currently, only the first found
trampoline that is attached to the failed record is reported. Instead, show
all trampolines that are hooked to it.

Also, not only show the ops pointer but also report the function it calls.

While at it, add this info to the enabled_functions debug file too.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 16:04:59 -05:00
Steven Rostedt (Red Hat) b05086c77a ftrace: Add variable ftrace_expected for archs to show expected code
When an anomaly is found while modifying function code, ftrace_bug() is
called which disables the function tracing infrastructure and reports
information about what failed. If the code that is to be replaced does not
match what is expected, then actual code is shown. Currently there is no
arch generic way to show what was expected.

Add a new variable pointer calld ftrace_expected that the arch code can set
to point to what it expected so that ftrace_bug() can report the actual text
as well as the text that was expected to be there.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:16 -05:00
Steven Rostedt (Red Hat) 02a392a043 ftrace: Add new type to distinguish what kind of ftrace_bug()
The ftrace function hook utility has several internal checks to make sure
that whatever it modifies is exactly what it expects to be modifying. This
is essential as modifying running code can be extremely dangerous to the
system.

When an anomaly is detected, ftrace_bug() is called which sends a splat to
the console and disables function tracing. There's some extra information
that is printed to help diagnose the issue.

One thing that is missing though is output of what ftrace was doing at the
time of the crash. Was it updating a call site or perhaps converting a call
site to a nop? A new global enum variable is created to state what ftrace
was doing at the time of the anomaly, and this is reported in ftrace_bug().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:15 -05:00
Tom Zanussi 4e4a4d7570 tracing: Update cond flag when enabling or disabling a trigger
When a trigger is enabled, the cond flag should be set beforehand,
otherwise a trigger that's expecting to process a trace record
(e.g. one with post_trigger set) could be invoked without one.

Likewise a trigger's cond flag should be reset after it's disabled,
not before.

Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:14 -05:00
Steven Rostedt (Red Hat) 4239c38fe0 ring-buffer: Process commits whenever moving to a new page.
When crossing over to a new page, commit the current work. This will allow
readers to get data with less latency, and also simplifies the work to get
timestamps working for interrupted events.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25 15:24:05 -05:00
Arnd Bergmann d2b4365809 cpuset: Replace all instances of time_t with time64_t
The following patch replaces all instances of time_t with time64_t i.e.
change the type used for representing time from 32-bit to 64-bit. All
32-bit kernels to date use a signed 32-bit time_t type, which can only
represent time until January 2038. Since embedded systems running 32-bit
Linux are going to survive beyond that date, we have to change all
current uses, in a backwards compatible way.

The patch also changes the function get_seconds() that returns a 32-bit
integer to ktime_get_seconds() that returns seconds as 64-bit integer.

The patch changes the type of ticks from time_t to u32. We keep ticks as
32-bits as the function uses 32-bit arithmetic which would prove less
expensive than 64-bit arithmetic and the function is expected to be
called atleast once every 32 seconds.

Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-25 14:00:05 -05:00
Daniel Borkmann c9da161c65 bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-25 12:14:09 -05:00
Eric Dumazet 81b1a832d7 pidns: fix NULL dereference in __task_pid_nr_ns()
I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

    if (pid && ns->level <= pid->level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-24 12:03:55 -08:00
Steven Rostedt (Red Hat) 70004986ff ring-buffer: Remove redundant update of page timestamp
The first commit of a buffer page updates the timestamp of that page. No
need to have the update to the next page add the timestamp too. It will only
be replaced by the first commit on that page anyway.

Only update to a page if it contains an event.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:29:16 -05:00
Steven Rostedt (Red Hat) 8573636ea7 ring-buffer: Use READ_ONCE() for most tail_page access
As cpu_buffer->tail_page may be modified by interrupts at almost any time,
the flow of logic is very important. Do not let gcc get smart with
re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
accesses.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:29:15 -05:00
Steven Rostedt (Red Hat) bd1b7cd360 ring-buffer: Put back the length if crossed page with add_timestamp
Commit fcc742eaad "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified  parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.

Remove the extra space added if a timestamp was added and the event went
across the page.

Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742eaad "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:27:25 -05:00
Steven Rostedt (Red Hat) b81f472a20 ring-buffer: Update read stamp with first real commit on page
Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.

rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.

Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24 09:23:17 -05:00
Andy Lutomirski ed11a7f1b3 context_tracking: Switch to new static_branch API
This is much less error-prone than the old code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/812df7e64f120c5c7c08481f36a8caa9f53b2199.1447361906.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-24 09:56:43 +01:00
Paul E. McKenney 6cf1008122 rcu: Add transitivity to remaining rcu_node ->lock acquisitions
The rule is that all acquisitions of the rcu_node structure's ->lock
must provide transitivity:  The lock is not acquired that frequently,
and sorting out exactly which required it and which did not would be
a maintenance nightmare.  This commit therefore supplies the needed
transitivity to the remaining ->lock acquisitions.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Peter Zijlstra 2a67e741bb rcu: Create transitive rnp->lock acquisition functions
Providing RCU's memory-ordering guarantees requires that the rcu_node
tree's locking provide transitive memory ordering, which the Linux kernel's
spinlocks currently do not provide unless smp_mb__after_unlock_lock()
is used.  Having a separate smp_mb__after_unlock_lock() after each and
every lock acquisition is error-prone, hard to read, and a bit annoying,
so this commit provides wrapper functions that pull in the
smp_mb__after_unlock_lock() invocations.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-11-23 10:37:35 -08:00
Waiman Long d78045306c locking/pvqspinlock, x86: Optimize the PV unlock code path
The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.

For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().

This patch optimizes the PV unlock code path by:

 1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..

 2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:

  Threads	Before patch	After patch	% Change
  -------	------------	-----------	--------
     1		   134.1 ms	  119.3 ms	  -11%
     2		   1286  ms	   953  ms	  -26%
     3		   3715  ms	  3480  ms	  -6.3%
     4		   4092  ms	  3764  ms	  -8.0%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:02:02 +01:00
Waiman Long aa68744f80 locking/qspinlock: Avoid redundant read of next pointer
With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:02:00 +01:00
Waiman Long 81b5598665 locking/qspinlock: Prefetch the next node cacheline
A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.

This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.

The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:01:59 +01:00
Waiman Long 64d816cba0 locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()
This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 10:01:58 +01:00
Byungchul Park 525628c73b sched/fair: Modify the comment about lock assumptions in migrate_task_rq_fair()
The comment describing migrate_task_rq_fair() says that the caller
should hold p->pi_lock. But in some cases the caller can hold
task_rq(p)->lock instead of p->pi_lock. So the comment is broken and
this patch fixes it.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447806899-20303-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:20 +01:00
Oleg Nesterov accaf6ea3d stop_machine: Clean up the usage of the preemption counter in cpu_stopper_thread()
1. Change this code to use preempt_count_inc/preempt_count_dec; this way
   it works even if CONFIG_PREEMPT_COUNT=n, and we avoid the unnecessary
   __preempt_schedule() check (stop_sched_class is not preemptible).

   And this makes clear that we only want to make preempt_count() != 0
   for __might_sleep() / schedule_debug().

2. Change WARN_ONCE() to use %pf to print the function name and remove
   kallsyms_lookup/ksym_buf.

3. Move "int ret" into the "if (work)" block, this looks more consistent.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193332.GA8281@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:20 +01:00
Oleg Nesterov dd2e3121e3 stop_machine: Shift the 'done != NULL' check from cpu_stop_signal_done() to callers
Change cpu_stop_queue_work() and cpu_stopper_thread() to check done != NULL
before cpu_stop_signal_done(done). This makes the code more clean imo, note
that cpu_stopper_thread() has to do this check anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193329.GA8274@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov 6fa3b826bc stop_machine: Kill cpu_stop_done->executed
Now that cpu_stop_done->executed becomes write-only (ignoring WARN_ON()
checks) we can remove it.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193326.GA8269@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov 4aff1ca697 stop_machine: Change __stop_cpus() to rely on cpu_stop_queue_work()
Change queue_stop_cpus_work() to return true if it queues at least one
work, this means that the caller should wait.

__stop_cpus() can check the value returned by queue_stop_cpus_work() and
avoid done.executed, just like stop_one_cpu() does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193323.GA8262@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:19 +01:00
Oleg Nesterov 958c5f848e stop_machine: Change stop_one_cpu() to rely on cpu_stop_queue_work()
Change stop_one_cpu() to return -ENOENT if cpu_stop_queue_work() fails.
Otherwise we know that ->executed must be true after wait_for_completion()
so we can just return done.ret.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193320.GA8259@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov 1b034bd989 stop_machine: Make cpu_stop_queue_work() and stop_one_cpu_nowait() return bool
Change cpu_stop_queue_work() to return true if the work was queued and
change stop_one_cpu_nowait() to return the result of cpu_stop_queue_work().
This makes it more useful, for example now you can alloc cpu_stop_work for
stop_one_cpu_nowait() and free it in the callback or if stop_one_cpu_nowait()
fails, currently this is impossible because you can't know if @fn will be
called or not.

Also, this allows to kill cpu_stop_done->executed, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151117170523.GA13955@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov 6a19005157 stop_machine: Don't disable preemption in stop_two_cpus()
Now that stop_two_cpus() path does not check cpu_active() we can remove
preempt_disable(), it was only needed to ensure that stop_machine() can
not be called after we observe cpu_active() == T and before we queue the
new work.

Also, turn the pointless and confusing ->executed check into WARN_ON().
We know that both works must be executed, otherwise we have a bug. And
in fact I think that done->executed should die, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193314.GA8249@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:18 +01:00
Oleg Nesterov 64038f292a stop_machine: Fix possible cpu_stopper_thread() crash
stop_one_cpu_nowait(fn) will crash the kernel if the callback returns
nonzero, work->done == NULL in this case.

This needs more cleanups, cpu_stop_signal_done() is called right after
we check done != NULL and it does the same check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Milos Vyletel <milos@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151115193311.GA8242@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Geliang Tang 01783e0d45 sched/core: Use list_is_singular() in sched_can_stop_tick()
Use list_is_singular() to check if run_list has only one entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a5453fafd735affcf28e53a1d0a3d6965cb5dbb5.1447582547.git.geliangtang@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Joonwoo Park 3ea94de15c sched/core: Fix incorrect wait time and wait count statistics
At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc/<pid>/sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ohaugan@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.GA4247@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:48:17 +01:00
Peter Zijlstra 90eec103b9 treewide: Remove old email address
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:44:58 +01:00
Rik van Riel 51170840fe sched/numa: Cap PTE scanning overhead to 3% of run time
There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work() runs, the task spends at least 32x as much time on
other things than it did on task_numa_work().

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work(), we delay a future run of
task_numa_work() until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work().
The longer task_numa_work() takes, the more likely it is this happens.

If task_numa_work() takes very little time, chances are low that that
code will do anything, but we will not care.

Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:54 +01:00
Byungchul Park 525705d15e sched/fair: Consider missed ticks in NOHZ_FULL in update_cpu_load_nohz()
Usually the tick can be stopped for an idle CPU in NOHZ. However in NOHZ_FULL
mode, a non-idle CPU's tick can also be stopped. However, update_cpu_load_nohz()
does not consider the case a non-idle CPU's tick has been stopped at all.

This patch makes the update_cpu_load_nohz() know if the calling path comes
from NOHZ_FULL or idle NOHZ.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447115762-19734-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Byungchul Park 5954327548 sched/fair: Prepare __update_cpu_load() to handle active tickless
There are some cases where distance between ticks is more than one tick
while the CPU is not idle, e.g. full NOHZ.

However __update_cpu_load() assumes it is the idle tickless case if the
distance between ticks is more than 1, even though it can be the active
tickless case as well. Thus in the active tickless case, updating the CPU
load will not be performed correctly.

Where the current code assumes the load for each tick is zero, this is
(obviously) not true in non-idle tickless case. We can approximately
consider the load ~= this_rq->cpu_load[0] during tickless in non-idle
tickless case.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1444816056-11886-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:53 +01:00
Peter Zijlstra d937cdc59e sched/fair: Clean up the explanation around decaying load update misses
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:52 +01:00
Dietmar Eggemann 38c6ade2dd sched/fair: Remove empty idle enter and exit functions
Commit cd126afe83 ("sched/fair: Remove rq's runnable avg") got rid of
rq->avg and so there is no need to update it any more when entering or
exiting idle.

Remove the now empty functions idle_{enter|exit}_fair().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:37:51 +01:00
Arnd Bergmann 89b411081d sched/rt: Hide the push_irq_work_func() declaration
The push_irq_work_func() function is conditionally defined only
when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
forward declaration remains visibile without HAVE_RT_PUSH_IPI,
causing a gcc warning in ARM64 allnoconfig:

  kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]

This changes the code to use the same condition for both the
declaration and the function definition, which gets rid of the
warning.

As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
thing after:

  8053871d0f ("smp: Fix smp_call_function_single_async() locking")

Until that is done, this patch can be used to avoid the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:25:08 +01:00
Stephane Eranian 614e4c4ebc perf/core: Robustify the perf_cgroup_from_task() RCU checks
This patch reinforces the lockdep checks performed by
perf_cgroup_from_tsk() by passing the perf_event_context
whenever possible. It is okay to not hold the RCU read lock
when we know we hold the ctx->lock. This patch makes sure this
property holds.

In some functions, such as perf_cgroup_sched_in(), we do not
pass the context because we are sure we are holding the RCU
read lock.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:21:03 +01:00
Stephane Eranian ddaaf4e291 perf/core: Fix RCU problem with cgroup context switching code
The RCU checker detected RCU violation in the cgroup switching routines
perf_cgroup_sched_in() and perf_cgroup_sched_out(). We were dereferencing
cgroup from task without holding the RCU lock.

Fix this by holding the RCU read lock. We move the locking from
perf_cgroup_switch() to avoid double locking.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: edumazet@google.com
Link: http://lkml.kernel.org/r/1447322404-10920-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:21:03 +01:00
Vitaly Kuznetsov 7625b3a000 kernel/panic.c: turn off locks debug before releasing console lock
Commit 08d78658f3 ("panic: release stale console lock to always get the
logbuf printed out") introduced an unwanted bad unlock balance report when
panic() is called directly and not from OOPS (e.g.  from out_of_memory()).
The difference is that in case of OOPS we disable locks debug in
oops_enter() and on direct panic call nobody does that.

Fixes: 08d78658f3 ("panic: release stale console lock to always get the logbuf printed out")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
Richard Weinberger 9d8a765211 kernel/signal.c: unexport sigsuspend()
sigsuspend() is nowhere used except in signal.c itself, so we can mark it
static do not pollute the global namespace.

But this patch is more than a boring cleanup patch, it fixes a real issue
on UserModeLinux.  UML has a special console driver to display ttys using
xterm, or other terminal emulators, on the host side.  Vegard reported
that sometimes UML is unable to spawn a xterm and he's facing the
following warning:

  WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()

It turned out that this warning makes absolutely no sense as the UML
xterm code calls sigsuspend() on the host side, at least it tries.  But
as the kernel itself offers a sigsuspend() symbol the linker choose this
one instead of the glibc wrapper.  Interestingly this code used to work
since ever but always blocked signals on the wrong side.  Some recent
kernel change made the WARN_ON() trigger and uncovered the bug.

It is a wonderful example of how much works by chance on computers. :-)

Fixes: 68f3f16d9a ("new helper: sigsuspend()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>	[3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-20 16:17:32 -08:00
Tejun Heo 16af439645 cgroup: implement cgroup_get_from_path() and expose cgroup_put()
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path.  This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.

v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).

Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20 15:55:52 -05:00
Tejun Heo b11cfb5807 cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question.  While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.

This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level.  cgroup_is_descendant() is accordingly reimplmented
and made inline.

Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-20 15:55:52 -05:00
Daniel Borkmann f99bf205da bpf: add show_fdinfo handler for maps
Add a handler for show_fdinfo() to be used by the anon-inodes
backend for eBPF maps, and dump the map specification there. Not
only useful for admins, but also it provides a minimal way to
compare specs from ELF vs pinned object.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 11:04:15 -05:00
Linus Torvalds a3d66b5a17 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
 "A fix for module handling in case kASLR has been enabled, from Zhou
  Chengming"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: x86: fix relocation computation with kASLR
2015-11-19 12:16:12 -08:00
Sudeep Holla a946e8c717 genirq: Delay incrementing interrupt count if it's disabled/pending
In case of a wakeup interrupt, irq_pm_check_wakeup disables the interrupt
and marks it pending and suspended, disables it and notifies the pm core
about the wake event. The interrupt gets handled later once the system
is resumed.

However the irq stats is updated twice: once when it's disabled waiting
for the system to resume and later when it's handled, resulting in wrong
counting of the wakeup interrupt when waking up the system.

This patch updates the interrupt count so that it's updated only when
the interrupt gets handled. It's already handled correctly in
handle_edge_irq and handle_edge_eoi_irq.

Reported-by: Manoil Claudiu <claudiu.manoil@freescale.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1446661957-1019-1-git-send-email-sudeep.holla@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-16 17:55:55 +01:00
Tejun Heo 67e9c74b8a cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild.  Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.

  mount -t cgroup2 none $MOUNT_POINT

The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").

v2: Assign a different magic number to cgroup2 fs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-11-16 11:13:34 -05:00
Tejun Heo 34c06254ff cgroup: fix cftype->file_offset handling
6f60eade24 ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded.  These handles
then can be used, for example, to generate file modified
notifications.

Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction.  Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again.  This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time.  This incorrect assumption led
to the corruption of css->files lists.

Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: James Sedgwick <jsedgwick@fb.com>
Fixes: 6f60eade24 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2015-11-16 10:58:26 -05:00
Linus Torvalds 0ca9b67606 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Mostly updates to the perf tool plus two fixes to the kernel core code:

   - Handle tracepoint filters correctly for inherited events (Peter
     Zijlstra)

   - Prevent a deadlock in perf_lock_task_context (Paul McKenney)

   - Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
     Melo)

   - Print full source file paths when using 'perf annotate --print-line
     --full-paths' (Michael Petlan)

   - Fix 'perf probe -d' when just one out of uprobes and kprobes is
     enabled (Wang Nan)

   - Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
     tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)

   - Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
     the 'perf test' LLVM entries, when running it in-tree, to
     .gitignore (Yunlong Song)

   - libbpf error reporting improvements, using a strerror interface to
     more precisely tell the user about problems with the provided
     scriptlet, be it in C or as a ready made object file (Wang Nan)

   - Do not be case sensitive when searching for matching 'perf test'
     entries (Arnaldo Carvalho de Melo)

   - Inform the user about objdump failures in 'perf annotate' (Andi
     Kleen)

   - Improve the LLVM 'perf test' entry, introduce a new ones for BPF
     and kbuild tests to check the environment used by clang to compile
     .c scriptlets (Wang Nan)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
  tools include: Add compiler.h to list.h
  perf probe: Verify parameters in two functions
  perf session: Add missing newlines to some pr_err() calls
  perf annotate: Support full source file paths for srcline fix
  perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
  perf: Fix inherited events vs. tracepoint filters
  perf: Disable IRQs across RCU RS CS that acquires scheduler lock
  perf test: Do not be case sensitive when searching for matching tests
  perf test: Add 'perf test BPF'
  perf test: Enhance the LLVM tests: add kbuild test
  perf test: Enhance the LLVM test: update basic BPF test program
  perf bpf: Improve BPF related error messages
  perf tools: Make fetch_kernel_version() publicly available
  bpf tools: Add new API bpf_object__get_kversion()
  bpf tools: Improve libbpf error reporting
  perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
  perf annotate: Inform the user about objdump failures in --stdio
  perf stat: Make stat options global
  perf sched latency: Fix thread pid reuse issue
  ...
2015-11-15 09:36:24 -08:00
Linus Torvalds 051b29f279 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single fix to prevent math underflow in the numa balancing code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix math underflow in task_tick_numa()
2015-11-15 09:35:33 -08:00
Linus Torvalds 511601bdbc Merge branches 'irq-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and timer fixes from Thomas Gleixner:

 - An irq regression fix to restore the wakeup behaviour of chained
   interrupts.

 - A timer fix for a long standing race versus timers scheduled on a
   target cpu which got exposed by recent changes in the workqueue
   implementation.

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/PM: Restore system wake up from chained interrupts

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Use proper base migration in add_timer_on()
2015-11-15 09:30:48 -08:00
Linus Torvalds 0e97606425 This contains three more clean up patches.
One patch is needed to make tracing work without debugfs now that tracing
 uses its own tracefs.
 
 The second is removing an unused variable.
 
 The third is fixing a warning about unused variables when MAX_TRACER is
 not configured. Note, this warning shows up in gcc 6.0, but does not show
 up in gcc 4.9, as it seems that gcc does not complain about constants
 not being used.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWRKA+AAoJEKKk/i67LK/8BZ4H/0UgA8+OpbUmBHQk/RdMRZW2
 BSwO/GuErq3nSKJUjP4f3b4BGdMoFgThIOxR9dvHKwJvGfwpbTivpgR0OqUdFKAZ
 1FyUT/oNM0ySNGpZFGuMQ9hvR+deeUvvyPhosjPWlxZLSnRSsEzTlePUqikX1VY6
 hEQj+ulK8J0iLlHMDYLJ5FkQ10+owrI2vmT8BU1O8KZN8EBvhoUx4EfJewD63/pa
 Gzjm6C9RpVb6ks555NmwFzLeN1Yd5X4n5m2CQFrIpzBgRlibWIGCPMVe7OgNawMu
 wONNvhq9zLfo3pftS092lek4iPgEw7ypl6iEgBxDzqTIAjSxUqf+N+1xIbOCcOE=
 =GikR
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull trace cleanups from Steven Rostedt:
 "This contains three more clean up patches.

  One patch is needed to make tracing work without debugfs now that
  tracing uses its own tracefs.

  The second is removing an unused variable.

  The third is fixing a warning about unused variables when MAX_TRACER
  is not configured.  Note, this warning shows up in gcc 6.0, but does
  not show up in gcc 4.9, as it seems that gcc does not complain about
  constants not being used"

* tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: #ifdef out uses of max trace when CONFIG_TRACER_MAX_TRACE is not set
  tracing: Remove unused ftrace_cpu_disabled per cpu variable
  tracing: Make tracing work when debugfs is not configured in
2015-11-12 16:22:54 -08:00
Linus Torvalds 3370b69eb0 Four changes:
- x86: work around two nasty cases where a benign exception occurs while
 another is being delivered.  The endless stream of exceptions causes an
 infinite loop in the processor, which not even NMIs or SMIs can interrupt;
 in the virt case, there is no possibility to exit to the host either.
 
 - x86: support for Skylake per-guest TSC rate.  Long supported by AMD,
 the patches mostly move things from there to common arch/x86/kvm/ code.
 
 - generic: remove local_irq_save/restore from the guest entry and exit
 paths when context tracking is enabled.  The patches are a few months
 old, but we discussed them again at kernel summit.  Andy will pick up
 from here and, in 4.5, try to remove it from the user entry/exit paths.
 
 - PPC: Two bug fixes, see merge commit 370289756b for details.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
 6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
 Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
 jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
 23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
 o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
 =Um9M
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second batch of kvm updates from Paolo Bonzini:
 "Four changes:

   - x86: work around two nasty cases where a benign exception occurs
     while another is being delivered.  The endless stream of exceptions
     causes an infinite loop in the processor, which not even NMIs or
     SMIs can interrupt; in the virt case, there is no possibility to
     exit to the host either.

   - x86: support for Skylake per-guest TSC rate.  Long supported by
     AMD, the patches mostly move things from there to common
     arch/x86/kvm/ code.

   - generic: remove local_irq_save/restore from the guest entry and
     exit paths when context tracking is enabled.  The patches are a few
     months old, but we discussed them again at kernel summit.  Andy
     will pick up from here and, in 4.5, try to remove it from the user
     entry/exit paths.

   - PPC: Two bug fixes, see merge commit 370289756b for details"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: x86: rename update_db_bp_intercept to update_bp_intercept
  KVM: svm: unconditionally intercept #DB
  KVM: x86: work around infinite loop in microcode when #AC is delivered
  context_tracking: avoid irq_save/irq_restore on guest entry and exit
  context_tracking: remove duplicate enabled check
  KVM: VMX: Dump TSC multiplier in dump_vmcs()
  KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
  KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
  KVM: VMX: Enable and initialize VMX TSC scaling
  KVM: x86: Use the correct vcpu's TSC rate to compute time scale
  KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
  KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
  KVM: x86: Replace call-back compute_tsc_offset() with a common function
  KVM: x86: Replace call-back set_tsc_khz() with a common function
  KVM: x86: Add a common TSC scaling function
  KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
  KVM: x86: Collect information for setting TSC scaling ratio
  KVM: x86: declare a few variables as __read_mostly
  KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
  KVM: PPC: Book3S HV: Don't dynamically split core when already split
  ...
2015-11-12 14:34:06 -08:00
Zhou Chengming e41b104c7d livepatch: x86: fix relocation computation with kASLR
With kASLR enabled, old_addr provided by patch module is being shifted
accrodingly so that the symbol lookups work. To have module relocations
handled properly as well, the same transformation needs to be perfomed
on relocation address information.

[jkosina@suse.cz: extended / reworded changelog a bit]
Reported-by: Cyril B. <cbay@alwaysdata.com>
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-11-11 17:36:04 +01:00