Commit graph

13260 commits

Author SHA1 Message Date
Linus Torvalds cb60e3e65c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "New notable features:
   - The seccomp work from Will Drewry
   - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
   - Longer security labels for Smack from Casey Schaufler
   - Additional ptrace restriction modes for Yama by Kees Cook"

Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  apparmor: fix long path failure due to disconnected path
  apparmor: fix profile lookup for unconfined
  ima: fix filename hint to reflect script interpreter name
  KEYS: Don't check for NULL key pointer in key_validate()
  Smack: allow for significantly longer Smack labels v4
  gfp flags for security_inode_alloc()?
  Smack: recursive tramsmute
  Yama: replace capable() with ns_capable()
  TOMOYO: Accept manager programs which do not start with / .
  KEYS: Add invalidation support
  KEYS: Do LRU discard in full keyrings
  KEYS: Permit in-place link replacement in keyring list
  KEYS: Perform RCU synchronisation on keys prior to key destruction
  KEYS: Announce key type (un)registration
  KEYS: Reorganise keys Makefile
  KEYS: Move the key config into security/keys/Kconfig
  KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
  Yama: remove an unused variable
  samples/seccomp: fix dependencies on arch macros
  Yama: add additional ptrace scopes
  ...
2012-05-21 20:27:36 -07:00
Linus Torvalds bf67f3a5c4 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug cleanups from Thomas Gleixner:
 "This series is merily a cleanup of code copied around in arch/* and
  not changing any of the real cpu hotplug horrors yet.  I wish I'd had
  something more substantial for 3.5, but I underestimated the lurking
  horror..."

Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
  um: Remove leftover declaration of alloc_task_struct_node()
  task_allocator: Use config switches instead of magic defines
  sparc: Use common threadinfo allocator
  score: Use common threadinfo allocator
  sh-use-common-threadinfo-allocator
  mn10300: Use common threadinfo allocator
  powerpc: Use common threadinfo allocator
  mips: Use common threadinfo allocator
  hexagon: Use common threadinfo allocator
  m32r: Use common threadinfo allocator
  frv: Use common threadinfo allocator
  cris: Use common threadinfo allocator
  x86: Use common threadinfo allocator
  c6x: Use common threadinfo allocator
  fork: Provide kmemcache based thread_info allocator
  tile: Use common threadinfo allocator
  fork: Provide weak arch_release_[task_struct|thread_info] functions
  fork: Move thread info gfp flags to header
  fork: Remove the weak insanity
  sh: Remove cpu_idle_wait()
  ...
2012-05-21 19:43:57 -07:00
Linus Torvalds 226da0dbc8 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
 "This is the v3.5 RCU tree from Paul E.  McKenney:

 1) A set of improvements and fixes to the RCU_FAST_NO_HZ feature (with
    more on the way for 3.6).  Posted to LKML:
       https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
       https://lkml.org/lkml/2012/4/16/611 (commit 4),
       https://lkml.org/lkml/2012/4/30/390 (commit 6), and
       https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
       the other commits for the convenience of the tester).

 2) Changes to make rcu_barrier() avoid disrupting execution of CPUs
    that have no RCU callbacks.  Posted to LKML:
       https://lkml.org/lkml/2012/4/23/322.

 3) A couple of commits that improve the efficiency of the interaction
    between preemptible RCU and the scheduler, these two being all that
    survived an abortive attempt to allow preemptible RCU's
    __rcu_read_lock() to be inlined.  The full set was posted to LKML at
    https://lkml.org/lkml/2012/4/14/143, and the first and third patches
    of that set remain.

 4) Lai Jiangshan's algorithmic implementation of SRCU, which includes
    call_srcu() and srcu_barrier().  A major feature of this new
    implementation is that synchronize_srcu() no longer disturbs the
    execution of other CPUs.  This work is based on earlier
    implementations by Peter Zijlstra and Paul E.  McKenney.  Posted to
    LKML: https://lkml.org/lkml/2012/2/22/82.

 5) A number of miscellaneous bug fixes and improvements which were
    posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
    subsequent updates posted to LKML."

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rcu: Make rcu_barrier() less disruptive
  rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
  rcu: Make RCU_FAST_NO_HZ handle timer migration
  rcu: Update RCU maintainership
  rcu: Make exit_rcu() more precise and consolidate
  rcu: Move PREEMPT_RCU preemption to switch_to() invocation
  rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
  rcu: Add rcutorture test for call_srcu()
  rcu: Implement per-domain single-threaded call_srcu() state machine
  rcu: Use single value to handle expedited SRCU grace periods
  rcu: Improve srcu_readers_active_idx()'s cache locality
  rcu: Remove unused srcu_barrier()
  rcu: Implement a variant of Peter's SRCU algorithm
  rcu: Improve SRCU's wait_idx() comments
  rcu: Flip ->completed only once per SRCU grace period
  rcu: Increment upper bit only for srcu_read_lock()
  rcu: Remove fast check path from __synchronize_srcu()
  rcu: Direct algorithmic SRCU implementation
  rcu: Introduce rcutorture testing for rcu_barrier()
  timer: Fix mod_timer_pinned() header comment
  ...
2012-05-21 19:26:51 -07:00
Linus Torvalds 5ec29e3149 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Ingo Molnar:
 "This update:

   - extends and simplifies x86 NMI callback handling code to enhance
     and fix the HP hw-watchdog driver

   - simplifies the x86 NMI callback handling code to fix a kmemcheck
     bug.

   - enhances the hung-task debugger"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Fix the type of the nmiaction.flags field
  x86/nmi: Fix page faults by nmiaction if kmemcheck is enabled
  x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
  watchdog, hpwdt: Remove priority option for NMI callback
  hung task debugging: Inject NMI when hung and going to panic
2012-05-21 19:25:14 -07:00
Linus Torvalds 31ae98359d Merge branches 'perf-urgent-for-linus', 'x86-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf, x86 and scheduler updates from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing: Do not enable function event with enable
  perf stat: handle ENXIO error for perf_event_open
  perf: Turn off compiler warnings for flex and bison generated files
  perf stat: Fix case where guest/host monitoring is not supported by kernel
  perf build-id: Fix filename size calculation

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable
  x86: Fix section annotation of acpi_map_cpu2node()
  x86/microcode: Ensure that module is only loaded on supported Intel CPUs

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
2012-05-17 09:35:17 -07:00
Jiri Kosina 3911ff30f5 genirq: export handle_edge_irq() and irq_to_desc()
Export handle_edge_irq() and irq_to_desc() to modules to allow them to
do things such as

	__irq_set_handler_locked(...., handle_edge_irq);

This fixes

	ERROR: "handle_edge_irq" [drivers/gpio/gpio-pch.ko] undefined!
	ERROR: "irq_to_desc" [drivers/gpio/gpio-pch.ko] undefined!

when gpio-pch is being built as a module.

This was introduced by commit df9541a60a ("gpio: pch9: Use proper flow
type handlers") that added

	__irq_set_handler_locked(d->irq, handle_edge_irq);

but handle_edge_irq() was not exported for modules (and inlined
__irq_set_handler_locked() requires irq_to_desc() exported as well)

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-15 08:10:07 -07:00
Ingo Molnar 2d84e023cb Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.5 RCU tree from Paul E. McKenney:

 1)	A set of improvements and fixes to the RCU_FAST_NO_HZ feature
	(with more on the way for 3.6).  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/324 (commits 1-3 and 5),
	https://lkml.org/lkml/2012/4/16/611 (commit 4),
	https://lkml.org/lkml/2012/4/30/390 (commit 6), and
	https://lkml.org/lkml/2012/5/4/410 (commit 7, combined with
	the other commits for the convenience of the tester).

 2)	Changes to make rcu_barrier() avoid disrupting execution of CPUs
	that have no RCU callbacks.  Posted to LKML:
	https://lkml.org/lkml/2012/4/23/322.

 3)	A couple of commits that improve the efficiency of the interaction
	between preemptible RCU and the scheduler, these two being all
	that survived an abortive attempt to allow preemptible RCU's
	__rcu_read_lock() to be inlined.  The full set was posted to
	LKML at https://lkml.org/lkml/2012/4/14/143, and the first and
	third patches of that set remain.

 4)	Lai Jiangshan's algorithmic implementation of SRCU, which includes
	call_srcu() and srcu_barrier().  A major feature of this new
	implementation is that synchronize_srcu() no longer disturbs
	the execution of other CPUs.  This work is based on earlier
	implementations by Peter Zijlstra and Paul E. McKenney.  Posted to
	LKML: https://lkml.org/lkml/2012/2/22/82.

 5)	A number of miscellaneous bug fixes and improvements which were
	posted to LKML at: https://lkml.org/lkml/2012/4/23/353 with
	subsequent updates posted to LKML.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-14 08:41:46 +02:00
Paul E. McKenney dc36be4419 Merge branches 'barrier.2012.05.09a', 'fixes.2012.04.26a', 'inline.2012.05.02b' and 'srcu.2012.05.07b' into HEAD
barrier:  Reduce the amount of disturbance by rcu_barrier() to the rest of
    	the system.  This branch also includes improvements to
    	RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes:  Miscellaneous fixes.
inline:  Remaining changes from an abortive attempt to inline
    	preemptible RCU's __rcu_read_lock().  These are (1) making
    	exit_rcu() avoid unnecessary work and (2) avoiding having
    	preemptible RCU record a blocked thread when the scheduler
    	declines to do a context switch.
srcu:	Lai Jiangshan's algorithmic implementation of SRCU, including
    	call_srcu().
2012-05-11 10:14:21 -07:00
Mike Galbraith 5e2bf01422 namespaces, pid_ns: fix leakage on fork() failure
Fork() failure post namespace creation for a child cloned with
CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted
during creation, but not unmounted during cleanup.  Call
pid_ns_release_proc() during cleanup.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Louis Rilling <louis.rilling@kerlabs.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Steven Rostedt 9b63776fa3 tracing: Do not enable function event with enable
With the adding of function tracing event to perf, it caused a
side effect that produces the following warning when enabling all
events in ftrace:

 # echo 1 > /sys/kernel/debug/tracing/events/enable

[console]
event trace: Could not enable event function

This is because when enabling all events via the debugfs system
it ignores events that do not have a ->reg() function assigned.
This was to skip over the ftrace internal events (as they are
not TRACE_EVENTs). But as the ftrace function event now has
a ->reg() function attached to it for use with perf, it is no
longer ignored.

Worse yet, this ->reg() function is being called when it should
not be. It returns an error and causes the above warning to
be printed.

By adding a new event_call flag (TRACE_EVENT_FL_IGNORE_ENABLE)
and have all ftrace internel event structures have it set,
setting the events/enable will no longe try to incorrectly enable
the function event and does not warn.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-10 15:55:43 -04:00
Jan Kiszka b7dafa0ef3 compat: Fix RT signal mask corruption via sigprocmask
compat_sys_sigprocmask reads a smaller signal mask from userspace than
sigprogmask accepts for setting.  So the high word of blocked.sig[0]
will be cleared, releasing any potentially blocked RT signal.

This was discovered via userspace code that relies on get/setcontext.
glibc's i386 versions of those functions use sigprogmask instead of
rt_sigprogmask to save/restore signal mask and caused RT signal
unblocking this way.

As suggested by Linus, this replaces the sys_sigprocmask based compat
version with one that open-codes the required logic, including the merge
of the existing blocked set with the new one provided on SIG_SETMASK.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 08:58:33 -07:00
Paul E. McKenney b1420f1c8b rcu: Make rcu_barrier() less disruptive
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU.  Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.

However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state.  This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks.  Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:27:54 -07:00
Paul E. McKenney 98248a0e24 rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes
needless and fragile assumptions about the initial value of things like
the jiffies counter.  This commit therefore explicitly initializes all of
them that are better started with a non-zero value.  It also adds some
comments describing the per-CPU state variables.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:26:57 -07:00
Paul E. McKenney 21e52e1566 rcu: Make RCU_FAST_NO_HZ handle timer migration
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.

However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked.  This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:

	https://bugzilla.redhat.com/show_bug.cgi?id=806548

This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.

Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-09 14:26:56 -07:00
Igor Mammedov 30b4e9eb78 sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
   IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
   Call Trace:
       [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50

The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early

        if (cpu != cpumask_first(sched_domain_span(sd)))
                return 0;

without initializing sd->groups->next field.

Fix bug by initializing next field right after sched_group was
allocated.

Also-Reported-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: pjt@google.com
Cc: seto.hidetoshi@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 12:27:35 +02:00
Thomas Gleixner f5e1028736 task_allocator: Use config switches instead of magic defines
Replace __HAVE_ARCH_TASK_ALLOCATOR and __HAVE_ARCH_THREAD_ALLOCATOR
with proper config switches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20120505150142.371309416@linutronix.de
2012-05-08 14:08:46 +02:00
Thomas Gleixner 0d15d74a1e fork: Provide kmemcache based thread_info allocator
Several architectures have their own kmemcache based thread allocator
because THREAD_SIZE is smaller than PAGE_SIZE. Add it to the core code
conditionally on THREAD_SIZE < PAGE_SIZE so the private copies can go.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.491002124@linutronix.de
2012-05-08 14:08:44 +02:00
Thomas Gleixner 67ba5293f7 Merge branch 'smp/threadalloc' into smp/hotplug
Reason: Pull in the separate branch which was created so arch/tile can
base further work on it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-08 14:07:48 +02:00
Thomas Gleixner 41101809a8 fork: Provide weak arch_release_[task_struct|thread_info] functions
These functions allow us to move most of the duplicated thread_info
allocators to the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.366461660@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner 2889f60814 fork: Move thread info gfp flags to header
These flags can be useful for extra allocations outside of the core
code.

Add __GFP_NOTRACK to them, so the archs which have kmemcheck do
not have to provide extra allocators just for that reason.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.428211694@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner 6c0a9fa62f fork: Remove the weak insanity
We error out when compiling with gcc4.1.[01] as it miscompiles
__weak. The workaround with magic defines is not longer
necessary. Make it __weak again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.306358267@linutronix.de
2012-05-08 13:55:20 +02:00
Thomas Gleixner f37f435f33 smp: Implement kick_all_cpus_sync()
Will replace the misnomed cpu_idle_wait() function which is copied a
gazillion times all over arch/*

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de
2012-05-08 12:35:06 +02:00
Thomas Gleixner a4a2eb490e init_task: Create generic init_task instance
All archs define init_task in the same way (except ia64, but there is
no particular reason why ia64 cannot use the common version). Create a
generic instance so all archs can be converted over.

The config switch is temporary and will be removed when all archs are
converted over.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de
2012-05-05 13:00:21 +02:00
Thomas Gleixner 43a18b1e58 smp: Fix idle_thread_init() inline stub
idle_thread_init() does not have arguments.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-04 12:52:25 +02:00
James Morris 898bfc1d46 Linux 3.4-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
 oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
 MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
 U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
 JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
 xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
 =4d4w
 -----END PGP SIGNATURE-----

Merge tag 'v3.4-rc5' into next

Linux 3.4-rc5

Merge to pull in prerequisite change for Smack:
86812bb0de

Requested by Casey.
2012-05-04 12:46:40 +10:00
Suresh Siddha 3bb5d2ee39 smp, idle: Allocate idle thread for each possible cpu during boot
percpu areas are already allocated during boot for each possible cpu.
percpu idle threads can be considered as an extension of the percpu areas,
and allocate them for each possible cpu during boot.

This will eliminate the need for workqueue based idle thread allocation.
In future we can move the idle thread area into the percpu area too.

[ tglx: Moved the loop into smpboot.c and added an error check when
  the init code failed to allocate an idle thread for a cpu which
  should be onlined ]

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: venki@google.com
Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-03 19:32:34 +02:00
Paul E. McKenney 9dd8fb16c3 rcu: Make exit_rcu() more precise and consolidate
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing).  The exit() path invokes exit_rcu() to
do this cleanup.

However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space.  Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.

While we are at it, consolidate the two identical exit_rcu() functions
into a single function.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>

Conflicts:

	kernel/rcupdate.c
2012-05-02 14:48:27 -07:00
Paul E. McKenney 616c310e83 rcu: Move PREEMPT_RCU preemption to switch_to() invocation
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler.
This is inefficient because enqueuing is required only if there is a
context switch, and entry to the scheduler does not guarantee a context
switch.

The commit therefore moves the enqueuing to immediately precede the
call to switch_to() from the scheduler.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-02 14:43:23 -07:00
Paul E. McKenney f511fc6246 rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPU
Timers are subject to migration, which can lead to the following
system-hang scenario when CONFIG_RCU_FAST_NO_HZ=y:

1.	CPU 0 executes synchronize_rcu(), which posts an RCU callback.

2.	CPU 0 then goes idle.  It cannot immediately invoke the callback,
	but there is nothing RCU needs from ti, so it enters dyntick-idle
	mode after posting a timer.

3.	The timer gets migrated to CPU 1.

4.	CPU 0 never wakes up, so the synchronize_rcu() never returns, so
	the system hangs.

This commit fixes this problem by using mod_timer_pinned(), as suggested
by Peter Zijlstra, to ensure that the timer is actually posted on the
running CPU.

Reported-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-01 08:22:50 -07:00
Lai Jiangshan 9059c94017 rcu: Add rcutorture test for call_srcu()
Add srcu_torture_deferred_free() for srcu_ops so as to test the new
call_srcu().  Rename the original srcu_ops to srcu_sync_ops.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:26 -07:00
Lai Jiangshan 931ea9d1a6 rcu: Implement per-domain single-threaded call_srcu() state machine
This commit implements an SRCU state machine in support of call_srcu().
The state machine is preemptible, light-weight, and single-threaded,
minimizing synchronization overhead.  In particular, there is no longer
any need for synchronize_srcu() to be guarded by a mutex.

Expedited processing is handled, at least in the absence of concurrent
grace-period operations on that same srcu_struct structure, by having
the synchronize_srcu_expedited() thread take on the role of the
workqueue thread for one iteration.

There is a reasonable probability that a given SRCU callback will
be invoked on the same CPU that registered it, however, there is no
guarantee.  Concurrent SRCU grace-period primitives can cause callbacks
to be executed elsewhere, even in absence of CPU-hotplug operations.

Callbacks execute in process context, but under the influence of
local_bh_disable(), so it is illegal to sleep in an SRCU callback
function.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:25 -07:00
Lai Jiangshan d9792edd7a rcu: Use single value to handle expedited SRCU grace periods
The earlier algorithm used an "expedited" flag combined with a "trycount"
counter to differentiate between normal and expedited SRCU grace periods.
However, the difference can be encoded into a single counter with a cutoff
value and different initial values for expedited and normal SRCU grace
periods.  This commit makes that change.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Conflicts:

	kernel/srcu.c
2012-04-30 10:48:24 -07:00
Lai Jiangshan dc87917501 rcu: Improve srcu_readers_active_idx()'s cache locality
Expand the calls to srcu_readers_active_idx() from srcu_readers_active()
inline.  This change improves cache locality by interating over the CPUs
once rather than twice.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:24 -07:00
Lai Jiangshan b52ce066c5 rcu: Implement a variant of Peter's SRCU algorithm
This commit implements a variant of Peter's algorithm, which may be found
at https://lkml.org/lkml/2012/2/1/119.

o	Make the checking lock-free to enable parallel checking.
	Parallel checking is required when (1) the original checking
	task is preempted for a long time, (2) sychronize_srcu_expedited()
	starts during an ongoing SRCU grace period, or (3) we wish to
	avoid acquiring a lock.

o	Since the checking is lock-free, we avoid a mutex in state machine
	for call_srcu().

o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
	This might allow us to remove the preempt_disable() in future
	versions, though such removal will need great care because it
	rescinds the one-old-reader-per-CPU guarantee.

o	Remove a smp_mb(), simplify the comments and make the smp_mb() pairs
	more intuitive.

Inspired-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:22 -07:00
Lai Jiangshan 18108ebfeb rcu: Improve SRCU's wait_idx() comments
The safety of SRCU is provided byy wait_idx() rather than flipping.
The flipping actually prevents starvation.

This commit therefore updates the comments to more accurately and
precisely describe what is going on.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:22 -07:00
Lai Jiangshan 944ce9af47 rcu: Flip ->completed only once per SRCU grace period
This is an optimization of the SRCU grace period.  To guard against
preempted readers with old values of the counter, it suffices to scan the
old counters once more, then flip ->completed only one time.  The reason
this works is that the old readers must have incremented the old set of
counters (if they have not yet incremented, then their critical section
starts after this grace period, so they may be safely ignored).

This commit therefore optimizes the second flip out in favor of a simple
rescan.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:21 -07:00
Lai Jiangshan 440253c17f rcu: Increment upper bit only for srcu_read_lock()
The purpose of the upper bit of SRCU's per-CPU counters is to guarantee
that no reasonable series of srcu_read_lock() and srcu_read_unlock()
operations can return the value of the counter to its original value.
This guarantee is require only after the index has been switched to
the other set of counters, so at most one srcu_read_lock() can affect
a given CPU's counter.  The number of srcu_read_unlock() operations
on a given counter is limited to the number of tasks in the system,
which given the Linux kernel's current structure is limited to far less
than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems.
(Something about a limited number of bytes in the kernel's address space.)

Therefore, if srcu_read_lock() increments the upper bits, then
srcu_read_unlock() need not do so.  In this case, an srcu_read_lock() and
an srcu_read_unlock() will flip the lower bit of the upper field of the
counter.  An unreasonably large additional number of srcu_read_unlock()
operations would be required to return the counter to its initial value,
thus preserving the guarantee.

This commit takes this approach, which further allows it to shrink
the size of the upper field to one bit, making the number of
srcu_read_unlock() operations required to return the counter to its
initial value even more unreasonable than before.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:20 -07:00
Lai Jiangshan 4b7a3e9e32 rcu: Remove fast check path from __synchronize_srcu()
The fastpath in __synchronize_srcu() is designed to handle cases where
there are a large number of concurrent calls for the same srcu_struct
structure.  However, the Linux kernel currently does not use SRCU in
this manner, so remove the fastpath checks for simplicity.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:20 -07:00
Paul E. McKenney cef50120b6 rcu: Direct algorithmic SRCU implementation
The current implementation of synchronize_srcu_expedited() can cause
severe OS jitter due to its use of synchronize_sched(), which in turn
invokes try_stop_cpus(), which causes each CPU to be sent an IPI.
This can result in severe performance degradation for real-time workloads
and especially for short-interation-length HPC workloads.  Furthermore,
because only one instance of try_stop_cpus() can be making forward progress
at a given time, only one instance of synchronize_srcu_expedited() can
make forward progress at a time, even if they are all operating on
distinct srcu_struct structures.

This commit, inspired by an earlier implementation by Peter Zijlstra
(https://lkml.org/lkml/2012/1/31/211) and by further offline discussions,
takes a strictly algorithmic bits-in-memory approach.  This has the
disadvantage of requiring one explicit memory-barrier instruction in
each of srcu_read_lock() and srcu_read_unlock(), but on the other hand
completely dispenses with OS jitter and furthermore allows SRCU to be
used freely by CPUs that RCU believes to be idle or offline.

The update-side implementation handles the single read-side memory
barrier by rechecking the per-CPU counters after summing them and
by running through the update-side state machine twice.

This implementation has passed moderate rcutorture testing on both
x86 and Power.  Also updated to use this_cpu_ptr() instead of per_cpu_ptr(),
as suggested by Peter Zijlstra.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-04-30 10:48:19 -07:00
Paul E. McKenney fae4b54f28 rcu: Introduce rcutorture testing for rcu_barrier()
Although rcutorture does invoke rcu_barrier() and friends, it cannot
really be called a torture test given that it invokes them only once
at the end of the test.  This commit therefore introduces heavy-duty
rcutorture testing for rcu_barrier(), which may be carried out
concurrently with normal rcutorture testing.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30 10:48:18 -07:00
Linus Torvalds 6cfdd02b88 Power management fixes for 3.4
Fix for an issue causing hibernation to hang on systems with highmem (that
 practically means i386) due to broken memory management (bug introduced in 3.2,
 so -stable material) and PM documentation update making the freezer
 documentation follow the code again after some recent updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJPnbeLAAoJEKhOf7ml8uNs4nkQAKwhfWWfbM7ZepPfT56A5NW1
 9vlfgO+1ibUgdjkO0hi1biCAbARTNVS5eCLyRJW0W/msGgL51nYleBeFmwwx5E6h
 m5Vwr/53cGeAeF0AOkrQkD45YKJaAlTmWF4/T2YWKLWNMgaVuLmGf7eyYZ6rP1NO
 rJxzUOMC6UrRjIA+S2anDU0CdMyqDHvV3OmY+InZBikFCk0YAtDYUYfNDNqQpEBG
 bzkG3SyaJeqnbQDkhme7U/uAPJCThSz2Z4gvvOxiXdB+I+yp6FhluhLSGxqMh/kj
 OUAJe9s6AAdKz+K62/OgowwucxvmeJRCyYWkN2ZEpsZLoqTEOqLNS4+eaUO6xS/2
 tq89LnfSIVFwRx23XeVr/oMfxUJZ8VKZENo5Pm6NjTAYykTeyD4ug/GAHqgXR0TT
 B+fvx8QmQ68R843aJsjR9h0AKsSeXfgCAROJt+x0ONYAvmJNV62nzs81broDEl4I
 BmWHpOWI7wlzMPt7bNWn4ev4K+WhbVsioFDS61he0Y47Rqt3yUJ8G2OfBq6JYndw
 As4ImoPOVGl0+TKcHJ9Y3bVPnsY7fJyF0GeG50NHxsVFsnTv+rYZ9K3GM9KExhO/
 5mCfoHNgkOJnhGHfZppbnQBHbmjH8EA3QUx57Abo+Q4wiPNNAVG9P1JpZeyGj8KF
 3YML5FjjGQHtYBWeH5WR
 =HaVL
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael J. Wysocki:
 "Fix for an issue causing hibernation to hang on systems with highmem
  (that practically means i386) due to broken memory management (bug
  introduced in 3.2, so -stable material) and PM documentation update
  making the freezer documentation follow the code again after some
  recent updates."

* tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / Freezer / Docs: Update documentation about freezing of tasks
  PM / Hibernate: fix the number of pages used for hibernate/thaw buffering
2012-04-29 15:00:44 -07:00
Linus Torvalds 78e97a4788 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU fix from Ingo Molnar.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Permit call_rcu() from CPU_DYING notifiers
2012-04-27 19:40:56 -07:00
Linus Torvalds daae677f56 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix OOPS when build_sched_domains() percpu allocation fails
  sched: Fix more load-balancing fallout
2012-04-27 19:37:00 -07:00
Linus Torvalds 06fc5d3d24 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar.

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix perf_event_for_each() to use sibling
  perf symbols: Read plt symbols from proper symtab_type binary
  tracing: Fix stacktrace of latency tracers (irqsoff and friends)
  perf tools: Add 'G' and 'H' modifiers to event parsing
  tracing: Fix regression with tracing_on
  perf tools: Drop CROSS_COMPILE from flex and bison calls
  perf report: Fix crash showing warning related to kernel maps
  tracing: Fix build breakage without CONFIG_PERF_EVENTS (again)
2012-04-27 19:35:50 -07:00
Linus Torvalds f6072452c9 Merge branch 'for-v3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull build fixes for less mainstream architectures from Paul Gortmaker:
 "These are fixes for frv(1), blackfin(2), powerpc(1) and xtensa(4).

  Fortunately the touches are nearly all specific to files just used by
  the arch in question.  The two touches to shared/common files
  [kernel/irq/debug.h and drivers/pci/Makefile] are trivial to assess as
  no risk to anyone.

  Half of them relate to xtensa directly.  It was only when I fixed the
  last xtensa issue that I realized that the arch has been broken for a
  significant time, and isn't a specific v3.4 regression.  So if you
  wanted, we could leave xtensa lying bleeding in the street for a
  couple more weeks and queue those for 3.5.  But given they are no risk
  to anyone outside of xtensa, I figured to just leave them in.

  If you are OK with taking the xtensa fixes, then please pull to get:

   - one last implicit include uncovered by system.h that is in a file
     specific to just one powerpc defconfig.  (I'd sync'd with BenH).

   - fix an oversight in the PCI makefile where shared code wasn't being
     compiled for ARCH=frv

   - fix a missing include for GPIO in blackfin framebuffer.

   - audit and tag endif in blackfin ezkit board file, in order to find
     and fix the misplaced endif masking a block of code.

   - fix irq/debug.h choice of temporary macro names to be more internal
     so they don't conflict with names used by xtensa.

   - fix a reference to an undeclared local var in xtensa's signal.c

   - fix an implicit bug.h usage in xtensa's asm/io.h uncovered by my
     removing bug.h from kernel.h

   - fix xtensa to properly indicate it is using asm-generic/hardirq.h
     in order to resolve the link error - undefined ack_bad_irq

  The xtensa still fails final link as my latest binutils does something
  evil when ld forward-relocates unlikely() blocks, but in theory people
  who have older/valid toolchains could now use the thing."

* 'for-v3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
  xtensa: fix build fail on undefined ack_bad_irq
  blackfin: fix ifdef fustercluck in mach-bf538/boards/ezkit.c
  blackfin: fix compile error in bfin-lq035q1-fb.c
  pci: frv architecture needs generic setup-bus infrastructure
  irq: hide debug macros so they don't collide with others.
  xtensa: fix build error in xtensa/include/asm/io.h
  xtensa: fix build failure in xtensa/kernel/signal.c
  powerpc: fix system.h fallout in sysdev/scom.c [chroma_defconfig]
2012-04-27 19:32:37 -07:00
Paul E. McKenney 048a0e8f5e timer: Fix mod_timer_pinned() header comment
The mod_timer_pinned() header comment states that it prevents timers
from being migrated to a different CPU.  This is not the case, instead,
it ensures that the timer is posted to the current CPU, but does nothing
to prevent CPU-hotplug operations from migrating the timer.

This commit therefore brings the comment header into alignment with
reality.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
2012-04-26 12:28:03 -07:00
Paul E. McKenney 79b9a75fb7 rcu: Add warning for RCU_FAST_NO_HZ timer firing
RCU_FAST_NO_HZ uses a timer to limit the time that a CPU with callbacks
can remain in dyntick-idle mode.  This timer is cancelled when the CPU
exits idle, and therefore should never fire.  However, if the timer
were migrated to some other CPU for whatever reason (1) the timer could
actually fire and (2) firing on some other CPU would fail to wake up the
CPU with callbacks, possibly resulting in sluggishness or a system hang.

This commit therfore adds a WARN_ON_ONCE() to the timer handler in order
to detect this condition.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-26 08:49:05 -07:00
Michael Ellerman 724b6daa13 perf: Fix perf_event_for_each() to use sibling
In perf_event_for_each() we call a function on an event, and then
iterate over the siblings of the event.

However we don't call the function on the siblings, we call it
repeatedly on the original event - it seems "obvious" that we should
be calling it with sibling as the argument.

It looks like this broke in commit 75f937f24b ("Fix ctx->mutex
vs counter->mutex inversion").

The only effect of the bug is that the PERF_IOC_FLAG_GROUP parameter
to the ioctls doesn't work.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334109253-31329-1-git-send-email-michael@ellerman.id.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26 13:51:31 +02:00
he, bo fb2cf2c660 sched: Fix OOPS when build_sched_domains() percpu allocation fails
Under extreme memory used up situations, percpu allocation
might fail. We hit it when system goes to suspend-to-ram,
causing a kworker panic:

 EIP: [<c124411a>] build_sched_domains+0x23a/0xad0
 Kernel panic - not syncing: Fatal exception
 Pid: 3026, comm: kworker/u:3
 3.0.8-137473-gf42fbef #1

 Call Trace:
  [<c18cc4f2>] panic+0x66/0x16c
  [...]
  [<c1244c37>] partition_sched_domains+0x287/0x4b0
  [<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210
  [<c123712d>] cpuset_cpu_inactive+0x1d/0x30
  [...]

With this fix applied build_sched_domains() will return -ENOMEM and
the suspend attempt fails.

Signed-off-by: he, bo <bo.he@intel.com>
Reviewed-by: Zhang, Yanmin <yanmin.zhang@intel.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
[ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
  I don't like that kind of sad behavior but nevertheless it should
  not crash under high memory load. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26 12:54:53 +02:00
Peter Zijlstra eb95308ee2 sched: Fix more load-balancing fallout
Commits 367456c756 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.

Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26 12:54:52 +02:00