INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=10002 jiffies)

Discussion:

Fengguang Wu

2012-09-25 11:19:38 UTC

Hi Paul,

I've just bisected down one RCU stall problem:

[ 12.035785] pktgen: Packet Generator for packet performance testing. Version: 2.74
[ 12.435439] atkbd: probe of serio0 rejects match -19
[ 111.700160] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=10002 jiffies)
[ 111.700171] Pid: 0, comm: swapper/0 Not tainted 3.6.0-rc5-00004-gda10491 #1
[ 111.700178] Call Trace:
[ 111.700475] [<c10c3c84>] rcu_check_callbacks+0x544/0x570
[ 111.700538] [<c1075e86>] update_process_times+0x36/0x70
[ 111.700547] [<c10a6267>] tick_sched_timer+0x57/0xc0
[ 111.700552] [<c108758a>] __run_hrtimer.isra.31+0x4a/0xc0
[ 111.700557] [<c10a6210>] ? tick_nohz_handler+0xf0/0xf0
[ 111.700559] [<c1088155>] hrtimer_interrupt+0xf5/0x290
[ 111.700562] [<c1091cb8>] ? sched_clock_idle_wakeup_event+0x18/0x20
[ 111.700565] [<c10a6399>] ? tick_nohz_stop_idle+0x39/0x40
[ 111.700572] [<c104f56f>] smp_apic_timer_interrupt+0x4f/0x80
[ 111.700587] [<c1753636>] apic_timer_interrupt+0x2a/0x30
[ 111.700593] [<c10565b5>] ? native_safe_halt+0x5/0x10
[ 111.700599] [<c1039f89>] default_idle+0x29/0x50
[ 111.700601] [<c103a958>] cpu_idle+0x68/0xb0
[ 111.700609] [<c16f2ff7>] rest_init+0x67/0x70
[ 111.700627] [<c1af7929>] start_kernel+0x2ea/0x2f0
[ 111.700629] [<c1af7474>] ? repair_env_string+0x51/0x51
[ 111.700631] [<c1af72a2>] i386_start_kernel+0x78/0x7d
[ 127.040302] bus: 'serio': driver_probe_device: matched device serio0 with driver atkbd
[ 127.041308] CPA self-test:

to this commit:

commit 06ae115a1d551cd952d80df06eaf8b5153351875
Author: Paul E. McKenney <***@linux.vnet.ibm.com>
Date: Sun Aug 14 15:56:54 2011 -0700

rcu: Avoid having just-onlined CPU resched itself when RCU is idle

Thanks,
Fengguang

Paul E. McKenney

2012-09-25 15:07:01 UTC

Permalink

Post by Fengguang Wu
Hi Paul,
[ 12.035785] pktgen: Packet Generator for packet performance testing. Version: 2.74
[ 12.435439] atkbd: probe of serio0 rejects match -19
[ 111.700160] INFO: rcu_preempt detected stalls on CPUs/tasks: { 1} (detected by 0, t=10002 jiffies)
[ 111.700171] Pid: 0, comm: swapper/0 Not tainted 3.6.0-rc5-00004-gda10491 #1
[ 111.700475] [<c10c3c84>] rcu_check_callbacks+0x544/0x570
[ 111.700538] [<c1075e86>] update_process_times+0x36/0x70
[ 111.700547] [<c10a6267>] tick_sched_timer+0x57/0xc0
[ 111.700552] [<c108758a>] __run_hrtimer.isra.31+0x4a/0xc0
[ 111.700557] [<c10a6210>] ? tick_nohz_handler+0xf0/0xf0
[ 111.700559] [<c1088155>] hrtimer_interrupt+0xf5/0x290
[ 111.700562] [<c1091cb8>] ? sched_clock_idle_wakeup_event+0x18/0x20
[ 111.700565] [<c10a6399>] ? tick_nohz_stop_idle+0x39/0x40
[ 111.700572] [<c104f56f>] smp_apic_timer_interrupt+0x4f/0x80
[ 111.700587] [<c1753636>] apic_timer_interrupt+0x2a/0x30
[ 111.700593] [<c10565b5>] ? native_safe_halt+0x5/0x10
[ 111.700599] [<c1039f89>] default_idle+0x29/0x50
[ 111.700601] [<c103a958>] cpu_idle+0x68/0xb0
[ 111.700609] [<c16f2ff7>] rest_init+0x67/0x70
[ 111.700627] [<c1af7929>] start_kernel+0x2ea/0x2f0
[ 111.700629] [<c1af7474>] ? repair_env_string+0x51/0x51
[ 111.700631] [<c1af72a2>] i386_start_kernel+0x78/0x7d
[ 127.040302] bus: 'serio': driver_probe_device: matched device serio0 with driver atkbd
commit 06ae115a1d551cd952d80df06eaf8b5153351875
Date: Sun Aug 14 15:56:54 2011 -0700
rcu: Avoid having just-onlined CPU resched itself when RCU is idle

Interesting. Of course the stack is from the CPU that detected the
problem rather than the problematic CPU. ;-)

Could you please try the following patch?

Thanx, Paul
------------------------------------------------------------------------

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting
for that grace period to complete. However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally. In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period. Given that nothing is waiting on this grace
period, this is also not a problem.

That is, unless RCU CPU stall warnings are also enabled, as they are
in recent kernels. In this case, if a CPU wakes up after at least one
minute of inactivity, an RCU CPU stall warning will result. The reason
that no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that consistently get into this situation.

All this begs the question of exactly how a callback-free grace period
gets started in the first place. This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait for
another grace period, never mind the fact that the grace period that they
were waiting for just completed. This CPU can therefore erroneously
decide to start a new grace period. Note that this can happen in
TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system: Deadlock
considerations mean that the CPU that detected the end of the grace
period is not necessarily officially informed of this fact for some time.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks. It then won't have any callbacks left. If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period. This new check relies on an array of tail pointers
into each CPU's list of callbacks. If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment. The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU is officially notified that the old grace period
has ended.

This change is to cpu_needs_another_gp(), which is called in a number
of places. The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.

Reported-by: Becky Bruce <***@gmail.com>
Reported-by: Subodh Nijsure <***@grid-net.com>
Reported-by: Paul Walmsley <***@pwsan.com>
Signed-off-by: Paul E. McKenney <***@linaro.org>
Signed-off-by: Paul E. McKenney <***@linux.vnet.ibm.com>
Tested-by: Paul Walmsley <***@pwsan.com> # OMAP3730, OMAP4430
Cc: ***@vger.kernel.org

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f280e54..f7bcd9e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -305,7 +305,9 @@ cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp)
static int
cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
{
- return *rdp->nxttail[RCU_DONE_TAIL] && !rcu_gp_in_progress(rsp);
+ return *rdp->nxttail[RCU_DONE_TAIL +
+ ACCESS_ONCE(rsp->completed) != rdp->completed] &&
+ !rcu_gp_in_progress(rsp);
}

/*

Avi Kivity

2012-09-30 11:10:55 UTC

Permalink

[ . . . ]

But could you also please send your .config file and a description of

.config attached.

the workload you are running?

It's basically the below commands. The exact initrd is not relevant in
this case because it's a boot time warning before user space is
started. The stalls roughly happen 1 time on every 10 boots.

Yow!!!
You have severe cross-CPU time-synchronization problems. See for
example the first dmesg, with the relevant part extracted right here.
One CPU believes that it is about 37 seconds past boot, and the other
CPU beleives that it is about 137 seconds past boot. Given that large
of a time difference, an RCU CPU stall warning is expected behavior.

Good spot! Yeah I noticed that huge timestamp gap, however didn't take
it seriously enough..

Get your two CPUs in agreement about what time it is, and I bet that
the CPU stall warnings will go away.

Possibly KVM related? Because the warnings show up in many test boxes
running KVM and so is not likely some hardware specific issue.

I vaguely recall seeing something recently. But let's ask the KVM and
timekeeping guys.
From the logs it looks like hpet (why not kvmclock?) is used for the

clock, it should not generate such drifts since it is a global clock.
Can you verify current_clocksource on a boot that actually failed (in
case the clocksource is switched during runtime)?

--
error compiling committee.c: too many arguments to function

Fengguang Wu

2012-09-30 11:23:08 UTC

Permalink

Post by Avi Kivity

[ . . . ]

But could you also please send your .config file and a description of

.config attached.

the workload you are running?

It's basically the below commands. The exact initrd is not relevant in
this case because it's a boot time warning before user space is
started. The stalls roughly happen 1 time on every 10 boots.

Good spot! Yeah I noticed that huge timestamp gap, however didn't take
it seriously enough..

Get your two CPUs in agreement about what time it is, and I bet that
the CPU stall warnings will go away.

Possibly KVM related? Because the warnings show up in many test boxes
running KVM and so is not likely some hardware specific issue.

I vaguely recall seeing something recently. But let's ask the KVM and
timekeeping guys.
From the logs it looks like hpet (why not kvmclock?) is used for the

clock, it should not generate such drifts since it is a global clock.
Can you verify current_clocksource on a boot that actually failed (in
case the clocksource is switched during runtime)?

I've checked out the dmesg that's cited by Paul, attached. Yes it
contains lines

[ 4.970051] Switching to clocksource hpet

and then

[ 7.250353] Switching to clocksource tsc

And there is no kvm-clock lines. Oh well for this particular kernel:

# CONFIG_KVM_CLOCK is not set

I'm not sure how this happen, maybe some kconfig that CONFIG_KVM_CLOCK
depends on is randconfig'ed to off..

Thanks,
Fengguang

Avi Kivity

2012-09-30 11:32:46 UTC

Permalink

Post by Fengguang Wu

Post by Avi Kivity

[ . . . ]

But could you also please send your .config file and a description of

.config attached.

the workload you are running?

It's basically the below commands. The exact initrd is not relevant in
this case because it's a boot time warning before user space is
started. The stalls roughly happen 1 time on every 10 boots.

Good spot! Yeah I noticed that huge timestamp gap, however didn't take
it seriously enough..

Get your two CPUs in agreement about what time it is, and I bet that
the CPU stall warnings will go away.

Possibly KVM related? Because the warnings show up in many test boxes
running KVM and so is not likely some hardware specific issue.

I vaguely recall seeing something recently. But let's ask the KVM and
timekeeping guys.
From the logs it looks like hpet (why not kvmclock?) is used for the

clock, it should not generate such drifts since it is a global clock.
Can you verify current_clocksource on a boot that actually failed (in
case the clocksource is switched during runtime)?

I've checked out the dmesg that's cited by Paul, attached. Yes it
contains lines
[ 4.970051] Switching to clocksource hpet
and then
[ 7.250353] Switching to clocksource tsc

Ah, tsc will certainly break on kvm if the hardware doesn't provide a
constant tsc source. I'm surprised the guest kernel didn't detect it
and switch back to hpet though.

--
error compiling committee.c: too many arguments to function

Fengguang Wu

2012-09-30 11:59:37 UTC