Mailing List Archive

[PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes
As widely reported on the internet today, some Linux systems after
the leapsecond was inserted are experiencing futex related load
spikes (usually connected to MySQL, Firefox, Thunderbird, Java, etc).

An apparent for this issue workaround is running:
$ date -s "`date`"

Credit: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix

I believe this issue is due to the leapsecond being added without
calling clock_was_set() to notify the hrtimer subsystem of the
change. (Although I've not yet chased all the way down to the
hrtimer code to validate exactly what's going on there).

The workaround functions as it forces a clock_was_set()
call from settimeofday().

This fix adds some extra logic to track when a leapsecond
is added from update_wall_time() and calls clock_was_set()
once the timekeeper.lock is released.

I've been able to reproduce the load spike using Thunderbird
when triggering a leap second and with this patch the issue
did not crop up.

NOTE: Some reports have been of a hard hang right at or before
the leapsecond. I've not been able to reproduce or diagnose
this, so this fix does not likely address the reported hard
hangs (unless they end up being connected to the futex/hrtimer
issue).

It had been a long day before I heard about this issue, so
my brain is a little mushy right now. Reviews and extra
testing would be greatly appreciated.

CC: stable@vger.kernel.org
CC: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
---
kernel/time/timekeeping.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6f46a00..e5da44f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -942,7 +942,7 @@ static void timekeeping_adjust(s64 offset)
*
* Returns the unconsumed cycles.
*/
-static cycle_t logarithmic_accumulation(cycle_t offset, int shift)
+static cycle_t logarithmic_accumulation(cycle_t offset, int shift, int* clockset)
{
u64 nsecps = (u64)NSEC_PER_SEC << timekeeper.shift;
u64 raw_nsecs;
@@ -963,6 +963,8 @@ static cycle_t logarithmic_accumulation(cycle_t offset, int shift)
leap = second_overflow(timekeeper.xtime.tv_sec);
timekeeper.xtime.tv_sec += leap;
timekeeper.wall_to_monotonic.tv_sec -= leap;
+ if (leap)
+ *clockset = 1;
}

/* Accumulate raw time */
@@ -994,6 +996,7 @@ static void update_wall_time(void)
struct clocksource *clock;
cycle_t offset;
int shift = 0, maxshift;
+ int clockset = 0;
unsigned long flags;

write_seqlock_irqsave(&timekeeper.lock, flags);
@@ -1026,7 +1029,7 @@ static void update_wall_time(void)
maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
shift = min(shift, maxshift);
while (offset >= timekeeper.cycle_interval) {
- offset = logarithmic_accumulation(offset, shift);
+ offset = logarithmic_accumulation(offset, shift, &clockset);
if(offset < timekeeper.cycle_interval<<shift)
shift--;
}
@@ -1079,6 +1082,8 @@ static void update_wall_time(void)
leap = second_overflow(timekeeper.xtime.tv_sec);
timekeeper.xtime.tv_sec += leap;
timekeeper.wall_to_monotonic.tv_sec -= leap;
+ if (leap)
+ clockset = 1;
}

timekeeping_update(false);
@@ -1086,6 +1091,8 @@ static void update_wall_time(void)
out:
write_sequnlock_irqrestore(&timekeeper.lock, flags);

+ if (clockset)
+ clock_was_set();
}

/**
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On 07/01/2012 02:36 AM, John Stultz wrote:
> As widely reported on the internet today, some Linux systems after
> the leapsecond was inserted are experiencing futex related load
> spikes (usually connected to MySQL, Firefox, Thunderbird, Java, etc).
>
> An apparent for this issue workaround is running:
> $ date -s "`date`"
>
> Credit: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix
>
> I believe this issue is due to the leapsecond being added without
> calling clock_was_set() to notify the hrtimer subsystem of the
> change. (Although I've not yet chased all the way down to the
> hrtimer code to validate exactly what's going on there).
>
> The workaround functions as it forces a clock_was_set()
> call from settimeofday().
>
> This fix adds some extra logic to track when a leapsecond
> is added from update_wall_time() and calls clock_was_set()
> once the timekeeper.lock is released.
>
> I've been able to reproduce the load spike using Thunderbird
> when triggering a leap second and with this patch the issue
> did not crop up.

Also, attached is the test case I've been using to trigger leapseconds,
in case anyone else is interested in trying to either test this fix or
help reproduce the reported hard hangs.

To build:
gcc leaptest.c -o leaptest -lrt

thanks
-john
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On 07/01/2012 11:36 AM, John Stultz wrote:
> I believe this issue is due to the leapsecond being added without
> calling clock_was_set() to notify the hrtimer subsystem of the
> change. (Although I've not yet chased all the way down to the
> hrtimer code to validate exactly what's going on there).

For the benefit of -stable:

Am I right in thinking that, if the analysis is confirmed, this was
caused by the following commit:

commit 746976a301ac9c9aa10d7d42454f8d6cdad8ff2b
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Tue Jul 3 20:05:20 2007 +0200

NTP: remove clock_was_set() call to prevent deadlock

The clock_was_set() call in seconds_overflow() which happens only when
leap seconds are inserted / deleted is wrong in two aspects:

1. it results in a call to on_each_cpu() with interrupts disabled
2. it is potential deadlock source vs. call_lock in smp_call_function()

The only possible side effect of the removal might be, that an absolute
CLOCK_REALTIME timer fires 1 second too late, in the rare case of leap
second deletion and an absolute CLOCK_REALTIME timer which expires in
the affected time frame. It will never fire too early.

This was probably observed by the reporter of a June 30th -> July 1st
hang: http://lkml.org/lkml/2007/7/3/103

A similar problem was observed by Dave Jones, who provided a screen shot
with a lockdep back trace, which allowed to analyse the problem.

Sob: Thomas Gleixner <tglx@linutronix.de>
Ab: Ingo Molnar <mingo@elte.hu>
Sob: Linus Torvalds <torvalds@linux-foundation.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
John,

I was hit by the futex issue as well. I saw your patch and quickly did a test
with top-of-tree + your patch using your reproducer. I end up with warnings
from the smp_call_function code followed by all sorts of deadlocks, etc.

I haven't had a chance to debug and will start doing so shortly ...

intel-canoepass-02 login: [ 108.479555] Clock: inserting leap second 23:59:60 UTC
[ 108.485199] ------------[ cut here ]------------
[ 108.490368] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
[ 108.498236] Hardware name: S2600CP
[ 108.502060] Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc
kvm_intel igb coretemp kvm ixgbe ptp pps_core ioatdma mdio tpm_tis crc32c_intel
wmi joydev dca tpm lpc_ich ghash_clmulni_intel sb_edac mfd_core edac_core
i2c_i801 microcode pcspkr tpm_bios hid_generic isci libsas scsi_transport_sas
mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
[ 108.540561] Pid: 1328, comm: leaptest Not tainted 3.5.0-rc4+ #4
[ 108.547169] Hypervisor: no hypervisor
[ 108.551273] Call Trace:
[ 108.554019] <IRQ> [<ffffffff8105814f>] warn_slowpath_common+0x7f/0xc0
[ 108.561398] [<ffffffff810581aa>] warn_slowpath_null+0x1a/0x20
[ 108.567911] [<ffffffff810b39bd>] smp_call_function_many+0xbd/0x260
[ 108.574931] [<ffffffff8107e960>] ? hrtimer_wakeup+0x30/0x30
[ 108.581242] [<ffffffff8107e960>] ? hrtimer_wakeup+0x30/0x30
[ 108.587560] [<ffffffff810b3cb2>] smp_call_function+0x22/0x30
[ 108.593982] [<ffffffff810b3d18>] on_each_cpu+0x28/0x70
[ 108.599825] [<ffffffff8107ef7c>] clock_was_set+0x1c/0x30
[ 108.605847] [<ffffffff810a71d5>] do_timer+0x315/0x570
[ 108.611592] [<ffffffff810adb18>] tick_do_update_jiffies64+0x78/0xc0
[ 108.618680] [<ffffffff810add28>] tick_sched_timer+0xb8/0xc0
[ 108.624991] [<ffffffff8107ed03>] __run_hrtimer+0x73/0x1d0
[ 108.631111] [<ffffffff810adc70>] ? tick_nohz_handler+0x110/0x110
[ 108.637908] [<ffffffff8107f5d7>] hrtimer_interrupt+0xd7/0x1f0
[ 108.644447] [<ffffffff81610c19>] smp_apic_timer_interrupt+0x69/0x99
[ 108.651550] [<ffffffff8160f98a>] apic_timer_interrupt+0x6a/0x70
[ 108.658255] <EOI>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On 07/01/2012 11:28 AM, Prarit Bhargava wrote:
> John,
>
> I was hit by the futex issue as well. I saw your patch and quickly did a test
> with top-of-tree + your patch using your reproducer. I end up with warnings
> from the smp_call_function code followed by all sorts of deadlocks, etc.
>
> I haven't had a chance to debug and will start doing so shortly ...
>
> intel-canoepass-02 login: [ 108.479555] Clock: inserting leap second 23:59:60 UTC
> [ 108.485199] ------------[ cut here ]------------
> [ 108.490368] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
> [ 108.498236] Hardware name: S2600CP
> [ 108.502060] Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc
> kvm_intel igb coretemp kvm ixgbe ptp pps_core ioatdma mdio tpm_tis crc32c_intel
> wmi joydev dca tpm lpc_ich ghash_clmulni_intel sb_edac mfd_core edac_core
> i2c_i801 microcode pcspkr tpm_bios hid_generic isci libsas scsi_transport_sas
> mgag200 i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
> [ 108.540561] Pid: 1328, comm: leaptest Not tainted 3.5.0-rc4+ #4
> [ 108.547169] Hypervisor: no hypervisor
> [ 108.551273] Call Trace:
> [ 108.554019] <IRQ> [<ffffffff8105814f>] warn_slowpath_common+0x7f/0xc0
> [ 108.561398] [<ffffffff810581aa>] warn_slowpath_null+0x1a/0x20
> [ 108.567911] [<ffffffff810b39bd>] smp_call_function_many+0xbd/0x260
> [ 108.574931] [<ffffffff8107e960>] ? hrtimer_wakeup+0x30/0x30
> [ 108.581242] [<ffffffff8107e960>] ? hrtimer_wakeup+0x30/0x30
> [ 108.587560] [<ffffffff810b3cb2>] smp_call_function+0x22/0x30
> [ 108.593982] [<ffffffff810b3d18>] on_each_cpu+0x28/0x70
> [ 108.599825] [<ffffffff8107ef7c>] clock_was_set+0x1c/0x30

John, the issue is that clock_was_set calls on_each_cpu() -- which cannot be
called from interrupt context as it calls smp_call_function_many().

I don't think you can call call_was_set() from update_wall_time() as
update_wall_time() is called in interrupt context.

Looking into it more ...

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On 07/01/2012 09:56 AM, Prarit Bhargava wrote:
>
> John, the issue is that clock_was_set calls on_each_cpu() -- which cannot be
> called from interrupt context as it calls smp_call_function_many().
>
> I don't think you can call call_was_set() from update_wall_time() as
> update_wall_time() is called in interrupt context.
>

Thanks for pointing this out. Ben Blum also mentioned this last night,
but I missed the atomic bit instead of the xtime_lock issue.

Reworking the patch now.

thanks again
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On Sun, Jul 01, 2012 at 10:28:25AM -0700, John Stultz wrote:
>
> Reworking the patch now.

John,

I know you didn't like my (originally Michael Hack's) idea of keeping
time in TAI, but wouldn't changing to an internal, continuous time
scale (not necessary TAI) solve these sorts of timer issues?

There have been a number of clock/timer/leap bugs over the last
years. Some of these might have been avoided by using a continuous
scale, since no special timer actions would be needed during a leap
second.

The run time cost is low, just one additional test and addition when
reading the time. It might be worth it for the peace of mind when
the next leap second rolls around.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On 07/02/2012 03:16 AM, Richard Cochran wrote:
> On Sun, Jul 01, 2012 at 10:28:25AM -0700, John Stultz wrote:
>> Reworking the patch now.
> John,
>
> I know you didn't like my (originally Michael Hack's) idea of keeping
> time in TAI, but wouldn't changing to an internal, continuous time
> scale (not necessary TAI) solve these sorts of timer issues?
So first, I don't think keeping a different time base would have avoided
this particular issue.
Its really an issue where the hrtimer code has in-effect a cache of
timekeeping state that, since clock_was_set() wasn't called, didn't get
updated when we applied the leapsecond.

Second, I'm not opposed to reworking how the internal system keeps track
of time. I just wasn't fond of specifics in your implementation (mostly
around mixing cleanups with behavioural changes).

I wouldn't be opposed to something like:
CLOCK_TAI = CLOCK_MONOTONIC + monotonic_to_tai
CLOCK_REALTIME = CLOCK_TAI + tai_to_utc

Also, some of your suggested changes to move some of the NTP state into
the timekeeper struct made sense as well, but just needed some slight
tweaks.

> There have been a number of clock/timer/leap bugs over the last
> years. Some of these might have been avoided by using a continuous
> scale, since no special timer actions would be needed during a leap
> second.
Unfortunately the other issues have been locking related, so I don't
think changing the internal time scale would have helped.

Regardless, I do hope you rework and resend your proposed changes.
Clearly we could use more eyes in this area.

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
Hi Richard,

On Mon, Jul 02, 2012 at 12:16:08PM +0200, Richard Cochran wrote:
> I know you didn't like my (originally Michael Hack's) idea of keeping
> time in TAI, but wouldn't changing to an internal, continuous time
> scale (not necessary TAI) solve these sorts of timer issues?

Doesn't that actually make the problem of leap seconds worse, as you'd have to
start tabulating past leap seconds in the kernel?

Even worse, *future* leap seconds would need to be tracked and after they've
happened stored on disk, and loaded back into the kernel after booting, which
seems like a mess. The trouble here is that leap seconds are only announced a
short while before they happen, so there's no way to bake leap seconds into
the software; they need to be dynamically added by ntpd.

Or is there somehow some way to avoid that?

> There have been a number of clock/timer/leap bugs over the last
> years. Some of these might have been avoided by using a continuous
> scale, since no special timer actions would be needed during a leap
> second.
>
> The run time cost is low, just one additional test and addition when
> reading the time. It might be worth it for the peace of mind when
> the next leap second rolls around.

I don't know if reworking the system that's been in place for ages is a good
way to give us 'peace of mind'. Then again, I love to be enlightened :-)

Sytse Wielinga
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On Mon, Jul 02, 2012 at 10:08:21PM +0200, Sytse Wielinga wrote:
> Hi Richard,
>
> On Mon, Jul 02, 2012 at 12:16:08PM +0200, Richard Cochran wrote:
> > I know you didn't like my (originally Michael Hack's) idea of keeping
> > time in TAI, but wouldn't changing to an internal, continuous time
> > scale (not necessary TAI) solve these sorts of timer issues?
>
> Doesn't that actually make the problem of leap seconds worse, as you'd have to
> start tabulating past leap seconds in the kernel?

No, I am not suggesting to do that.

> Even worse, *future* leap seconds would need to be tracked and after they've
> happened stored on disk, and loaded back into the kernel after booting, which
> seems like a mess. The trouble here is that leap seconds are only announced a
> short while before they happen, so there's no way to bake leap seconds into
> the software; they need to be dynamically added by ntpd.
>
> Or is there somehow some way to avoid that?

I think the established practice of announcing the event by network is
the only sane way of handling this issue. The list of TAI-UTC offsets
belongs to what David Mills has called our "institutional memory", and
this is a user space issue. The kernel's job is to just live in the
moment and provide the right time for *now*.

> > There have been a number of clock/timer/leap bugs over the last
> > years. Some of these might have been avoided by using a continuous
> > scale, since no special timer actions would be needed during a leap
> > second.
> >
> > The run time cost is low, just one additional test and addition when
> > reading the time. It might be worth it for the peace of mind when
> > the next leap second rolls around.
>
> I don't know if reworking the system that's been in place for ages is a good
> way to give us 'peace of mind'. Then again, I love to be enlightened :-)

There have been lockups and other kernel issues due to leap second
bugs. That is a fact. Does that give you peace of mind?

My own computers were off for the last leap second. But some people
cannot afford to do this. I suggest that changing the code so that no
special actions occur at a leap second would be more reliable than
having rarely tested code paths just for leap second handling.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On Tue, Jul 03, 2012 at 11:23:25AM +0200, Richard Cochran wrote:
> I think the established practice of announcing the event by network is
> the only sane way of handling this issue. The list of TAI-UTC offsets
> belongs to what David Mills has called our "institutional memory", and
> this is a user space issue. The kernel's job is to just live in the
> moment and provide the right time for *now*.

I do suppose hardware clock and file system times will have to be UTC (or
UTC-based local time) though? Or do you think the 35 seconds difference
simply will be so small of a problem that it's not worth fussing over?

Doing this translation in libc and keeping fs times in TAI+tz offset would
seem to necessitate at least modifications to every single program accessing
file system data directly, and would still cause minor problems with
multiboot; doing it in the kernel would mean adding a step to the boot
sequence (before mounting root r/w) for loading the current TAI-UTC difference
into the kernel; also, it'd mean splitting 'kernel time' into multiple times.
Which solution did you have in mind?

> > > There have been a number of clock/timer/leap bugs over the last
> > > years. Some of these might have been avoided by using a continuous
> > > scale, since no special timer actions would be needed during a leap
> > > second.
> > >
> > > The run time cost is low, just one additional test and addition when
> > > reading the time. It might be worth it for the peace of mind when
> > > the next leap second rolls around.
> >
> > I don't know if reworking the system that's been in place for ages is a good
> > way to give us 'peace of mind'. Then again, I love to be enlightened :-)
>
> There have been lockups and other kernel issues due to leap second
> bugs. That is a fact. Does that give you peace of mind?
>
> My own computers were off for the last leap second. But some people
> cannot afford to do this. I suggest that changing the code so that no
> special actions occur at a leap second would be more reliable than
> having rarely tested code paths just for leap second handling.

I suppose you're right; the new code might be buggy, but at least it'd get
year-round testing instead of just once every few years or so.

Then again, you and John have come up with a good regression test.

Sytse
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] Potential fix for leapsecond caused futex related load spikes [ In reply to ]
On Tue, Jul 03, 2012 at 02:05:36PM +0200, Sytse Wielinga wrote:
> On Tue, Jul 03, 2012 at 11:23:25AM +0200, Richard Cochran wrote:
> > I think the established practice of announcing the event by network is
> > the only sane way of handling this issue. The list of TAI-UTC offsets
> > belongs to what David Mills has called our "institutional memory", and
> > this is a user space issue. The kernel's job is to just live in the
> > moment and provide the right time for *now*.
>
> I do suppose hardware clock and file system times will have to be UTC (or
> UTC-based local time) though?

Yes, you are right. Those things can never change. What I have in mind
is purely internal to the kernel and won't be visible in any way.

Thanks,
Richard


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/