Mailing List Archive

Kernel 4.1 hang, apparently in __inet_lookup_established
Dear kernel developers,

I recently started to upgrade my production hosts and VMs from the 3.14 series
to 4.1 kernels, starting with 4.1.6. Yesterday, for the second time after I
started these upgrades, I experienced one of our webserver VMs hanging.

The first time this happened, the VM hung completely, all 5 virtual cores
spinning at 100%, ping still worked, but nothing else, including no virsh
console reaction - I had to destroy and restart that VM. No messages were to
be found.

Yesterday, when it happened the second time, I found the VM spinning on a
single core only, and could still connect to it via ssh - but it stopped
accepting apache connections. The core it spun on showed 100% time used in
"si", with top, and it produced the messages appended below. The VM did not
shutdown properly when told to, and had to be destroyed again.

If I read that dmesg output correctly it spins in __inet_lookup_established,
which indeed reads like it has infinite spin potential. But that code itself
did not change relative to the 3.14 series we've been running for a long time
without the issues - so the root cause would be something else.

For our production systems I'll revert to the 3.14 series, but maybe this
report may help somebody understand what's going on.

best regards
Patrick

dmesg of the hang:

[449302.540017] INFO: rcu_sched self-detected stall on CPU { 4} (t=6000
jiffies g=22900108 c=22900107 q=22617)
[449302.540017] Task dump for CPU 4:
[449302.540017] swapper/4 R running task 0 0 1
0x00000008
[449302.540017] ffffffff81831140 ffff88081f403950 ffffffff810ead0e
0000000000000004
[449302.540017] ffffffff81831140 ffff88081f403970 ffffffff810ed288
0000000000000083
[449302.540017] 0000000000000005 ffff88081f4039a0 ffffffff81105cc0
ffff88081f414d00
[449302.540017] Call Trace:
[449302.540017] <IRQ> [<ffffffff810ead0e>] sched_show_task+0xae/0x120
[449302.540017] [<ffffffff810ed288>] dump_cpu_task+0x38/0x40
[449302.540017] [<ffffffff81105cc0>] rcu_dump_cpu_stacks+0x90/0xd0
[449302.540017] [<ffffffff81108cfb>] rcu_check_callbacks+0x3eb/0x6e0
[449302.540017] [<ffffffff810ee22c>] ? account_process_tick+0x5c/0x180
[449302.540017] [<ffffffff81119170>] ? tick_sched_handle.isra.18+0x40/0x40
[449302.540017] [<ffffffff8110ae04>] update_process_times+0x34/0x60
[449302.540017] [<ffffffff81119161>] tick_sched_handle.isra.18+0x31/0x40
[449302.540017] [<ffffffff811191ac>] tick_sched_timer+0x3c/0x70
[449302.540017] [<ffffffff8110b27a>] __run_hrtimer.isra.34+0x4a/0xf0
[449302.540017] [<ffffffff8110bbfd>] hrtimer_interrupt+0xcd/0x1f0
[449302.540017] [<ffffffff810296b4>] local_apic_timer_interrupt+0x34/0x60
[449302.540017] [<ffffffff810299ec>] smp_apic_timer_interrupt+0x3c/0x60
[449302.540017] [<ffffffff8149e98b>] apic_timer_interrupt+0x6b/0x70
[449302.540017] [<ffffffff8141eee8>] ? __inet_lookup_established+0x68/0x130
[449302.540017] [<ffffffff8141eec1>] ? __inet_lookup_established+0x41/0x130
[449302.540017] [<ffffffff81438466>] tcp_v4_early_demux+0x96/0x150
[449302.540017] [<ffffffff814150c8>] ip_rcv_finish+0xb8/0x360
[449302.540017] [<ffffffff814157e4>] ip_rcv+0x294/0x3f0
[449302.540017] [<ffffffff81415010>] ? ip_local_deliver_finish+0x140/0x140
[449302.540017] [<ffffffff81398edb>] __netif_receive_skb_core+0x52b/0x760
[449302.540017] [<ffffffff81399123>] __netif_receive_skb+0x13/0x60
[449302.540017] [<ffffffff8139918e>] netif_receive_skb_internal+0x1e/0x90
[449302.540017] [<ffffffff8139c8ec>] netif_receive_skb_sk+0xc/0x10
[449302.540017] [<ffffffff81370e51>] virtnet_receive+0x221/0x7a0
[449302.540017] [<ffffffff813714cc>] virtnet_poll+0x1c/0x80
[449302.540017] [<ffffffff8139cfca>] net_rx_action+0xea/0x2b0
[449302.540017] [<ffffffff810ccd8a>] __do_softirq+0xda/0x1f0
[449302.540017] [<ffffffff810cd0dd>] irq_exit+0x9d/0xb0
[449302.540017] [<ffffffff81003f95>] do_IRQ+0x55/0xf0
[449302.540017] [<ffffffff8149e6eb>] common_interrupt+0x6b/0x6b
[449302.540017] <EOI> [<ffffffff810edb88>] ? sched_clock_cpu+0x98/0xc0
[449302.540017] [<ffffffff81030b06>] ? native_safe_halt+0x6/0x10
[449302.540017] [<ffffffff8100a9a9>] default_idle+0x9/0x10
[449302.540017] [<ffffffff8100b3da>] arch_cpu_idle+0xa/0x10
[449302.540017] [<ffffffff810f8d08>] cpu_startup_entry+0x258/0x310
[449302.540017] [<ffffffff81028063>] start_secondary+0x123/0x130
[449482.570137] INFO: rcu_sched self-detected stall on CPU { 4} (t=24004
jiffies g=22900108 c=22900107 q=97787)
[449482.570148] Task dump for CPU 4:
[449482.570151] swapper/4 R running task 0 0 1
0x00000008
[449482.570156] ffffffff81831140 ffff88081f403950 ffffffff810ead0e
0000000000000004
[449482.570165] ffffffff81831140 ffff88081f403970 ffffffff810ed288
0000000000000083
[449482.570167] 0000000000000005 ffff88081f4039a0 ffffffff81105cc0
ffff88081f414d00
[449482.570169] Call Trace:
[449482.570171] <IRQ> [<ffffffff810ead0e>] sched_show_task+0xae/0x120
[449482.570183] [<ffffffff810ed288>] dump_cpu_task+0x38/0x40
[449482.570188] [<ffffffff81105cc0>] rcu_dump_cpu_stacks+0x90/0xd0
[449482.570191] [<ffffffff81108cfb>] rcu_check_callbacks+0x3eb/0x6e0
[449482.570194] [<ffffffff810ee22c>] ? account_process_tick+0x5c/0x180
[449482.570199] [<ffffffff81119170>] ? tick_sched_handle.isra.18+0x40/0x40
[449482.570202] [<ffffffff8110ae04>] update_process_times+0x34/0x60
[449482.570203] [<ffffffff81119161>] tick_sched_handle.isra.18+0x31/0x40
[449482.570205] [<ffffffff811191ac>] tick_sched_timer+0x3c/0x70
[449482.570207] [<ffffffff8110b27a>] __run_hrtimer.isra.34+0x4a/0xf0
[449482.570209] [<ffffffff8110bbfd>] hrtimer_interrupt+0xcd/0x1f0
[449482.570220] [<ffffffff810296b4>] local_apic_timer_interrupt+0x34/0x60
[449482.570222] [<ffffffff810299ec>] smp_apic_timer_interrupt+0x3c/0x60
[449482.570226] [<ffffffff8149e98b>] apic_timer_interrupt+0x6b/0x70
[449482.570230] [<ffffffff8141eee0>] ? __inet_lookup_established+0x60/0x130
[449482.570232] [<ffffffff8141eec1>] ? __inet_lookup_established+0x41/0x130
[449482.570235] [<ffffffff81438466>] tcp_v4_early_demux+0x96/0x150
[449482.570238] [<ffffffff814150c8>] ip_rcv_finish+0xb8/0x360
[449482.570239] [<ffffffff814157e4>] ip_rcv+0x294/0x3f0
[449482.570241] [<ffffffff81415010>] ? ip_local_deliver_finish+0x140/0x140
[449482.570246] [<ffffffff81398edb>] __netif_receive_skb_core+0x52b/0x760
[449482.570247] [<ffffffff81399123>] __netif_receive_skb+0x13/0x60
[449482.570249] [<ffffffff8139918e>] netif_receive_skb_internal+0x1e/0x90
[449482.570252] [<ffffffff8139c8ec>] netif_receive_skb_sk+0xc/0x10
[449482.570255] [<ffffffff81370e51>] virtnet_receive+0x221/0x7a0
[449482.570257] [<ffffffff813714cc>] virtnet_poll+0x1c/0x80
[449482.570259] [<ffffffff8139cfca>] net_rx_action+0xea/0x2b0
[449482.570260] [<ffffffff810ccd8a>] __do_softirq+0xda/0x1f0
[449482.570260] [<ffffffff810cd0dd>] irq_exit+0x9d/0xb0
[449482.570260] [<ffffffff81003f95>] do_IRQ+0x55/0xf0
[449482.570260] [<ffffffff8149e6eb>] common_interrupt+0x6b/0x6b
[449482.570260] <EOI> [<ffffffff810edb88>] ? sched_clock_cpu+0x98/0xc0
[449482.570260] [<ffffffff81030b06>] ? native_safe_halt+0x6/0x10
[449482.570260] [<ffffffff8100a9a9>] default_idle+0x9/0x10
[449482.570260] [<ffffffff8100b3da>] arch_cpu_idle+0xa/0x10
[449482.570260] [<ffffffff810f8d08>] cpu_startup_entry+0x258/0x310
[449482.570260] [<ffffffff81028063>] start_secondary+0x123/0x130
[449662.610077] INFO: rcu_sched self-detected stall on CPU { 4} (t=42008
jiffies g=22900108 c=22900107 q=140789)
[449662.610077] Task dump for CPU 4:
[449662.610077] swapper/4 R running task 0 0 1
0x00000008
[449662.610077] ffffffff81831140 ffff88081f403950 ffffffff810ead0e
0000000000000004
[449662.610077] ffffffff81831140 ffff88081f403970 ffffffff810ed288
0000000000000083
[449662.610077] 0000000000000005 ffff88081f4039a0 ffffffff81105cc0
ffff88081f414d00
[449662.610077] Call Trace:
[449662.610077] <IRQ> [<ffffffff810ead0e>] sched_show_task+0xae/0x120
[449662.610077] [<ffffffff810ed288>] dump_cpu_task+0x38/0x40
[449662.610077] [<ffffffff81105cc0>] rcu_dump_cpu_stacks+0x90/0xd0
[449662.610077] [<ffffffff81108cfb>] rcu_check_callbacks+0x3eb/0x6e0
[449662.610077] [<ffffffff810ee22c>] ? account_process_tick+0x5c/0x180
[449662.610077] [<ffffffff81119170>] ? tick_sched_handle.isra.18+0x40/0x40
[449662.610077] [<ffffffff8110ae04>] update_process_times+0x34/0x60
[449662.610077] [<ffffffff81119161>] tick_sched_handle.isra.18+0x31/0x40
[449662.610077] [<ffffffff811191ac>] tick_sched_timer+0x3c/0x70
[449662.610077] [<ffffffff8110b27a>] __run_hrtimer.isra.34+0x4a/0xf0
[449662.610077] [<ffffffff8110bbfd>] hrtimer_interrupt+0xcd/0x1f0
[449662.610077] [<ffffffff810296b4>] local_apic_timer_interrupt+0x34/0x60
[449662.610077] [<ffffffff810299ec>] smp_apic_timer_interrupt+0x3c/0x60
[449662.610077] [<ffffffff8149e98b>] apic_timer_interrupt+0x6b/0x70
[449662.610077] [<ffffffff8141eee8>] ? __inet_lookup_established+0x68/0x130
[449662.610077] [<ffffffff8141eec1>] ? __inet_lookup_established+0x41/0x130
[449662.610077] [<ffffffff81438466>] tcp_v4_early_demux+0x96/0x150
[449662.610077] [<ffffffff814150c8>] ip_rcv_finish+0xb8/0x360
[449662.610077] [<ffffffff814157e4>] ip_rcv+0x294/0x3f0
[449662.610077] [<ffffffff81415010>] ? ip_local_deliver_finish+0x140/0x140
[449662.610077] [<ffffffff81398edb>] __netif_receive_skb_core+0x52b/0x760
[449662.610077] [<ffffffff81399123>] __netif_receive_skb+0x13/0x60
[449662.610077] [<ffffffff8139918e>] netif_receive_skb_internal+0x1e/0x90
[449662.610077] [<ffffffff8139c8ec>] netif_receive_skb_sk+0xc/0x10
[449662.610077] [<ffffffff81370e51>] virtnet_receive+0x221/0x7a0
[449662.610077] [<ffffffff813714cc>] virtnet_poll+0x1c/0x80
[449662.610077] [<ffffffff8139cfca>] net_rx_action+0xea/0x2b0
[449662.610077] [<ffffffff810ccd8a>] __do_softirq+0xda/0x1f0
[449662.610077] [<ffffffff810cd0dd>] irq_exit+0x9d/0xb0
[449662.610077] [<ffffffff81003f95>] do_IRQ+0x55/0xf0
[449662.610077] [<ffffffff8149e6eb>] common_interrupt+0x6b/0x6b
[449662.610077] <EOI> [<ffffffff810edb88>] ? sched_clock_cpu+0x98/0xc0
[449662.610077] [<ffffffff81030b06>] ? native_safe_halt+0x6/0x10
[449662.610077] [<ffffffff8100a9a9>] default_idle+0x9/0x10
[449662.610077] [<ffffffff8100b3da>] arch_cpu_idle+0xa/0x10
[449662.610077] [<ffffffff810f8d08>] cpu_startup_entry+0x258/0x310
[449662.610077] [<ffffffff81028063>] start_secondary+0x123/0x130
[449842.650014] INFO: rcu_sched self-detected stall on CPU { 4} (t=60011
jiffies g=22900108 c=22900107 q=206408)
[449842.650014] Task dump for CPU 4:
[449842.650014] swapper/4 R running task 0 0 1
0x00000008
[449842.650014] ffffffff81831140 ffff88081f403950 ffffffff810ead0e
0000000000000004
[449842.650014] ffffffff81831140 ffff88081f403970 ffffffff810ed288
0000000000000083
[449842.650014] 0000000000000005 ffff88081f4039a0 ffffffff81105cc0
ffff88081f414d00
[449842.650014] Call Trace:
[449842.650014] <IRQ> [<ffffffff810ead0e>] sched_show_task+0xae/0x120
[449842.650014] [<ffffffff810ed288>] dump_cpu_task+0x38/0x40
[449842.650014] [<ffffffff81105cc0>] rcu_dump_cpu_stacks+0x90/0xd0
[449842.650014] [<ffffffff81108cfb>] rcu_check_callbacks+0x3eb/0x6e0
[449842.650014] [<ffffffff810ee22c>] ? account_process_tick+0x5c/0x180
[449842.650014] [<ffffffff81119170>] ? tick_sched_handle.isra.18+0x40/0x40
[449842.650014] [<ffffffff8110ae04>] update_process_times+0x34/0x60
[449842.650014] [<ffffffff81119161>] tick_sched_handle.isra.18+0x31/0x40
[449842.650014] [<ffffffff811191ac>] tick_sched_timer+0x3c/0x70
[449842.650014] [<ffffffff8110b27a>] __run_hrtimer.isra.34+0x4a/0xf0
[449842.650014] [<ffffffff8110bbfd>] hrtimer_interrupt+0xcd/0x1f0
[449842.650014] [<ffffffff810296b4>] local_apic_timer_interrupt+0x34/0x60
[449842.650014] [<ffffffff810299ec>] smp_apic_timer_interrupt+0x3c/0x60
[449842.650014] [<ffffffff8149e98b>] apic_timer_interrupt+0x6b/0x70
[449842.650014] [<ffffffff8141ef58>] ? __inet_lookup_established+0xd8/0x130
[449842.650014] [<ffffffff8141eec1>] ? __inet_lookup_established+0x41/0x130
[449842.650014] [<ffffffff81438466>] tcp_v4_early_demux+0x96/0x150
[449842.650014] [<ffffffff814150c8>] ip_rcv_finish+0xb8/0x360
[449842.650014] [<ffffffff814157e4>] ip_rcv+0x294/0x3f0
[449842.650014] [<ffffffff81415010>] ? ip_local_deliver_finish+0x140/0x140
[449842.650014] [<ffffffff81398edb>] __netif_receive_skb_core+0x52b/0x760
[449842.650014] [<ffffffff81399123>] __netif_receive_skb+0x13/0x60
[449842.650014] [<ffffffff8139918e>] netif_receive_skb_internal+0x1e/0x90
[449842.650014] [<ffffffff8139c8ec>] netif_receive_skb_sk+0xc/0x10
[449842.650014] [<ffffffff81370e51>] virtnet_receive+0x221/0x7a0
[449842.650014] [<ffffffff813714cc>] virtnet_poll+0x1c/0x80
[449842.650014] [<ffffffff8139cfca>] net_rx_action+0xea/0x2b0
[449842.650014] [<ffffffff810ccd8a>] __do_softirq+0xda/0x1f0
[449842.650014] [<ffffffff810cd0dd>] irq_exit+0x9d/0xb0
[449842.650014] [<ffffffff81003f95>] do_IRQ+0x55/0xf0
[449842.650014] [<ffffffff8149e6eb>] common_interrupt+0x6b/0x6b
[449842.650014] <EOI> [<ffffffff810edb88>] ? sched_clock_cpu+0x98/0xc0
[449842.650014] [<ffffffff81030b06>] ? native_safe_halt+0x6/0x10
[449842.650014] [<ffffffff8100a9a9>] default_idle+0x9/0x10
[449842.650014] [<ffffffff8100b3da>] arch_cpu_idle+0xa/0x10
[449842.650014] [<ffffffff810f8d08>] cpu_startup_entry+0x258/0x310
[449842.650014] [<ffffffff81028063>] start_secondary+0x123/0x130

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Wed, 2015-09-23 at 10:25 +0200, Patrick Schaaf wrote:
> Dear kernel developers,
>
> I recently started to upgrade my production hosts and VMs from the 3.14 series
> to 4.1 kernels, starting with 4.1.6. Yesterday, for the second time after I
> started these upgrades, I experienced one of our webserver VMs hanging.
>
> The first time this happened, the VM hung completely, all 5 virtual cores
> spinning at 100%, ping still worked, but nothing else, including no virsh
> console reaction - I had to destroy and restart that VM. No messages were to
> be found.
>
> Yesterday, when it happened the second time, I found the VM spinning on a
> single core only, and could still connect to it via ssh - but it stopped
> accepting apache connections. The core it spun on showed 100% time used in
> "si", with top, and it produced the messages appended below. The VM did not
> shutdown properly when told to, and had to be destroyed again.
>
> If I read that dmesg output correctly it spins in __inet_lookup_established,
> which indeed reads like it has infinite spin potential. But that code itself
> did not change relative to the 3.14 series we've been running for a long time
> without the issues - so the root cause would be something else.
>
> For our production systems I'll revert to the 3.14 series, but maybe this
> report may help somebody understand what's going on.
>
> best regards
> Patrick


You could try following commits :

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=ed2e923945892a8372ab70d2f61d364b0b6d9054

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=29c6852602e259d2c1882f320b29d5c3fec0de04



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
Hi Patrick,

Have you tried the two patches Eric mentioned? One of my 4.1.11 server
just hanged with very similar stack trace and I am wondering whether the
aforementioned patches would help.

Thanks,

Grant

On 23/09/2015 09:31, Eric Dumazet wrote:
> On Wed, 2015-09-23 at 10:25 +0200, Patrick Schaaf wrote:
>> Dear kernel developers,
>>
>> I recently started to upgrade my production hosts and VMs from the 3.14 series
>> to 4.1 kernels, starting with 4.1.6. Yesterday, for the second time after I
>> started these upgrades, I experienced one of our webserver VMs hanging.
>>
>> The first time this happened, the VM hung completely, all 5 virtual cores
>> spinning at 100%, ping still worked, but nothing else, including no virsh
>> console reaction - I had to destroy and restart that VM. No messages were to
>> be found.
>>
>> Yesterday, when it happened the second time, I found the VM spinning on a
>> single core only, and could still connect to it via ssh - but it stopped
>> accepting apache connections. The core it spun on showed 100% time used in
>> "si", with top, and it produced the messages appended below. The VM did not
>> shutdown properly when told to, and had to be destroyed again.
>>
>> If I read that dmesg output correctly it spins in __inet_lookup_established,
>> which indeed reads like it has infinite spin potential. But that code itself
>> did not change relative to the 3.14 series we've been running for a long time
>> without the issues - so the root cause would be something else.
>>
>> For our production systems I'll revert to the 3.14 series, but maybe this
>> report may help somebody understand what's going on.
>>
>> best regards
>> Patrick
>
>
> You could try following commits :
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=ed2e923945892a8372ab70d2f61d364b0b6d9054
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=29c6852602e259d2c1882f320b29d5c3fec0de04
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Sunday 15 November 2015 16:58:33 Grant Zhang wrote:
>
> Have you tried the two patches Eric mentioned? One of my 4.1.11 server
> just hanged with very similar stack trace and I am wondering whether the
> aforementioned patches would help.

Sorry, Grant - I'm sticking to 3.14.xx for now.

best regards
Patrick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Sun, 2015-11-15 at 16:58 -0800, Grant Zhang wrote:
> Hi Patrick,
>
> Have you tried the two patches Eric mentioned? One of my 4.1.11 server
> just hanged with very similar stack trace and I am wondering whether the
> aforementioned patches would help.
>
> Thanks,

linux-4.1.12 definitely contains the fixes.

8ae3dfacdd82 inet: fix race in reqsk_queue_unlink()
31b8abd140ad inet: fix races in reqsk_queue_hash_req()

Please upgrade to 4.1.13 and you should be fine.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On 16/11/2015 07:07, Eric Dumazet wrote:
> On Sun, 2015-11-15 at 16:58 -0800, Grant Zhang wrote:
>> Hi Patrick,
>>
>> Have you tried the two patches Eric mentioned? One of my 4.1.11 server
>> just hanged with very similar stack trace and I am wondering whether the
>> aforementioned patches would help.
>>
>> Thanks,
>
> linux-4.1.12 definitely contains the fixes.
>
> 8ae3dfacdd82 inet: fix race in reqsk_queue_unlink()
> 31b8abd140ad inet: fix races in reqsk_queue_hash_req()
>
> Please upgrade to 4.1.13 and you should be fine.
>
> Thanks.
>
>

Thank you Eric and Patrick. I will upgrade to 4.1.13.

Grant
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On 16/11/2015 09:30, Grant Zhang wrote:
> On 16/11/2015 07:07, Eric Dumazet wrote:
>> On Sun, 2015-11-15 at 16:58 -0800, Grant Zhang wrote:
>>> Hi Patrick,
>>>
>>> Have you tried the two patches Eric mentioned? One of my 4.1.11 server
>>> just hanged with very similar stack trace and I am wondering whether the
>>> aforementioned patches would help.
>>>
>>> Thanks,
>>
>> linux-4.1.12 definitely contains the fixes.
>>
>> 8ae3dfacdd82 inet: fix race in reqsk_queue_unlink()
>> 31b8abd140ad inet: fix races in reqsk_queue_hash_req()
>>
>> Please upgrade to 4.1.13 and you should be fine.
>>
>> Thanks.
>>
>>
>
> Thank you Eric and Patrick. I will upgrade to 4.1.13.
>
> Grant

Hi Eric,

One of my 4.1.13 server(have been up 50+ days) under testing got into a
similar kernel hang (stack trace attached). Looking back at the initial
conversation on this issue you also mentioned the following patch in
https://lkml.org/lkml/2015/9/23/433

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=ed2e923945892a8372ab70d2f61d364b0b6d9054
tcp/dccp: fix timewait races in timer handling

Which does not seem to be part of stable 4.1 tree. Would the above patch
fix the kernel hang issue?

Thanks,

Grant

----stacktrace----
Jan 9 19:12:42 kernel:[4544972.126385] INFO: rcu_sched self-detected
stall on CPU { 13} (t=15001 jiffies g=422586407 c=422586406 q=3730083)
Jan 9 19:12:42 kernel:[4544972.134383] INFO: rcu_sched detected stalls
on CPUs/tasks: { 13} (detected by 5, t=15002 jiffies, g=422586407,
c=422586406, q=3730200)
Jan 9 19:12:42 kernel:[4544972.134384] Task dump for CPU 13:
Jan 9 19:12:42 kernel:[4544972.134387] swapper/13 R running task
0 0 1 0x00000008
Jan 9 19:12:42 kernel:[4544972.134389] 0000000000000010
0000000000000246 ffff885ecde0be68 0000000000000018
Jan 9 19:12:42 kernel:[4544972.134390] ffffffff8164045d
ffffffff00000007 00102982b9fa1875 ffffffff81c7fc80
Jan 9 19:12:42 kernel:[4544972.134391] 0000000d00000000
ffff88beff0e0300 ffffffff81ce5448 ffff885ecde08000
Jan 9 19:12:42 kernel:[4544972.134391] Call Trace:
Jan 9 19:12:42 kernel:[4544972.134397] [<ffffffff8164045d>] ?
cpuidle_enter_state+0x7d/0x1f0
Jan 9 19:12:42 kernel:[4544972.134398] [<ffffffff81640607>] ?
cpuidle_enter+0x17/0x20
Jan 9 19:12:42 kernel:[4544972.134401] [<ffffffff810b6c41>] ?
cpu_startup_entry+0x2d1/0x350
Jan 9 19:12:42 kernel:[4544972.134403] [<ffffffff810e0aac>] ?
clockevents_config_and_register+0x2c/0x40
Jan 9 19:12:42 kernel:[4544972.134406] [<ffffffff81034aa3>] ?
start_secondary+0x123/0x130
Jan 9 19:12:42 kernel:[4544972.149242] Task dump for CPU 13:
Jan 9 19:12:42 kernel:[4544972.149244] swapper/13 R running task
0 0 1 0x00000008
Jan 9 19:12:42 kernel:[4544972.149248] ffffffff81c3f300
ffff88beff0c3820 ffffffff810a5791 000000000000000d
Jan 9 19:12:42 kernel:[4544972.149250] ffffffff81c3f300
ffff88beff0c3840 ffffffff810a8c4f ffff88beff0c3880
Jan 9 19:12:42 kernel:[4544972.149251] ffffffff81c3f3c0
ffff88beff0c3870 ffffffff810ca763 ffff88beff0d6c80
Jan 9 19:12:42 kernel:[4544972.149253] Call Trace:
Jan 9 19:12:42 kernel:[4544972.149255] <IRQ> [<ffffffff810a5791>]
sched_show_task+0xb1/0x120
Jan 9 19:12:42 kernel:[4544972.149267] [<ffffffff810a8c4f>]
dump_cpu_task+0x3f/0x50
Jan 9 19:12:42 kernel:[4544972.149270] [<ffffffff810ca763>]
rcu_dump_cpu_stacks+0x93/0xc0
Jan 9 19:12:42 kernel:[4544972.149272] [<ffffffff810cdcaa>]
rcu_check_callbacks+0x4aa/0x760
Jan 9 19:12:42 kernel:[4544972.149277] [<ffffffff8110e27c>] ?
acct_account_cputime+0x1c/0x20
Jan 9 19:12:42 kernel:[4544972.149279] [<ffffffff810d3868>]
update_process_times+0x38/0x70
Jan 9 19:12:42 kernel:[4544972.149283] [<ffffffff810e2a38>]
tick_sched_timer+0x58/0x190
Jan 9 19:12:42 kernel:[4544972.149284] [<ffffffff810d41ad>]
__run_hrtimer+0x6d/0x220
Jan 9 19:12:42 kernel:[4544972.149285] [<ffffffff810e29e0>] ?
tick_init_highres+0x20/0x20
Jan 9 19:12:42 kernel:[4544972.149287] [<ffffffff810d4893>]
hrtimer_interrupt+0x103/0x240
Jan 9 19:12:42 kernel:[4544972.149292] [<ffffffff810363f9>]
local_apic_timer_interrupt+0x39/0x60
Jan 9 19:12:42 kernel:[4544972.149296] [<ffffffff817b5275>]
smp_apic_timer_interrupt+0x45/0x60
Jan 9 19:12:42 kernel:[4544972.149299] [<ffffffff817b39bb>]
apic_timer_interrupt+0x6b/0x70
Jan 9 19:12:42 kernel:[4544972.149304] [<ffffffff816f36f0>] ?
__inet_lookup_established+0x70/0x170
Jan 9 19:12:42 kernel:[4544972.149306] [<ffffffff816f36c6>] ?
__inet_lookup_established+0x46/0x170
Jan 9 19:12:42 kernel:[4544972.149309] [<ffffffff8170eb7d>]
tcp_v4_early_demux+0xad/0x160
Jan 9 19:12:42 kernel:[4544972.149311] [<ffffffff816e9428>]
ip_rcv_finish+0x158/0x380
Jan 9 19:12:42 kernel:[4544972.149312] [<ffffffff816e9cf2>]
ip_rcv+0x292/0x3b0
Jan 9 19:12:42 kernel:[4544972.149318] [<ffffffffa04f0102>] ?
macvlan_handle_frame+0x1f2/0x310 [macvlan]
Jan 9 19:12:42 kernel:[4544972.149320] [<ffffffff816e92d0>] ?
inet_add_protocol+0x50/0x50
Jan 9 19:12:42 kernel:[4544972.149324] [<ffffffff81693bf0>]
__netif_receive_skb_core+0x300/0x7b0
Jan 9 19:12:42 kernel:[4544972.149325] [<ffffffff817b5185>] ?
do_IRQ+0x65/0x110
Jan 9 19:12:42 kernel:[4544972.149327] [<ffffffff816940c1>]
__netif_receive_skb+0x21/0x70
Jan 9 19:12:42 kernel:[4544972.149329] [<ffffffff81694271>]
netif_receive_skb_internal+0x31/0xa0
Jan 9 19:12:42 kernel:[4544972.149331] [<ffffffff81694ff0>]
napi_gro_receive+0x130/0x1b0
Jan 9 19:12:42 kernel:[4544972.149341] [<ffffffffa005f389>]
ixgbe_clean_rx_irq+0x7b9/0xa20 [ixgbe]
Jan 9 19:12:42 kernel:[4544972.149344] [<ffffffffa006024b>]
ixgbe_poll+0x42b/0x7e0 [ixgbe]
Jan 9 19:12:42 kernel:[4544972.149346] [<ffffffff816949ce>]
net_rx_action+0x13e/0x320
Jan 9 19:12:42 kernel:[4544972.149350] [<ffffffff8107fcce>]
__do_softirq+0xde/0x2d0
Jan 9 19:12:42 kernel:[4544972.149352] [<ffffffff8108009d>]
irq_exit+0x4d/0x60
Jan 9 19:12:42 kernel:[4544972.149353] [<ffffffff817b5185>]
do_IRQ+0x65/0x110
Jan 9 19:12:42 kernel:[4544972.149356] [<ffffffff817b36eb>]
common_interrupt+0x6b/0x6b
Jan 9 19:12:42 kernel:[4544972.149356] <EOI> [<ffffffff8164048e>] ?
cpuidle_enter_state+0xae/0x1f0
Jan 9 19:12:42 kernel:[4544972.149360] [<ffffffff8164045d>] ?
cpuidle_enter_state+0x7d/0x1f0
Jan 9 19:12:42 kernel:[4544972.149361] [<ffffffff81640607>]
cpuidle_enter+0x17/0x20
Jan 9 19:12:42 kernel:[4544972.149364] [<ffffffff810b6c41>]
cpu_startup_entry+0x2d1/0x350
Jan 9 19:12:42 kernel:[4544972.149366] [<ffffffff810e0aac>] ?
clockevents_config_and_register+0x2c/0x40
Jan 9 19:12:42 kernel:[4544972.149368] [<ffffffff81034aa3>]
start_secondary+0x123/0x130
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Mon, 2016-01-11 at 12:27 -0800, Grant Zhang wrote:

> Hi Eric,
>
> One of my 4.1.13 server(have been up 50+ days) under testing got into a
> similar kernel hang (stack trace attached). Looking back at the initial
> conversation on this issue you also mentioned the following patch in
> https://lkml.org/lkml/2015/9/23/433
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=ed2e923945892a8372ab70d2f61d364b0b6d9054
> tcp/dccp: fix timewait races in timer handling
>
> Which does not seem to be part of stable 4.1 tree. Would the above patch
> fix the kernel hang issue?
>
> Thanks,
>
> Grant
>
> ----stacktrace----
> Jan 9 19:12:42 kernel:[4544972.126385] INFO: rcu_sched self-detected
> stall on CPU { 13} (t=15001 jiffies g=422586407 c=422586406 q=3730083)
> Jan 9 19:12:42 kernel:[4544972.134383] INFO: rcu_sched detected stalls
> on CPUs/tasks: { 13} (detected by 5, t=15002 jiffies, g=422586407,
> c=422586406, q=3730200)
> Jan 9 19:12:42 kernel:[4544972.134384] Task dump for CPU 13:
> Jan 9 19:12:42 kernel:[4544972.134387] swapper/13 R running task
> 0 0 1 0x00000008
> Jan 9 19:12:42 kernel:[4544972.134389] 0000000000000010
> 0000000000000246 ffff885ecde0be68 0000000000000018
> Jan 9 19:12:42 kernel:[4544972.134390] ffffffff8164045d
> ffffffff00000007 00102982b9fa1875 ffffffff81c7fc80
> Jan 9 19:12:42 kernel:[4544972.134391] 0000000d00000000
> ffff88beff0e0300 ffffffff81ce5448 ffff885ecde08000
> Jan 9 19:12:42 kernel:[4544972.134391] Call Trace:
> Jan 9 19:12:42 kernel:[4544972.134397] [<ffffffff8164045d>] ?
> cpuidle_enter_state+0x7d/0x1f0
> Jan 9 19:12:42 kernel:[4544972.134398] [<ffffffff81640607>] ?
> cpuidle_enter+0x17/0x20
> Jan 9 19:12:42 kernel:[4544972.134401] [<ffffffff810b6c41>] ?
> cpu_startup_entry+0x2d1/0x350
> Jan 9 19:12:42 kernel:[4544972.134403] [<ffffffff810e0aac>] ?
> clockevents_config_and_register+0x2c/0x40
> Jan 9 19:12:42 kernel:[4544972.134406] [<ffffffff81034aa3>] ?
> start_secondary+0x123/0x130
> Jan 9 19:12:42 kernel:[4544972.149242] Task dump for CPU 13:
> Jan 9 19:12:42 kernel:[4544972.149244] swapper/13 R running task
> 0 0 1 0x00000008
> Jan 9 19:12:42 kernel:[4544972.149248] ffffffff81c3f300
> ffff88beff0c3820 ffffffff810a5791 000000000000000d
> Jan 9 19:12:42 kernel:[4544972.149250] ffffffff81c3f300
> ffff88beff0c3840 ffffffff810a8c4f ffff88beff0c3880
> Jan 9 19:12:42 kernel:[4544972.149251] ffffffff81c3f3c0
> ffff88beff0c3870 ffffffff810ca763 ffff88beff0d6c80
> Jan 9 19:12:42 kernel:[4544972.149253] Call Trace:
> Jan 9 19:12:42 kernel:[4544972.149255] <IRQ> [<ffffffff810a5791>]
> sched_show_task+0xb1/0x120
> Jan 9 19:12:42 kernel:[4544972.149267] [<ffffffff810a8c4f>]
> dump_cpu_task+0x3f/0x50
> Jan 9 19:12:42 kernel:[4544972.149270] [<ffffffff810ca763>]
> rcu_dump_cpu_stacks+0x93/0xc0
> Jan 9 19:12:42 kernel:[4544972.149272] [<ffffffff810cdcaa>]
> rcu_check_callbacks+0x4aa/0x760
> Jan 9 19:12:42 kernel:[4544972.149277] [<ffffffff8110e27c>] ?
> acct_account_cputime+0x1c/0x20
> Jan 9 19:12:42 kernel:[4544972.149279] [<ffffffff810d3868>]
> update_process_times+0x38/0x70
> Jan 9 19:12:42 kernel:[4544972.149283] [<ffffffff810e2a38>]
> tick_sched_timer+0x58/0x190
> Jan 9 19:12:42 kernel:[4544972.149284] [<ffffffff810d41ad>]
> __run_hrtimer+0x6d/0x220
> Jan 9 19:12:42 kernel:[4544972.149285] [<ffffffff810e29e0>] ?
> tick_init_highres+0x20/0x20
> Jan 9 19:12:42 kernel:[4544972.149287] [<ffffffff810d4893>]
> hrtimer_interrupt+0x103/0x240
> Jan 9 19:12:42 kernel:[4544972.149292] [<ffffffff810363f9>]
> local_apic_timer_interrupt+0x39/0x60
> Jan 9 19:12:42 kernel:[4544972.149296] [<ffffffff817b5275>]
> smp_apic_timer_interrupt+0x45/0x60
> Jan 9 19:12:42 kernel:[4544972.149299] [<ffffffff817b39bb>]
> apic_timer_interrupt+0x6b/0x70
> Jan 9 19:12:42 kernel:[4544972.149304] [<ffffffff816f36f0>] ?
> __inet_lookup_established+0x70/0x170
> Jan 9 19:12:42 kernel:[4544972.149306] [<ffffffff816f36c6>] ?
> __inet_lookup_established+0x46/0x170
> Jan 9 19:12:42 kernel:[4544972.149309] [<ffffffff8170eb7d>]
> tcp_v4_early_demux+0xad/0x160
> Jan 9 19:12:42 kernel:[4544972.149311] [<ffffffff816e9428>]
> ip_rcv_finish+0x158/0x380
> Jan 9 19:12:42 kernel:[4544972.149312] [<ffffffff816e9cf2>]
> ip_rcv+0x292/0x3b0
> Jan 9 19:12:42 kernel:[4544972.149318] [<ffffffffa04f0102>] ?
> macvlan_handle_frame+0x1f2/0x310 [macvlan]
> Jan 9 19:12:42 kernel:[4544972.149320] [<ffffffff816e92d0>] ?
> inet_add_protocol+0x50/0x50
> Jan 9 19:12:42 kernel:[4544972.149324] [<ffffffff81693bf0>]
> __netif_receive_skb_core+0x300/0x7b0
> Jan 9 19:12:42 kernel:[4544972.149325] [<ffffffff817b5185>] ?
> do_IRQ+0x65/0x110
> Jan 9 19:12:42 kernel:[4544972.149327] [<ffffffff816940c1>]
> __netif_receive_skb+0x21/0x70
> Jan 9 19:12:42 kernel:[4544972.149329] [<ffffffff81694271>]
> netif_receive_skb_internal+0x31/0xa0
> Jan 9 19:12:42 kernel:[4544972.149331] [<ffffffff81694ff0>]
> napi_gro_receive+0x130/0x1b0
> Jan 9 19:12:42 kernel:[4544972.149341] [<ffffffffa005f389>]
> ixgbe_clean_rx_irq+0x7b9/0xa20 [ixgbe]
> Jan 9 19:12:42 kernel:[4544972.149344] [<ffffffffa006024b>]
> ixgbe_poll+0x42b/0x7e0 [ixgbe]
> Jan 9 19:12:42 kernel:[4544972.149346] [<ffffffff816949ce>]
> net_rx_action+0x13e/0x320
> Jan 9 19:12:42 kernel:[4544972.149350] [<ffffffff8107fcce>]
> __do_softirq+0xde/0x2d0
> Jan 9 19:12:42 kernel:[4544972.149352] [<ffffffff8108009d>]
> irq_exit+0x4d/0x60
> Jan 9 19:12:42 kernel:[4544972.149353] [<ffffffff817b5185>]
> do_IRQ+0x65/0x110
> Jan 9 19:12:42 kernel:[4544972.149356] [<ffffffff817b36eb>]
> common_interrupt+0x6b/0x6b
> Jan 9 19:12:42 kernel:[4544972.149356] <EOI> [<ffffffff8164048e>] ?
> cpuidle_enter_state+0xae/0x1f0
> Jan 9 19:12:42 kernel:[4544972.149360] [<ffffffff8164045d>] ?
> cpuidle_enter_state+0x7d/0x1f0
> Jan 9 19:12:42 kernel:[4544972.149361] [<ffffffff81640607>]
> cpuidle_enter+0x17/0x20
> Jan 9 19:12:42 kernel:[4544972.149364] [<ffffffff810b6c41>]
> cpu_startup_entry+0x2d1/0x350
> Jan 9 19:12:42 kernel:[4544972.149366] [<ffffffff810e0aac>] ?
> clockevents_config_and_register+0x2c/0x40
> Jan 9 19:12:42 kernel:[4544972.149368] [<ffffffff81034aa3>]
> start_secondary+0x123/0x130
>

Hi Grant

Most probably the commit would help ;)

David, can you make sure ed2e923945892a8372ab70d2f61d364b0b6d9054 is
pushed to stable trees (4.1 & 4.2)

Thanks.
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Mon, 2016-01-11 at 13:11 -0800, Eric Dumazet wrote:

>
> Hi Grant
>
> Most probably the commit would help ;)
>
> David, can you make sure ed2e923945892a8372ab70d2f61d364b0b6d9054 is
> pushed to stable trees (4.1 & 4.2)

We also want this fix:

commit 8695a144da9e500a5a60fa34c06694346ec1048f
tcp/dccp: fix old style declarations
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Jan 2016 13:11:28 -0800

> David, can you make sure ed2e923945892a8372ab70d2f61d364b0b6d9054 is
> pushed to stable trees (4.1 & 4.2)

It should definitely be there:

[davem@localhost linux-stable]$ git shortlog v4.1.. -- include/net/inet_timewait_sock.h
David S. Miller (1):
Merge git://git.kernel.org/.../davem/net

Eric Dumazet (4):
inet: simplify timewait refcounting
inet: inet_twsk_deschedule factorization
tcp/dccp: fix timewait races in timer handling
net: align sk_refcnt on 128 bytes boundary

Raanan Avargil (1):
tcp/dccp: fix old style declarations
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Wed, 2016-01-13 at 11:01 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 11 Jan 2016 13:11:28 -0800
>
> > David, can you make sure ed2e923945892a8372ab70d2f61d364b0b6d9054 is
> > pushed to stable trees (4.1 & 4.2)
>
> It should definitely be there:
>
> [davem@localhost linux-stable]$ git shortlog v4.1.. -- include/net/inet_timewait_sock.h
> David S. Miller (1):
> Merge git://git.kernel.org/.../davem/net
>
> Eric Dumazet (4):
> inet: simplify timewait refcounting
> inet: inet_twsk_deschedule factorization
> tcp/dccp: fix timewait races in timer handling
> net: align sk_refcnt on 128 bytes boundary
>
> Raanan Avargil (1):
> tcp/dccp: fix old style declarations
>

Nothing yet on 4.1.15 and 4.2.8


$ git log --oneline v4.1..v4.1.15 -- include/net/inet_timewait_sock.h
$ git log --oneline v4.1..v4.2.8 -- include/net/inet_timewait_sock.h
$ git log --oneline v4.1..v4.4 -- include/net/inet_timewait_sock.h
26440c835f8b Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
8e5eb54d303b net: align sk_refcnt on 128 bytes boundary
8695a144da9e tcp/dccp: fix old style declarations
ed2e92394589 tcp/dccp: fix timewait races in timer handling
dbe7faa4045e inet: inet_twsk_deschedule factorization
fc01538f9fb7 inet: simplify timewait refcounting
$
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Jan 2016 08:19:50 -0800

> On Wed, 2016-01-13 at 11:01 -0500, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 11 Jan 2016 13:11:28 -0800
>>
>> > David, can you make sure ed2e923945892a8372ab70d2f61d364b0b6d9054 is
>> > pushed to stable trees (4.1 & 4.2)
>>
>> It should definitely be there:
>>
>> [davem@localhost linux-stable]$ git shortlog v4.1.. -- include/net/inet_timewait_sock.h
>> David S. Miller (1):
>> Merge git://git.kernel.org/.../davem/net
>>
>> Eric Dumazet (4):
>> inet: simplify timewait refcounting
>> inet: inet_twsk_deschedule factorization
>> tcp/dccp: fix timewait races in timer handling
>> net: align sk_refcnt on 128 bytes boundary
>>
>> Raanan Avargil (1):
>> tcp/dccp: fix old style declarations
>>
>
> Nothing yet on 4.1.15 and 4.2.8

I should really learn how to use GIT properly, thanks Eric I'll queue
the two patches up. :-)
Re: Kernel 4.1 hang, apparently in __inet_lookup_established [ In reply to ]
On Wed, 2016-01-13 at 14:38 -0500, David Miller wrote:

> I should really learn how to use GIT properly, thanks Eric I'll queue
> the two patches up. :-)

Hmm, you know git much better than me ;)

Thanks !