Mailing List Archive

Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x"
Hi,

We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
protocol C on DRBD9.

Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
Storage is SSD.

When doing heavy I/O in a VM, we have a kernel panic in drbd module on
the node running the VM.

We get the kernel panic using the latest proxmox kernel (drbd9
360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)

We have a kdump crash dump if that can be of any help.

Virtualization: KVM guest with virtio for net and disk. Using
writethrough caching strategy for guest VM. Backing storage for VM is
LVM on top of DRBD.

Tried both versions:

# cat /proc/drbd
version: 9.0.0 (api:2/proto:86-110)
GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root@elsa,
2016-01-10 15:26:34
Transports (api:10): tcp (1.0.0)

# cat /proc/drbd
version: 9.0.0 (api:2/proto:86-110)
GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
root@sd-84686, 2016-01-17 16:31:20
Transports (api:13): tcp (1.0.0)

Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M

Node:

Linux version 4.2.6-1-pve (root@sd-84686) (gcc version 4.9.2 (Debian
4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016

[ 861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243
[ 862.065397] ------------[ cut here ]------------
[ 862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!
[ 862.065484] invalid opcode: 0000 [#1] SMP
[ 862.065529] Modules linked in: drbd_transport_tcp(O) drbd(O)
netconsole configfs ip_set ip6table_filter ip6_tables iptable_filter
ip_tables softdog x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp libiscs
i_tcp libiscsi scsi_transport_iscsi ip_gre ip_tunnel vport_gre gre
openvswitch libcrc32c nfnetlink_log nfnetlink ipmi_ssif ipmi_devintf
intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel dcdbas kvm crct10dif_
pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper
ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr joydev
input_leds sb_edac edac_core mei_me ioatdma mei shpchp lpc_ich dca wmi
ipmi_si 8250_fintek ipmi_msgha
ndler mac_hid acpi_power_meter vhost_net vhost macvtap macvlan autofs4
hid_generic usbkbd usbmouse usbhid hid ahci libahci tg3 ptp pps_core
megaraid_sas [last unloaded: drbd]
[ 862.066319] CPU: 0 PID: 2343 Comm: drbd_a_r0 Tainted: G O
4.2.6-1-pve #1
[ 862.066386] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS
1.5.4 10/002/2015
[ 862.066451] task: ffff881fee1d0000 ti: ffff881fecd78000 task.ti:
ffff881fecd78000
[ 862.066517] RIP: 0010:[<ffffffffc0556e30>] [<ffffffffc0556e30>]
lc_put+0x90/0xa0 [drbd]
[ 862.066594] RSP: 0000:ffff881fecd7bb08 EFLAGS: 00010046
[ 862.066633] RAX: 0000000000000000 RBX: 000000000000faf3 RCX: ffff881fe7b8cab0
[ 862.066677] RDX: ffff881fe65dc000 RSI: ffff881fe7b8cab0 RDI: ffff881fdc2eca80
[ 862.066721] RBP: ffff881fecd7bb08 R08: 0000000000000484 R09: 0000000000000000
[ 862.066765] R10: ffff883cf8428870 R11: 0000000000000000 R12: ffff881fec93ec00
[ 862.066808] R13: 0000000000000000 R14: 000000000000faf3 R15: 0000000000000001
[ 862.066852] FS: 0000000000000000(0000) GS:ffff881ffec00000(0000)
knlGS:0000000000000000
[ 862.066919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 862.066959] CR2: 00007f1f0a44f028 CR3: 0000000001e0d000 CR4: 00000000001426f0
[ 862.067003] Stack:
[ 862.067034] ffff881fecd7bb58 ffffffffc0553b5a 0000000000000046
ffff881fec93eeb0
[ 862.067115] ffff881fecd7bb68 ffff883cf8428438 ffff881fec93ec00
ffff883cf8428448
[ 862.067196] 0000000000000800 0000000000004000 ffff881fecd7bb68
ffffffffc0554060
[ 862.067277] Call Trace:
[ 862.067316] [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
[ 862.067360] [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
[ 862.067406] [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
[ 862.067451] [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
[ 862.067493] [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
[ 862.067537] [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
[ 862.067582] [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
[ 862.067626] [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]
[ 862.067670] [<ffffffffc054cdc0>] drbd_ack_receiver+0x160/0x5c0 [drbd]
[ 862.067716] [<ffffffffc05571d0>] ? w_complete+0x20/0x20 [drbd]
[ 862.067760] [<ffffffffc0557234>] drbd_thread_setup+0x64/0x120 [drbd]
[ 862.067804] [<ffffffffc05571d0>] ? w_complete+0x20/0x20 [drbd]
[ 862.067847] [<ffffffff8109acaa>] kthread+0xea/0x100
[ 862.067886] [<ffffffff8109abc0>] ? kthread_create_on_node+0x1f0/0x1f0
[ 862.067930] [<ffffffff8180875f>] ret_from_fork+0x3f/0x70
[ 862.067970] [<ffffffff8109abc0>] ? kthread_create_on_node+0x1f0/0x1f0
[ 862.068012] Code: 89 42 08 48 89 56 10 48 89 7e 18 48 89 07 83 6f
64 01 f0 80 a7 90 00 00 00 f7 f0 80 a7 90 00 00 00 fe 8b 46 20 5d c3
0f 0b 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
00 00
[ 862.068414] RIP [<ffffffffc0556e30>] lc_put+0x90/0xa0 [drbd]
[ 862.068459] RSP <ffff881fecd7bb08>
[ 862.069000] ---[ end trace b005772103543ee2 ]---
[ 872.163694] ------------[ cut here ]------------

# drbdsetup show
resource r0 {
_this_host {
node-id 0;
volume 0 {
device minor 0;
disk "/dev/sda4";
meta-disk internal;
disk {
disk-flushes no;
}
}
}
connection {
_peer_node_id 1;
path {
_this_host ipv4 10.0.0.197:7788;
_remote_host ipv4 10.0.0.140:7788;
}
net {
allow-two-primaries yes;
cram-hmac-alg "sha1";
shared-secret "xxxxxxxx";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
verify-alg "md5";
_name "proxmox1";
}
volume 0 {
disk {
resync-rate 40960k; # bytes/second
}
}
}
}

Shortly after the tg3 watchdog trigger, it's probably a consequence of
the drbd kernel panic but maybe not ?

See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg

Is this a known problem for this kind of configuration?
(kvm->virtio->lvm->drbd->h730p+tg3)

Best regards,
Francois
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
On Sun, Jan 17, 2016 at 05:59:20PM +0100, Francois Baligant wrote:
> Hi,
>
> We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
> protocol C on DRBD9.
>
> Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
> megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
> Storage is SSD.
>
> When doing heavy I/O in a VM, we have a kernel panic in drbd module on
> the node running the VM.
>
> We get the kernel panic using the latest proxmox kernel (drbd9
> 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
> also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)
>
> We have a kdump crash dump if that can be of any help.
>
> Virtualization: KVM guest with virtio for net and disk. Using
> writethrough caching strategy for guest VM. Backing storage for VM is
> LVM on top of DRBD.
>
> Tried both versions:
>
> # cat /proc/drbd
> version: 9.0.0 (api:2/proto:86-110)
> GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root@elsa,
> 2016-01-10 15:26:34
> Transports (api:10): tcp (1.0.0)
>
> # cat /proc/drbd
> version: 9.0.0 (api:2/proto:86-110)
> GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
> root@sd-84686, 2016-01-17 16:31:20
> Transports (api:13): tcp (1.0.0)
>
> Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M
>
> Node:
>
> Linux version 4.2.6-1-pve (root@sd-84686) (gcc version 4.9.2 (Debian
> 4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016
>
> [ 861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243

This is the real problem ^^

I will add a fix to the "LOGIC BUG" path there
that at least will not return "Success" for a failed operation,
so it won't later trigger the BUG_ON() below.

This BUG_ON() is only a followup failure.

But the interesting thing will be to figure out
where the logic is wrong: if, within a protected critical region,
I first check that at least N "slots" are available,
and then a few lines later, still within the same protected region,
suddenly some of them are not available...
As they say, this "can not happen" ;-)

> [ 862.065397] ------------[ cut here ]------------
> [ 862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!

> [ 862.067277] Call Trace:
> [ 862.067316] [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
> [ 862.067360] [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
> [ 862.067406] [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
> [ 862.067451] [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
> [ 862.067493] [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
> [ 862.067537] [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
> [ 862.067582] [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
> [ 862.067626] [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]

...

> # drbdsetup show
> resource r0 {
> _this_host {
> node-id 0;
> volume 0 {
> device minor 0;
> disk "/dev/sda4";
> meta-disk internal;
> disk {
> disk-flushes no;
> }
> }
> }
> connection {
> _peer_node_id 1;
> path {
> _this_host ipv4 10.0.0.197:7788;
> _remote_host ipv4 10.0.0.140:7788;
> }
> net {
> allow-two-primaries yes;
> cram-hmac-alg "sha1";
> shared-secret "xxxxxxxx";
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> verify-alg "md5";
> _name "proxmox1";
> }
> volume 0 {
> disk {
> resync-rate 40960k; # bytes/second
> }
> }
> }
> }
>
> Shortly after the tg3 watchdog trigger, it's probably a consequence of
> the drbd kernel panic but maybe not ?
>
> See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg
>
> Is this a known problem for this kind of configuration?
> (kvm->virtio->lvm->drbd->h730p+tg3)
>
> Best regards,
> Francois

--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA and Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
Hi Lars,

Thanks for your analysis !

If it can be of any help, we downgraded (module downgrade and metadata
downgrade, nothing else touched, even drbd configuration) this cluster
to DRBD 8.4 and after some heavy stress tests we have yet to make it
crash while it would take around only 10 minutes to crash it on DRBD
9.0.

version: 8.4.7-1 (api:1/proto:86-101)
GIT-hash: aff41b8a77838faac8f4e8f8ee843e182d4e4bcc build by
root@sd-84686, 2016-01-17 20:32:22
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:798510236 nr:54885336 dw:853395572 dr:54193620 al:630522 bm:0
lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

Best regards,
Francois

2016-01-19 11:46 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> On Sun, Jan 17, 2016 at 05:59:20PM +0100, Francois Baligant wrote:
>> Hi,
>>
>> We run 2 Proxmox 4 nodes with KVM in a dual-primary scenario with
>> protocol C on DRBD9.
>>
>> Hardware is PowerEdge R730 with tg3 NIC and H730P RAID card with
>> megaraid_sas driver with latest firmwares for IDRAC, BIOS and RAID.
>> Storage is SSD.
>>
>> When doing heavy I/O in a VM, we have a kernel panic in drbd module on
>> the node running the VM.
>>
>> We get the kernel panic using the latest proxmox kernel (drbd9
>> 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 ) and using drbd9 git master
>> also (a48a43a73ebc01e398ca1b755a7006b96ccdfb28)
>>
>> We have a kdump crash dump if that can be of any help.
>>
>> Virtualization: KVM guest with virtio for net and disk. Using
>> writethrough caching strategy for guest VM. Backing storage for VM is
>> LVM on top of DRBD.
>>
>> Tried both versions:
>>
>> # cat /proc/drbd
>> version: 9.0.0 (api:2/proto:86-110)
>> GIT-hash: 360c65a035fc2dec2b93e839b5c7fae1201fa7d9 build by root@elsa,
>> 2016-01-10 15:26:34
>> Transports (api:10): tcp (1.0.0)
>>
>> # cat /proc/drbd
>> version: 9.0.0 (api:2/proto:86-110)
>> GIT-hash: a48a43a73ebc01e398ca1b755a7006b96ccdfb28 build by
>> root@sd-84686, 2016-01-17 16:31:20
>> Transports (api:13): tcp (1.0.0)
>>
>> Doing in VM: dd if=/dev/zero of=dd1 bs=65536 count=1M
>>
>> Node:
>>
>> Linux version 4.2.6-1-pve (root@sd-84686) (gcc version 4.9.2 (Debian
>> 4.9.2-10) ) #1 SMP Sun Jan 17 13:39:16 CET 2016
>>
>> [ 861.968976] drbd r0/0 drbd0: LOGIC BUG for enr=64243
>
> This is the real problem ^^
>
> I will add a fix to the "LOGIC BUG" path there
> that at least will not return "Success" for a failed operation,
> so it won't later trigger the BUG_ON() below.
>
> This BUG_ON() is only a followup failure.
>
> But the interesting thing will be to figure out
> where the logic is wrong: if, within a protected critical region,
> I first check that at least N "slots" are available,
> and then a few lines later, still within the same protected region,
> suddenly some of them are not available...
> As they say, this "can not happen" ;-)
>
>> [ 862.065397] ------------[ cut here ]------------
>> [ 862.065442] kernel BUG at /usr/src/drbd-9.0/drbd/lru_cache.c:571!
>
>> [ 862.067277] Call Trace:
>> [ 862.067316] [<ffffffffc0553b5a>] put_actlog+0x6a/0x120 [drbd]
>> [ 862.067360] [<ffffffffc0554060>] drbd_al_complete_io+0x30/0x40 [drbd]
>> [ 862.067406] [<ffffffffc054e192>] drbd_req_destroy+0x442/0x880 [drbd]
>> [ 862.067451] [<ffffffff81734640>] ? tcp_recvmsg+0x390/0xb90
>> [ 862.067493] [<ffffffffc054ead8>] mod_rq_state+0x508/0x7c0 [drbd]
>> [ 862.067537] [<ffffffffc054f084>] __req_mod+0x214/0x8d0 [drbd]
>> [ 862.067582] [<ffffffffc0558c4b>] tl_release+0x1db/0x320 [drbd]
>> [ 862.067626] [<ffffffffc053c3c2>] got_BarrierAck+0x32/0xc0 [drbd]
>
> ...
>
>> # drbdsetup show
>> resource r0 {
>> _this_host {
>> node-id 0;
>> volume 0 {
>> device minor 0;
>> disk "/dev/sda4";
>> meta-disk internal;
>> disk {
>> disk-flushes no;
>> }
>> }
>> }
>> connection {
>> _peer_node_id 1;
>> path {
>> _this_host ipv4 10.0.0.197:7788;
>> _remote_host ipv4 10.0.0.140:7788;
>> }
>> net {
>> allow-two-primaries yes;
>> cram-hmac-alg "sha1";
>> shared-secret "xxxxxxxx";
>> after-sb-0pri discard-zero-changes;
>> after-sb-1pri discard-secondary;
>> verify-alg "md5";
>> _name "proxmox1";
>> }
>> volume 0 {
>> disk {
>> resync-rate 40960k; # bytes/second
>> }
>> }
>> }
>> }
>>
>> Shortly after the tg3 watchdog trigger, it's probably a consequence of
>> the drbd kernel panic but maybe not ?
>>
>> See here: https://pastebin.synalabs.hosting/#cI5nWLuuD37_yN6ii8RLtg
>>
>> Is this a known problem for this kind of configuration?
>> (kvm->virtio->lvm->drbd->h730p+tg3)
>>
>> Best regards,
>> Francois
>
> --
> : Lars Ellenberg
> : http://www.LINBIT.com | Your Way to High Availability
> : DRBD, Linux-HA and Pacemaker support and consulting
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user



--
François BALIGANT
Gérant
+33 (0) 811 69 65 60
http://www.synalabs.com
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
Is there any update on this?

I'm experiencing the same issue on Proxmox 4.1, kernel 4.2.8-1-pve, DRBD
9.0.0.
My kernel panic log is identical to the one posted above, same error in
lru_cache.c:571.

What's best option for me to now? Upgrade to 9.0.1 or downgrade to 8.4?

Upgrade to 9.0.1: @Lars, was this fixed in DRBD 9.0.1, so I could ask
Proxmox guys to build a kernel with this DRBD version (or trying to
build it by myself)?

Downgrade to 8.4: @Francois: I suppose you've built DRBD 8.4 by yourself.
How have you downgraded disk metadata from 9.0 to 8.4? Did you discarded
everything (I mean the metadata) and started from scratch?

Thanks both for your help.

> Hi Lars,
>
> Thanks for your analysis !
>
> If it can be of any help, we downgraded (module downgrade and metadata
> downgrade, nothing else touched, even drbd configuration) this cluster
> to DRBD 8.4 and after some heavy stress tests we have yet to make it
> crash while it would take around only 10 minutes to crash it on DRBD
> 9.0.
>
> version: 8.4.7-1 (api:1/proto:86-101)
> GIT-hash: aff41b8a77838faac8f4e8f8ee843e182d4e4bcc build by
> root at sd-84686, 2016-01-17 20:32:22
> 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
> ns:798510236 nr:54885336 dw:853395572 dr:54193620 al:630522 bm:0
> lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
>
> Best regards,
> Francois
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
That's great, will test it immediately and report back...

Thanks

Il 24/02/2016 10:01, Dietmar Maurer wrote:
>> Upgrade to 9.0.1: @Lars, was this fixed in DRBD 9.0.1, so I could ask
>> Proxmox guys to build a kernel with this DRBD version (or trying to
>> build it by myself)?
> I just build a new proxmox kernel with 9.0.1 - will upload today to pvetest ...
>
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
Just tested DRBD 9.0.1 and it still crashes with the same kernel panic
at the same line:

---------------------------
[ 1892.949041] drbd r0/0 drbd0: LOGIC BUG for enr=107636
[ 1892.954170] drbd r0/0 drbd0: LOGIC BUG for enr=107636
[ 1893.141512] ------------[ cut here ]------------
[ 1893.146192] kernel BUG at
/home/dietmar/pve4-devel/pve-kernel/drbd-9.0.1-1/drbd/lru_cache.c: 571!
[ 1893.155075] invalid opcode: 0000 [#1] SMP
[ 1893.159244] Modules linked in: ip_set ip6table_filter ip6_tables
drbd_transport_tcp(O) drbd(O) libcrc32c softdog nfsd auth_rpcgss nfs_acl
nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment xt_conntrack xt_multiport
iptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables
nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO)
spl(O) zavl(PO) ipmi_ssif amdkfd amd_iommu_v2 radeon ttm gpio_ich
drm_kms_helper drm psmouse coretemp snd_pcm i2c_algo_bit kvm_intel
snd_timer snd kvm soundcore input_leds hpilo shpchp serio_raw
i7core_edac pcspkr acpi_power_meter ipmi_si lpc_ich ipmi_msghandler
8250_fintek mac_hid edac_core vhost_net vhost macvtap macvlan autofs4
hid_generic usbkbd usbmouse usbhid hid pata_acpi tg3 e1000e(O)
ptppps_core hpsa
[ 1893.245546] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P IO 4.2.8-1-pve #1
[ 1893.253218] Hardware name: HP ProLiant ML350 G6, BIOS D22 08/16/2015
[ 1893.259682] task: ffff88020e29be80 ti: ffff88020e2b0000 task.ti:
ffff88020e2b0000
[ 1893.267274] RIP: 0010:[<ffffffffc0ab0fe0>] [<ffffffffc0ab0fe0>]
lc_put+0x90/0xa0 [drbd]
[ 1893.275483] RSP: 0018:ffff880217503ac8 EFLAGS: 00010046
[ 1893.280853] RAX: 0000000000000000 RBX: 000000000001a474 RCX:
ffff8800357d9900
[ 1893.288066] RDX: ffff8800dec48000 RSI: ffff8800357d9900 RDI:
ffff88020b2a6b40
[ 1893.295306] RBP: ffff880217503ac8 R08: 0000000000000011 R09:
0000000000000000
[ 1893.302520] R10: ffff8801a5e3edc0 R11: 0000000000000166 R12:
ffff88020c478c00
[ 1893.309733] R13: 0000000000000000 R14: 000000000001a474 R15:
0000000000000001
[ 1893.316981] FS: 0000000000000000(0000) GS:ffff880217500000(0000)
knlGS:0000000000000000
[ 1893.325160] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1893.330996] CR2: 00007f47508cbf70 CR3: 0000000001e0d000 CR4:
00000000000026e0
[ 1893.338207] Stack:
[ 1893.340241] ffff880217503b18 ffffffffc0aadd0a 0000000000000046
ffff88020c478eb0
[ 1893.347776] ffff88020c478c08 ffff8801a5e3e978 ffff88020c478c00
ffff8801a5e3e988
[ 1893.355326] 0000000000000800 0000000000004000 ffff880217503b28
ffffffffc0aae210
[ 1893.362876] Call Trace:
[ 1893.365348] <IRQ>
[ 1893.367302] [<ffffffffc0aadd0a>] put_actlog+0x6a/0x120 [drbd]
[ 1893.373395] [<ffffffffc0aae210>] drbd_al_complete_io+0x30/0x40 [drbd]
[ 1893.380000] [<ffffffffc0aa8342>] drbd_req_destroy+0x442/0x880 [drbd]
[ 1893.386518] [<ffffffffc0aa7996>] ?
drbd_req_put_completion_ref+0x116/0x350 [drbd]
[ 1893.394177] [<ffffffffc0aa8c88>] mod_rq_state+0x508/0x7c0 [drbd]
[ 1893.404919] [<ffffffff811852bf>] ? mempool_free+0x2f/0x90
[ 1893.415114] [<ffffffffc0aa90f7>] __req_mod+0xd7/0x8d0 [drbd]
[ 1893.425501] [<ffffffffc0a8ff81>] drbd_request_endio+0x81/0x230 [drbd]
[ 1893.436651] [<ffffffff813954c7>] bio_endio+0x57/0x90
[ 1893.446272] [<ffffffff8139c31f>] blk_update_request+0x8f/0x340
[ 1893.456751] [<ffffffff81583f23>] scsi_end_request+0x33/0x1c0
[ 1893.467069] [<ffffffff815864d4>] scsi_io_completion+0xc4/0x650
[ 1893.477558] [<ffffffff8157d50f>] scsi_finish_command+0xcf/0x120
[ 1893.488152] [<ffffffff81585d26>] scsi_softirq_done+0x126/0x150
[ 1893.498614] [<ffffffff813a2f47>] blk_done_softirq+0x87/0xb0
[ 1893.508796] [<ffffffff81080095>] __do_softirq+0x105/0x260
[ 1893.518755] [<ffffffff8108034e>] irq_exit+0x8e/0x90
[ 1893.528139] [<ffffffff8180d6f8>] do_IRQ+0x58/0xe0
[ 1893.537325] [<ffffffff8180b66b>] common_interrupt+0x6b/0x6b
[ 1893.547299] <EOI>
[ 1893.549247] [<ffffffff8168d011>] ? cpuidle_enter_state+0xf1/0x220
[ 1893.564052] [<ffffffff8168cff0>] ? cpuidle_enter_state+0xd0/0x220
[ 1893.574285] [<ffffffff8168d177>] cpuidle_enter+0x17/0x20
[ 1893.583642] [<ffffffff810be18b>] call_cpuidle+0x3b/0x70
[ 1893.592753] [<ffffffff8168d153>] ? cpuidle_select+0x13/0x20
[ 1893.602118] [<ffffffff810be45c>] cpu_startup_entry+0x29c/0x360
[ 1893.611711] [<ffffffff8104d983>] start_secondary+0x183/0x1c0
[ 1893.620980] Code: 89 42 08 48 89 56 10 48 89 7e 18 48 89 07 83 6f 64
01 f0 80 a7 90 00 00 00 f7 f0 80 a7 90 00 00 00 fe 8b 46 20 5d c3 0f 0b
0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
[ 1893.647996] RIP [<ffffffffc0ab0fe0>] lc_put+0x90/0xa0 [drbd]
[ 1893.657350] RSP <ffff880217503ac8>
[ 1893.664377] ---[ end trace 00eeba9098fc3948 ]---
[ 1893.672498] Kernel panic - not syncing: Fatal exception in interrupt
[ 1894.745252] Shutting down cpus with NMI
[ 1894.752650] Kernel Offset: disabled
[ 1894.759570] drm_kms_helper: panic occurred, switching back to text
console
[ 1894.769935] ---[. end Kernel panic - not syncing: Fatal exception in
interrupt
[ 1894.780616] ------------[ cut here ]------------
[ 1894.788757] WARNING: CPU: 4 PID: 0 at arch/x86/kernel/smp.c:124
native_smp_send_reschedule+0x60/0x70()
[ 1894.801701] Modules linked in: ip_set ip6table_filter ip6_tables
drbd_transport_tcp(O) drbd(O) libcrc32c softdog nfsd auth_rpcgss nfs_acl
nfs lockd grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_comment xt_conntrack xt_multiport
iptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables
nfnetlink_log nfnetlink zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO)
spl(O) zavl(PO) ipmi_ssif amdkfd amd_iommu_v2 radeon ttm gpio_ich
drm_kms_helper drm psmouse coretemp snd_pcm i2c_algo_bit kvm_intel
snd_timer snd kvm soundcore input_leds hpilo shpchp serio_raw
i7core_edac pcspkr acpi_power_meter ipmi_si lpc_ich ipmi_msghandler
8250_fintek mac_hid edac_core vhost_net vhost macvtap macvlan autofs4
hid_generic usbkbd usbmouse usbhid hid pata_acpi tg3 e1000e(O)
ptppps_core hpsa
[ 1894.918913] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P D IO
4.2.8-1-pve #1
[ 1894.930775] Hardware name: HP ProLiant ML350 G6, BIOS D22 08/16/2015
[ 1894.941441] 0000000000000000 cb864877fc32c408 ffff880217503530
ffffffff81803a9b
[ 1894.953278] 0000000000000000 0000000000000000 ffff880217503570
ffffffff8107bbfa
[ 1894.965055] ffff880217503560 0000000000000000 ffff880217416a00
0000000000000004
[ 1894.976768] Call Trace:
[ 1894.983495] <IRQ> [<ffffffff81803a9b>] dump_stack+0x45/0x57
[ 1894.993593] [<ffffffff8107bbfa>] warn_slowpath_common+0x8a/0xc0
[ 1895.003920] [<ffffffff8107bd2a>] warn_slowpath_null+0x1a/0x20
[ 1895.014099] [<ffffffff8104cc50>] native_smp_send_reschedule+0x60/0x70
[ 1895.024995] [<ffffffff810b897b>] trigger_load_balance+0x13b/0x230
[ 1895.035518] [<ffffffff810a7ab6>] scheduler_tick+0xa6/0xd0
[ 1895.045349] [<ffffffff810f7ac0>] ? tick_sched_do_timer+0x30/0x30
[ 1895.055802] [<ffffffff810e81b1>] update_process_times+0x51/0x60
[ 1895.066195] [<ffffffff810f74b5>] tick_sched_handle.isra.15+0x25/0x60
[ 1895.077027] [<ffffffff810f7b04>] tick_sched_timer+0x44/0x80
[ 1895.087031] [<ffffffff810e8d83>] __hrtimer_run_queues+0xf3/0x220
[ 1895.097489] [<ffffffff810e91e8>] hrtimer_interrupt+0xa8/0x1a0
[ 1895.107695] [<ffffffff8104f57c>] local_apic_timer_interrupt+0x3c/0x70
[ 1895.118653] [<ffffffff8180d7c1>] smp_apic_timer_interrupt+0x41/0x60
[ 1895.129425] [<ffffffff8180b95b>] apic_timer_interrupt+0x6b/0x70
[ 1895.139812] [<ffffffff818018a0>] ? panic+0x1d3/0x217
[ 1895.149264] [<ffffffff8180189c>] ? panic+0x1cf/0x217
[ 1895.158676] [<ffffffff810180a6>] oops_end+0xd6/0xe0
[ 1895.167997] [<ffffffff810185cb>] die+0x4b/0x70
[ 1895.176869] [<ffffffff810154bd>] do_trap+0x13d/0x150
[ 1895.186265] [<ffffffff81015a99>] do_error_trap+0x89/0x110
[ 1895.196100] [<ffffffffc0ab0fe0>] ? lc_put+0x90/0xa0 [drbd]
[ 1895.205911] [<ffffffff8118a069>] ? __free_pages+0x19/0x30
[ 1895.215521] [<ffffffff811dbf6a>] ? __free_slab+0xda/0x1e0
[ 1895.225002] [<ffffffff81015dc0>] do_invalid_op+0x20/0x30
[ 1895.234412] [<ffffffff8180c41e>] invalid_op+0x1e/0x30
[ 1895.243541] [<ffffffffc0ab0fe0>] ? lc_put+0x90/0xa0 [drbd]
[ 1895.253039] [<ffffffffc0ab0d50>] ? lc_find+0x10/0x20 [drbd]
[ 1895.262582] [<ffffffffc0aadd0a>] put_actlog+0x6a/0x120 [drbd]
[ 1895.272310] [<ffffffffc0aae210>] drbd_al_complete_io+0x30/0x40 [drbd]
[ 1895.282784] [<ffffffffc0aa8342>] drbd_req_destroy+0x442/0x880 [drbd]
[ 1895.293104] [<ffffffffc0aa7996>] ?
drbd_req_put_completion_ref+0x116/0x350 [drbd]
[ 1895.304576] [<ffffffffc0aa8c88>] mod_rq_state+0x508/0x7c0 [drbd]
[ 1895.314461] [<ffffffff811852bf>] ? mempool_free+0x2f/0x90
[ 1895.323681] [<ffffffffc0aa90f7>] __req_mod+0xd7/0x8d0 [drbd]
[ 1895.333029] [<ffffffffc0a8ff81>] drbd_request_endio+0x81/0x230 [drbd]
[ 1895.343097] [<ffffffff813954c7>] bio_endio+0x57/0x90
[ 1895.351572] [<ffffffff8139c31f>] blk_update_request+0x8f/0x340
[ 1895.360826] [<ffffffff81583f23>] scsi_end_request+0x33/0x1c0
[ 1895.369798] [<ffffffff815864d4>] scsi_io_completion+0xc4/0x650
[ 1895.378798] [<ffffffff8157d50f>] scsi_finish_command+0xcf/0x120
[ 1895.387812] [<ffffffff81585d26>] scsi_softirq_done+0x126/0x150
[ 1895.396678] [<ffffffff813a2f47>] blk_done_softirq+0x87/0xb0
[ 1895.405243] [<ffffffff81080095>] __do_softirq+0x105/0x260
[ 1895.413575] [<ffffffff8108034e>] irq_exit+0x8e/0x90
[ 1895.421405] [<ffffffff8180d6f8>] do_IRQ+0x58/0xe0
[ 1895.429037] [<ffffffff8180b66b>] common_interrupt+0x6b/0x6b
[ 1895.437471] <EOI> [<ffffffff8168d011>] ? cpuidle_enter_state+0xf1/0x220
[ 1895.447063] [<ffffffff8168cff0>] ? cpuidle_enter_state+0xd0/0x220
[ 1895.456071] [<ffffffff8168d177>] cpuidle_enter+0x17/0x20
[ 1895.464262] [<ffffffff810be18b>] call_cpuidle+0x3b/0x70
[ 1895.472391] [<ffffffff8168d153>] ? cpuidle_select+0x13/0x20
[ 1895.480850] [<ffffffff810be45c>] cpu_startup_entry+0x29c/0x360
[ 1895.489579] [<ffffffff8104d983>] start_secondary+0x183/0x1c0
[ 1895.498098] ---[ end trace 00eeba9098fc3949 ]---
-------------------

I was watching "drbdadm status" each 2s.
This is its last output before the panic:
-------------------
r0 node-id:0 role:Primary suspended:no
write-ordering:drain
volume:0 minor:0 disk:UpToDate
size:488336928 read:829508 written:5750835 al-writes:2689 bm-writes:0
upper-pending:320 lower-pending:320 al-suspended:no blocked:no
srvvmhost2 node-id:1 connection:Connected role:Primary
congested:no
volume:0 replication:Established peer-disk:UpToDate
resync-suspended:no
received:1034427 sent:4717688 out-of-sync:0 pending:0 unacked:0
-------------------

I suppose that version 9.0.1 is not targeting this bug.
@Lars: can you confirm it?

@Dietmar: what's my best option now?
I'd like to stay on DRBD9, but I urge to fix this kernel panic soon
because the hosts are already in production.
Self compiling 8.4 could be an option but I suppose Proxmox will use 9.x
in the future and never get back to 8.4.
Am I right or is there a special kernel version with 8.4?

@Lars: in case of a downgrade (if I decide to build 8.4 by myself and
enter the versioning hell), is this the right path?
1) move all of the VMs to node B
2) downgrade node A module 9.0-->8.4
3) ... resource metadata? ...
4) reboot A (now 8.4) and reconnect to node B (still at 9.0)
5) repeat 2) and 3) on node B

Could you please help me on points 3) and 4)?

Thank you all for helping

Claudio

Il 24/02/2016 10:05, Claudio ha scritto:
> That's great, will test it immediately and report back...
>
> Thanks
>
> Il 24/02/2016 10:01, Dietmar Maurer wrote:
>>> Upgrade to 9.0.1: @Lars, was this fixed in DRBD 9.0.1, so I could ask
>>> Proxmox guys to build a kernel with this DRBD version (or trying to
>>> build it by myself)?
>> I just build a new proxmox kernel with 9.0.1 - will upload today to pvetest ...
>>
>
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
I have built 8.4 drbd module for kernel 4.2.8-1-pve, together with the
latest drbd-utils (with support for 84 version), then downgraded
resources to 8.4.

I've run the same IO-intensive tests as before (on 9.0) for more than 24
hours: no issues.
9.0 crashed after no more than 3/4 minutes.

My two-nodes Proxmox is actually running since 3 days without any issue.

Should be great, for whom like me with a simple two-nodes setup, to have
a kernel with drbd 8.4 instead of 9.0; is it possible?
Otherwise I'll create a script to automate buiilding process each time
I'll update the kernel in the future.

PS: since there's near-to-no ducumentation available on DRBD
downgrading, I'll will write a post with detailed instruction (I took
detailed notes while building and reconfiguring my nodes).
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Kernel panic with DRBD 9.0 on Kernel 4.2.6 "LOGIC BUG for enr=x" [ In reply to ]
Since I've had near-to-no feedback from both Linbit and Proxmox guys,
I've downgraded my two nodes setup to DRBD 8.4.
It now works without issues.

I've put my downgrading notes in a blog post; I hope it will be useful
for other users too.

http://coolsoft.altervista.org/blog/2016/02/proxmox-41-kernel-panic-downgrade-drbd-resources-drbd-9-84


_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user