Mailing List Archive

Regression: Disk corruption with dm-crypt and kernels >= 4.0
I made sure to run a completely vanilla kernel when testing why I was suddenly
seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:

-------------------->8--------------------
[ 165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action 0x6
frozen
[ 165.592140] ata5.00: irq_stat 0x20000000, host bus error
[ 165.592143] ata5: SError: { HostInt }
[ 165.592145] ata5.00: failed command: READ FPDMA QUEUED
[ 165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq 4096
in
res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
(host bus error)
[ 165.592151] ata5.00: status: { DRDY }
-------------------->8--------------------

After a few dozen of these errors, I'd suddenly find my system in read-only
mode with corrupted files throughout my encrypted filesystems (seemed like
either a read or a write would corrupt a file, though I could be mistaken). I
decided to do a git bisect with a random read-write-sync test to narrow down
the culprit, which turned out to be this commit (part of a series):

# first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
allocate pages for a partial request

Just to be sure, I created a patch to revert the entire nine patch series that
commit belonged to... and the bad behavior disappeared. I've now been running
kernel 4.0 for a few days without issue, and went so far as to stress test my
poor SSD for a few hours to be 100% positive.

Here's some more info on my setup.

-------------------->8--------------------
$ lsblk -f
NAME FSTYPE LABEL MOUNTPOINT
sda
├─sda1 vfat /boot/EFI
├─sda2 ext4 /boot
└─sda3 LVM2_member
├─SSD-root crypto_LUKS
│ └─root f2fs /
└─SSD-home crypto_LUKS
└─home f2fs /home

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow-discards
root=/dev/mapper/root acpi_osi=Linux security=tomoyo
TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
modprobe.blacklist=nouveau rw quiet

$ cat /etc/lvm/lvm.conf | grep "issue_discards"
issue_discards = 1
-------------------->8--------------------

If there's anything else I can do to help diagnose the underlying problem, I'm
more than willing.

Thanks,

Abelardo Ricart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, May 01 2015 at 12:37am -0400,
Abelardo Ricart III <aricart@memnix.com> wrote:

> I made sure to run a completely vanilla kernel when testing why I was suddenly
> seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:
>
> -------------------->8--------------------
> [ 165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action 0x6
> frozen
> [ 165.592140] ata5.00: irq_stat 0x20000000, host bus error
> [ 165.592143] ata5: SError: { HostInt }
> [ 165.592145] ata5.00: failed command: READ FPDMA QUEUED
> [ 165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq 4096
> in
> res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> (host bus error)
> [ 165.592151] ata5.00: status: { DRDY }
> -------------------->8--------------------
>
> After a few dozen of these errors, I'd suddenly find my system in read-only
> mode with corrupted files throughout my encrypted filesystems (seemed like
> either a read or a write would corrupt a file, though I could be mistaken). I
> decided to do a git bisect with a random read-write-sync test to narrow down
> the culprit, which turned out to be this commit (part of a series):
>
> # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
> allocate pages for a partial request
>
> Just to be sure, I created a patch to revert the entire nine patch series that
> commit belonged to... and the bad behavior disappeared. I've now been running
> kernel 4.0 for a few days without issue, and went so far as to stress test my
> poor SSD for a few hours to be 100% positive.
>
> Here's some more info on my setup.
>
> -------------------->8--------------------
> $ lsblk -f
> NAME FSTYPE LABEL MOUNTPOINT
> sda
> ├─sda1 vfat /boot/EFI
> ├─sda2 ext4 /boot
> └─sda3 LVM2_member
> ├─SSD-root crypto_LUKS
> │ └─root f2fs /
> └─SSD-home crypto_LUKS
> └─home f2fs /home
>
> $ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow-discards
> root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> modprobe.blacklist=nouveau rw quiet
>
> $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> issue_discards = 1
> -------------------->8--------------------
>
> If there's anything else I can do to help diagnose the underlying problem, I'm
> more than willing.

The patchset in question was tested quite heavily so this is a
surprising report. I'm noticing you are opting in to dm-crypt discard
support. Have you tested without discards enabled?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, May 01, 2015 at 12:37:07AM -0400, Abelardo Ricart III wrote:
> # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt: don't
> allocate pages for a partial request

That's not a particularly good commit to identify.

If you didn't already, can you confirm whether or not the code works at the
patch immediately following?

7145c241a1bf2841952c3e297c4080b357b3e52d

Alasdair

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, 2015-05-01 at 17:17 -0400, Mike Snitzer wrote:
> On Fri, May 01 2015 at 12:37am -0400,
> Abelardo Ricart III <aricart@memnix.com> wrote:
>
> > I made sure to run a completely vanilla kernel when testing why I was
> > suddenly
> > seeing some nasty libata errors with all kernels >= v4.0. Here's a snippet:
> >
> > -------------------->8--------------------
> > [ 165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800 action
> > 0x6
> > frozen
> > [ 165.592140] ata5.00: irq_stat 0x20000000, host bus error
> > [ 165.592143] ata5: SError: { HostInt }
> > [ 165.592145] ata5.00: failed command: READ FPDMA QUEUED
> > [ 165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12 ncq
> > 4096
> > in
> > res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> > (host bus error)
> > [ 165.592151] ata5.00: status: { DRDY }
> > -------------------->8--------------------
> >
> > After a few dozen of these errors, I'd suddenly find my system in read-only
> > mode with corrupted files throughout my encrypted filesystems (seemed like
> > either a read or a write would corrupt a file, though I could be mistaken).
> > I
> > decided to do a git bisect with a random read-write-sync test to narrow down
> > the culprit, which turned out to be this commit (part of a series):
> >
> > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt:
> > don't
> > allocate pages for a partial request
> >
> > Just to be sure, I created a patch to revert the entire nine patch series
> > that
> > commit belonged to... and the bad behavior disappeared. I've now been
> > running
> > kernel 4.0 for a few days without issue, and went so far as to stress test
> > my
> > poor SSD for a few hours to be 100% positive.
> >
> > Here's some more info on my setup.
> >
> > -------------------->8--------------------
> > $ lsblk -f
> > NAME FSTYPE LABEL MOUNTPOINT
> > sda
> > ├─sda1 vfat /boot/EFI
> > ├─sda2 ext4 /boot
> > └─sda3 LVM2_member
> > ├─SSD-root crypto_LUKS
> > │ └─root f2fs /
> > └─SSD-home crypto_LUKS
> > └─home f2fs /home
> >
> > $ cat /proc/cmdline
> > BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow
> > -discards
> > root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> > TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> > modprobe.blacklist=nouveau rw quiet
> >
> > $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> > issue_discards = 1
> > -------------------->8--------------------
> >
> > If there's anything else I can do to help diagnose the underlying problem,
> > I'm
> > more than willing.
>
> The patchset in question was tested quite heavily so this is a
> surprising report. I'm noticing you are opting in to dm-crypt discard
> support. Have you tested without discards enabled?

I've disabled discards universally and rebuilt a vanilla kernel. After running
my heavy read-write-sync scripts, everything seems to be working fine now. I
suppose this could be something that used to fail silently before, but now
produces bad behavior? I seem to remember having something in my message log
about "discards not supported on this device" when running with it enabled
before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, 2015-05-01 at 18:24 -0400, Abelardo Ricart III wrote:
> On Fri, 2015-05-01 at 17:17 -0400, Mike Snitzer wrote:
> > On Fri, May 01 2015 at 12:37am -0400,
> > Abelardo Ricart III <aricart@memnix.com> wrote:
> >
> > > I made sure to run a completely vanilla kernel when testing why I was
> > > suddenly
> > > seeing some nasty libata errors with all kernels >= v4.0. Here's a
> > > snippet:
> > >
> > > -------------------->8--------------------
> > > [ 165.592136] ata5.00: exception Emask 0x60 SAct 0x7000 SErr 0x800
> > > action
> > > 0x6
> > > frozen
> > > [ 165.592140] ata5.00: irq_stat 0x20000000, host bus error
> > > [ 165.592143] ata5: SError: { HostInt }
> > > [ 165.592145] ata5.00: failed command: READ FPDMA QUEUED
> > > [ 165.592149] ata5.00: cmd 60/08:60:a0:0d:89/00:00:07:00:00/40 tag 12
> > > ncq
> > > 4096
> > > in
> > > res 40/00:74:40:58:5d/00:00:00:00:00/40 Emask 0x60
> > > (host bus error)
> > > [ 165.592151] ata5.00: status: { DRDY }
> > > -------------------->8--------------------
> > >
> > > After a few dozen of these errors, I'd suddenly find my system in read
> > > -only
> > > mode with corrupted files throughout my encrypted filesystems (seemed like
> > > either a read or a write would corrupt a file, though I could be
> > > mistaken).
> > > I
> > > decided to do a git bisect with a random read-write-sync test to narrow
> > > down
> > > the culprit, which turned out to be this commit (part of a series):
> > >
> > > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt:
> > > don't
> > > allocate pages for a partial request
> > >
> > > Just to be sure, I created a patch to revert the entire nine patch series
> > > that
> > > commit belonged to... and the bad behavior disappeared. I've now been
> > > running
> > > kernel 4.0 for a few days without issue, and went so far as to stress
> > > test
> > > my
> > > poor SSD for a few hours to be 100% positive.
> > >
> > > Here's some more info on my setup.
> > >
> > > -------------------->8--------------------
> > > $ lsblk -f
> > > NAME FSTYPE LABEL MOUNTPOINT
> > > sda
> > > ├─sda1 vfat /boot/EFI
> > > ├─sda2 ext4 /boot
> > > └─sda3 LVM2_member
> > > ├─SSD-root crypto_LUKS
> > > │ └─root f2fs /
> > > └─SSD-home crypto_LUKS
> > > └─home f2fs /home
> > >
> > > $ cat /proc/cmdline
> > > BOOT_IMAGE=/vmlinuz-linux-memnix cryptdevice=/dev/SSD/root:root:allow
> > > -discards
> > > root=/dev/mapper/root acpi_osi=Linux security=tomoyo
> > > TOMOYO_trigger=/usr/lib/systemd/systemd intel_iommu=on
> > > modprobe.blacklist=nouveau rw quiet
> > >
> > > $ cat /etc/lvm/lvm.conf | grep "issue_discards"
> > > issue_discards = 1
> > > -------------------->8--------------------
> > >
> > > If there's anything else I can do to help diagnose the underlying
> > > problem,
> > > I'm
> > > more than willing.
> >
> > The patchset in question was tested quite heavily so this is a
> > surprising report. I'm noticing you are opting in to dm-crypt discard
> > support. Have you tested without discards enabled?
>
> I've disabled discards universally and rebuilt a vanilla kernel. After running
> my heavy read-write-sync scripts, everything seems to be working fine now. I
> suppose this could be something that used to fail silently before, but now
> produces bad behavior? I seem to remember having something in my message log
> about "discards not supported on this device" when running with it enabled
> before.

Forgive me, but I spoke too soon. The corruption and libata errors are still
there, as was evidenced when I went to reboot and got treated to an eye full of
"read-only filesystem" and ata errors.

So no, disabling discards unfortunately did nothing to help.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, 2015-05-01 at 22:47 +0100, Alasdair G Kergon wrote:
> On Fri, May 01, 2015 at 12:37:07AM -0400, Abelardo Ricart III wrote:
> > # first bad commit: [cf2f1abfbd0dba701f7f16ef619e4d2485de3366] dm crypt:
> > don't
> > allocate pages for a partial request
>
> That's not a particularly good commit to identify.
>
> If you didn't already, can you confirm whether or not the code works at the
> patch immediately following?
>
> 7145c241a1bf2841952c3e297c4080b357b3e52d
>
> Alasdair
>
Just built that revision and it failed almost immediately with more ata errors. It also corrupted my testing log.

As an aside, here's my fstab in case it's of any use

-------------------->8--------------------
/dev/mapper/root / f2fs rw,relatime,flush_merge,background_gc=on,user_xattr,acl,active_logs=6 0 0

/dev/mapper/home /home f2fs rw,relatime,flush_merge,background_gc=on,user_xattr,acl,active_logs=6 0 2

/dev/sda2 /boot ext4 rw,relatime,data=ordered 0 2

/dev/sda1 /boot/EFI vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 2

tmpfs /scratch tmpfs nodev,nosuid,size=12G 0 0
-------------------->8--------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > The patchset in question was tested quite heavily so this is a
> > > surprising report. I'm noticing you are opting in to dm-crypt discard
> > > support. Have you tested without discards enabled?
> >
> > I've disabled discards universally and rebuilt a vanilla kernel. After running
> > my heavy read-write-sync scripts, everything seems to be working fine now. I
> > suppose this could be something that used to fail silently before, but now
> > produces bad behavior? I seem to remember having something in my message log
> > about "discards not supported on this device" when running with it enabled
> > before.
>
> Forgive me, but I spoke too soon. The corruption and libata errors are still
> there, as was evidenced when I went to reboot and got treated to an eye full of
> "read-only filesystem" and ata errors.
>
> So no, disabling discards unfortunately did nothing to help.

I've been experiencing the same problem. Vanilla 4.0 series kernels,
dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
LiteOn LGT-256M6G SSD.

After some of googling around, I found some chatter relating to changes
in NCQ on SSDs in 4.0. Been running w/o NCQ for a full kernel build so
far without issue. Perhaps there's been some change in the interaction
between dm-crypt and NCQ?

Abelardo, can you try w/o NCQ and see if that helps your situation?

Best,

--Brandon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > The patchset in question was tested quite heavily so this is a
> > > > surprising report. I'm noticing you are opting in to dm-crypt discard
> > > > support. Have you tested without discards enabled?
> > >
> > > I've disabled discards universally and rebuilt a vanilla kernel. After
> > > running
> > > my heavy read-write-sync scripts, everything seems to be working fine now.
> > > I
> > > suppose this could be something that used to fail silently before, but now
> > > produces bad behavior? I seem to remember having something in my message
> > > log
> > > about "discards not supported on this device" when running with it enabled
> > > before.
> >
> > Forgive me, but I spoke too soon. The corruption and libata errors are still
> > there, as was evidenced when I went to reboot and got treated to an eye full
> > of
> > "read-only filesystem" and ata errors.
> >
> > So no, disabling discards unfortunately did nothing to help.
>
> I've been experiencing the same problem. Vanilla 4.0 series kernels,
> dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> LiteOn LGT-256M6G SSD.
>
> After some of googling around, I found some chatter relating to changes
> in NCQ on SSDs in 4.0. Been running w/o NCQ for a full kernel build so
> far without issue. Perhaps there's been some change in the interaction
> between dm-crypt and NCQ?
>
> Abelardo, can you try w/o NCQ and see if that helps your situation?
>
> Best,
>
> --Brandon

I've been running with NCQ disabled and been stress testing for awhile and the
issue is indeed gone. Thanks for the workaround!

So it seems the issue is somehow related to the combination of NCQ, dm-crypt,
and possibly (some?) SSDs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Mon, 18 May 2015, Abelardo Ricart III wrote:

> On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> > On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > > The patchset in question was tested quite heavily so this is a
> > > > > surprising report. I'm noticing you are opting in to dm-crypt discard
> > > > > support. Have you tested without discards enabled?
> > > >
> > > > I've disabled discards universally and rebuilt a vanilla kernel. After
> > > > running
> > > > my heavy read-write-sync scripts, everything seems to be working fine now.
> > > > I
> > > > suppose this could be something that used to fail silently before, but now
> > > > produces bad behavior? I seem to remember having something in my message
> > > > log
> > > > about "discards not supported on this device" when running with it enabled
> > > > before.
> > >
> > > Forgive me, but I spoke too soon. The corruption and libata errors are still
> > > there, as was evidenced when I went to reboot and got treated to an eye full
> > > of
> > > "read-only filesystem" and ata errors.
> > >
> > > So no, disabling discards unfortunately did nothing to help.
> >
> > I've been experiencing the same problem. Vanilla 4.0 series kernels,
> > dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> > LiteOn LGT-256M6G SSD.
> >
> > After some of googling around, I found some chatter relating to changes
> > in NCQ on SSDs in 4.0. Been running w/o NCQ for a full kernel build so
> > far without issue. Perhaps there's been some change in the interaction
> > between dm-crypt and NCQ?
> >
> > Abelardo, can you try w/o NCQ and see if that helps your situation?
> >
> > Best,
> >
> > --Brandon
>
> I've been running with NCQ disabled and been stress testing for awhile and the
> issue is indeed gone. Thanks for the workaround!
>
> So it seems the issue is somehow related to the combination of NCQ, dm-crypt,
> and possibly (some?) SSDs.

Hi

I suspect that this is a bug in kernel NCQ processing or in SSD firmware
and recent dm-crypt changes made the bug show up.

I suggest this:

If you have some test that reliably reproduces the bug, please do this:
take kernel 3.19 or 3.18 and apply dm-crypt parallelization patches
(commits f3396c58fd8442850e759843457d78b6ec3a9589,
cf2f1abfbd0dba701f7f16ef619e4d2485de3366,
7145c241a1bf2841952c3e297c4080b357b3e52d,
94f5e0243c48aa01441c987743dc468e2d6eaca2,
dc2676210c425ee8e5cb1bec5bc84d004ddf4179,
0f5d8e6ee758f7023e4353cca75d785b2d4f6abe,
b3c5fd3052492f1b8d060799d4f18be5a5438add) on it. If the bug doesn't show
up with the older kernel and dm-crypt parallelization patches, use git
bisect to find out which patch broken NCQ. When you test a kernel with
bisect, apply the above mentioned patches to it.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
On Tue, 2015-06-02 at 13:51 -0400, Mikulas Patocka wrote:
>
> On Mon, 18 May 2015, Abelardo Ricart III wrote:
>
> > On Fri, 2015-05-15 at 08:04 -0700, Brandon Smith wrote:
> > > On 2015-05-01 (Fri) at 19:42:15 -0400, Abelardo Ricart III wrote:
> > > > > > The patchset in question was tested quite heavily so this is a
> > > > > > surprising report. I'm noticing you are opting in to dm-crypt
> discard
> > > > > > support. Have you tested without discards enabled?
> > > > >
> > > > > I've disabled discards universally and rebuilt a vanilla kernel. After
>
> > > > > running
> > > > > my heavy read-write-sync scripts, everything seems to be working fine
> now.
> > > > > I
> > > > > suppose this could be something that used to fail silently before, but
> now
> > > > > produces bad behavior? I seem to remember having something in my
> message
> > > > > log
> > > > > about "discards not supported on this device" when running with it
> enabled
> > > > > before.
> > > >
> > > > Forgive me, but I spoke too soon. The corruption and libata errors are
> still
> > > > there, as was evidenced when I went to reboot and got treated to an eye
> full
> > > > of
> > > > "read-only filesystem" and ata errors.
> > > >
> > > > So no, disabling discards unfortunately did nothing to help.
> > >
> > > I've been experiencing the same problem. Vanilla 4.0 series kernels,
> > > dm-crypt, with/or without discards, on a ThinkPad X1 Carbon with a
> > > LiteOn LGT-256M6G SSD.
> > >
> > > After some of googling around, I found some chatter relating to changes
> > > in NCQ on SSDs in 4.0. Been running w/o NCQ for a full kernel build so
> > > far without issue. Perhaps there's been some change in the interaction
> > > between dm-crypt and NCQ?
> > >
> > > Abelardo, can you try w/o NCQ and see if that helps your situation?
> > >
> > > Best,
> > >
> > > --Brandon
> >
> > I've been running with NCQ disabled and been stress testing for awhile and
> the
> > issue is indeed gone. Thanks for the workaround!
> >
> > So it seems the issue is somehow related to the combination of NCQ, dm
> -crypt,
> > and possibly (some?) SSDs.
>
> Hi
>
> I suspect that this is a bug in kernel NCQ processing or in SSD firmware
> and recent dm-crypt changes made the bug show up.
>
> I suggest this:
>
> If you have some test that reliably reproduces the bug, please do this:
> take kernel 3.19 or 3.18 and apply dm-crypt parallelization patches
> (commits f3396c58fd8442850e759843457d78b6ec3a9589,
> cf2f1abfbd0dba701f7f16ef619e4d2485de3366,
> 7145c241a1bf2841952c3e297c4080b357b3e52d,
> 94f5e0243c48aa01441c987743dc468e2d6eaca2,
> dc2676210c425ee8e5cb1bec5bc84d004ddf4179,
> 0f5d8e6ee758f7023e4353cca75d785b2d4f6abe,
> b3c5fd3052492f1b8d060799d4f18be5a5438add) on it. If the bug doesn't show
> up with the older kernel and dm-crypt parallelization patches, use git
> bisect to find out which patch broken NCQ. When you test a kernel with
> bisect, apply the above mentioned patches to it.
>
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Alright, I'll try this next and report back soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Regression: Disk corruption with dm-crypt and kernels >= 4.0 [ In reply to ]
Hi,

Could you please try the following patch (against any of the kernels you
saw the corruption with. be it 4.0, 4.1, 4.2) to see if the regression
you reported goes away? Thanks, Mike

From: Mike Snitzer <snitzer@redhat.com>
Date: Wed, 9 Sep 2015 21:34:51 -0400
Subject: [PATCH] dm crypt: constrain crypt device's max_segment_size to
PAGE_SIZE

Unfortunate constraint that is required to avoid the potential for
exceeding underlying device's max_segments limits -- due to
crypt_alloc_buffer() possibly allocating pages for the encryption bio
that are not as physically contiguous as the original bio.

Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-crypt.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 76f1d6e..f717762 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -973,7 +973,8 @@ static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone);

/*
* Generate a new unfragmented bio with the given size
- * This should never violate the device limitations
+ * This should never violate the device limitations (but only because
+ * max_segment_size is being constrained to PAGE_SIZE).
*
* This function may be called concurrently. If we allocate from the mempool
* concurrently, there is a possibility of deadlock. For example, if we have
@@ -2057,9 +2058,20 @@ static int crypt_iterate_devices(struct dm_target *ti,
return fn(ti, cc->dev, cc->start, ti->len, data);
}

+static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+ /*
+ * Unfortunate constraint that is required to avoid the potential
+ * for exceeding underlying device's max_segments limits -- due to
+ * crypt_alloc_buffer() possibly allocating pages for the encryption
+ * bio that are not as physically contiguous as the original bio.
+ */
+ limits->max_segment_size = PAGE_SIZE;
+}
+
static struct target_type crypt_target = {
.name = "crypt",
- .version = {1, 14, 0},
+ .version = {1, 14, 1},
.module = THIS_MODULE,
.ctr = crypt_ctr,
.dtr = crypt_dtr,
@@ -2071,6 +2083,7 @@ static struct target_type crypt_target = {
.message = crypt_message,
.merge = crypt_merge,
.iterate_devices = crypt_iterate_devices,
+ .io_hints = crypt_io_hints,
};

static int __init dm_crypt_init(void)
--
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/