Mailing List Archive

Nature of ext4 corruption fixed by recent patch?
Hi,

I recently had my server's filesystem implode, and I'm currently in the
process of cleaning it up. It had widespread corruption in files and
directories scattered across the filesystem, though all vaguely recently
changed. Directories appeared corrupted or truncated, various files
showed up as piles of NULs, and 5000+ files and directories ended up in
lost+found. I observed this corruption shortly after a reboot into
4.0.2 (from a previous kernel of 3.16), with ext4 noticing an
inconsistency and mounting the filesystem read-only. The underling
disks had no errors.

Reading about the corruption issue fixed by
d2dc317d564a46dfc683978a2e5a4f91434e9711 ("ext4: fix data corruption
caused by unwritten and delayed extents"), it sounds plausible. Can
that strike both file data and directory data, assuming all of that data
ended up grouped with a delayed extent? Would that bug manifest as
corrupted directories and files filled with NULs? The system is a
72-way server on which I was doing piles of parallel git pulls and
builds, so hitting a race seems plausible.

I'm trying to track down potential causes of this so that I can feel
comfortable trusting that system again.

Thanks,
Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Mon, May 18, 2015 at 03:58:24PM -0700, josh@joshtriplett.org wrote:
>
> I recently had my server's filesystem implode, and I'm currently in the
> process of cleaning it up. It had widespread corruption in files and
> directories scattered across the filesystem, though all vaguely recently
> changed. Directories appeared corrupted or truncated, various files
> showed up as piles of NULs, and 5000+ files and directories ended up in
> lost+found. I observed this corruption shortly after a reboot into
> 4.0.2 (from a previous kernel of 3.16), with ext4 noticing an
> inconsistency and mounting the filesystem read-only. The underling
> disks had no errors.
>
> Reading about the corruption issue fixed by
> d2dc317d564a46dfc683978a2e5a4f91434e9711 ("ext4: fix data corruption
> caused by unwritten and delayed extents"), it sounds plausible. Can
> that strike both file data and directory data, assuming all of that data
> ended up grouped with a delayed extent? Would that bug manifest as
> corrupted directories and files filled with NULs? The system is a
> 72-way server on which I was doing piles of parallel git pulls and
> builds, so hitting a race seems plausible.

Unfortunately, I don't think you can blame all of your problems on the
bug fixed by this particular bug. First of all, it doesn't apply to
directories at all; secondly, it's been around for a long time. I'd
have to check and see whether or not 3.16 had the problem, but it
wouldn't surprise me at all. Finally, git pulls and builds are not
at all likely to hit the problem.

It requires the combination of (a) writing to a portion of a file that
was not previously allocated using buffered I/O, (b) an fallocate of a
region of the file which is a superset of region written in (a) before
it has chance to be written to disk, (c) waiting for the file data in
(a) to be written out to disk (either via fsync or via the writeback
daemons), and then (d) before the extent status cache gets pushed out
of memory, another random write to a portion of the file covered by
(a) -- in which case that specific portion of (a) could be replaced by
all zeros.

Even most database or torrent downloads are not likely to hit this
pattern, since it requires an fallocate of a previous previously (and
very recently) allocated region of a file using a buffered write.
Torrent downloads will tend to fallocate the whole file in advance,
and while Oracle or DB2 might intermix writes and fallocates, they
don't fallocate previously written regions of the file, and they use
direct I/O in any case.

So it's pretty hard to hit this bug by accident, unless you happen to
be using fsx, and even then, the only files that would get corrupted
would be the files being written using fsx. So I'm afraid you'll have
to look farther afield, and consider other bugs as well as potential
hardware problems before trusting the system again.

Cheers,

- Ted

P.S. It's bugs like these which is why I'm always amused by people
who think that just because a file system is safely being used by
their developers, that it's safe to throw production workloads on
them. These sorts of subtle data corruptors tend to be highly timing
depend, and very hard to find. Sometimes these bugs can hang around
for years before they are found and fixed. The flip side is that
fortunately, they tend to strike very rarely. It's also why I'm very
grateful for developers like Jan and Lukas. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Tue, May 19, 2015 at 09:40:05AM -0400, Theodore Ts'o wrote:
> On Mon, May 18, 2015 at 03:58:24PM -0700, josh@joshtriplett.org wrote:
> >
> > I recently had my server's filesystem implode, and I'm currently in the
> > process of cleaning it up. It had widespread corruption in files and
> > directories scattered across the filesystem, though all vaguely recently
> > changed. Directories appeared corrupted or truncated, various files
> > showed up as piles of NULs, and 5000+ files and directories ended up in
> > lost+found. I observed this corruption shortly after a reboot into
> > 4.0.2 (from a previous kernel of 3.16), with ext4 noticing an
> > inconsistency and mounting the filesystem read-only. The underling
> > disks had no errors.
> >
> > Reading about the corruption issue fixed by
> > d2dc317d564a46dfc683978a2e5a4f91434e9711 ("ext4: fix data corruption
> > caused by unwritten and delayed extents"), it sounds plausible. Can
> > that strike both file data and directory data, assuming all of that data
> > ended up grouped with a delayed extent? Would that bug manifest as
> > corrupted directories and files filled with NULs? The system is a
> > 72-way server on which I was doing piles of parallel git pulls and
> > builds, so hitting a race seems plausible.
>
> Unfortunately, I don't think you can blame all of your problems on the
> bug fixed by this particular bug. First of all, it doesn't apply to
> directories at all; secondly, it's been around for a long time. I'd
> have to check and see whether or not 3.16 had the problem, but it
> wouldn't surprise me at all. Finally, git pulls and builds are not
> at all likely to hit the problem.
>
> It requires the combination of (a) writing to a portion of a file that
> was not previously allocated using buffered I/O, (b) an fallocate of a
> region of the file which is a superset of region written in (a) before
> it has chance to be written to disk, (c) waiting for the file data in
> (a) to be written out to disk (either via fsync or via the writeback
> daemons), and then (d) before the extent status cache gets pushed out
> of memory, another random write to a portion of the file covered by
> (a) -- in which case that specific portion of (a) could be replaced by
> all zeros.
>
> Even most database or torrent downloads are not likely to hit this
> pattern, since it requires an fallocate of a previous previously (and
> very recently) allocated region of a file using a buffered write.
> Torrent downloads will tend to fallocate the whole file in advance,
> and while Oracle or DB2 might intermix writes and fallocates, they
> don't fallocate previously written regions of the file, and they use
> direct I/O in any case.

Ah, thanks for the clarification. :(

In particular, I didn't realize this was *only* the data of the
delayed-extent-based files. The bug here seems to have struck various
recently-written files and directories. (Recent in days, not seconds,
as far as I can tell; and it isn't universal based on age.) The initial
symptom was ext4 noticing that a directory was corrupt (truncated, IIRC)
and immediately marking the whole filesystem read-only.

> So it's pretty hard to hit this bug by accident, unless you happen to
> be using fsx, and even then, the only files that would get corrupted
> would be the files being written using fsx. So I'm afraid you'll have
> to look farther afield, and consider other bugs as well as potential
> hardware problems before trusting the system again.

I'm quite skeptical of hardware problems. The system is a few months
old, well past infant-mortality and too young for burnout. And I've
tested the disks carefully.

Are there any other known bugs that seem likely to fit the symptoms and
circumstances?

Note that since I saw this after rebooting from 3.16 into 4.0.2, I don't
know whether the corruption was more likely caused by 3.16 or 4.0.2.

> P.S. It's bugs like these which is why I'm always amused by people
> who think that just because a file system is safely being used by
> their developers, that it's safe to throw production workloads on
> them.

Heh. Yeah, I like exciting new software in most areas, but not in
filesystems. In filesystems I prefer boring. :)

> These sorts of subtle data corruptors tend to be highly timing
> depend, and very hard to find. Sometimes these bugs can hang around
> for years before they are found and fixed. The flip side is that
> fortunately, they tend to strike very rarely.

...lucky me.

> It's also why I'm very
> grateful for developers like Jan and Lukas. :-)

Indeed.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Tue, May 19, 2015 at 09:37:40AM -0700, Josh Triplett wrote:
> In particular, I didn't realize this was *only* the data of the
> delayed-extent-based files. The bug here seems to have struck various
> recently-written files and directories. (Recent in days, not seconds,
> as far as I can tell; and it isn't universal based on age.) The initial
> symptom was ext4 noticing that a directory was corrupt (truncated, IIRC)
> and immediately marking the whole filesystem read-only.

Do you have the transcript of fsck run on the file system? Either
with -n, or as you were trying to fix it? I'd need to know a lot more
about the pattern of corruptions to hazard a guess.

The sorts of corruption that turn into a large number of file system
errors are (a) corruptions in the block allocation bitmap, so blockes
get used for more than one purpose, or (b) garbage (or the wrong
portion of an inode table) getting written into the inode table. But
these all have their own distinctive signatures in terms of the file
system problems reported by e2fsck.

In general though this doesn't cause large number of files to contain
NULLs. though. So it doesn't smell like a file system problem, but
I'd want to see a detailed listing of the problems reported by e2fsck
before making a definitive statement.

Were you using LVM, raid, or anything else between the file system and
the storage device(s)?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Tue, May 19, 2015 at 01:50:24PM -0400, Theodore Ts'o wrote:
> On Tue, May 19, 2015 at 09:37:40AM -0700, Josh Triplett wrote:
> > In particular, I didn't realize this was *only* the data of the
> > delayed-extent-based files. The bug here seems to have struck various
> > recently-written files and directories. (Recent in days, not seconds,
> > as far as I can tell; and it isn't universal based on age.) The initial
> > symptom was ext4 noticing that a directory was corrupt (truncated, IIRC)
> > and immediately marking the whole filesystem read-only.
>
> Do you have the transcript of fsck run on the file system? Either
> with -n, or as you were trying to fix it? I'd need to know a lot more
> about the pattern of corruptions to hazard a guess.

Well, I *was* going to say that I didn't have the logs or transcripts of
fsck because the filesystem got remounted read-only and I didn't log
fsck. However, it looks like *after* the fsck, the problem occurred
again:

[173581.359925] EXT4-fs error (device md0): ext4_ext_remove_space:2976: inode #8395881: comm rm: pblk 33595732 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
[173581.360054] Aborting journal on device md0-8.
[173581.360100] EXT4-fs (md0): Remounting filesystem read-only
[173581.360117] EXT4-fs error (device md0) in ext4_ext_remove_space:3048: IO failure
[173581.360189] EXT4-fs error (device md0) in ext4_ext_truncate:4669: IO failure
[173581.360262] EXT4-fs error (device md0) in ext4_reserve_inode_write:4837: Journal has aborted
[173581.360337] EXT4-fs error (device md0) in ext4_truncate:3668: Journal has aborted
[173581.360411] EXT4-fs error (device md0) in ext4_reserve_inode_write:4837: Journal has aborted
[173581.360485] EXT4-fs error (device md0) in ext4_orphan_del:2694: Journal has aborted
[173581.360559] EXT4-fs error (device md0) in ext4_reserve_inode_write:4837: Journal has aborted

And since this is after the fsck, and I'm assuming fsck would have
noticed that kind of corruption, then this would have to be new
filesystem corruption.

Here's the result of "fsck -n" from the still-running system, since the
filesystem is read-only anyway:

fsck from util-linux 2.26.2
e2fsck 1.42.13 (17-May-2015)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/md0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 2097220 has zero dtime. Fix? no

Inodes that were part of a corrupted orphan linked list found. Fix? no

Inode 4725949 was part of the orphaned inode list. IGNORED.
Inode 6172273 has an invalid extent node (blk 24679449, lblk 0)
Clear? no

Inode 6172273, i_blocks is 152, should be 0. Fix? no

Inode 8395881 was part of the orphaned inode list. IGNORED.
Inode 9970463 has an invalid extent node (blk 39892628, lblk 0)
Clear? no

Inode 9970463, i_blocks is 21072, should be 0. Fix? no

Inode 18488219 has an invalid extent node (blk 73989773, lblk 0)
Clear? no

Inode 18488219, i_blocks is 2459648, should be 2185832. Fix? no

Pass 2: Checking directory structure
Directory inode 4470367, block #0, offset 0: directory corrupted
Salvage? no

e2fsck: aborted

/dev/md0: ********** WARNING: Filesystem still has errors **********


These look roughly similar to those that came up during the previous
issue, though at that point there were far more of them. Other messages
in the previous round of fsck not shown above include the usual repair
procedures (adding in . and .., correcting link counts); it's possible
that there were other messages as well, but I don't recall which ones
and I don't have a transcript.

Is there some additional data I can collect to help determine what the
issue might be here? I can leave this system running for a bit in this
read-only state, before I start trying to recover it.

> The sorts of corruption that turn into a large number of file system
> errors are (a) corruptions in the block allocation bitmap, so blockes
> get used for more than one purpose, or (b) garbage (or the wrong
> portion of an inode table) getting written into the inode table. But
> these all have their own distinctive signatures in terms of the file
> system problems reported by e2fsck.
>
> In general though this doesn't cause large number of files to contain
> NULLs. though. So it doesn't smell like a file system problem, but
> I'd want to see a detailed listing of the problems reported by e2fsck
> before making a definitive statement.
>
> Were you using LVM, raid, or anything else between the file system and
> the storage device(s)?

md-based RAID0, on top of a pair of SSDs.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Wed, 20 May 2015, josh@joshtriplett.org wrote:
> md-based RAID0, on top of a pair of SSDs.

Might this be it? It was mentioned in another, more recent thread about
data loss involving md-raid0 and ssds...

https://bugzilla.kernel.org/show_bug.cgi?id=98501

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Nature of ext4 corruption fixed by recent patch? [ In reply to ]
On Wed, May 20, 2015 at 10:23:22PM -0300, Henrique de Moraes Holschuh wrote:
> On Wed, 20 May 2015, josh@joshtriplett.org wrote:
> > md-based RAID0, on top of a pair of SSDs.
>
> Might this be it? It was mentioned in another, more recent thread about
> data loss involving md-raid0 and ssds...
>
> https://bugzilla.kernel.org/show_bug.cgi?id=98501

That looks extremely likely, particularly since I do have the "discard"
option enabled. And discarding the wrong blocks seems entirely
consistent with the symptoms I've observed.

Thanks! I'll disable discard as a short-term workaround until that
change goes into the Debian kernel.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/