Mailing List Archive

1 2 3 4 5 6 7 8 9  View All
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Nate Diller wrote:

> On 7/31/06, Jeff V. Merkey <jmerkey@wolfmountaingroup.com> wrote:
>
>> Nate Diller wrote:
>>
>> > On 7/31/06, Jeff V. Merkey <jmerkey@wolfmountaingroup.com> wrote:
>> >
>> >> Gregory Maxwell wrote:
>> >>
>> >> > On 7/31/06, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> >> >
>> >> >> Its well accepted that reiserfs3 has some robustness problems
>> in the
>> >> >> face of physical media errors. The structure of the file system
>> >> and the
>> >> >> tree basis make it very hard to avoid such problems. XFS appears
>> >> to have
>> >> >> managed to achieve both robustness and better data structures.
>> >> >>
>> >> >> How reiser4 compares I've no idea.
>> >> >
>> >> >
>> >> > Citation?
>> >> >
>> >> > I ask because your clam differs from the only detailed research
>> that
>> >> > I'm aware of on the subject[1]. In figure 2 of the iron filesystems
>> >> > paper that Ext3 is show to ignore a great number of data-loss
>> inducing
>> >> > failure conditions that Reiser3 detects an panics under.
>> >> >
>> >> > Are you sure that you aren't commenting on cases where Reiser3
>> alerts
>> >> > the user to a critical data condition (via a panic) which leads
>> to a
>> >> > trouble report while ext3 ignores the problem which suppresses the
>> >> > trouble report from the user?
>> >> >
>> >> > *1) http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf
>> >>
>> >> Hi Gregory, Wikimedia Foundation and LKML?
>> >>
>> >> How's Wikimania going. :-)
>> >>
>> >> What he says is correct. I have seen some serious issues with
>> reiserfs
>> >> in terms of stability and
>> >> data corruption. Resier is however FASTER, but the statement is has
>> >> robustness issues is accurate.
>> >> I was using reiserfs but we opted to make EXT3 the default for Solera
>> >> appliances, even when using Suse 10
>> >> due to issues I have seen with data corruption and hard hangs on
>> RAID 0
>> >> read/write sector errors. I have
>> >> stopped using it for local drives and based everything on EXT3.
>> Not to
>> >> say it won't get there eventually, but
>> >> file systems have to endure a lot of time in the field and deployment
>> >> befor they are ready for prime time.
>> >>
>> >> The Wikimedia appliances use Wolf Mountain, and I've tested it for
>> about
>> >> 4 months with few problems, but
>> >> I only use it for hosting the Cherokee Langauge Wikipedia. It's
>> >> performance is several magnitudes better
>> >> than either EXT3 or ReiserFS. Despite this, for vertical wiki
>> servers,
>> >> its ok to go out with, folks can specifiy
>> >> whether they want appliances with EXT3, Reiser, or WMFS, but iit's a
>> >> long way from being "cooked"
>> >> completely, though it does scale to 1 exabyte FS images.
>> >
>> >
>> > i've seen you mention the Wolf Mountain FS in other emails, but google
>> > isn't telling me a lot about it. Do you have a whitepaper? are there
>> > any published benchmark results? what sort of workloads do you
>> > benchmark?
>> >
>> > NATE
>> >
>> Wikipedia is the app for now. I have not done any benchmarks on the FS
>> side, just the capture side, and its been transferred to
>> another entity. I have no idea what they are naming it to, but I expect
>> you may hear about it soon. One of the incarnations
>> of it is Solera's DSFS which can be reviewed here:
>>
>> www.soleranetworks.com
>
>
> so this is a single stream, write only? ...
>
>> I can sustain 850 MB/S throughput from user space with it -- about 5 x
>> any other FS. On some hardware, I've broken
>> the 1.25 GB/S (gigabyte/second) windows with it.
>
>
> and you're saying it scales to much higher multi-spindle
> single-machine throughput. cool.
>
> i'd love to see a whitepaper, or failing that, have an off-list
> discussion of your approach and the various kernel limitations you ran
> up against in testing. i don't suppose they invited you to the Kernel
> Summit to talk about it, heh.
>
> NATE
>
The patents have been filed for over a year, and will publish in several
weeks at uspto.gov -- that's the only acclaim I care for --
one that results in value for the industry and more patent protection
for Linux and profits for folks. No, I have not been invited
to the summit, probably because of the lawsuit I filed against some
folks who were threatening my family -- Peter Anvin booted
me off Kernel.org after allowing folks to pinch my code and copy my bash
history files all over the internet, and several folks
have stiffed me. I could care less. I keep creating cool technology,
make tons of money off of it, and I have cultivated an
excellent relationship with the Wikimedia Foundation, and I am now the
principal contributor on the Cherokee Wikipedia. Wales
even deleted the article folks had used to smear me and made folks
rewrite it. Wales is a very nice man and good dude.

I am content to contribute to Linux from a business viewpoint, and if
the treatment I received from Anvin is par for kernel.org accounts,
I don't care for one -- IP addresses are rather cheap on the
internet. I was and have remained loyal to Linux through it all.

I am appreciative of your interest. Check uspto.gov in next few weeks
for published applications, it's all described there, distributed
architecture and all.

All my Wikilove.

Jeff






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Matthias Andree wrote:

>
>>Have you ever seen VxFS or WAFL in action?
>>
>>
>
>No I haven't. As long as they are commercial, it's not likely that I
>will.
>
>
WAFL was well done. It has several innovations that I admire,
including quota trees, non-support of fragments for performance reasons,
and the basic WAFL notion applied to an NFS RAID special (though
important) case.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
>
>> A filesystem with a fixed number of inodes (= not readjustable while
>> mounted) is ehr.. somewhat unuseable for a lot of people with
>> big and *flexible* storage needs (Talking about NetApp/EMC owners)
>
>Which is untrue at least for Solaris, which allows resizing a life file
>system. FreeBSD and


>Linux require an unmount.

Only for shrinking.


Jan Engelhardt
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
On Mon, Jul 31, 2006 at 05:59:58PM +0200, Adrian Ulrich wrote:
> Hello Matthias,
>
> > This looks rather like an education issue rather than a technical limit.
>
> We aren't talking about the same issue: I was asking to do it
> on-the-fly. Umounting the filesystem, running e2fsck and resize2fs
> is something different ;-)
>
> > Which is untrue at least for Solaris, which allows resizing a life file
> > system. FreeBSD and Linux require an unmount.
>
> Correct: You can add more inodes to a Solaris UFS on-the-fly if you are
> lucky enough to have some free space available.
>
> A colleague of mine happened to create a ~300gb filesystem and started
> to migrate Mailboxes (Maildir-style format = many small files (1-3kb))
> to the new LUN. At about 70% the filesystem ran out of inodes; Not a
> big deal with VxFS because such a problem is fixable within seconds.
> What would have happened if he had used UFS? mkfs -G wouldn't work
> because he had no additional Diskspace left... *ouch*..
>
This case is solvable by planning. When you know that the new fs
must be created with all inodes from the start, simply count
how many you need before migration. (And add a decent safety margin.)
That's what I do with my home machine ask disks wear out every third
year or so. The tools for ext2/3 tells how many inodes are in use,
and the new fs can be made accordingly. The approach works for bigger
machines too of course.

Helge Hafting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
planning sometimes is not possible, exspecially in certain highly stressed
environment.

Just think. I had 3 years ago a database that was 2 TB, we were supposing
it could grow in three years of 6 TB, but now it is 40 TB because
the market situation is changed, and with this the number of the users and
their needs.

Please you have to suppose that when you
have to deal with filesystems use for some kind of services, it is
impossible to predict the grown rate, and this is true also about the
numeber of used i-nodes.



On Tue, 1 Aug 2006, Helge Hafting wrote:

> On Mon, Jul 31, 2006 at 05:59:58PM +0200, Adrian Ulrich wrote:
>> Hello Matthias,
>>
>>> This looks rather like an education issue rather than a technical limit.
>>
>> We aren't talking about the same issue: I was asking to do it
>> on-the-fly. Umounting the filesystem, running e2fsck and resize2fs
>> is something different ;-)
>>
>>> Which is untrue at least for Solaris, which allows resizing a life file
>>> system. FreeBSD and Linux require an unmount.
>>
>> Correct: You can add more inodes to a Solaris UFS on-the-fly if you are
>> lucky enough to have some free space available.
>>
>> A colleague of mine happened to create a ~300gb filesystem and started
>> to migrate Mailboxes (Maildir-style format = many small files (1-3kb))
>> to the new LUN. At about 70% the filesystem ran out of inodes; Not a
>> big deal with VxFS because such a problem is fixable within seconds.
>> What would have happened if he had used UFS? mkfs -G wouldn't work
>> because he had no additional Diskspace left... *ouch*..
>>
> This case is solvable by planning. When you know that the new fs
> must be created with all inodes from the start, simply count
> how many you need before migration. (And add a decent safety margin.)
> That's what I do with my home machine ask disks wear out every third
> year or so. The tools for ext2/3 tells how many inodes are in use,
> and the new fs can be made accordingly. The approach works for bigger
> machines too of course.
>
> Helge Hafting
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Alan, I have seen only anecdotal evidence against reiserfsck, and I have
seen formal tests from Vitaly (which it seems a user has replicated)
where our fsck did better than ext3s. Note that these tests are of the
latest fsck from us: I am sure everyone understands that it takes time
for an fsck to mature, and that our early fsck's were poor. I will also
say the V4's fsck is more robust than V3's because we made disk format
changes specifically to help fsck.

Now I am not dismissing your anecdotes as I will never dismiss data I
have not seen, and it sounds like you have seen more data than most
people, but I must dismiss your explanation of them.

Being able to throw away all of the tree but the leaves and twigs with
extent pointers and rebuild all of it makes V4 very robust, more so than
ext3. This business of inodes not moving, I don't see what the
advantage is, we can lose the directory entry and rebuild just as well
as ext3, probably better because we can at least figure out what
directory it was in.

Vitaly can say all of this more expertly than I....

Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Ric Wheeler wrote:

> Alan Cox wrote:
>
>>
>>
>> You do it turns out. Its becoming an issue more and more that the sheer
>> amount of storage means that the undetected error rate from disks,
>> hosts, memory, cables and everything else is rising.
>
>
>
> I agree with Alan

You will want to try our compression plugin, it has an ecc for every 64k....

Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Gregory Maxwell wrote:

> This is why ZFS offers block checksums... it can then try all the
> permutations of raid regens to find a solution which gives the right
> checksum.
>
ZFS performance is pretty bad in the only benchmark I have seen of it.
Does anyone have serious benchmarks of it? I suspect that our
compression plugin (with ecc) will outperform it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Bernd Schubert <bernd-schubert@gmx.de> wrote:
> On Monday 31 July 2006 21:29, Jan-Benedict Glaw wrote:
> > The point is that it's quite hard to really fuck up ext{2,3} with only
> > some KB being written while it seems (due to the
> > fragile^Wsophisticated on-disk data structures) that it's just easy to
> > kill a reiser3 filesystem.

> Well, I was once very 'luckily' and after a system crash (*) e2fsck put
> all files into lost+found. Sure, I never experienced this again, but I
> also never experienced something like this with reiserfs. So please, stop
> this kind of FUD against reiser3.6.

It isn't FUD. One data point doesn't allow you to draw conclusions.

Yes, I've seen/heard of ext2/ext3 failures and data loss too. But at least
the same number for ReiserFS. And I know it is outnumbered 10 to 1 or so in
my sample, so that would indicate at a 10 fold higher probability of
catastrophic data loss, other factors mostly the same.

> While filesystem speed is nice, it also would be great if reiser4.x would be
> very robust against any kind of hardware failures.

Can't have both.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
> > While filesystem speed is nice, it also would be great if reiser4.x would be
> > very robust against any kind of hardware failures.
>
> Can't have both.

..and some people simply don't care about this:

If you are running a 'big' Storage-System with battery protected
WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
you don't need your filesystem beeing super-robust against bad sectors
and such stuff because:

a) You've paid enough money to let the storage care about
Hardware issues.
b) If your storage is on fire you can do a failover using the mirror.
c) And if someone ran dd if=/dev/urandom of=/dev/sda you could
even rollback your Snapshot.
(Btw: i did this once to a Reiser4 filesystem (overwritten about
1.2gb). fsck.reiser4 --rebuild-sb was able to fix it.)


..but what you really need is a flexible and **fast** filesystem: Like
Reiser4.

(Yeah.. yeah.. i know: ext3 is also flexible and fast.. but Reiser4
simply is *MUCH* faster than ext3 for 'my' workload/application).

Regards,
Adrian

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
> WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
> you don't need your filesystem beeing super-robust against bad sectors
> and such stuff because:

You do it turns out. Its becoming an issue more and more that the sheer
amount of storage means that the undetected error rate from disks,
hosts, memory, cables and everything else is rising.

There has been a great deal of discussion about this at the filesystem
and kernel summits - and data is getting kicked the way of networking -
end to end not reliability in the middle.

The sort of changes this needs hit the block layer and ever fs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Alan Cox wrote:
> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>> WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>> you don't need your filesystem beeing super-robust against bad sectors
>> and such stuff because:
>
> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.

Yikes. Undetected.

Wait, what? Disks, at least, would be protected by RAID. Are you
telling me RAID won't detect such an error?

It just seems wholly alien to me that errors would go undetected, and
we're OK with that, so long as our filesystems are robust enough. If
it's an _undetected_ error, doesn't that cause way more problems
(impossible problems) than FS corruption? Ok, your FS is fine -- but
now your bank database shows $1k less on random accounts -- is that ok?

> There has been a great deal of discussion about this at the filesystem
> and kernel summits - and data is getting kicked the way of networking -
> end to end not reliability in the middle.

Sounds good, but I've never let discussions by people smarter than me
prevent me from asking the stupid questions.

> The sort of changes this needs hit the block layer and ever fs.

Seems it would need to hit every application also...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Horst H. von Brand wrote:
> Bernd Schubert <bernd-schubert@gmx.de> wrote:

>> While filesystem speed is nice, it also would be great if reiser4.x would be
>> very robust against any kind of hardware failures.
>
> Can't have both.

Why not? I mean, other than TANSTAAFL, is there a technical reason for
them being mutually exclusive? I suspect it's more "we haven't found a
way yet..."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
On 8/1/06, David Masover <ninja@slaphack.com> wrote:
> Yikes. Undetected.
>
> Wait, what? Disks, at least, would be protected by RAID. Are you
> telling me RAID won't detect such an error?

Unless the disk ECC catches it raid won't know anything is wrong.

This is why ZFS offers block checksums... it can then try all the
permutations of raid regens to find a solution which gives the right
checksum.

Every level of the system must be paranoid and take measure to avoid
corruption if the system is to avoid it... it's a tough problem. It
seems that the ZFS folks have addressed this challenge by building as
much of what is classically separate layers into one part.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover:
> Yikes. Undetected.
>
> Wait, what? Disks, at least, would be protected by RAID. Are you
> telling me RAID won't detect such an error?

Yes.

RAID deals with the case where a device fails. RAID 1 with 2 disks can
in theory detect an internal inconsistency but cannot fix it.

> we're OK with that, so long as our filesystems are robust enough. If
> it's an _undetected_ error, doesn't that cause way more problems
> (impossible problems) than FS corruption? Ok, your FS is fine -- but
> now your bank database shows $1k less on random accounts -- is that ok?

Not really no. Your bank is probably using a machine (hopefully using a
machine) with ECC memory, ECC cache and the like. The UDMA and SATA
storage subsystems use CRC checksums between the controller and the
device. SCSI uses various similar systems - some older ones just use a
parity bit so have only a 50/50 chance of noticing a bit error.

Similarly the media itself is recorded with a lot of FEC (forward error
correction) so will spot most changes.

Unfortunately when you throw this lot together with astronomical amounts
of data you get burned now and then, especially as most systems are not
using ECC ram, do not have ECC on the CPU registers and may not even
have ECC on the caches in the disks.

> > The sort of changes this needs hit the block layer and ever fs.
>
> Seems it would need to hit every application also...

Depending how far you propogate it. Someone people working with huge
data sets already write and check user level CRC values for this reason
(in fact bitkeeper does it for one example). It should be relatively
cheap to get much of that benefit without doing application to
application just as TCP gets most of its benefit without going app to
app.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Alan Cox wrote:
> Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover:
>> Yikes. Undetected.
>>
>> Wait, what? Disks, at least, would be protected by RAID. Are you
>> telling me RAID won't detect such an error?
>
> Yes.
>
> RAID deals with the case where a device fails. RAID 1 with 2 disks can
> in theory detect an internal inconsistency but cannot fix it.

Still, if it does that, that should be enough. The scary part wasn't
that there's an internal inconsistency, but that you wouldn't know.

And it can fix it if you can figure out which disk went. Or give it 3
disks and it should be entirely automatic -- admin gets paged, admin
hotswaps in a new disk, done.

>> we're OK with that, so long as our filesystems are robust enough. If
>> it's an _undetected_ error, doesn't that cause way more problems
>> (impossible problems) than FS corruption? Ok, your FS is fine -- but
>> now your bank database shows $1k less on random accounts -- is that ok?
>
> Not really no. Your bank is probably using a machine (hopefully using a
> machine) with ECC memory, ECC cache and the like. The UDMA and SATA
> storage subsystems use CRC checksums between the controller and the
> device. SCSI uses various similar systems - some older ones just use a
> parity bit so have only a 50/50 chance of noticing a bit error.
>
> Similarly the media itself is recorded with a lot of FEC (forward error
> correction) so will spot most changes.
>
> Unfortunately when you throw this lot together with astronomical amounts
> of data you get burned now and then, especially as most systems are not
> using ECC ram, do not have ECC on the CPU registers and may not even
> have ECC on the caches in the disks.

It seems like this is the place to fix it, not the software. If the
software can fix it easily, great. But I'd much rather rely on the
hardware looking after itself, because when hardware goes bad, all bets
are off.

Specifically, it seems like you do mention lots of hardware solutions,
that just aren't always used. It seems like storage itself is getting
cheap enough that it's time to step back a year or two in Moore's Law to
get the reliability.

>>> The sort of changes this needs hit the block layer and ever fs.
>> Seems it would need to hit every application also...
>
> Depending how far you propogate it. Someone people working with huge
> data sets already write and check user level CRC values for this reason
> (in fact bitkeeper does it for one example). It should be relatively
> cheap to get much of that benefit without doing application to
> application just as TCP gets most of its benefit without going app to
> app.

And yet, if you can do that, I'd suspect you can, should, must do it at
a lower level than the FS. Again, FS robustness is good, but if the
disk itself is going, what good is having your directory (mostly) intact
if the files themselves have random corruptions?

If you can't trust the disk, you need more than just an FS which can
mostly survive hardware failure. You also need the FS itself (or maybe
the block layer) to support bad block relocation and all that good
stuff, or you need your apps designed to do that job by themselves.

It just doesn't make sense to me to do this at the FS level. You
mention TCP -- ok, but if TCP is doing its job, I shouldn't also need to
implement checksums and other robustness at the protocol layer (http,
ftp, ssh), should I? Because in this analogy, it looks like TCP is the
"block layer" and a protocol is the "fs".

As I understand it, TCP only lets the protocol/application know when
something's seriously FUBARed and it has to drop the connection.
Similarly, the FS (and the apps) shouldn't have to know about hardware
problems until it really can't do anything about it anymore, at which
point the right thing to do is for the FS and apps to go "oh shit" and
drop what they're doing, and the admin replaces hardware and restores
from backup. Or brings a backup server online, or...



I guess my main point was that _undetected_ problems are serious, but if
you can detect them, and you have at least a bit of redundancy, you
should be good. For instance, if your RAID reports errors that it can't
fix, you bring that server down and let the backup server run.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Gregory Maxwell wrote:
> On 8/1/06, David Masover <ninja@slaphack.com> wrote:
>> Yikes. Undetected.
>>
>> Wait, what? Disks, at least, would be protected by RAID. Are you
>> telling me RAID won't detect such an error?
>
> Unless the disk ECC catches it raid won't know anything is wrong.
>
> This is why ZFS offers block checksums... it can then try all the
> permutations of raid regens to find a solution which gives the right
> checksum.

Isn't there a way to do this at the block layer? Something in
device-mapper?

> Every level of the system must be paranoid and take measure to avoid
> corruption if the system is to avoid it... it's a tough problem. It
> seems that the ZFS folks have addressed this challenge by building as
> much of what is classically separate layers into one part.

Sounds like bad design to me, and I can point to the antipattern, but
what do I know?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.

IMHO the possibility to hit such a random-so-far-undetected-corruption
is very low with one of the big/expensive raid systems as they are
doing fancy stuff like 'disk scrubbing' and usually do fail disks
at very early stages..

* I've seen storage systems from a BIG vendor die due to
firmware bugs
* I've seen FC-Cards die.. SAN-switches rebooted.. People used
my cables to do rope skipping
* We had Fire, non-working UPS and faulty diesel generators..

but so far the FSes (and applications) on the Storage never
complained about corrupted data.

..YMMV..

Btw: I don't think that Reiserfs really behaves this bad
with broken hardware. So far, Reiser3 survived 2 broken Harddrives
without problems while i've seen ext2/3 die 4 times so far...
(= everything inside /lost+found). Reiser4 survived
# mkisofs . > /dev/sda

Lucky me.. maybe..


To get back on-topic:

Some people try very hard to claim that the world doesn't need
Reiser4 and that you can do everything with ext3.

Ext3 may be fine for them but some people (like me) really need Reiser4
because they got applications/workloads that won't work good (fast) on ext3.

Why is it such a big thing to include a filesystem?
Even if it's unstable: does anyone care? Eg: the HFS+ driver
is buggy (corrupted the FS of my OSX installation 3 times so far) but
does this buggyness affect people *not* using it? No.

Regards,
Adrian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
> > This is why ZFS offers block checksums... it can then try all the
> > permutations of raid regens to find a solution which gives the right
> > checksum.
>
> Isn't there a way to do this at the block layer? Something in
> device-mapper?

Remember: Suns new Filesystem + Suns new Volume Manager = ZFS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Alan Cox wrote:
> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>
>>WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>>you don't need your filesystem beeing super-robust against bad sectors
>>and such stuff because:
>
>
> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.


I agree with Alan despite being an enthusiastic supporter of neat array
based technologies.

Most people use absolutely giant disks in laptops and desktop systems
(300GB & 500GB are common, 750GB on the way). File systems need to be as
robust as possible for users of these systems as people are commonly
storing personal "critical" data like photos mostly on these unprotected
drives.

Even for the high end users, array based mirroring and so on can only do
so much to protect you.

Mirroring a corrupt file system to a remote data center will mirror your
corruption.

Rolling back to a snapshot typically only happens when you notice a
corruption which can go undetected for quite a while, so even that will
benefit from having "reliability" baked into the file system (i.e., it
should grumble about corruption to let you know that you need to roll
back or fsck or whatever).

An even larger issue is that our tools, like fsck, which are used to
uncover these silent corruptions need to scale up to the point that they
can uncover issues in minutes instead of days. A lot of the focus at
the file system workshop was around how to dramatically reduce the
repair time of file systems.

In a way, having super reliable storage hardware is only as good as the
file system layer on top of it - reliability needs to be baked into the
entire IO system stack...

ric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
Ric Wheeler wrote:
> Alan Cox wrote:
>> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>>
>>> WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>>> you don't need your filesystem beeing super-robust against bad sectors
>>> and such stuff because:
>>
>>
>> You do it turns out. Its becoming an issue more and more that the sheer
>> amount of storage means that the undetected error rate from disks,
>> hosts, memory, cables and everything else is rising.

> Most people use absolutely giant disks in laptops and desktop systems
> (300GB & 500GB are common, 750GB on the way). File systems need to be as
> robust as possible for users of these systems as people are commonly
> storing personal "critical" data like photos mostly on these unprotected
> drives.

Their loss. Robust FS is good, but really, if you aren't doing backup,
you are going to lose data. End of story.

> Even for the high end users, array based mirroring and so on can only do
> so much to protect you.
>
> Mirroring a corrupt file system to a remote data center will mirror your
> corruption.

Assuming it's undetected. Why would it be undetected?

> Rolling back to a snapshot typically only happens when you notice a
> corruption which can go undetected for quite a while, so even that will
> benefit from having "reliability" baked into the file system (i.e., it
> should grumble about corruption to let you know that you need to roll
> back or fsck or whatever).

Yes, the filesystem should complain about corruption. So should the
block layer -- if you don't trust the FS, use a checksum at the block
layer. So should...

There are just so many other, better places to do this than the FS. The
FS should complain, yes, but if the disk is bad, there's going to be
corruption.

> An even larger issue is that our tools, like fsck, which are used to
> uncover these silent corruptions need to scale up to the point that they
> can uncover issues in minutes instead of days. A lot of the focus at
> the file system workshop was around how to dramatically reduce the
> repair time of file systems.

That would be interesting. I know from experience that fsck.reiser4 is
amazing. Blew away my data with something akin to an rm -rf, and fsck
fixed it. Tons of crashing/instability in the early days, but only once
-- before they even had a version instead of a date, I think -- did I
ever have a case where fsck couldn't fix it.

So I guess the next step would be to make fsck faster. Someone
mentioned a fsck that repairs the FS in the background?

> In a way, having super reliable storage hardware is only as good as the
> file system layer on top of it - reliability needs to be baked into the
> entire IO system stack...

That bit makes no sense. If you have super reliable storage failure
(never dies), and your FS is also reliable (never dies unless hardware
does, but may go bat-shit insane when hardware dies), then you've got a
super reliable system.

You're right, running Linux's HFS+ or NTFS write support is generally a
bad idea, no matter how reliable your hardware is. But this discussion
was not about whether an FS is stable, but how well an FS survives
hardware corruption.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
David Masover <ninja@slaphack.com> writes:

>> RAID deals with the case where a device fails. RAID 1 with 2 disks
>> can
>> in theory detect an internal inconsistency but cannot fix it.
>
> Still, if it does that, that should be enough. The scary part wasn't
> that there's an internal inconsistency, but that you wouldn't know.

RAID1 can do that in theory but it practice there is no verification,
so the other disk can perform another read simultaneously (thus
increasing performance).

Some high-end systems, maybe.

That would be hardly economical. Per-block checksums (like used by the
ZFS) are different story, they add only little additional load.

> And it can fix it if you can figure out which disk went. Or give it 3
> disks and it should be entirely automatic -- admin gets paged, admin
> hotswaps in a new disk, done.

Yep, that could be done. Or with 2 disks with block checksums.
Actually, while I don't exactly buy their ads, I think ZFS employs
some useful ideas.

> And yet, if you can do that, I'd suspect you can, should, must do it
> at a lower level than the FS. Again, FS robustness is good, but if
> the disk itself is going, what good is having your directory (mostly)
> intact if the files themselves have random corruptions?

With per-block checksum you will know. Of course, that's still not
end to end checksum.

> If you can't trust the disk, you need more than just an FS which can
> mostly survive hardware failure. You also need the FS itself (or
> maybe the block layer) to support bad block relocation and all that
> good stuff, or you need your apps designed to do that job by
> themselves.

Drives have internal relocation mechanisms, I don't think the
filesystem needs to duplicate them (though it should try to work
with bad blocks - relocations are possible on write).

> It just doesn't make sense to me to do this at the FS level. You
> mention TCP -- ok, but if TCP is doing its job, I shouldn't also need
> to implement checksums and other robustness at the protocol layer
> (http, ftp, ssh), should I?

Sure you have to, if you value your data.

> Similarly, the FS (and the apps) shouldn't have to know
> about hardware problems until it really can't do anything about it
> anymore, at which point the right thing to do is for the FS and apps
> to go "oh shit" and drop what they're doing, and the admin replaces
> hardware and restores from backup. Or brings a backup server online,
> or...

I don't think so. Going read-only if the disk returns write error,
ok. But taking the fs offline? Why?

Continuous backups (or rather transaction logs) are possible but
who has them? Do you have them? Would you throw away several hours
of work just because some file (or, say, unused area) contained
unreadable block (which could probably be transient problem, and/or
could be corrected by write)?
--
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
On Tue, 01 Aug 2006, David Masover wrote:

> >RAID deals with the case where a device fails. RAID 1 with 2 disks can
> >in theory detect an internal inconsistency but cannot fix it.
>
> Still, if it does that, that should be enough. The scary part wasn't
> that there's an internal inconsistency, but that you wouldn't know.

You won't usually know, unless you run a consistency check: RAID-1 will
only read from one of the two drives for speed - except if you make the
system check consistency as it goes, which would imply waiting for both
disks at the same time. And in that case, you'd better look for drives
that allow to synchronize their platter staples in order to avoid the
read access penalty that waiting for two drives entails.

> And it can fix it if you can figure out which disk went.

If it's decent and detects a bad block, it'll log it and rewrite it with
data from the mirror and let the drive do the remapping through ARWE.

> >Depending how far you propogate it. Someone people working with huge
> >data sets already write and check user level CRC values for this reason
> >(in fact bitkeeper does it for one example). It should be relatively
> >cheap to get much of that benefit without doing application to
> >application just as TCP gets most of its benefit without going app to
> >app.
>
> And yet, if you can do that, I'd suspect you can, should, must do it at
> a lower level than the FS. Again, FS robustness is good, but if the
> disk itself is going, what good is having your directory (mostly) intact
> if the files themselves have random corruptions?

Berkeley DB can, since version 4.1 (IIRC), write checksums (newer
versions document this as SHA1) on its database pages, to detect
corruptions and writes that were supposed to be atomic but failed
(because you cannot write 4K or 16K atomically on a disk drive).

--
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
On Tue, 01 Aug 2006, Ric Wheeler wrote:

> Mirroring a corrupt file system to a remote data center will mirror your
> corruption.
>
> Rolling back to a snapshot typically only happens when you notice a
> corruption which can go undetected for quite a while, so even that will
> benefit from having "reliability" baked into the file system (i.e., it
> should grumble about corruption to let you know that you need to roll
> back or fsck or whatever).
>
> An even larger issue is that our tools, like fsck, which are used to
> uncover these silent corruptions need to scale up to the point that they
> can uncover issues in minutes instead of days. A lot of the focus at
> the file system workshop was around how to dramatically reduce the
> repair time of file systems.

Which makes me wonder if backup systems shouldn't help with this. If
they are reading the whole file anyways, they can easily compute strong
checksums as they go, and record them for later use, and check so many
percent of unchanged files every day to complain about corruptions.

--
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion [ In reply to ]
On Tue, 01 Aug 2006, Hans Reiser wrote:

> You will want to try our compression plugin, it has an ecc for every 64k....

What kind of forward error correction would that be, and how much and
what failure patterns can it correct? URL suffices.

--
Matthias Andree
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9  View All