Mailing List Archive

DRBD fsync() seems to return before writing to disk
I'm attempting to set up an NFS cluster, with a block device stack that
looks like this:

ext4
LVM
DRBD
LVM
MD RAID
4 SATA drives on each node

I want to guarantee that fsync() doesn't return until writes have made
it to physical storage. In particular, I care about PostgreSQL database
integrity.

Postgres includes a simple test program which performs timing of fsync()
operations [1]. I don't have any difficult performance requirements
here, so I'm not using any battery-backed caches. Consequently, if
fsync() is working, I shouldn't be able to do more than one per
revolution of my platters. For the 7200 RPM drives I'm using, that's 8ms.

I'm running Debian Squeeze. I'm running Linux 3.2.0-0.bpo.2-amd64 from
squeeze backports, which is necessary to get write barrier support on md
raid devices. drbdadm status indicates version 8.3.11.

Now, I've verified that with an ext4 on the first three layers of my
storage stack (LVM on MD RAID on SATA drives) I get a working fsync(). I
know this because running test_fsync[1] gives latencies of just a bit
over 8 ms. However, performing a similar test with the full stack,
including DRBD, gives much lower latency numbers (1.2 - 2.0 ms),
indicating that fsync() must not be working, because it's physically
impossible for my 7200 RPM SATA drives to perform synchronous writes
that fast. At least, that's how I understand it.

Looking for previous answers, I found [2], which says that DRBD doesn't
offer write barriers to upper layers. I also found [3], which suggests
to me that fsync() should work as I'm expecting it to work. Both of
these posts are pretty old. Lastly, I found [4], which suggests that
Linux is horribly broken and fsync doesn't really do anything, anyhow.

To be honest, I'm a little unclear on the relationship between fsync()
and write barriers. Regardless, DRBD is clearly having some effect,
since it makes fsync() about 7 times faster, which can't be good. So I'm
wondering, is this expected behavior? Is my data integrity at risk? How
can an application writing to a DRBD device be sure data has been
written to nonvolatile storage, fsync() or something else? Is this
something that would change if I moved to DRBD 8.4?

[1] http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
[2] http://lists.linbit.com/pipermail/drbd-user/2008-September/010306.html
[3] http://lists.linbit.com/pipermail/drbd-user/2006-December/006105.html
[4] http://milek.blogspot.com/2010/12/linux-osync-and-write-barriers.html

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
Maybe I was too verbose, so I'll rephrase more briefly:

My tests indicate that fsync() on a DRBD device is returning before data
has been written to non-volatile storage. This is unacceptable for any
application that makes ACID guarantees. Is this something that should be
working? Is there something I must do to enable the behavior I desire
from fsync()?

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 03:58 PM, Phil Frost wrote:
> Maybe I was too verbose, so I'll rephrase more briefly:

Oh, your verbosity was just fine, and I'm eagerly waiting for responses
as well.

I assume that few dare comment on such low-level tech issues. I further
assume that the guys over at Linbit will speak up before long.

Cheers,
Felix
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 08:58 AM, Phil Frost wrote:

> Maybe I was too verbose, so I'll rephrase more briefly:

I think the problem here is that you didn't provide your DRBD config.
For all we know, you did this:

disk {
no-disk-barrier;
no-disk-flushes;
no-disk-drain;
no-md-flushes;
}

Which is an explicit no-no and would definitely break fsync. But I also
give you this from the manual:

http://www.drbd.org/users-guide/re-drbdconf.html

"Unfortunately device mapper (LVM) might not support barriers."

That leaves you with either flushes or drain as the fsync method. Since
you don't have a BBWC, you must have at least this option enabled. Then
we can talk.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 10:43 AM, Shaun Thomas wrote:
> I think the problem here is that you didn't provide your DRBD config.
> For all we know, you did this:
>
> disk {
> no-disk-barrier;
> no-disk-flushes;
> no-disk-drain;
> no-md-flushes;
> }
> [...]
> "Unfortunately device mapper (LVM) might not support barriers."

Ah sorry, I forgot to include that with the original post. Full
configuration copied below, but I haven't disabled any of the methods
you mention.

I have read that part of the manual, but I think in my case, LVM does
support barriers. It looks to me like they began to be supported in
2.6.29 [1] and all the kinks were ironed out by 2.6.31-rc1 [2]. I'm
running a 3.2 kernel (though even Debian squeeze runs 2.6.32) so these
issues should be history.

Further, the fsync() latency I observed running ext4 on MD and LVM (but
not DRBD devices) empirically suggest that barriers are working for
DRBD's underlying block device, at least given my understanding that
working barriers are required for a fsync() that really does flush to
non-volatile storage.

[1] http://lwn.net/Articles/326597/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=9554

DRBD config follows:

global { usage-count no; }
common {
protocol C;
handlers { }
startup { }
disk { on-io-error detach; }
net { cram-hmac-alg sha1; }
syncer {
verify-alg crc32c;
rate 5M;
}
}
resource nfsexports {
device minor 0;
net { shared-secret secret; }
on storage01 {
disk /dev/storage01/nfsexports;
flexible-meta-disk /dev/storage01/nfsexports-drbdmd;
address 10.0.0.7:7789;
}
on storage02 {
disk /dev/storage02/nfsexports;
flexible-meta-disk /dev/storage02/nfsexports-drbdmd;
address 10.0.0.9:7789;
}
}


_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 10:59 AM, Phil Frost wrote:
> On 06/20/2012 10:43 AM, Shaun Thomas wrote:
>> "Unfortunately device mapper (LVM) might not support barriers."
>
> Further, the fsync() latency I observed running ext4 on MD and LVM
> (but not DRBD devices) empirically suggest that barriers are working
> for DRBD's underlying block device, at least given my understanding
> that working barriers are required for a fsync() that really does
> flush to non-volatile storage.


Just to check my sanity further, I just did a test of a DRBD device
directly on a partition of a SATA drive. No LVM, no MD; just a plain
SATA drive and DRBD. I reached the same conclusion, DRBD is not
observing fsync(), O_SYNC, etc., and it's not for lack of support on the
underlying device. Can anyone reproduce?





pfrost@storage02:/mnt/synctest$ cat /etc/drbd.d/test.res
resource test {
device minor 1;
net {
shared-secret
BTawM41lfw8L0RTKBXhOiGK4lWb6dZTqJGNaGJJwF0pCfNhasdfhB5qjBjgZL4O;
}

on storage01 {
disk /dev/sda2;
meta-disk internal;
address 10.0.0.7:7790;
}
on storage02 {
disk /dev/sda2;
meta-disk internal;
address 10.0.0.9:7790;
}
}
pfrost@storage02:~$ sudo drbdadm -- --overwrite-data-of-peer primary test
pfrost@storage02:~$ sudo mkfs -t ext4 /dev/drbd/by-res/test
pfrost@storage02:~$ sudo mount /dev/drbd/by-res/test -o barrier=1 -t
ext4 /mnt/synctest/
pfrost@storage02:~$ cd /mnt/synctest/
pfrost@storage02:/mnt/synctest$ sudo ~/test_fsync -f ./test_fsync
Simple write timing:
write 0.002742

Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 0.196210
write, close, fsync 0.175143

Compare one o_sync write to two:
one 16k o_sync write 0.175284
two 8k o_sync writes 0.300499

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.157953
write, fdatasync 0.167023
write, fsync 0.169357

Compare file sync methods with two 8k writes:
(o_dsync unavailable)
open o_sync, write 0.300676
write, fdatasync 0.182892
write, fsync 0.184106

pfrost@storage02:/mnt/synctest$ cd ..
pfrost@storage02:/mnt$ sudo umount synctest/
pfrost@storage02:/mnt$ sudo drbdadm down test
pfrost@storage02:/mnt$ sudo mkfs -t ext4 /dev/sda2
pfrost@storage02:/mnt$ cd /mnt/synctest/
pfrost@storage02:/mnt/synctest$ sudo ~/test_fsync -f ./test_fsync
Simple write timing:
write 0.002765

Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 8.561847
write, close, fsync 8.490528

Compare one o_sync write to two:
one 16k o_sync write 8.365526
two 8k o_sync writes 20.089010

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 8.390429
write, fdatasync 8.407142
write, fsync 8.540648

Compare file sync methods with two 8k writes:
(o_dsync unavailable)
open o_sync, write 16.739327
write, fdatasync 8.423901
write, fsync 8.515552


_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 11:51 AM, Phil Frost wrote:

> Just to check my sanity further, I just did a test of a DRBD device
> directly on a partition of a SATA drive. No LVM, no MD; just a plain
> SATA drive and DRBD. I reached the same conclusion, DRBD is not
> observing fsync(), O_SYNC, etc., and it's not for lack of support on
> the underlying device. Can anyone reproduce?

Interesting. I don't have time to test that right now, but now I'm
curious at how this will turn out. I wonder if this is at all related to
the problems we had with 8.4.1 causing read errors with the new read
balancing settings when the OS cache is dropped.

I guess it would be pretty embarrassing if DRBD wasn't honoring fsync
under certain kernel combinations. For what it's worth, we have 8.3.10
on our existing cluster pending upgrade, and fsync times are definitely
longer when both nodes are present. I wouldn't be surprised if something
in kernel 3.2.0 causes a subtle break here.

That said, most of the complaints on this mailing list tend to
concentrate on how slow DRBD has made an existing setup. Now you come
along and start claiming the opposite. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 01:43 PM, Shaun Thomas wrote:
> I guess it would be pretty embarrassing if DRBD wasn't honoring fsync
> under certain kernel combinations. For what it's worth, we have 8.3.10
> on our existing cluster pending upgrade, and fsync times are
> definitely longer when both nodes are present. I wouldn't be surprised
> if something in kernel 3.2.0 causes a subtle break here.

The 2nd test I did, the one directly on a SATA disk, was with just one
node present. I just never bothered to configure the 2nd node for the
quick test.

What OS and kernel version are you running? Also, when you say slower
with both nodes present, is it slower enough to indicate that fsync is
working? Or just slower because of network latency? In my case I expect
latency would go up at least by the network RTT with both nodes present
as well, but the RTT of 180us is almost two orders of magnitude smaller
than the mean random access latency of my drives. In the case of small
synchronous writes, I'd expect any difference in latency to be
indistinguishable from measurement error.

For that matter, is anyone absolutely sure that fsync is working on
their DRBD devices? Any reproducible tests demonstrating such? At this
point, I'm not sure if the behavior I'm expecting is supported in DRBD
at all, or if it's something about my configuration or environment
that's causing the issue. Before I go on a witch hunt installing
different versions of things and pulling levers here and there I'd like
to know that success is at least possible.

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/20/2012 02:34 PM, Phil Frost wrote:

> What OS and kernel version are you running? Also, when you say slower
> with both nodes present, is it slower enough to indicate that fsync
> is working?

They're old CentOS boxes running 2.6.18. So far as the latency, I'm not
sure, but I doubt it's responsible; we have a direct 10G link between
the two machines. For what it's worth, we are running XFS, so it's not a
very good comparison. I don't use EXT4 in production environments.

> In the case of small synchronous writes, I'd expect any difference in
> latency to be indistinguishable from measurement error.

FusionIO drives here. Our PostgreSQL sync calls, based on checkpoint
logging, are almost 2x slower with a connected DRBD resource, than when
it's in standalone. About what you'd expect, actually.

We have a test cluster on a simple RAID0 with much newer OS running a
3.2 kernel, too. That's where I noticed a Pacemaker failover would
attempt to start the LVM service before DRBD was done promoting during a
manual failover. I had to add a 2-second start delay, which tells me the
DRBD RA is not waiting long enough before returning control to
Pacemaker. I suppose that could be sync related, but I haven't run any
specific tests.

> For that matter, is anyone absolutely sure that fsync is working on
> their DRBD devices?

I can confirm it works in our existing setup. We run a 10ktps OLTP
database on Postgres and have experienced a few unplanned outages where
we failed over to the other node. If DRBD was having trouble, we'd have
ripped it out and bought a SAN long ago.

I haven't done enough tests in our dev environment with the updated
kernel/DRBD to know for sure. If I have time later, I'll run a quick dd
test.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
Ok, Phil.

I ran some tests on our (crappy) RAID0, comprised of two 300GB SAS
drives. Here's what I got for varying block sizes:

DRBD connected:

8K - 26MB/s
16K - 39MB/s
32K - 65MB/s
64K - 92MB/s
128K - 121MB/s
256K - 137MB/s
512K - 152MB/s
1M - 177MB/s
2M - 196MB/s
4M - 197MB/s

DRBD disconnected:

8K - 57MB/s
16K - 95MB/s
32K - 144MB/s
64K - 196MB/s
128K - 242MB/s
256K - 272MB/s
512K - 285MB/s
1M - 292MB/s
2M - 297MB/s
4M - 287MB/s

Those disconnected numbers look not too far off from raw disk
performance in a simple 2-disk RAID0. To be fair, this is with DRBD
8.4.1, and with the following disk options:

disk {
disk-barrier no;
disk-flushes no;
md-flushes no;
}

And we could do all that because we have capacitor-backed RAID controllers.

We're seeing pretty much *exactly* what you should expect. This is with
the 3.2.0 kernel as well. To make sure this wasn't fake, I monitored
iostat and watched Dirty and Writeback from /proc/sys/meminfo. These are
legit numbers obtained directly from dd and oflag=sync for all tests.

So, I'm not sure about your own setup, but I can confirm that DRBD does
honor sync in our case.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
Phil,


-------- Original Message --------
Subject: Re: [DRBD-user] DRBD fsync() seems to return before writing to disk
From: Shaun Thomas <sthomas@optionshouse.com>
To: Phil Frost <phil@macprofessionals.com>
CC: drbd-user@lists.linbit.com
Date: 06/20/2012 10:48 PM
> On 06/20/2012 02:34 PM, Phil Frost wrote:
>
>> What OS and kernel version are you running? Also, when you say slower
>> with both nodes present, is it slower enough to indicate that fsync
>> is working?
>
> They're old CentOS boxes running 2.6.18.

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On Jun 20, 2012, at 6:19 PM, Shaun Thomas wrote:

> I ran some tests on our (crappy) RAID0, comprised of two 300GB SAS drives. Here's what I got for varying block sizes:
>
> DRBD connected:
> 8K - 26MB/s
> [...]
> DRBD disconnected:
> 8K - 57MB/s
> [...]
> Those disconnected numbers look not too far off from raw disk performance in a simple 2-disk RAID0.
> [...]
> And we could do all that because we have capacitor-backed RAID controllers.
>
> We're seeing pretty much *exactly* what you should expect. This is with the 3.2.0 kernel as well. To make sure this wasn't fake, I monitored iostat and watched Dirty and Writeback from /proc/sys/meminfo. These are legit numbers obtained directly from dd and oflag=sync for all tests.
>
> So, I'm not sure about your own setup, but I can confirm that DRBD does honor sync in our case.

I think this demonstrates that O_SYNC causes writes to happen immediately rather than accumulating in the pagecache (I assume you observed Dirty stayed very low), but I don't think it demonstrates anything about DRBD issuing a sync to the underlying device when it receives a sync itself. We already know O_SYNC or fsync will flush the pagecache; this is a fuction of the Linux VFS and is not visible from DRBD's perspective. What's important is that when DRBD receives a sync operation, it passes it through to the lower layer, but as long as the BBU remains enabled, your RAID controller will treat all sync operations as no-ops because there's no volatile cache on the RAID device to be synced, so there's no change in behavior that we could observe.

To test my hypothesis, you'd have to disable the write-back cache. What you should see is a drop in performance of one or two orders of magnitude, going from your measured 7296 IOPS (57MB/s*1024/8k), an impossibility for any spinning media on this planet, to something limited by the rotation speed. If you had, for example, a 10000 RPM drive, anything faster than 10000/60 or 166 IOPS or 6 ms per IO means the IO syscalls must not be blocking until the data has reached nonvolatile storage (as requested by O_SYNC). You might also as much as double the IOPS with RAID-0 if you are performing small sequential writes rather than re-writing the same block between each sync, but even so this is an order of magnitude slower than what you just measured.

If you don't see a huge drop in performance after disabling the battery backed writeback cache, then we can conclude that DRBD or something else is eating the sync operations between userland (fsync(), open(O_SYNC), etc) and the underlying device. Not a problem if you have money to spend on battery-backed cache, and can tolerate the added risk of power loss when the battery has failed, or is reconditioning, or is being replaced, or power loss longer than battery hold time, but for everyone else, it's a big problem.
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/21/2012 01:10 PM, Phillip Frost wrote:
> Not a problem if you have money to spend on battery-backed cache, and can tolerate the added risk of power loss when the battery has failed, or is reconditioning, or is being replaced, or power loss longer than battery hold time, but for everyone else, it's a big problem.

I'd argue that it's problematic even then. If your assumptions are
correct, I don't see anything here that asserts that the sync call has
even reached the RAID Controller cache.

It all seems quite suspicious though - a bug like that should have been
noticed earlier, or so I'd believe.
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
> To test my hypothesis, you'd have to disable the write-back cache.

You're right of course. I've done a couple more tests:

uname26 hpacucli ctrl=0 ld 2 modify aa=disable

dd if=/dev/zero of=foo bs=8192 count=1024 oflag=sync
1024+0 records in
1024+0 records out
8388608 bytes (8.4 MB) copied, 11.9341 s, 703 kB/s

703 kB/s? Ouch!

uname26 hpacucli ctrl=0 ld 2 modify aa=disable

dd if=/dev/zero of=foo bs=8192 count=8192 oflag=sync
8192+0 records in
8192+0 records out
67108864 bytes (67 MB) copied, 1.21189 s, 55.4 MB/s

Seems that at least in our case, disabling the write cache is a very reliable way to obliterate performance. This is with md flushing, disk flushing, and disk barriers re-enabled in DRBD, just to be safe with the inconsistent cache state.

Now, for these particular controllers, disabling the BBU automatically disables the write cache. I can't say for sure whether or not they actually do that without going to the colo and yanking the capacitor, but presumably we'd see similarly terrible performance without the BBU.

I can't confirm any of this with 8.3.11 unfortunately. But 8.4.1 at least, works as expected on our hardware.

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
> uname26 hpacucli ctrl=0 ld 2 modify aa=disable
>
> dd if=/dev/zero of=foo bs=8192 count=8192 oflag=sync
> 8192+0 records in
> 8192+0 records out
> 67108864 bytes (67 MB) copied, 1.21189 s, 55.4 MB/s

Oops, cut and paste bug. That was actually aa=enable that gave us the stats that followed, in case that was confusing.

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/21/2012 10:09 AM, Shaun Thomas wrote:
> 703 kB/s? Ouch!

Interesting. For 8 kB blocks, 703 kB/s means 87 IO/s, or 5272 IO/min.
Sounds like a 7200 RPM drive, with some overhead.

But now that I'm thinking about this more, there's a flaw in this
testing methodology. We don't know if the decrease in performance is
because DRBD is asking your RAID controller to flush cache, or if your
RAID controller is simply not caching at all.

I could, in my environment with a simple SATA drive, make DRBD "safe" by
disabling all write caching entirely with something like "hdparm -W 0
/dev/sda". However, there's a big difference between not caching when
it's important, and not caching, ever. No caching at all also means I
can't take advantage of NCQ.

I'm having a hard time thinking of a way to test your RAID controller's
behavior that isn't harder than just testing on a regular SATA drive,
which has caching semantics that are well known. I don't think there are
any means of doing IO from userspace that bypass the pagecache and also
don't flush the drive cache (unless they are broken).

I'd still really like to hear from a DRBD developer about what the
desired behavior of DRBD is. I keep thinking about [1], and it seems
pretty clear to me that the position there, that fsync() causes no
change in behavior as far as DRBD is concerned, is horribly wrong, and
demonstrably not true of MD, LVM, or SATA devices. But that was in 2006;
has anything changed since then?

[1] http://lists.linbit.com/pipermail/drbd-user/2006-December/006105.html

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
> Interesting. For 8 kB blocks, 703 kB/s means 87 IO/s, or
> 5272 IO/min. Sounds like a 7200 RPM drive, with some overhead.

Nope. 10k RPM drives in this case. The 8k reads are small enough that they created a lot more journal traffic than you'd normally like. A watched iostat showed about 165 IOPS, which is what you'd expect there.

> I'm having a hard time thinking of a way to test your RAID
> controller's behavior that isn't harder than just testing on
> a regular SATA drive, which has caching semantics that are well
> known.

Unfortunately no machine I have access to has just plain drives to confirm results like you're claiming. Every cache/writethrough/flushing combo I've thrown at it reacted as expected in our case. I tried though. :)

Maybe someone with a regular old desktop and 7200 RPM drives will pipe in eventually, but I'm out of ideas. I'm willing to wager it's just your setup though. For one reason or another.



______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On 06/19/2012 11:55 AM, Phil Frost wrote:
> I want to guarantee that fsync() doesn't return until writes have made
> it to physical storage. In particular, I care about PostgreSQL
> database integrity.

Well, this is proving very frustrating. I still don't know if I'm
chasing behavior that simply isn't implemented, or isn't working in my
environment. However, I'm very sure something is wrong here. I tried
digging around in the source code (3.2.0 kernel from debian
squeeze-backports) a bit, and I'm CCing drbd-dev since I don't imagine
too many users read the code. I pretty much have no experience with
block device programming, but I did find some good documentation in the
kernel [1] that provided some good grep victims, specifically REQ_FLUSH
and REQ_FUA. I found evidence that these are supported by DRBD, in
drbd_main.c:

static u32 bio_flags_to_wire(struct drbd_conf *mdev, unsigned long bi_rw)
{
if (mdev->agreed_pro_version >= 95)
return (bi_rw & REQ_SYNC ? DP_RW_SYNC : 0) |
(bi_rw & REQ_FUA ? DP_FUA : 0) |
(bi_rw & REQ_FLUSH ? DP_FLUSH : 0) |
(bi_rw & REQ_DISCARD ? DP_DISCARD : 0);
else
return bi_rw & REQ_SYNC ? DP_RW_SYNC : 0;
}

This appears to be responsible for encoding the block request flags into
a network format for the peer, and there is an inverse function in
drbd_receiver.c. However, [1] also says block device drivers (well,
"request_fn based" drivers, but I don't know what that means, but I
think it applies) must call blk_queue_flush to advertise support for
REQ_FUA and REQ_FLUSH. grep tells me DRBD doesn't do this anywhere, but
I do see it in other drivers I recognize, MD, loop, xen-blkfront, etc.

So, my hypothesis is that DRBD had the code to pass REQ_FUA and
REQ_FLUSH through to the underlying device, but it never sees those
flags because it doesn't claim to support them. So, they get stripped
off by the block IO system, which figures the best it can do is drain
the queue, which is clearly the Wrong Thing.

Unfortunately, I don't feel very qualified in this area, so can anyone
tell me if I'm totally off base here? Any suggestions on how I might
test this?

[1]
http://www.mjmwired.net/kernel/Documentation/block/writeback_cache_control.txt

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
Phil Frost writes:

> Unfortunately, I don't feel very qualified in this area, so can anyone
> tell me if I'm totally off base here? Any suggestions on how I might
> test this?

I don't know much about the kernel myself, but your post suggested a
fix. I applied the patch below to linux-3.4.1 and this patch appears to
fix the problem. Specifically, fsync() is too fast before the patch,
and it runs at non-drbd speed after the patch.

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 211fc44..96e400b 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -3413,6 +3413,7 @@ struct drbd_conf *drbd_new_device(unsigned int minor)
set_disk_ro(disk, true);

disk->queue = q;
+ blk_queue_flush(disk->queue, REQ_FLUSH | REQ_FUA);
disk->major = DRBD_MAJOR;
disk->first_minor = minor;
disk->fops = &drbd_ops;

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
On Fri, Jun 22, 2012 at 02:01:13PM -0400, Matteo Frigo wrote:
> Phil Frost writes:


Sorry, we have been busy last week, and not been able to react properly
on the more interesting threads.

>
> > Unfortunately, I don't feel very qualified in this area, so can anyone
> > tell me if I'm totally off base here? Any suggestions on how I might
> > test this?
>
> I don't know much about the kernel myself, but your post suggested a
> fix. I applied the patch below to linux-3.4.1 and this patch appears to
> fix the problem. Specifically, fsync() is too fast before the patch,
> and it runs at non-drbd speed after the patch.
>
> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> index 211fc44..96e400b 100644
> --- a/drivers/block/drbd/drbd_main.c
> +++ b/drivers/block/drbd/drbd_main.c
> @@ -3413,6 +3413,7 @@ struct drbd_conf *drbd_new_device(unsigned int minor)
> set_disk_ro(disk, true);
>
> disk->queue = q;
> + blk_queue_flush(disk->queue, REQ_FLUSH | REQ_FUA);
> disk->major = DRBD_MAJOR;
> disk->first_minor = minor;
> disk->fops = &drbd_ops;

Basically you are correct, yes,
this is part of what is needed to expose FLUSH/FUA
to upper layers/file systems.

But. This alone...

... it will crash the kernel ...



DRBD as of now, does not expose FLUSH/FUA to upper layers, so whatever
upper layers do, generic_maker_request() will strip away these flags,
and the full stack from DRBD and below down to the hardware will no
longer even see these flags.

I'm preparing a patch that will enable the use and pass through of
FLUSH/FUA on DRBD, should hit public git this week.


Now, a short history of fsync, barrier, flush, fua and DRBD.


Some time ago, fsync would only make sure that all relevant
"dirty" data in the linux page cache and buffer cache
would have been submitted to the hardware. Maybe even wait for the
completion of these requests.

But not care for potentially volatile device caches.

Then, devices had "native" respective "tagged" command
queueing (NCQ/TCQ), and drivers started to support these operations.

This was supported as "barrier" request in the linux block and file
system layer.

Still some drivers did not support this, some file systems did not
use them, or not by default, anyways.
There have been implementation flaws, like fsync "sometimes" causing
a barrier, but sometimes not.
Some stacked drivers did not support them either.

Then with increased awareness of the "fsync problem",
file system implementations have been audited and improved,
defaults have been revised.

With mainline 2.6.29, non-fragmented single segment dm-linear
learned to support (pass through) barrier requests.
2.6.30 most dm targets (LVM) had support for barriers, with
2.6.33 even dm-raid1, which was the last one to receive barrier support.
I don't recall which mainline version of MD started to support barriers,
but I think it dates back to 2005.


"Back then", DRBD did neither use barriers, nor expose, pass-through, or
support them in other ways. As most other drivers did not either,
and file systems (ok, ext3) did not enable them by default, and we'd
have to implement a lot of "try, then fall back and disable, but
re-issue the request anyways" code, we did not.


Also, because people who cared used battery backed write cache in their
controllers anyways, and would explicitly disable barriers.


Then DRBD started to care for barriers for DRBD internal meta data
writes: for each meta data transaction, and before all "DRBD barrier
acks", which separate reorder domains or "epochs" in the DRBD
replication stream, we issue a barrier or flush request.

We still did not expose the use of barriers to upper layers.

* either you disable your volatile write caches,
* or you make them non-volatile, by adding a BBU

But even with volatile caches, for the "simple" node failure + failover,
DRBD would be ok, because of the frequent flush requests for the "DRBD
barrier ack", while they have been replicating still.

If you drive DRBD without a peer, though, you'd only get barriers
down to the hardware, if you happened to leave the current hot working
set, causing DRBD to write an activity log transaction, and only
if the drbd meta-data is located on the same physical drive, or the same
lower level linux block device queue, or both.

So if you crash a primary drbd without a peer on volatile cache,
you may experience data loss, where, with proper use of barriers
in your file system and lower level devices may have been avoided
without DRBD in the mix. Lame excuse: "Multiple failures" and such ...



Then there was the re-wiring of the linux block layer respective
barriers and flushes. BIO_RW_BARRIER and REQ_HARDBARRIER vanished, and
all semantics are expressed as a combination of REQ_FLUSH | REQ_FUA.
Users of the interface are no longer required to implement
try-and-fallback, it is supposed to be transparent now.


That has been almost two years ago already (2.6.36 following).
But still DRBD does not expose or pass through these request types.
Me culpa.


As these are now supposed to be transparent (no try-then-fallback
implementation needed on our part anymore), I have implemented
support for this finally.

Will do some more testing, and add "#ifdef's" as necessary for older
kernel versions. Then you should have it in git, and with all
further releases (that will be 8.3.14, 8.4.2, and newer).

Still my recommendation for best performance is to have battery backed
write cache on your controllers, disable any volatile caches, and
disable/forget about barriers/flushes/fua.


For those interested, below is my current, not yet final, patch.
As is, it should work with kernels >= 2.6.36,
and those that have backported FLUSH/FUA and blk_queue_flush(),
which include RHEL 6.1 and later (note: NOT 6.0).



diff --git a/drbd/drbd_actlog.c b/drbd/drbd_actlog.c
index 0f03a4c..b856c95 100644
--- a/drbd/drbd_actlog.c
+++ b/drbd/drbd_actlog.c
@@ -926,7 +926,11 @@ int __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector, int size,
unsigned int enr, count = 0;
struct lc_element *e;

- if (size <= 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_BIO_SIZE) {
+ /* this should be an empty REQ_FLUSH */
+ if (size == 0)
+ return 0;
+
+ if (size < 0 || (size & 0x1ff) != 0 || size > DRBD_MAX_BIO_SIZE) {
dev_err(DEV, "sector: %llus, size: %d\n",
(unsigned long long)sector, size);
return 0;
diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index c3cad43..f2e9c39 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -3916,6 +3916,7 @@ struct drbd_conf *drbd_new_device(unsigned int minor)
q->backing_dev_info.congested_data = mdev;

blk_queue_make_request(q, drbd_make_request);
+ blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
/* Setting the max_hw_sectors to an odd value of 8kibyte here
This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 6f8de0c..993a09c 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -316,6 +316,9 @@ STATIC void drbd_pp_free(struct drbd_conf *mdev, struct page *page, int is_net)
atomic_t *a = is_net ? &mdev->pp_in_use_by_net : &mdev->pp_in_use;
int i;

+ if (page == NULL)
+ return;
+
if (drbd_pp_vacant > (DRBD_MAX_BIO_SIZE/PAGE_SIZE)*minor_count)
i = page_chain_free(page);
else {
@@ -355,7 +358,7 @@ struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,
gfp_t gfp_mask) __must_hold(local)
{
struct drbd_epoch_entry *e;
- struct page *page;
+ struct page *page = NULL;
unsigned nr_pages = (data_size + PAGE_SIZE -1) >> PAGE_SHIFT;

if (drbd_insert_fault(mdev, DRBD_FAULT_AL_EE))
@@ -368,9 +371,11 @@ struct drbd_epoch_entry *drbd_alloc_ee(struct drbd_conf *mdev,
return NULL;
}

- page = drbd_pp_alloc(mdev, nr_pages, (gfp_mask & __GFP_WAIT));
- if (!page)
- goto fail;
+ if (data_size) {
+ page = drbd_pp_alloc(mdev, nr_pages, (gfp_mask & __GFP_WAIT));
+ if (!page)
+ goto fail;
+ }

INIT_HLIST_NODE(&e->collision);
e->epoch = NULL;
@@ -1476,7 +1481,6 @@ read_in_block(struct drbd_conf *mdev, u64 id, sector_t sector, int data_size) __

data_size -= dgs;

- ERR_IF(data_size == 0) return NULL;
ERR_IF(data_size & 0x1ff) return NULL;
ERR_IF(data_size > DRBD_MAX_BIO_SIZE) return NULL;

@@ -1497,6 +1501,9 @@ read_in_block(struct drbd_conf *mdev, u64 id, sector_t sector, int data_size) __
if (!e)
return NULL;

+ if (!data_size)
+ return e;
+
ds = data_size;
page = e->pages;
page_chain_for_each(page) {
@@ -1933,6 +1940,10 @@ STATIC int receive_Data(struct drbd_conf *mdev, enum drbd_packets cmd, unsigned

dp_flags = be32_to_cpu(p->dp_flags);
rw |= wire_flags_to_bio(mdev, dp_flags);
+ if (e->pages == NULL) {
+ D_ASSERT(e->size == 0);
+ D_ASSERT(dp_flags & DP_FLUSH);
+ }

if (dp_flags & DP_MAY_SET_IN_SYNC)
e->flags |= EE_MAY_SET_IN_SYNC;
diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c
index 5593fc8..a104daf 100644
--- a/drbd/drbd_req.c
+++ b/drbd/drbd_req.c
@@ -1168,13 +1168,12 @@ MAKE_REQUEST_TYPE drbd_make_request(struct request_queue *q, struct bio *bio)
/*
* what we "blindly" assume:
*/
- D_ASSERT(bio->bi_size > 0);
D_ASSERT((bio->bi_size & 0x1ff) == 0);

/* to make some things easier, force alignment of requests within the
* granularity of our hash tables */
s_enr = bio->bi_sector >> HT_SHIFT;
- e_enr = (bio->bi_sector+(bio->bi_size>>9)-1) >> HT_SHIFT;
+ e_enr = bio->bi_size ? (bio->bi_sector+(bio->bi_size>>9)-1) >> HT_SHIFT : s_enr;

if (likely(s_enr == e_enr)) {
do {




--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: DRBD fsync() seems to return before writing to disk [ In reply to ]
Lars Ellenberg <lars.ellenberg@linbit.com>
writes:

> Now, a short history of fsync, barrier, flush, fua and DRBD.

Thanks for the insightful post. Much appreciated.

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user