Mailing List Archive

[PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
There are a number of cases where pages get cleaned but two of concern
to this patch are;
o When dirtying pages, processes may be throttled to clean pages if
dirty_ratio is not met.
o Pages belonging to inodes dirtied longer than
dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU
if pages are being dirtied slowly so that neither the throttling cleans
them or a flusher thread waking periodically.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 1.5 times the
number of dirty pages encountered by reclaimers. The reasoning is that pages
were being dirtied at a roughly constant rate recently so if N dirty pages
were encountered in this scan block, we are likely to see roughly N dirty
pages again soon so try keep the flusher threads ahead of reclaim.

This is unfortunately very hand-wavy but there is not really a good way of
quantifying how bad it is when reclaim encounters dirty pages other than
"down with that sort of thing". Similarly, there is not an obvious way of
figuring how what percentage of dirty pages are old in terms of LRU-age and
should be cleaned. Ideally, the background flushers would only be cleaning
pages belonging to the zone being scanned but it's not clear if this would
be of benefit (less IO) or not (potentially less efficient IO if an inode
is scattered across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 18 +++++++++++-------
1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc50937..5763719 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -806,6 +806,8 @@ restart_dirty:
}

if (PageDirty(page)) {
+ nr_dirty++;
+
/*
* If the caller cannot writeback pages, dirty pages
* are put on a separate list for cleaning by either
@@ -814,7 +816,6 @@ restart_dirty:
if (!reclaim_can_writeback(sc, page)) {
list_add(&page->lru, &dirty_pages);
unlock_page(page);
- nr_dirty++;
goto keep_dirty;
}

@@ -933,13 +934,16 @@ keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}

+ /*
+ * If reclaim is encountering dirty pages, it may be because
+ * dirty pages are reaching the end of the LRU even though
+ * the dirty_ratio may be satisified. In this case, wake
+ * flusher threads to pro-actively clean some pages
+ */
+ wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
+
if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
- /*
- * Wakeup a flusher thread to clean at least as many dirty
- * pages as encountered by direct reclaim. Wait on congestion
- * to throttle processes cleaning dirty pages
- */
- wakeup_flusher_threads(nr_dirty);
+ /* Throttle direct reclaimers cleaning pages */
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though
> + * the dirty_ratio may be satisified. In this case, wake
> + * flusher threads to pro-actively clean some pages
> + */
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> +

Where is the laptop-mode magic coming from?

And btw, at least currently wakeup_flusher_threads writes back nr_pages
for each BDI, which might not be what you want. Then again probably
no caller wants it, but I don't see an easy way to fix it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though
> > + * the dirty_ratio may be satisified. In this case, wake
> > + * flusher threads to pro-actively clean some pages
> > + */
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > +
>
> Where is the laptop-mode magic coming from?
>

It comes from other parts of page reclaim where writing pages is avoided
by page reclaim where possible. Things like this

wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

and

.may_writepage = !laptop_mode

although the latter can get disabled too. Deleting the magic is an
option which would trade IO efficiency for power efficiency but my
current thinking is laptop mode preferred reduced power.

> And btw, at least currently wakeup_flusher_threads writes back nr_pages
> for each BDI, which might not be what you want.

I saw you pointing that out in another thread all right although I can't
remember the context. It's not exactly what I want but then again we
really want writing back of pages from a particular zone which we don't
get either. There did not seem to be an ideal here and this appeared to
be "less bad" than the alternatives.

> Then again probably
> no caller wants it, but I don't see an easy way to fix it.
>

I didn't either but my writeback-foo is weak (getting better but still weak). I
hoped to bring it up at MM Summit and maybe at the Filesystem Summit too to
see what ideas exist to improve this.

When this idea was first floated, you called it a band-aid and I
prioritised writing back old inodes over this. How do you feel about
this approach now?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On 07/19/2010 09:11 AM, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.

I can't see a better way to do this without creating
a way-too-big-to-merge patch series, and this patch
should result in the right behaviour, so ...

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> @@ -933,13 +934,16 @@ keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though
> + * the dirty_ratio may be satisified. In this case, wake
> + * flusher threads to pro-actively clean some pages
> + */
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

An argument of 0 means 'every dirty page in the system', I assume this
is not what you wanted, right? Something like this?

if (nr_dirty && !laptop_mode)
wakeup_flusher_threads(nr_dirty + nr_dirty / 2);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though
> > > + * the dirty_ratio may be satisified. In this case, wake
> > > + * flusher threads to pro-actively clean some pages
> > > + */
> > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > +
> >
> > Where is the laptop-mode magic coming from?
> >
>
> It comes from other parts of page reclaim where writing pages is avoided
> by page reclaim where possible. Things like this
>
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);

Actually, it's not avoiding writing pages in laptop mode, instead it
is lumping writeouts aggressively (as I wrote in my other mail,
.nr_pages=0 means 'write everything') to keep disk spinups rare and
make maximum use of them.

> although the latter can get disabled too. Deleting the magic is an
> option which would trade IO efficiency for power efficiency but my
> current thinking is laptop mode preferred reduced power.

Maybe couple your wakeup with sc->may_writepage? It is usually false
for laptop_mode but direct reclaimers enable it at one point in
do_try_to_free_pages() when it scanned more than 150% of the reclaim
target, so you could use existing disk spin-up points instead of
introducing new ones or disabling the heuristics in laptop mode.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote:
> > > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote:
> > > > + /*
> > > > + * If reclaim is encountering dirty pages, it may be because
> > > > + * dirty pages are reaching the end of the LRU even though
> > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > + * flusher threads to pro-actively clean some pages
> > > > + */
> > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > +
> > >
> > > Where is the laptop-mode magic coming from?
> > >
> >
> > It comes from other parts of page reclaim where writing pages is avoided
> > by page reclaim where possible. Things like this
> >
> > wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
>
> Actually, it's not avoiding writing pages in laptop mode, instead it
> is lumping writeouts aggressively (as I wrote in my other mail,
> .nr_pages=0 means 'write everything') to keep disk spinups rare and
> make maximum use of them.
>

You're right, 0 does mean flush everything - /me slaps self. It was introduced
in 2.6.6 with the patch "[PATCH] laptop mode". Quoting from it

Algorithm: the idea is to hold dirty data in memory for a long time,
but to flush everything which has been accumulated if the disk happens
to spin up for other reasons.

So, the reason for the magic is half right - avoid excessive disk spin-ups
but my reasoning for it was wrong. I thought it was avoiding a cleaning to
save power. What it is actually intended to do is "if we are spinning up the
disk anyway, do as much work as possible so it can spin down for longer later".

Where it's wrong is that it should only wakeup flusher threads if dirty
pages were encountered. What it's doing right now is potentially
cleaning everything. It means I need to rerun all the tests and see if
the number of pages encountered by page reclaim is really reduced or was
it because I was calling wakeup_flusher_threads(0) when no dirty pages
were encountered.

> > although the latter can get disabled too. Deleting the magic is an
> > option which would trade IO efficiency for power efficiency but my
> > current thinking is laptop mode preferred reduced power.
>
> Maybe couple your wakeup with sc->may_writepage? It is usually false
> for laptop_mode but direct reclaimers enable it at one point in
> do_try_to_free_pages() when it scanned more than 150% of the reclaim
> target, so you could use existing disk spin-up points instead of
> introducing new ones or disabling the heuristics in laptop mode.
>

How about the following?

if (nr_dirty && sc->may_writepage)
wakeup_flusher_threads(laptop_mode ? 0 :
nr_dirty + nr_dirty / 2);


1. Wakup flusher threads if dirty pages are encountered
2. For direct reclaim, only wake them up if may_writepage is set
indicating that the system is ready to spin up disks and start
reclaiming
3. In laptop_mode, flush everything to reduce future spin-ups

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 20, 2010 at 03:10:49PM +0100, Mel Gorman wrote:
> On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote:
> > On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote:
> > > although the latter can get disabled too. Deleting the magic is an
> > > option which would trade IO efficiency for power efficiency but my
> > > current thinking is laptop mode preferred reduced power.
> >
> > Maybe couple your wakeup with sc->may_writepage? It is usually false
> > for laptop_mode but direct reclaimers enable it at one point in
> > do_try_to_free_pages() when it scanned more than 150% of the reclaim
> > target, so you could use existing disk spin-up points instead of
> > introducing new ones or disabling the heuristics in laptop mode.
> >
>
> How about the following?
>
> if (nr_dirty && sc->may_writepage)
> wakeup_flusher_threads(laptop_mode ? 0 :
> nr_dirty + nr_dirty / 2);
>
>
> 1. Wakup flusher threads if dirty pages are encountered
> 2. For direct reclaim, only wake them up if may_writepage is set
> indicating that the system is ready to spin up disks and start
> reclaiming
> 3. In laptop_mode, flush everything to reduce future spin-ups

Sounds like the sanest approach to me. Thanks.

Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU
> if pages are being dirtied slowly so that neither the throttling cleans
> them or a flusher thread waking periodically.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 1.5 times the
> number of dirty pages encountered by reclaimers. The reasoning is that pages
> were being dirtied at a roughly constant rate recently so if N dirty pages
> were encountered in this scan block, we are likely to see roughly N dirty
> pages again soon so try keep the flusher threads ahead of reclaim.
>
> This is unfortunately very hand-wavy but there is not really a good way of
> quantifying how bad it is when reclaim encounters dirty pages other than
> "down with that sort of thing". Similarly, there is not an obvious way of
> figuring how what percentage of dirty pages are old in terms of LRU-age and
> should be cleaned. Ideally, the background flushers would only be cleaning
> pages belonging to the zone being scanned but it's not clear if this would
> be of benefit (less IO) or not (potentially less efficient IO if an inode
> is scattered across multiple zones).
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 18 +++++++++++-------
> 1 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc50937..5763719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -806,6 +806,8 @@ restart_dirty:
> }
>
> if (PageDirty(page)) {
> + nr_dirty++;
> +
> /*
> * If the caller cannot writeback pages, dirty pages
> * are put on a separate list for cleaning by either
> @@ -814,7 +816,6 @@ restart_dirty:
> if (!reclaim_can_writeback(sc, page)) {
> list_add(&page->lru, &dirty_pages);
> unlock_page(page);
> - nr_dirty++;
> goto keep_dirty;
> }
>
> @@ -933,13 +934,16 @@ keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though
> + * the dirty_ratio may be satisified. In this case, wake
> + * flusher threads to pro-actively clean some pages
> + */
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);

Ah it's very possible that nr_dirty==0 here! Then you are hitting the
number of dirty pages down to 0 whether or not pageout() is called.

Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
normally a small number, much smaller than MAX_WRITEBACK_PAGES.
The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
for efficiency. And it seems good to let the flusher write much more
than nr_dirty pages to safeguard a reasonable large
vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
update the comments.

Thanks,
Fengguang

> if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
> - /*
> - * Wakeup a flusher thread to clean at least as many dirty
> - * pages as encountered by direct reclaim. Wait on congestion
> - * to throttle processes cleaning dirty pages
> - */
> - wakeup_flusher_threads(nr_dirty);
> + /* Throttle direct reclaimers cleaning pages */
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> --
> 1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 26, 2010 at 03:28:32PM +0800, Wu Fengguang wrote:
> On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote:
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> > o When dirtying pages, processes may be throttled to clean pages if
> > dirty_ratio is not met.
> > o Pages belonging to inodes dirtied longer than
> > dirty_writeback_centisecs get cleaned.
> >
> > The problem for reclaim is that dirty pages can reach the end of the LRU
> > if pages are being dirtied slowly so that neither the throttling cleans
> > them or a flusher thread waking periodically.
> >
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 1.5 times the
> > number of dirty pages encountered by reclaimers. The reasoning is that pages
> > were being dirtied at a roughly constant rate recently so if N dirty pages
> > were encountered in this scan block, we are likely to see roughly N dirty
> > pages again soon so try keep the flusher threads ahead of reclaim.
> >
> > This is unfortunately very hand-wavy but there is not really a good way of
> > quantifying how bad it is when reclaim encounters dirty pages other than
> > "down with that sort of thing". Similarly, there is not an obvious way of
> > figuring how what percentage of dirty pages are old in terms of LRU-age and
> > should be cleaned. Ideally, the background flushers would only be cleaning
> > pages belonging to the zone being scanned but it's not clear if this would
> > be of benefit (less IO) or not (potentially less efficient IO if an inode
> > is scattered across multiple zones).
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 18 +++++++++++-------
> > 1 files changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bc50937..5763719 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -806,6 +806,8 @@ restart_dirty:
> > }
> >
> > if (PageDirty(page)) {
> > + nr_dirty++;
> > +
> > /*
> > * If the caller cannot writeback pages, dirty pages
> > * are put on a separate list for cleaning by either
> > @@ -814,7 +816,6 @@ restart_dirty:
> > if (!reclaim_can_writeback(sc, page)) {
> > list_add(&page->lru, &dirty_pages);
> > unlock_page(page);
> > - nr_dirty++;
> > goto keep_dirty;
> > }
> >
> > @@ -933,13 +934,16 @@ keep_dirty:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
> >
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though
> > + * the dirty_ratio may be satisified. In this case, wake
> > + * flusher threads to pro-actively clean some pages
> > + */
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
>
> Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> number of dirty pages down to 0 whether or not pageout() is called.
>

True, this has been fixed to only wakeup flusher threads when this is
the file LRU, dirty pages have been encountered and the caller has
sc->may_writepage.

> Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> for efficiency.
> And it seems good to let the flusher write much more
> than nr_dirty pages to safeguard a reasonable large
> vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> update the comments.
>

Ok, the reasoning had been to flush a number of pages that was related
to the scanning rate but if that is inefficient for the flusher, I'll
use MAX_WRITEBACK_PAGES.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
> > > @@ -933,13 +934,16 @@ keep_dirty:
> > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > }
> > >
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though
> > > + * the dirty_ratio may be satisified. In this case, wake
> > > + * flusher threads to pro-actively clean some pages
> > > + */
> > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> >
> > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > number of dirty pages down to 0 whether or not pageout() is called.
> >
>
> True, this has been fixed to only wakeup flusher threads when this is
> the file LRU, dirty pages have been encountered and the caller has
> sc->may_writepage.

OK.

> > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > for efficiency.
> > And it seems good to let the flusher write much more
> > than nr_dirty pages to safeguard a reasonable large
> > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > update the comments.
> >
>
> Ok, the reasoning had been to flush a number of pages that was related
> to the scanning rate but if that is inefficient for the flusher, I'll
> use MAX_WRITEBACK_PAGES.

It would be better to pass something like (nr_dirty * N).
MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
obviously too large as a parameter. When the batch size is increased
to 128MB, the writeback code may be improved somehow to not exceed the
nr_pages limit too much.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > }
> > > >
> > > > + /*
> > > > + * If reclaim is encountering dirty pages, it may be because
> > > > + * dirty pages are reaching the end of the LRU even though
> > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > + * flusher threads to pro-actively clean some pages
> > > > + */
> > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > >
> > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > number of dirty pages down to 0 whether or not pageout() is called.
> > >
> >
> > True, this has been fixed to only wakeup flusher threads when this is
> > the file LRU, dirty pages have been encountered and the caller has
> > sc->may_writepage.
>
> OK.
>
> > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > for efficiency.
> > > And it seems good to let the flusher write much more
> > > than nr_dirty pages to safeguard a reasonable large
> > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > update the comments.
> > >
> >
> > Ok, the reasoning had been to flush a number of pages that was related
> > to the scanning rate but if that is inefficient for the flusher, I'll
> > use MAX_WRITEBACK_PAGES.
>
> It would be better to pass something like (nr_dirty * N).
> MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> obviously too large as a parameter. When the batch size is increased
> to 128MB, the writeback code may be improved somehow to not exceed the
> nr_pages limit too much.
>

What might be a useful value for N? 1.5 appears to work reasonably well
to create a window of writeback ahead of the scanner but it's a bit
arbitrary.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > + * flusher threads to pro-actively clean some pages
> > > > > + */
> > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > >
> > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > >
> > >
> > > True, this has been fixed to only wakeup flusher threads when this is
> > > the file LRU, dirty pages have been encountered and the caller has
> > > sc->may_writepage.
> >
> > OK.
> >
> > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > for efficiency.
> > > > And it seems good to let the flusher write much more
> > > > than nr_dirty pages to safeguard a reasonable large
> > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > update the comments.
> > > >
> > >
> > > Ok, the reasoning had been to flush a number of pages that was related
> > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > use MAX_WRITEBACK_PAGES.
> >
> > It would be better to pass something like (nr_dirty * N).
> > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > obviously too large as a parameter. When the batch size is increased
> > to 128MB, the writeback code may be improved somehow to not exceed the
> > nr_pages limit too much.
> >
>
> What might be a useful value for N? 1.5 appears to work reasonably well
> to create a window of writeback ahead of the scanner but it's a bit
> arbitrary.

I'd recommend N to be a large value. It's no longer relevant now since
we'll call the flusher to sync some range containing the target page.
The flusher will then choose an N large enough (eg. 4MB) for efficient
IO. It needs to be a large value, otherwise the vmscan code will
quickly run into dirty pages again..

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > }
> > > > > >
> > > > > > + /*
> > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > + */
> > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > >
> > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > >
> > > >
> > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > the file LRU, dirty pages have been encountered and the caller has
> > > > sc->may_writepage.
> > >
> > > OK.
> > >
> > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > for efficiency.
> > > > > And it seems good to let the flusher write much more
> > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > update the comments.
> > > > >
> > > >
> > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > use MAX_WRITEBACK_PAGES.
> > >
> > > It would be better to pass something like (nr_dirty * N).
> > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > obviously too large as a parameter. When the batch size is increased
> > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > nr_pages limit too much.
> > >
> >
> > What might be a useful value for N? 1.5 appears to work reasonably well
> > to create a window of writeback ahead of the scanner but it's a bit
> > arbitrary.
>
> I'd recommend N to be a large value. It's no longer relevant now since
> we'll call the flusher to sync some range containing the target page.
> The flusher will then choose an N large enough (eg. 4MB) for efficient
> IO. It needs to be a large value, otherwise the vmscan code will
> quickly run into dirty pages again..
>

Ok, I took the 4MB at face value to be a "reasonable amount that should
not cause congestion". The end result is

#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
static inline long nr_writeback_pages(unsigned long nr_dirty)
{
return laptop_mode ? 0 :
min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
}

nr_writeback_pages(nr_dirty) is what gets passed to
wakeup_flusher_threads(). Does that seem sensible?


--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > }
> > > > > > >
> > > > > > > + /*
> > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > + */
> > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > >
> > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > >
> > > > >
> > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > sc->may_writepage.
> > > >
> > > > OK.
> > > >
> > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > for efficiency.
> > > > > > And it seems good to let the flusher write much more
> > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > update the comments.
> > > > > >
> > > > >
> > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > use MAX_WRITEBACK_PAGES.
> > > >
> > > > It would be better to pass something like (nr_dirty * N).
> > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > obviously too large as a parameter. When the batch size is increased
> > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > nr_pages limit too much.
> > > >
> > >
> > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > to create a window of writeback ahead of the scanner but it's a bit
> > > arbitrary.
> >
> > I'd recommend N to be a large value. It's no longer relevant now since
> > we'll call the flusher to sync some range containing the target page.
> > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > IO. It needs to be a large value, otherwise the vmscan code will
> > quickly run into dirty pages again..
> >
>
> Ok, I took the 4MB at face value to be a "reasonable amount that should
> not cause congestion".

Under memory pressure, the disk should be busy/congested anyway.
The big 4MB adds much work, however many of the pages may need to be
synced in the near future anyway. It also requires more time to do
the bigger IO, hence adding some latency, however the latency should
be a small factor comparing to the IO queue time (which will be long
for a busy disk).

Overall expectation is, the more efficient IO, the more progress :)

> The end result is
>
> #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> static inline long nr_writeback_pages(unsigned long nr_dirty)
> {
> return laptop_mode ? 0 :
> min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> }
>
> nr_writeback_pages(nr_dirty) is what gets passed to
> wakeup_flusher_threads(). Does that seem sensible?

If you plan to keep wakeup_flusher_threads(), a simpler form may be
sufficient, eg.

laptop_mode ? 0 : (nr_dirty * 16)

On top of this, we may write another patch to convert the
wakeup_flusher_threads(bdi, nr_pages) call to some
bdi_start_inode_writeback(inode, offset) call, to start more oriented
writeback.

When talking the 4MB optimization, I was referring to the internal
implementation of bdi_start_inode_writeback(). Sorry for the missing
context in the previous email.

It may need a big patch to implement bdi_start_inode_writeback().
Would you like to try it, or leave the task to me?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
>
> laptop_mode ? 0 : (nr_dirty * 16)

This number is not sensitive because the writeback code may well round
it up to some more IO efficient value (currently 4MB). AFAIK the
nr_pages parameters passed by all existing flusher callers are some
rule-of-thumb value, and far from being an exact number.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > }
> > > > > > > >
> > > > > > > > + /*
> > > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > > + */
> > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > >
> > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > >
> > > > > >
> > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > sc->may_writepage.
> > > > >
> > > > > OK.
> > > > >
> > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > for efficiency.
> > > > > > > And it seems good to let the flusher write much more
> > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > update the comments.
> > > > > > >
> > > > > >
> > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > use MAX_WRITEBACK_PAGES.
> > > > >
> > > > > It would be better to pass something like (nr_dirty * N).
> > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > obviously too large as a parameter. When the batch size is increased
> > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > nr_pages limit too much.
> > > > >
> > > >
> > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > arbitrary.
> > >
> > > I'd recommend N to be a large value. It's no longer relevant now since
> > > we'll call the flusher to sync some range containing the target page.
> > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > IO. It needs to be a large value, otherwise the vmscan code will
> > > quickly run into dirty pages again..
> > >
> >
> > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > not cause congestion".
>
> Under memory pressure, the disk should be busy/congested anyway.

Not necessarily. It could be streaming reads where pages are being added
to the LRU quickly but not necessarily dominated by dirty pages. Due to the
scanning rate, a dirty page may be encountered but it could be rare.

> The big 4MB adds much work, however many of the pages may need to be
> synced in the near future anyway. It also requires more time to do
> the bigger IO, hence adding some latency, however the latency should
> be a small factor comparing to the IO queue time (which will be long
> for a busy disk).
>
> Overall expectation is, the more efficient IO, the more progress :)
>

Ok.

> > The end result is
> >
> > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > {
> > return laptop_mode ? 0 :
> > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > }
> >
> > nr_writeback_pages(nr_dirty) is what gets passed to
> > wakeup_flusher_threads(). Does that seem sensible?
>
> If you plan to keep wakeup_flusher_threads(), a simpler form may be
> sufficient, eg.
>
> laptop_mode ? 0 : (nr_dirty * 16)
>

I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
reclaim. I wanted to firmly bound how much writeback was being requested -
hence the mild complexity.

> On top of this, we may write another patch to convert the
> wakeup_flusher_threads(bdi, nr_pages) call to some
> bdi_start_inode_writeback(inode, offset) call, to start more oriented
> writeback.
>

I did a first pass at optimising based on prioritising inodes related to
dirty pages. It's incredibly primitive and I have to sit down and see
how the entire of writeback is put together to improve on it. Maybe
you'll spot something simple or see if it's the totally wrong direction.
Patch is below.

> When talking the 4MB optimization, I was referring to the internal
> implementation of bdi_start_inode_writeback(). Sorry for the missing
> context in the previous email.
>

No worries, I was assuming it was something in mainline I didn't know
yet :)

> It may need a big patch to implement bdi_start_inode_writeback().
> Would you like to try it, or leave the task to me?
>

If you send me a patch, I can try it out but it's not my highest
priority right now. I'm still looking to get writeback-from-reclaim down
to a reasonable level without causing a large amount of churn.

Here is the first pass anyway at kicking wakeup_flusher_threads() for
inodes belonging to a list of pages. You'll note that I do nothing with
page offset because I didn't spot a simple way of taking that
information into account. It's also horrible from a locking perspective.
So far, it's testing has been "it didn't crash".

==== CUT HERE ====
writeback: Prioritise dirty inodes encountered by reclaim for background flushing

It is preferable that as few dirty pages as possible are dispatched for
cleaning from the page reclaim path. When dirty pages are encountered by
page reclaim, this patch marks the inodes that they should be dispatched
immediately. When the background flusher runs, it moves such inodes immediately
to the dispatch queue regardless of inode age.

This is an early prototype. It could be optimised to not regularly take
the inode lock repeatedly and ideally the page offset would also be
taken into account.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/writeback.h | 1 +
mm/vmscan.c | 6 +++-
4 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5a3c764..27a8b75 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tinode;
int do_sb_sort = 0;

if (wbc->for_kupdate || wbc->for_background) {
@@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
older_than_this = jiffies - expire_interval;
}

+ /* Move inodes reclaim found at end of LRU to dispatch queue */
+ list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
+ if (inode->i_state & I_DIRTY_RECLAIM) {
+ inode->i_state &= ~I_DIRTY_RECLAIM;
+ list_move(&inode->i_list, &tmp);
+ }
+ }
+
while (!list_empty(delaying_queue)) {
inode = list_entry(delaying_queue->prev, struct inode, i_list);
if (expire_interval &&
@@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
rcu_read_unlock();
}

+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct inode *inode;
+
+ list_for_each_entry(page, page_list, lru) {
+ if (!PageDirty(page))
+ continue;
+
+ lock_page(page);
+ mapping = page_mapping(page);
+ if (!mapping || mapping == &swapper_space)
+ goto unlock;
+
+ /*
+ * Test outside the lock to see as if it is already set, taking
+ * the inode lock is a waste and the inode should be pinned by
+ * the lock_page
+ */
+ inode = page->mapping->host;
+ if (inode->i_state & I_DIRTY_RECLAIM)
+ goto unlock;
+
+ /*
+ * XXX: Yuck, has to be a way of batching this by not requiring
+ * the page lock to pin the inode
+ */
+ spin_lock(&inode_lock);
+ inode->i_state |= I_DIRTY_RECLAIM;
+ spin_unlock(&inode_lock);
+unlock:
+ unlock_page(page);
+ }
+
+ wakeup_flusher_threads(nr_pages);
+}
+
static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e29f0ed..8836698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1585,8 +1585,8 @@ struct super_operations {
/*
* Inode state bits. Protected by inode_lock.
*
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1633,6 +1633,7 @@ struct super_operations {
#define I_DIRTY_SYNC 1
#define I_DIRTY_DATASYNC 2
#define I_DIRTY_PAGES 4
+#define I_DIRTY_RECLAIM 256
#define __I_NEW 3
#define I_NEW (1 << __I_NEW)
#define I_WILL_FREE 16
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 494edd6..73a4df2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
struct writeback_control *wbc);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);

/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b66d1f5..bad1abf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,7 +901,8 @@ keep:
* laptop mode avoiding disk spin-ups
*/
if (file && nr_dirty_seen && sc->may_writepage)
- wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+ wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
+ page_list);

*nr_still_dirty = nr_dirty;
count_vm_events(PGACTIVATE, pgactivate);
@@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
list_add(&page->lru, &putback_list);
}

- wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+ wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
+ &page_list);
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> >
> > laptop_mode ? 0 : (nr_dirty * 16)
>
> This number is not sensitive because the writeback code may well round
> it up to some more IO efficient value (currently 4MB). AFAIK the
> nr_pages parameters passed by all existing flusher callers are some
> rule-of-thumb value, and far from being an exact number.
>

I get that it's a rule of thumb but decided I would still pass in some value
related to nr_dirty that was bounded in some manner. Currently, that bound
is 4MB but maybe it should have been bound to MAX_WRITEBACK_PAGES (which is
4MB for x86, but could be anything depending on the base page size).

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 27, 2010 at 10:40:26PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote:
> > > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > > sufficient, eg.
> > >
> > > laptop_mode ? 0 : (nr_dirty * 16)
> >
> > This number is not sensitive because the writeback code may well round
> > it up to some more IO efficient value (currently 4MB). AFAIK the
> > nr_pages parameters passed by all existing flusher callers are some
> > rule-of-thumb value, and far from being an exact number.
> >
>
> I get that it's a rule of thumb but decided I would still pass in some value
> related to nr_dirty that was bounded in some manner.
> Currently, that bound is 4MB but maybe it should have been bound to
> MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything
> depending on the base page size).

I see your worry about much bigger page size making

vmscan batch size > writeback batch size

and it's a legitimate worry.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages [ In reply to ]
On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > + /*
> > > > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > > > + */
> > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > >
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > >
> > > > > > >
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > >
> > > > > > >
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > >
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > >
> > > > >
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > >
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > >
> > >
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> >
> > Under memory pressure, the disk should be busy/congested anyway.
>
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.

Right.

> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> >
> > Overall expectation is, the more efficient IO, the more progress :)
> >
>
> Ok.
>
> > > The end result is
> > >
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > > return laptop_mode ? 0 :
> > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > >
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> >
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> >
> > laptop_mode ? 0 : (nr_dirty * 16)
> >
>
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.

OK.

> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> >
>
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.

The simplest style may be

struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
.nr_to_write = MAX_WRITEBACK_PAGES,
};

mapping->writeback_index = offset;
return do_writepages(mapping, &wbc);

But sure there will be many details to handle.

> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> >
>
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
>
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> >
>
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.

OK. That's already a great work.

> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".

It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.

Thanks,
Fengguang


> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
>
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
>
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 6 +++-
> 4 files changed, 59 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> older_than_this = jiffies - expire_interval;
> }
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_list, &tmp);
> + }
> + }
> +
> while (!list_empty(delaying_queue)) {
> inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping || mapping == &swapper_space)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set, taking
> + * the inode lock is a waste and the inode should be pinned by
> + * the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + /*
> + * XXX: Yuck, has to be a way of batching this by not requiring
> + * the page lock to pin the inode
> + */
> + spin_lock(&inode_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode_lock);
> +unlock:
> + unlock_page(page);
> + }
> +
> + wakeup_flusher_threads(nr_pages);
> +}
> +
> static noinline void block_dump___mark_inode_dirty(struct inode *inode)
> {
> if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
> /*
> * Inode state bits. Protected by inode_lock.
> *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
> *
> * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
> #define I_DIRTY_SYNC 1
> #define I_DIRTY_DATASYNC 2
> #define I_DIRTY_PAGES 4
> +#define I_DIRTY_RECLAIM 256
> #define __I_NEW 3
> #define I_NEW (1 << __I_NEW)
> #define I_WILL_FREE 16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
> struct writeback_control *wbc);
> long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
> void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>
> /* writeback.h requires fs.h; it, too, is not included from here. */
> static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
> * laptop mode avoiding disk spin-ups
> */
> if (file && nr_dirty_seen && sc->may_writepage)
> - wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> + page_list);
>
> *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> list_add(&page->lru, &putback_list);
> }
>
> - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> + &page_list);
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/