Mailing List Archive

[PATCH 2.0.5 futex] Fix to Children stuck on futex problem
(Patch and system details at bottom)

Hi all. I've root-caused and written a patch for the children stuck on
futex problem described by both Sean Thorne in 2009 and Max Barry (who I
work with) in 2011.

The core of the problem is that modperl_tipool_putback_base only
broadcasts that there are more interpreters available when there were no
available interpreters prior to this putback. While this makes sense, it
can create a problem.

Notation:
A: Acquire an interpreter
P: Putback an interpreter
B: Broadcast a free intepreter (really a signal)
W: Wait on condition tipool->available (for free interpreter)
(x,y): x is number of free interpreters at this point. y is the number
in use.
The number at the beginning of a line is the thread number
Each line occurs within a single critical section (on mutex tipool->tiplock)

Expected behavior:
4 threads, 2 free interpreters
1: A (1,1)
2: A (2,0)
3: W
4: W
1: P (1,1) B
3: A (2,0)
2: P (1,1) B
4: A (2,0)
3: P (1,1) B
4: P (0,2) <-- No broadcast because there was an available interpreter
prior to this putback.

Broken behavior:
4 threads, 2 free interpreters
1: A (1,1)
2: A (2,0)
3: W
4: W
1: P (1,1) B
2: P (0,2) <-- No broadcast because there was an available interpreter
prior to this putback.
3: A (1,1)
3: P (0,2) <-- No broadcast because there was an available interpreter
prior to this putback.
(Broken)

Thread 4 will never be signaled to pick up an interpreter. This results
in the thread getting stuck on futex because sooner or later, apache
will tell this worker to die (due to MaxRequestsPerChild). So, the
parent thread will wait on the child threads joining, but one or more
child threads will never wake up due to this problem.

My proposed fix is to always broadcast the availability of an
interpreter, regardless of whether there were already any free. This
change passes all tests that I have found to throw at it as well as no
longer deadlocking when reproducing the problem according to Max's
instructions (http://pastebin.com/YDbmq84w).

My System Details:
uname -a: Linux modperl 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11
03:49:04 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
Apache: Custom build of 2.2.20 with ubuntu patches
(http://packages.ubuntu.com/source/oneiric/apache2)
modperl: Custom build of 2.0.5 with ubuntu patches
(http://packages.ubuntu.com/source/oneiric/libapache2-mod-perl2)
Build process: Standard ubuntu build process with following flags set:
DEB_BUILD_OPTIONS="nostrip parallel=2 debug"
CFLAGS="-g -O2 -DMP_TRACE=1 -DPERL_DESTRUCT_LEVEL=2 -DMP_DEBUG=1
-UMP_USE_GTOP -I/usr/include/libgtop-2.0/ -I/usr/include/glib-2.0/
-I/usr/lib/x86_64-linux-gnu/glib-2.0/include/"

Patch:
--- src/modules/perl/modperl_tipool.c.old 2012-03-03
19:43:57.112152297 -0800
+++ src/modules/perl/modperl_tipool.c 2012-03-03 04:28:31.000000000 -0800
@@ -328,9 +328,9 @@
MP_TRACE_i(MP_FUNC, "0x%lx now available (%d in use, %d running)",
(unsigned long)listp->data, tipool->in_use, tipool->size);

+ modperl_tipool_broadcast(tipool);
if (tipool->in_use == (tipool->cfg->max - 1)) {
/* hurry up, another thread may be blocking */
- modperl_tipool_broadcast(tipool);
modperl_tipool_unlock(tipool);
return;
}


Please let me know how best to get this checked in and out. As you might
imagine, this futex problem has been causing us quite a few headaches :-)

Greg Rubin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
Seems reasonable to me. It seems like it might be a good idea to fold
this into 2.0.6 RC3 to me. Thoughts?

Adam


On 12-03-04 01:29 PM, SalusaSecondus wrote:
> (Patch and system details at bottom)
>
<snip>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
Hi Salusa,

Would you or Max be able to construct a unit test that demonstrates
this failure condition, and then success once the patch is applied?
There should be some example tests in the t/ directory which you can
draw on for inspiration.

On Sun, Mar 4, 2012 at 10:29 AM, SalusaSecondus <salusa@nationstates.net> wrote:
> (Patch and system details at bottom)
>
> Hi all. I've root-caused and written a patch for the children stuck on
> futex problem described by both Sean Thorne in 2009 and Max Barry (who I
> work with) in 2011.
>
> The core of the problem is that modperl_tipool_putback_base only
> broadcasts that there are more interpreters available when there were no
> available interpreters prior to this putback. While this makes sense, it
> can create a problem.
>
> Notation:
> A: Acquire an interpreter
> P: Putback an interpreter
> B: Broadcast a free intepreter (really a signal)
> W: Wait on condition tipool->available (for free interpreter)
> (x,y): x is number of free interpreters at this point. y is the number
> in use.
> The number at the beginning of a line is the thread number
> Each line occurs within a single critical section (on mutex tipool->tiplock)
>
> Expected behavior:
> 4 threads, 2 free interpreters
> 1: A (1,1)
> 2: A (2,0)
> 3: W
> 4: W
> 1: P (1,1) B
> 3: A (2,0)
> 2: P (1,1) B
> 4: A (2,0)
> 3: P (1,1) B
> 4: P (0,2) <-- No broadcast because there was an available interpreter
> prior to this putback.
>
> Broken behavior:
> 4 threads, 2 free interpreters
> 1: A (1,1)
> 2: A (2,0)
> 3: W
> 4: W
> 1: P (1,1) B
> 2: P (0,2) <-- No broadcast because there was an available interpreter
> prior to this putback.
> 3: A (1,1)
> 3: P (0,2) <-- No broadcast because there was an available interpreter
> prior to this putback.
> (Broken)
>
> Thread 4 will never be signaled to pick up an interpreter. This results
> in the thread getting stuck on futex because sooner or later, apache
> will tell this worker to die (due to MaxRequestsPerChild). So, the
> parent thread will wait on the child threads joining, but one or more
> child threads will never wake up due to this problem.
>
> My proposed fix is to always broadcast the availability of an
> interpreter, regardless of whether there were already any free. This
> change passes all tests that I have found to throw at it as well as no
> longer deadlocking when reproducing the problem according to Max's
> instructions (http://pastebin.com/YDbmq84w).
>
> My System Details:
> uname -a: Linux modperl 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11
> 03:49:04 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
> Apache: Custom build of 2.2.20 with ubuntu patches
> (http://packages.ubuntu.com/source/oneiric/apache2)
> modperl: Custom build of 2.0.5 with ubuntu patches
> (http://packages.ubuntu.com/source/oneiric/libapache2-mod-perl2)
> Build process: Standard ubuntu build process with following flags set:
> DEB_BUILD_OPTIONS="nostrip parallel=2 debug"
> CFLAGS="-g -O2 -DMP_TRACE=1 -DPERL_DESTRUCT_LEVEL=2 -DMP_DEBUG=1
> -UMP_USE_GTOP -I/usr/include/libgtop-2.0/ -I/usr/include/glib-2.0/
> -I/usr/lib/x86_64-linux-gnu/glib-2.0/include/"
>
> Patch:
> --- src/modules/perl/modperl_tipool.c.old       2012-03-03
> 19:43:57.112152297 -0800
> +++ src/modules/perl/modperl_tipool.c   2012-03-03 04:28:31.000000000 -0800
> @@ -328,9 +328,9 @@
>     MP_TRACE_i(MP_FUNC, "0x%lx now available (%d in use, %d running)",
>                (unsigned long)listp->data, tipool->in_use, tipool->size);
>
> +    modperl_tipool_broadcast(tipool);
>     if (tipool->in_use == (tipool->cfg->max - 1)) {
>         /* hurry up, another thread may be blocking */
> -        modperl_tipool_broadcast(tipool);
>         modperl_tipool_unlock(tipool);
>         return;
>     }
>
>
> Please let me know how best to get this checked in and out. As you might
> imagine, this futex problem has been causing us quite a few headaches :-)
>
> Greg Rubin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
> For additional commands, e-mail: dev-help@perl.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
Adam,

We'd love to see this in 2.0.6. This has caused us lots of server problems.

Fred,

I'll take a look at the tests when I get home to see what I can find.
Do you know of any specific tests which deal with multi-threading or
potential deadlocks?

Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
On Tue, Mar 6, 2012 at 11:12 AM, Salusa Secondus
<salusa@nationstates.net> wrote:
> Fred,
>
> I'll take a look at the tests when I get home to see what I can find.
> Do you know of any specific tests which deal with multi-threading or
> potential deadlocks?

Torsten, Steve H., and Gozer know that part of the code a lot better
than I do, but here's what I found which might be useful:

$ find t -name '*thread*' | grep -v svn
t/apr/threadmutex.t
t/apr/threadrwlock.t
t/apr-ext/threadmutex.t
t/apr-ext/threadrwlock.t
t/lib/TestAPRlib/threadmutex.pm
t/lib/TestAPRlib/threadrwlock.pm
t/perl/ithreads.t
t/perl/ithreads2.t
t/perl/ithreads_args.t
t/perl/ithreads_eval.t
t/response/TestAPR/threadmutex.pm
t/response/TestAPR/threadrwlock.pm
t/response/TestPerl/ithreads.pm
t/response/TestPerl/ithreads_args.pm
t/response/TestPerl/ithreads_eval.pm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
I've looked through those tests, and none seem to apply and I'm not sure
how one would even go about writing a unit test for this. We haven't
even been able to reproduce the problem on all systems.

I suspect the best we can do may simply be manual load tests (such as
Max and I have done) showing that this patch works.

If someone have a good strategy for how to approach this, I'd love to
see it since I'd much prefer to have a test. I just don't know how here.

Greg

On 3/6/2012 4:29 PM, Fred Moyer wrote:
> On Tue, Mar 6, 2012 at 11:12 AM, Salusa Secondus
> <salusa@nationstates.net> wrote:
>> Fred,
>>
>> I'll take a look at the tests when I get home to see what I can find.
>> Do you know of any specific tests which deal with multi-threading or
>> potential deadlocks?
> Torsten, Steve H., and Gozer know that part of the code a lot better
> than I do, but here's what I found which might be useful:
>
> $ find t -name '*thread*' | grep -v svn
> t/apr/threadmutex.t
> t/apr/threadrwlock.t
> t/apr-ext/threadmutex.t
> t/apr-ext/threadrwlock.t
> t/lib/TestAPRlib/threadmutex.pm
> t/lib/TestAPRlib/threadrwlock.pm
> t/perl/ithreads.t
> t/perl/ithreads2.t
> t/perl/ithreads_args.t
> t/perl/ithreads_eval.t
> t/response/TestAPR/threadmutex.pm
> t/response/TestAPR/threadrwlock.pm
> t/response/TestPerl/ithreads.pm
> t/response/TestPerl/ithreads_args.pm
> t/response/TestPerl/ithreads_eval.pm


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
The bug is a race condition, so as you say, I don't really think it's
definitively testable.

Adam

On 3/6/2012 10:19 PM, SalusaSecondus wrote:
> I've looked through those tests, and none seem to apply and I'm not sure
> how one would even go about writing a unit test for this. We haven't
> even been able to reproduce the problem on all systems.
>
> I suspect the best we can do may simply be manual load tests (such as
> Max and I have done) showing that this patch works.
>
> If someone have a good strategy for how to approach this, I'd love to
> see it since I'd much prefer to have a test. I just don't know how here.
>
> Greg
>
> On 3/6/2012 4:29 PM, Fred Moyer wrote:
>> On Tue, Mar 6, 2012 at 11:12 AM, Salusa Secondus
>> <salusa@nationstates.net> wrote:
>>> Fred,
>>>
>>> I'll take a look at the tests when I get home to see what I can find.
>>> Do you know of any specific tests which deal with multi-threading or
>>> potential deadlocks?
>> Torsten, Steve H., and Gozer know that part of the code a lot better
>> than I do, but here's what I found which might be useful:
>>
>> $ find t -name '*thread*' | grep -v svn
>> t/apr/threadmutex.t
>> t/apr/threadrwlock.t
>> t/apr-ext/threadmutex.t
>> t/apr-ext/threadrwlock.t
>> t/lib/TestAPRlib/threadmutex.pm
>> t/lib/TestAPRlib/threadrwlock.pm
>> t/perl/ithreads.t
>> t/perl/ithreads2.t
>> t/perl/ithreads_args.t
>> t/perl/ithreads_eval.t
>> t/response/TestAPR/threadmutex.pm
>> t/response/TestAPR/threadrwlock.pm
>> t/response/TestPerl/ithreads.pm
>> t/response/TestPerl/ithreads_args.pm
>> t/response/TestPerl/ithreads_eval.pm
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
On Sunday, 04 March 2012 10:29:45 SalusaSecondus wrote:
> Broken behavior:
> 4 threads, 2 free interpreters
> 1: A (1,1)
> 2: A (2,0)
> 3: W
> 4: W
> 1: P (1,1) B
> 2: P (0,2) <-- No broadcast because there was an available interpreter
> prior to this putback.
> 3: A (1,1)
> 3: P (0,2) <-- No broadcast because there was an available interpreter
> prior to this putback.
> (Broken)

I am not quite back to full working condition yet. So, I can't test the patch.

Your analysis seems correct, good catch, thanks! I was wondering about that
piece of code, too, when I saw it first. I think the original author of the
function forgot about the signaling before the unlock operation at the end of
the function. Your patch is better because it does not matter where the signal
is sent in a critical section. So, is saves one line of code.

To conceive a test case is quite complex but feasible and worth the effort.
The way I'd try to go is:

1) introduce a new vhost with a separate interp pool with exactly 2
interpreters

2) implement a handler that simply sleeps (select undef, undef, undef,
$timeout) for a short period of time (~0.5 sec)

3) write a test case that issues 4 parallel requests as described above by
SalusaSecondus. You won't be able to use Apache::TestRequest here or even LWP.
If you put a small delay between the requests, say 0.1 sec, then you should
even be able to predict which request will hang.

4) see what happens. With the broken system one of the requests should hang
forever. Ideally, the test client would abort the request and continue after a
few seconds.

Thanks again,
Torsten Förtsch

--
Need professional modperl support? Hire me! (http://foertsch.name)

Like fantasy? http://kabatinte.net


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org
Re: [PATCH 2.0.5 futex] Fix to Children stuck on futex problem [ In reply to ]
On Wednesday, 07 March 2012 11:15:36 Torsten Förtsch wrote:
> On Sunday, 04 March 2012 10:29:45 SalusaSecondus wrote:
> > Broken behavior:
> > 4 threads, 2 free interpreters
> > 1: A (1,1)
> > 2: A (2,0)
> > 3: W
> > 4: W
> > 1: P (1,1) B
> > 2: P (0,2) <-- No broadcast because there was an available interpreter
> > prior to this putback.
> > 3: A (1,1)
> > 3: P (0,2) <-- No broadcast because there was an available interpreter
> > prior to this putback.
> > (Broken)
>
> Your analysis seems correct, good catch, thanks!

I committed your change as revision 1299429, thanks.

I decided not to construct a test case because the implementation seemed the
buggier the more I thought about it. Further, even if I could conceive a test
case what would it check? Only the fact that nobody reverts the change because
it could only be quite artificial. That would be ridiculous.

Torsten Förtsch

--
Need professional modperl support? Hire me! (http://foertsch.name)

Like fantasy? http://kabatinte.net


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@perl.apache.org
For additional commands, e-mail: dev-help@perl.apache.org