Mailing List Archive

ocf:heartbeat:apache resource agent and timeouts
Hi list,

I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
(Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
with the apache resource agent.

But first things first, my test setup:

root@node0:~# crm configure show
node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
primitive apache ocf:heartbeat:apache \
params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
op monitor interval="30" timeout="120" \
meta is-managed="false"
primitive site0ip ocf:heartbeat:IPaddr \
params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
primitive site1ip ocf:heartbeat:IPaddr \
params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
clone apacheClone apache
colocation bothips -100: site0ip site1ip
colocation site0 inf: site0ip apacheClone
colocation site1 inf: site1ip apacheClone
property $id="cib-bootstrap-options" \
dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="Heartbeat" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
last-lrm-refresh="1333391544" \
cluster-recheck-interval="15min"



One of the test I did was simulate a messed up apache (e.g. connection
limit reached):
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

Of course, this should produce a monitor timeout, which should mark the
apache as failed, and that's what happened.

However, recovery didn't work after I did
$ iptables -F

The problem, according to what I could figure out:
The apache resource agent
/usr/lib/ocf/resource.d/heartbeat/apache
does not have a timeout set for curl/wget. Curl has a default timeout of
about 3 minutes, wget may even retry up to 20 times and thus may
potentially take ages to time out.

Thus, the monitor operation did time out instead of wget (thus,
pacemaker thinks that the monitor itself has failed instead of the
service it is monitoring, which is semantically just plain wrong, IMHO).
Since the resource agent let the (still waiting) wget process hang
around practically forever, it also didn't notice when apache had
recovered (after iptables -f).


Bottom line:
I think the apache resource agent badly needs a timeout parameter which
is supplied to wget/curl and the documentation should make clear that
the current monitor timeout provided by pacemaker is not a substitute
for that (it cannot really be used to detect non-responsive web
servers). I only figured that out after extensive testing and finally
looking at the source, which took an awful lot of time.

After implementing a workaround:
WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
(added -T 5 -t 1) pacemaker and the apache resource behaved as expected
even when doing the iptables test above, and apache quickly recovers
after I do iptables -F.



On a side note:
The apache resource agent allows to supply a config file, where one can
override the parameters for curl/wget. But the implementation here is
bogus, because even if you supply this file, it always does a default
test with default parameters first, so this is useless in this case...
(I consider this behavior to be a bug).

Side note II:
I did play a lot with on-fail=..., failure-timeout=,
cluster-recheck-interval=... Changing these values did not help, but in
some cases produced new weird behavior, e.g. in some cases pacemaker
didn't even notice that apache was unreachable...

Best regards,

David

--
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg@doodle.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On 04/03/2012 06:53 AM, David Gubler wrote:
> Hi list,
>
> I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
> (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
> with the apache resource agent.
...
> Thus, the monitor operation did time out instead of wget (thus,
> pacemaker thinks that the monitor itself has failed instead of the
> service it is monitoring, which is semantically just plain wrong, IMHO).

If it's the same resource agent I saw back when: the one that fetches
/server-status, s/semantically// -- it's just plain wrong.

Basically, if it's running on the same node as apache, the kernel should
be smart enough to route it via lo even if you explicitly ask wget to
hit the cluster ip. So it's not checking if httpd is answering where you
actually care.

FWIW
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
Hi,

On Tue, Apr 03, 2012 at 01:53:41PM +0200, David Gubler wrote:
> Hi list,
>
> I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
> (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
> with the apache resource agent.
>
> But first things first, my test setup:
>
> root@node0:~# crm configure show
> node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
> node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
> primitive apache ocf:heartbeat:apache \
> params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
> op monitor interval="30" timeout="120" \
> meta is-managed="false"
> primitive site0ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
> primitive site1ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
> clone apacheClone apache
> colocation bothips -100: site0ip site1ip
> colocation site0 inf: site0ip apacheClone
> colocation site1 inf: site1ip apacheClone
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> cluster-infrastructure="Heartbeat" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> last-lrm-refresh="1333391544" \
> cluster-recheck-interval="15min"
>
>
>
> One of the test I did was simulate a messed up apache (e.g. connection
> limit reached):
> $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
>
> Of course, this should produce a monitor timeout, which should mark the
> apache as failed, and that's what happened.
>
> However, recovery didn't work after I did
> $ iptables -F
>
> The problem, according to what I could figure out:
> The apache resource agent
> /usr/lib/ocf/resource.d/heartbeat/apache
> does not have a timeout set for curl/wget. Curl has a default timeout of
> about 3 minutes, wget may even retry up to 20 times and thus may
> potentially take ages to time out.
>
> Thus, the monitor operation did time out instead of wget (thus,
> pacemaker thinks that the monitor itself has failed instead of the
> service it is monitoring, which is semantically just plain wrong, IMHO).

The timeout is a timeout, wherever it happens.

> Since the resource agent let the (still waiting) wget process hang
> around practically forever, it also didn't notice when apache had
> recovered (after iptables -f).

So, you want the resource agent to notice while running monitor
that it can now talk to the server?

>
> Bottom line:
> I think the apache resource agent badly needs a timeout parameter which
> is supplied to wget/curl and the documentation should make clear that
> the current monitor timeout provided by pacemaker is not a substitute
> for that (it cannot really be used to detect non-responsive web
> servers). I only figured that out after extensive testing and finally
> looking at the source, which took an awful lot of time.
>
> After implementing a workaround:
> WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
> (added -T 5 -t 1) pacemaker and the apache resource behaved as expected
> even when doing the iptables test above, and apache quickly recovers
> after I do iptables -F.

Indeed in this case specifying a short timeout for the client
would speed things up. It should loop indefinitely in the
monitor op. We may accept a patch :)

> On a side note:
> The apache resource agent allows to supply a config file, where one can
> override the parameters for curl/wget. But the implementation here is
> bogus, because even if you supply this file, it always does a default
> test with default parameters first, so this is useless in this case...
> (I consider this behavior to be a bug).

If you use a config test file, you'd need to define a monitor
with depth 10. The depth 0 monitor (default) is always testing
the statusurl.

> Side note II:
> I did play a lot with on-fail=..., failure-timeout=,
> cluster-recheck-interval=... Changing these values did not help, but in
> some cases produced new weird behavior, e.g. in some cases pacemaker
> didn't even notice that apache was unreachable...

Cheers,

Dejan

> Best regards,
>
> David
>
> --
> David Gubler
> Senior Software & Operations Engineer
> MeetMe: http://doodle.com/david
> E-Mail: dg@doodle.com
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
Hi,

On Tue, Apr 03, 2012 at 01:18:39PM -0500, Dimitri Maziuk wrote:
> On 04/03/2012 06:53 AM, David Gubler wrote:
> > Hi list,
> >
> > I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
> > (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
> > with the apache resource agent.
> ...
> > Thus, the monitor operation did time out instead of wget (thus,
> > pacemaker thinks that the monitor itself has failed instead of the
> > service it is monitoring, which is semantically just plain wrong, IMHO).
>
> If it's the same resource agent I saw back when: the one that fetches
> /server-status, s/semantically// -- it's just plain wrong.

It may be wrong or not, that depends on what you need. If you
need deeper testing, use depth 10 and testurl or testconffile.

> Basically, if it's running on the same node as apache, the kernel should
> be smart enough to route it via lo even if you explicitly ask wget to
> hit the cluster ip. So it's not checking if httpd is answering where you
> actually care.

At any rate, tests are always running from the same host where
the server is running. If you want to test "where you actually
care", then you'd need to be a bit closer to your clients and
use other kind of monitoring.

Thanks,

Dejan

> FWIW
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>



> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On 04/04/2012 10:59 AM, Dejan Muhamedagic wrote:
> Hi,
... httpd monitor ...
> It may be wrong or not, that depends on what you need.

(Another questionable choice as I recall was to consider 4xx an error.
Dep. on what you need, you may want to treat only some 5xx codes as
errors b/c the others mean apache is up and answering properly.)

> At any rate, tests are always running from the same host where
> the server is running. If you want to test "where you actually
> care", then you'd need to be a bit closer to your clients and
> use other kind of monitoring.

Which is precisely what I do: I monitor on the host with (more or less)
"lsof -i | grep httpd.+\*:http" and I've nagios in another subnet
checking the actual webpages. (Obviously, there's then gateways and
switches and all that...)

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
Hi Dejan,

On 04.04.2012 17:56, Dejan Muhamedagic wrote:
> The timeout is a timeout, wherever it happens.

Unfortunately not! If the monitor operation times out and Pacemaker
moves on, the wget process (and thus the whole monitor process) will
keep running. In fact, it may still be running many minutes after the
timeout happened. And since the monitor (at least in case of the apache
resource agent) can't be run twice in parallel, this effectively
prevents further monitor operations until wget has timed out. And that's
exactly where we get a problem.


> So, you want the resource agent to notice while running monitor
> that it can now talk to the server?
Yes, I want automatic recovery. The resource agent should notice when
apache is back and working again. And that works fine with a patched
apache resource agent.


>> On a side note:
>> The apache resource agent allows to supply a config file, where one can
>> override the parameters for curl/wget. But the implementation here is
>> bogus, because even if you supply this file, it always does a default
>> test with default parameters first, so this is useless in this case...
>> (I consider this behavior to be a bug).
> If you use a config test file, you'd need to define a monitor
> with depth 10. The depth 0 monitor (default) is always testing
> the statusurl.
Yes, I figured that, but it's besides the point. If I use depth 10, it
will first do the simple (depth 0) test anyway (!), and after that the
advanced (depth 10) test. And since the simple test doesn't have a
useful timeout for wget, it will still stall for a long time if apache
doesn't respond, and it is irrelevant what the advanced test does.

@simple tests: Even though we run a complex web application behind
apache (apache acts as a load balancer using mod_jk), I don't want a
more complex test than fetching /server-status on localhost. This simple
test already shows that apache is working and has threads available for
clients to connect. Failover for our application servers is done by
mod_jk, I don't need Heardbeat/Pacemaker for that. Think of it as
independent failover at each layer: Virtual IPs with Heartbeat/Pacemaker
for failover between Apaches, mod_jk for failover between Tomcats,
mmm_monitor for failover between MySQL servers.


Best regards,

David
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Wed, Apr 04, 2012 at 11:21:15AM -0500, Dimitri Maziuk wrote:
> On 04/04/2012 10:59 AM, Dejan Muhamedagic wrote:
> > Hi,
> ... httpd monitor ...
> > It may be wrong or not, that depends on what you need.
>
> (Another questionable choice as I recall was to consider 4xx an error.
> Dep. on what you need, you may want to treat only some 5xx codes as
> errors b/c the others mean apache is up and answering properly.)

Isn't the status page always supposed to return, well, the
status?

For the depth 10 test, we may change the test (btw, it seems
like not many people are using that). Patches welcome :)

> > At any rate, tests are always running from the same host where
> > the server is running. If you want to test "where you actually
> > care", then you'd need to be a bit closer to your clients and
> > use other kind of monitoring.
>
> Which is precisely what I do: I monitor on the host with (more or less)
> "lsof -i | grep httpd.+\*:http"

How is that better than fetching the status page?

> and I've nagios in another subnet
> checking the actual webpages. (Obviously, there's then gateways and
> switches and all that...)

Good!

Cheers,

Dejan

> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>



> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
Hi,

On Wed, Apr 04, 2012 at 07:42:12PM +0200, David Gubler wrote:
> Hi Dejan,
>
> On 04.04.2012 17:56, Dejan Muhamedagic wrote:
> > The timeout is a timeout, wherever it happens.
>
> Unfortunately not! If the monitor operation times out and Pacemaker
> moves on, the wget process (and thus the whole monitor process) will
> keep running. In fact, it may still be running many minutes after the
> timeout happened. And since the monitor (at least in case of the apache
> resource agent) can't be run twice in parallel, this effectively
> prevents further monitor operations until wget has timed out. And that's
> exactly where we get a problem.

Hmm, the process running the monitor operation should be removed
(killed) by lrmd on timeout. If that doesn't happen, then you
just hit a jackpot bug!

> > So, you want the resource agent to notice while running monitor
> > that it can now talk to the server?
> Yes, I want automatic recovery. The resource agent should notice when
> apache is back and working again. And that works fine with a patched
> apache resource agent.

Hmm, I though we were past this... and I still don't see the
patch :)

Cheers,

Dejan

> >> On a side note:
> >> The apache resource agent allows to supply a config file, where one can
> >> override the parameters for curl/wget. But the implementation here is
> >> bogus, because even if you supply this file, it always does a default
> >> test with default parameters first, so this is useless in this case...
> >> (I consider this behavior to be a bug).
> > If you use a config test file, you'd need to define a monitor
> > with depth 10. The depth 0 monitor (default) is always testing
> > the statusurl.
> Yes, I figured that, but it's besides the point. If I use depth 10, it
> will first do the simple (depth 0) test anyway (!), and after that the
> advanced (depth 10) test. And since the simple test doesn't have a
> useful timeout for wget, it will still stall for a long time if apache
> doesn't respond, and it is irrelevant what the advanced test does.
>
> @simple tests: Even though we run a complex web application behind
> apache (apache acts as a load balancer using mod_jk), I don't want a
> more complex test than fetching /server-status on localhost. This simple
> test already shows that apache is working and has threads available for
> clients to connect. Failover for our application servers is done by
> mod_jk, I don't need Heardbeat/Pacemaker for that. Think of it as
> independent failover at each layer: Virtual IPs with Heartbeat/Pacemaker
> for failover between Apaches, mod_jk for failover between Tomcats,
> mmm_monitor for failover between MySQL servers.
>
>
> Best regards,
>
> David
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Tue, Apr 03, 2012 at 01:53:41PM +0200, David Gubler wrote:
> Hi list,
>
> I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
> (Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
> with the apache resource agent.
>
> But first things first, my test setup:
>
> root@node0:~# crm configure show
> node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
> node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
> primitive apache ocf:heartbeat:apache \
> params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
> op monitor interval="30" timeout="120" \
> meta is-managed="false"
> primitive site0ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
> primitive site1ip ocf:heartbeat:IPaddr \
> params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
> clone apacheClone apache
> colocation bothips -100: site0ip site1ip
> colocation site0 inf: site0ip apacheClone
> colocation site1 inf: site1ip apacheClone
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> cluster-infrastructure="Heartbeat" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> last-lrm-refresh="1333391544" \
> cluster-recheck-interval="15min"
>
>
>
> One of the test I did was simulate a messed up apache (e.g. connection
> limit reached):
> $ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

Uhm, "invalid test case".

rather try:
iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
or even
iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset

> Of course, this should produce a monitor timeout, which should mark the
> apache as failed, and that's what happened.
>
> However, recovery didn't work after I did
> $ iptables -F
>
> The problem, according to what I could figure out:
> The apache resource agent
> /usr/lib/ocf/resource.d/heartbeat/apache
> does not have a timeout set for curl/wget. Curl has a default timeout of
> about 3 minutes, wget may even retry up to 20 times and thus may
> potentially take ages to time out.
>
> Thus, the monitor operation did time out instead of wget (thus,
> pacemaker thinks that the monitor itself has failed instead of the
> service it is monitoring, which is semantically just plain wrong, IMHO).

But does not make a difference, practically.
The monitor timed out. That's a fact.
So why not have show up in the logs.

Pacemaker behaviour is just the same,
whether a monitor action "timed out", or "failed".

> Since the resource agent let the (still waiting) wget process hang
> around practically forever, it also didn't notice when apache had
> recovered (after iptables -f).

After the monitor action timed out or failed,
the recovery action by pacemaker would be to stop the service,
and restart it (there or elsewhere).

Did that not happen?

The start operation of the apache RA internally does monitor as well,
so it likely times out as well.

I'd expect the cluster to move the unresponsive apache to some other
node, after monitor and restart timed out. Which I think is the right
thing to do.

> Bottom line:
> I think the apache resource agent badly needs a timeout parameter which
> is supplied to wget/curl and the documentation should make clear that
> the current monitor timeout provided by pacemaker is not a substitute
> for that (it cannot really be used to detect non-responsive web
> servers).

Why not? I think it can.
Again: timeout is timeout, regardless on what level.

If you want shorter timeouts,
configure shorter timeouts on the monitor action.

But I'm not opposed to add "--connection-timeout=..."
and equivalent to the command line of the test clients.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On 4/5/2012 10:09 AM, Dejan Muhamedagic wrote:
> On Wed, Apr 04, 2012 at 11:21:15AM -0500, Dimitri Maziuk wrote:
>> On 04/04/2012 10:59 AM, Dejan Muhamedagic wrote:
>>> Hi,
>> ... httpd monitor ...
>>> It may be wrong or not, that depends on what you need.
>>
>> (Another questionable choice as I recall was to consider 4xx an error.
>> Dep. on what you need, you may want to treat only some 5xx codes as
>> errors b/c the others mean apache is up and answering properly.)
>
> Isn't the status page always supposed to return, well, the
> status?

No. From http://httpd.apache.org/docs/2.2/mod/mod_status.html:
"The Status module allows a server administrator to find out how well
their server is performing. A HTML page is presented that gives the
current server statistics in an easily readable form."

What cluster should monitor for is that httpd is running and answering
requests. For that purpose a "404 Not found" is a success (while "500
Internal server failure" probably isn't).

"How well the server is performing" implies that it's up and running, so
it's a valid test -- in the same sense that counting ice cubes in the
freezer compartment is a valid test to see if your fridge is working.

>> Which is precisely what I do: I monitor on the host with (more or less)
>> "lsof -i | grep httpd.+\*:http"
>
> How is that better than fetching the status page?

It's not. It tells me httpd is running and is bound to ip/port. It does
not tell me if the port is reachable on a particular ip (but neither
does the resource agent) or whether it serves what it's supposed to
serve (but again neither does the agent since in my config
/server-status is *supposed* to return 404).

The point is that if you tested your apache & iptables setup and http's
up and bound to the port, the reasons it wouldn't answer requests are
pretty much limited to DOS on the server or on the connection. And since
the resource agent isn't really monitoring those either, the only
practical difference is that my check has fewer breakable pieces.

Dimitri
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Thu, Apr 05, 2012 at 11:54:44AM -0500, Dimitri Maziuk wrote:
> On 4/5/2012 10:09 AM, Dejan Muhamedagic wrote:
> > On Wed, Apr 04, 2012 at 11:21:15AM -0500, Dimitri Maziuk wrote:
> >> On 04/04/2012 10:59 AM, Dejan Muhamedagic wrote:
> >>> Hi,
> >> ... httpd monitor ...
> >>> It may be wrong or not, that depends on what you need.
> >>
> >> (Another questionable choice as I recall was to consider 4xx an error.
> >> Dep. on what you need, you may want to treat only some 5xx codes as
> >> errors b/c the others mean apache is up and answering properly.)
> >
> > Isn't the status page always supposed to return, well, the
> > status?
>
> No. From http://httpd.apache.org/docs/2.2/mod/mod_status.html:
> "The Status module allows a server administrator to find out how well
> their server is performing. A HTML page is presented that gives the
> current server statistics in an easily readable form."

Oh, darn, I did mean that when I said "status." And the point was
that it should _always_ yield some valid html.

> What cluster should monitor for is that httpd is running and answering
> requests. For that purpose a "404 Not found" is a success (while "500
> Internal server failure" probably isn't).

OK, so getting /server-status doesn't mean that httpd is running
and answering requests?

> "How well the server is performing" implies that it's up and running, so
> it's a valid test -- in the same sense that counting ice cubes in the
> freezer compartment is a valid test to see if your fridge is working.

Oh, well... and isn't it?

> >> Which is precisely what I do: I monitor on the host with (more or less)
> >> "lsof -i | grep httpd.+\*:http"
> >
> > How is that better than fetching the status page?
>
> It's not. It tells me httpd is running and is bound to ip/port. It does
> not tell me if the port is reachable on a particular ip (but neither
> does the resource agent) or whether it serves what it's supposed to
> serve (but again neither does the agent since in my config
> /server-status is *supposed* to return 404).
>
> The point is that if you tested your apache & iptables setup and http's
> up and bound to the port, the reasons it wouldn't answer requests are
> pretty much limited to DOS on the server or on the connection. And since
> the resource agent isn't really monitoring those either, the only
> practical difference is that my check has fewer breakable pieces.

And that you have yet another piece of software to install and
take care of :) But I'm sure you know the best.

Cheers,

Dejan


> Dimitri
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On 04/06/2012 06:15 AM, Dejan Muhamedagic wrote:
> On Thu, Apr 05, 2012 at 11:54:44AM -0500, Dimitri Maziuk wrote:

>> "How well the server is performing" implies that it's up and running, so
>> it's a valid test -- in the same sense that counting ice cubes in the
>> freezer compartment is a valid test to see if your fridge is working.
>
> Oh, well... and isn't it?

The converse isn't true: having zero ice cubes in the freezer does *not*
mean the fridge *isn't* working.

Shooting the node because it doesn't get to
http://127.0.0.1/server-status within some number of seconds may be
exactly what the user wants. Just as long as the user understands how
that relates to his actual webpages served on his cluster ip.

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
but sonetimes it should be changed to query the cluster ip - as
sometimes there is no listener on localhost.

best regards,
thomas

Von meinem tiriPhone gesendet.


Am 06.04.2012 um 19:18 schrieb Dimitri Maziuk <dmaziuk@bmrb.wisc.edu>:

> On 04/06/2012 06:15 AM, Dejan Muhamedagic wrote:
>> On Thu, Apr 05, 2012 at 11:54:44AM -0500, Dimitri Maziuk wrote:
>
>>> "How well the server is performing" implies that it's up and running, so
>>> it's a valid test -- in the same sense that counting ice cubes in the
>>> freezer compartment is a valid test to see if your fridge is working.
>>
>> Oh, well... and isn't it?
>
> The converse isn't true: having zero ice cubes in the freezer does *not*
> mean the fridge *isn't* working.
>
> Shooting the node because it doesn't get to
> http://127.0.0.1/server-status within some number of seconds may be
> exactly what the user wants. Just as long as the user understands how
> that relates to his actual webpages served on his cluster ip.
>
> --
> Dimitri Maziuk
> Programmer/sysadmin
> BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On 05.04.2012 17:14, Dejan Muhamedagic wrote:
> Hmm, the process running the monitor operation should be removed
> (killed) by lrmd on timeout. If that doesn't happen, then you
> just hit a jackpot bug!

Ok, that's crucial information I've been missing, and thus I
misinterpreted my test results. Back to square one...

TEST 1: *Unpatched* Apache resource agent with this configuration:

root@node2:/etc/ha.d# crm configure show
node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1
node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2
primitive apache ocf:heartbeat:apache \
params statusurl="http://localhost/server-status" \
op monitor interval="15s" timeout="5s" \
meta is-managed="false"
clone apacheClone apache
property $id="cib-bootstrap-options" \
dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1333886776"


crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged)
Thus all is well.

Now I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

After a few seconds, crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1
(unmanaged) FAILED
Failed actions:
apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed
Out): unknown exec error
Using ps aux, I can see that the monitor and wget is started every 15s
and running up to the timeout, and then killed, just as you said. So far
so good.

Now I remove the iptables rule:
$ iptables -F

But no matter how long I wait, Pacemaker *doesn't* notice that Apache is
back! Even though the monitor is definitely executed (I can see the
request in Apache's log file). Also, crm_mon keeps saying
Failed actions:
apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed
Out): unknown exec error
The counters don't change (!)

If I manually do
$ crm resource cleanup apacheClone
then everything is fine again.



TEST 2: *Patched* Apache resource agent with the same configuration.
root@node1:/usr/lib/ocf/resource.d/heartbeat# diff apache apache.orig
66c66
< WGETOPTS="-O- -q -L --no-proxy -T 3 -t 1 --bind-address=127.0.0.1"
---
> WGETOPTS="-O- -q -L --no-proxy --bind-address=127.0.0.1"
So all I did was add two options to wget's command line.

Again, crm_mon shows that all is well.
Again I do
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP

Now crm_mon shows
Clone Set: apacheClone [apache]
apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
apache:1 (ocf::heartbeat:apache): Started node1
(unmanaged) FAILED
Failed actions:
apache:0_monitor_15000 (node=node1, call=13, rc=1,
status=complete): unknown error
NOTE: The "Failed actions" are different from the test before!

Now I remove the iptables rule:
$ iptables -F

After a few seconds, the clone set is back to working state.



Thus, what I'm seeing here:

It does make a difference to Pacemaker whether the monitor operation
returns failure or times out.

Monitor times out:
* apache:0_monitor_15000 (node=node1, call=9, rc=-2, status=Timed Out):
unknown exec error
* Monitor operation and wget both get killed when the timeout happens
(just as they should)
* Monitor operation keeps getting executed (and presumably returns
success), but this is ignored (!) by Pacemaker

Monitor returns failure (due to wget's timeout):
* apache:0_monitor_15000 (node=node1, call=13, rc=1, status=complete):
unknown error
* Monitor operation and wget don't need to be killed, because they time
out and complete before the whole monitor operation times out
* Monitor operation keeps getting executed, and on first success
Pacemakers notices and puts apache back into working state


The big question here is: Is this a bug in Pacemaker or by design?


> Hmm, I though we were past this... and I still don't see the
> patch :)
I'm still not sure what the actual problem is. Currently I feel like
it's a bug in Pacemaker, and my "fix" for the apache resource agent is
just fighting symptoms.

Sorry for the confusion - This Heartbeat/Pacemaker thing is very hard to
understand.

Best regards,

David
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
Hi Lars,

On 05.04.2012 18:53, Lars Ellenberg wrote:
> Uhm, "invalid test case".
>
> rather try:
> iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
> or even
> iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset
Yes, then it works, but that's not surprising, because in this case the
operations return immediately and never time out. But why should a
non-responsive apache be an invalid test case? We've reached apache's
connection limit more than once, and from the client's point of view
this produces a very similar effect to '-j DROP'.


> Pacemaker behaviour is just the same,
> whether a monitor action "timed out", or "failed".

I've come to the conclusion that this just isn't true, please see my
other mail, I've listed all the steps I did in detail.


>
> After the monitor action timed out or failed,
> the recovery action by pacemaker would be to stop the service,
> and restart it (there or elsewhere).
>
> Did that not happen?
>
> The start operation of the apache RA internally does monitor as well,
> so it likely times out as well.
>
> I'd expect the cluster to move the unresponsive apache to some other
> node, after monitor and restart timed out. Which I think is the right
> thing to do.

I'm using unmanaged resources, because for our application there's no
point in having Pacemaker shut down apache (apache can be used on all
hosts in parallel and without restrictions). So no stop/start for us.

Best regards,

David

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Sun, Apr 08, 2012 at 03:03:58PM +0200, David Gubler wrote:
> On 05.04.2012 17:14, Dejan Muhamedagic wrote:
> > Hmm, the process running the monitor operation should be removed
> > (killed) by lrmd on timeout. If that doesn't happen, then you
> > just hit a jackpot bug!
>
> Ok, that's crucial information I've been missing, and thus I
> misinterpreted my test results. Back to square one...
>
> TEST 1: *Unpatched* Apache resource agent with this configuration:
>
> root@node2:/etc/ha.d# crm configure show
> node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1
> node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2
> primitive apache ocf:heartbeat:apache \
> params statusurl="http://localhost/server-status" \
> op monitor interval="15s" timeout="5s" \
> meta is-managed="false"
> clone apacheClone apache
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1333886776"
>
>
> crm_mon shows
> Clone Set: apacheClone [apache]
> apache:0 (ocf::heartbeat:apache): Started node2 (unmanaged)
> apache:1 (ocf::heartbeat:apache): Started node1 (unmanaged)
> Thus all is well.

Nothing is well.
They are "unmanaged" already ...
Which means the cluster will still attempt to monitor for changes,
but will not take action.


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Sun, Apr 08, 2012 at 03:16:17PM +0200, David Gubler wrote:
> Hi Lars,
>
> On 05.04.2012 18:53, Lars Ellenberg wrote:
> > Uhm, "invalid test case".
> >
> > rather try:
> > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
> > or even
> > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset
> Yes, then it works, but that's not surprising, because in this case the
> operations return immediately and never time out. But why should a
> non-responsive apache be an invalid test case? We've reached apache's
> connection limit more than once, and from the client's point of view
> this produces a very similar effect to '-j DROP'.
>
>
> > Pacemaker behaviour is just the same,
> > whether a monitor action "timed out", or "failed".
>
> I've come to the conclusion that this just isn't true, please see my
> other mail, I've listed all the steps I did in detail.
>
>
> >
> > After the monitor action timed out or failed,
> > the recovery action by pacemaker would be to stop the service,
> > and restart it (there or elsewhere).
> >
> > Did that not happen?
> >
> > The start operation of the apache RA internally does monitor as well,
> > so it likely times out as well.
> >
> > I'd expect the cluster to move the unresponsive apache to some other
> > node, after monitor and restart timed out. Which I think is the right
> > thing to do.
>
> I'm using unmanaged resources, because for our application there's no
> point in having Pacemaker shut down apache (apache can be used on all
> hosts in parallel and without restrictions). So no stop/start for us.

Right. So the resources are not managed.
Did you mention that before?

I won't argue with that, if you think that is how it should be, so be it.

Pacemaker does not monitor resources that are supposed to
be stopped for "reviving on their own".
Not by default, at least.

I suggest you add a "monitor" action for "role=Stopped"
(with a different interval!)

So the better subject would have been
How to configure Pacemaker to monitor (unmanaged) stopped resources
in case they resurrect on their own?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: ocf:heartbeat:apache resource agent and timeouts [ In reply to ]
On Thu, Apr 12, 2012 at 12:06:54PM +0200, Lars Ellenberg wrote:
> On Sun, Apr 08, 2012 at 03:16:17PM +0200, David Gubler wrote:
> > Hi Lars,
> >
> > On 05.04.2012 18:53, Lars Ellenberg wrote:
> > > Uhm, "invalid test case".
> > >
> > > rather try:
> > > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
> > > or even
> > > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset
> > Yes, then it works, but that's not surprising, because in this case the
> > operations return immediately and never time out. But why should a
> > non-responsive apache be an invalid test case? We've reached apache's
> > connection limit more than once, and from the client's point of view
> > this produces a very similar effect to '-j DROP'.
> >
> >
> > > Pacemaker behaviour is just the same,
> > > whether a monitor action "timed out", or "failed".
> >
> > I've come to the conclusion that this just isn't true, please see my
> > other mail, I've listed all the steps I did in detail.
> >
> >
> > >
> > > After the monitor action timed out or failed,
> > > the recovery action by pacemaker would be to stop the service,
> > > and restart it (there or elsewhere).
> > >
> > > Did that not happen?
> > >
> > > The start operation of the apache RA internally does monitor as well,
> > > so it likely times out as well.
> > >
> > > I'd expect the cluster to move the unresponsive apache to some other
> > > node, after monitor and restart timed out. Which I think is the right
> > > thing to do.
> >
> > I'm using unmanaged resources, because for our application there's no
> > point in having Pacemaker shut down apache (apache can be used on all
> > hosts in parallel and without restrictions). So no stop/start for us.
>
> Right. So the resources are not managed.
> Did you mention that before?

Hm. So you did. Guess my auto-correction while reading dropped that line...

primitive apache ocf:heartbeat:apache \
params testconffile="/etc/ha.d/doodletest.pm"
testname="doodle"\
op monitor interval="30" timeout="20" \
op monitor interval="31" timeout="20" role=Stopped \
meta is-managed="false"


I think that "monitor role=Stopped" thing works for primitives.
It may work for clones, I'd have to double check that.
iirc, it does not work for ms resources.
At least not last time I checked.

> I won't argue with that, if you think that is how it should be, so be it.
>
> Pacemaker does not monitor resources that are supposed to
> be stopped for "reviving on their own".
> Not by default, at least.
>
> I suggest you add a "monitor" action for "role=Stopped"
> (with a different interval!)
>
> So the better subject would have been
> How to configure Pacemaker to monitor (unmanaged) stopped resources

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems