Hi list,
I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
(Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
with the apache resource agent.
But first things first, my test setup:
root@node0:~# crm configure show
node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
primitive apache ocf:heartbeat:apache \
params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
op monitor interval="30" timeout="120" \
meta is-managed="false"
primitive site0ip ocf:heartbeat:IPaddr \
params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
primitive site1ip ocf:heartbeat:IPaddr \
params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
clone apacheClone apache
colocation bothips -100: site0ip site1ip
colocation site0 inf: site0ip apacheClone
colocation site1 inf: site1ip apacheClone
property $id="cib-bootstrap-options" \
dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="Heartbeat" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
last-lrm-refresh="1333391544" \
cluster-recheck-interval="15min"
One of the test I did was simulate a messed up apache (e.g. connection
limit reached):
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
Of course, this should produce a monitor timeout, which should mark the
apache as failed, and that's what happened.
However, recovery didn't work after I did
$ iptables -F
The problem, according to what I could figure out:
The apache resource agent
/usr/lib/ocf/resource.d/heartbeat/apache
does not have a timeout set for curl/wget. Curl has a default timeout of
about 3 minutes, wget may even retry up to 20 times and thus may
potentially take ages to time out.
Thus, the monitor operation did time out instead of wget (thus,
pacemaker thinks that the monitor itself has failed instead of the
service it is monitoring, which is semantically just plain wrong, IMHO).
Since the resource agent let the (still waiting) wget process hang
around practically forever, it also didn't notice when apache had
recovered (after iptables -f).
Bottom line:
I think the apache resource agent badly needs a timeout parameter which
is supplied to wget/curl and the documentation should make clear that
the current monitor timeout provided by pacemaker is not a substitute
for that (it cannot really be used to detect non-responsive web
servers). I only figured that out after extensive testing and finally
looking at the source, which took an awful lot of time.
After implementing a workaround:
WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
(added -T 5 -t 1) pacemaker and the apache resource behaved as expected
even when doing the iptables test above, and apache quickly recovers
after I do iptables -F.
On a side note:
The apache resource agent allows to supply a config file, where one can
override the parameters for curl/wget. But the implementation here is
bogus, because even if you supply this file, it always does a default
test with default parameters first, so this is useless in this case...
(I consider this behavior to be a bug).
Side note II:
I did play a lot with on-fail=..., failure-timeout=,
cluster-recheck-interval=... Changing these values did not help, but in
some cases produced new weird behavior, e.g. in some cases pacemaker
didn't even notice that apache was unreachable...
Best regards,
David
--
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg@doodle.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
I've been experimenting with Heartbeat/Pacemaker on Ubuntu 11.10
(Pacemaker 1.1.5 and Heartbeat 3.0.5) and I have hit a very nasty issue
with the apache resource agent.
But first things first, my test setup:
root@node0:~# crm configure show
node $id="5a46c3c9-1f1e-45ad-9eb4-ebf216734d97" node1
node $id="9270b333-9056-4560-8ca2-9f878b1f8966" node0
primitive apache ocf:heartbeat:apache \
params testconffile="/etc/ha.d/doodletest.pm" testname="doodle"\
op monitor interval="30" timeout="120" \
meta is-managed="false"
primitive site0ip ocf:heartbeat:IPaddr \
params ip="192.168.88.90" cidr_netmask="255.255.255.0" nic="eth0"
primitive site1ip ocf:heartbeat:IPaddr \
params ip="192.168.88.91" cidr_netmask="255.255.255.0" nic="eth0"
clone apacheClone apache
colocation bothips -100: site0ip site1ip
colocation site0 inf: site0ip apacheClone
colocation site1 inf: site1ip apacheClone
property $id="cib-bootstrap-options" \
dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="Heartbeat" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
last-lrm-refresh="1333391544" \
cluster-recheck-interval="15min"
One of the test I did was simulate a messed up apache (e.g. connection
limit reached):
$ iptables -I INPUT -p tcp --dport 80 -i lo -j DROP
Of course, this should produce a monitor timeout, which should mark the
apache as failed, and that's what happened.
However, recovery didn't work after I did
$ iptables -F
The problem, according to what I could figure out:
The apache resource agent
/usr/lib/ocf/resource.d/heartbeat/apache
does not have a timeout set for curl/wget. Curl has a default timeout of
about 3 minutes, wget may even retry up to 20 times and thus may
potentially take ages to time out.
Thus, the monitor operation did time out instead of wget (thus,
pacemaker thinks that the monitor itself has failed instead of the
service it is monitoring, which is semantically just plain wrong, IMHO).
Since the resource agent let the (still waiting) wget process hang
around practically forever, it also didn't notice when apache had
recovered (after iptables -f).
Bottom line:
I think the apache resource agent badly needs a timeout parameter which
is supplied to wget/curl and the documentation should make clear that
the current monitor timeout provided by pacemaker is not a substitute
for that (it cannot really be used to detect non-responsive web
servers). I only figured that out after extensive testing and finally
looking at the source, which took an awful lot of time.
After implementing a workaround:
WGETOPTS="-O- -q -L -T 5 -t 1 --no-proxy --bind-address=127.0.0.1"
(added -T 5 -t 1) pacemaker and the apache resource behaved as expected
even when doing the iptables test above, and apache quickly recovers
after I do iptables -F.
On a side note:
The apache resource agent allows to supply a config file, where one can
override the parameters for curl/wget. But the implementation here is
bogus, because even if you supply this file, it always does a default
test with default parameters first, so this is useless in this case...
(I consider this behavior to be a bug).
Side note II:
I did play a lot with on-fail=..., failure-timeout=,
cluster-recheck-interval=... Changing these values did not help, but in
some cases produced new weird behavior, e.g. in some cases pacemaker
didn't even notice that apache was unreachable...
Best regards,
David
--
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg@doodle.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems