23.02.2012 10:43, Ante Karamatic wrote: > On 23.02.2012 07:57, Vladislav Bogdanov wrote:
>> Thanks for clarification, that wasn't clear at the moment I looked at
>> it. If I knew that, I wouldn't write that RA. One remark, my RA has
>> possibility to check service aliveness on monitor operation and repair
>> that service if it hangs.
> Well... Upstart actually does notice if the job failed and respawns it -
> depending on job's configuration. Monitoring cluster resource, in this
> case, should just return 'running' or 'not running'. It's up to the lrmd
> to restart the resource if it's not running. Restarting the resource
> within the 'monitor' doesn't look like the best way to do it? It somehow
> doesn't fit into the 'monitor' function and you lose some of the
> functionality when you don't report the problem to the lrmd (allowed
> number of restarts; what to do if monitor fails, etc...).
Well, monitor failure will cause all dependent resources to be restarted
by pacemaker, which is not always desired.
As some resources (like libvirtd or iscsid or ietd) support restarts
without affecting functionality at all, I prefer them to be restarted
automatically by upstart, not by pacemaker. That's why I use 'respawn'
there. Of course not all resources support that.
What I said above is not about resource NOT_RUNNING failure, but about
HANG failure. Imagine daemon which still runs (has a process) but does
not answer to requests. That is not notified by upstart. But in a case
of libvirtd that will be notified by VirtualDomain RA and will cause
monitor ERR_GENERIC (if I recall correctly) failure. VM then will be
scheduled to restart. Then it fails on stop because libvirtd still
doesn't answer, then node is fenced.
I was hit by this once, and that was a simple growth problem - libvirtd
has a limit on a number of connections. More resources (VMs) you have,
bigger the chance that you consume all connection slots for monitor
And I think that having libvirtd killed -9 by its RA on monitor (and
respawned by upstart) is a way less evil than to have whole cluster
forcibly restarted. Yes, this is a hack. But it works and allows me to
Of course that does not replace need in a proper configuration, just a
one more safety layer... >
>> I use it for libvirtd which sometimes become
>> unresponsive so I need to restart it before all other libvirt-related
>> resources begin to fail. Fortunately, modern libvirtd can be restarted
>> without affecting guests. Of course, that is just a hack, and that
>> should be fixed in libvirtd, but we live in a real world...
> You can prevent other resources from restarting by adjusting
> constraints. But this really depends on your setup. For some time
> running libvirtd is not a requirement for running a VM. I don't recall
> VMs ever failing if libvirt restarted.
But libvirtd is required to start/stop a libvirt-managed VM. That's why
one needs a constraints to colocate VM with libvirtd instance.
It is currently impossible to specify that something is needed to
start/stop resource but is not needed while it runs (btw in the case of
libvirtd it *is* needed to obtain resource status).
So constraints must be there.
But then, if pacemaker notifies that resource (libvirtd) is not running
it will stop all dependent resources (VMs) and then restart failed one.
And it will fail to stop that resources (because libvirtd is still not
running) and node will be fenced.
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf