Mailing List Archive

pacemaker+drbd promotion delay
The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
files and versions below.

Problem: If I restart both nodes at the same time, or even just start pacemaker
on both nodes at the same time, the drbd ms resource starts, but both nodes stay
in slave mode. They'll both stay in slave mode until one of the following occurs:

- I manually type "crm resource cleanup <ms-resource-name>"

- 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
resources are promoted.

The key resource definitions:

primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource="admin" \
op monitor interval="59s" role="Master" timeout="30s" \
op monitor interval="60s" role="Slave" timeout="30s" \
op stop interval="0" timeout="100" \
op start interval="0" timeout="240" \
meta target-role="Master"
ms AdminClone AdminDrbd \
meta master-max="2" master-node-max="1" clone-max="2" \
clone-node-max="1" notify="true" interleave="true"
# The lengthy definition of "FilesystemGroup" is in the crm pastebin below
clone FilesystemClone FilesystemGroup \
meta interleave="true" target-role="Started"
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start

Note that I stuck in "target-role" options to try to solve the problem; no effect.

When I look in /var/log/messages, I see no error messages or indications why the
promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
on both nodes. There are no error messages when I force the issue with:

crm resource cleanup AdminClone

It's as if pacemaker, at start, needs some kind of "kick" after the drbd
resource is ready to be promoted.

This is not just an abstract case for me. At my site, it's not uncommon for
there to be lengthy power outages that will bring down the cluster. Both systems
will come up when power is restored, and I need for cluster services to be
available shortly afterward, not 15 minutes later.

Any ideas?

Details:

# rpm -q kernel cman pacemaker drbd
kernel-2.6.32-220.4.1.el6.x86_64
cman-3.0.12.1-23.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
drbd-8.4.1-1.el6.x86_64

Output of crm_mon after two-node reboot or pacemaker restart:
<http://pastebin.com/jzrpCk3i>
cluster.conf: <http://pastebin.com/sJw4KBws>
"crm configure show": <http://pastebin.com/MgYCQ2JH>
"drbdadm dump all": <http://pastebin.com/NrY6bskk>
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Re: pacemaker+drbd promotion delay [ In reply to ]
On 3/27/12 6:12 PM, William Seligman wrote:
> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
> files and versions below.
>
> Problem: If I restart both nodes at the same time, or even just start pacemaker
> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
> in slave mode. They'll both stay in slave mode until one of the following occurs:
>
> - I manually type "crm resource cleanup <ms-resource-name>"
>
> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
> resources are promoted.
>
> The key resource definitions:
>
> primitive AdminDrbd ocf:linbit:drbd \
> params drbd_resource="admin" \
> op monitor interval="59s" role="Master" timeout="30s" \
> op monitor interval="60s" role="Slave" timeout="30s" \
> op stop interval="0" timeout="100" \
> op start interval="0" timeout="240" \
> meta target-role="Master"
> ms AdminClone AdminDrbd \
> meta master-max="2" master-node-max="1" clone-max="2" \
> clone-node-max="1" notify="true" interleave="true"
> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
> clone FilesystemClone FilesystemGroup \
> meta interleave="true" target-role="Started"
> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>
> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>
> When I look in /var/log/messages, I see no error messages or indications why the
> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
> on both nodes. There are no error messages when I force the issue with:
>
> crm resource cleanup AdminClone
>
> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
> resource is ready to be promoted.
>
> This is not just an abstract case for me. At my site, it's not uncommon for
> there to be lengthy power outages that will bring down the cluster. Both systems
> will come up when power is restored, and I need for cluster services to be
> available shortly afterward, not 15 minutes later.
>
> Any ideas?
>
> Details:
>
> # rpm -q kernel cman pacemaker drbd
> kernel-2.6.32-220.4.1.el6.x86_64
> cman-3.0.12.1-23.el6.x86_64
> pacemaker-1.1.6-3.el6.x86_64
> drbd-8.4.1-1.el6.x86_64
>
> Output of crm_mon after two-node reboot or pacemaker restart:
> <http://pastebin.com/jzrpCk3i>
> cluster.conf: <http://pastebin.com/sJw4KBws>
> "crm configure show": <http://pastebin.com/MgYCQ2JH>
> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

Well, I can't say that I've "solved" this one, but I have a solution: If I turn
on both machines at once there's a 15-minute delay. But if I turn on one
machine, wait a couple of minutes, then turn on the other, at least the
resources start promptly on the first machine. The second machine joins the
cluster, but there's still a 15-minute delay until its DRBD partition is
promoted by pacemaker.

The reason why DRBD is promoted on the first machine has to do the previous
issue I posted to this list:

<http://www.gossamer-threads.com/lists/linuxha/users/78691?do=post_view_threaded>

When doing the initial resource probe of the AdminLvm resource, it times out due
the one-node LVM issue I discuss in the that thread. This error causes the
pengine on the node to start re-probing resources, promote the DRBD partition,
which in turn leads to all all the other resources starting on that node.

So I have a work-around, but not a solution. I'll take what I can get!
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Re: pacemaker+drbd promotion delay [ In reply to ]
On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
<seligman@nevis.columbia.edu> wrote:
> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
> files and versions below.
>
> Problem: If I restart both nodes at the same time, or even just start pacemaker
> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
> in slave mode. They'll both stay in slave mode until one of the following occurs:
>
> - I manually type "crm resource cleanup <ms-resource-name>"
>
> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
> resources are promoted.
>
> The key resource definitions:
>
> primitive AdminDrbd ocf:linbit:drbd \
>        params drbd_resource="admin" \
>        op monitor interval="59s" role="Master" timeout="30s" \
>        op monitor interval="60s" role="Slave" timeout="30s" \
>        op stop interval="0" timeout="100" \
>        op start interval="0" timeout="240" \
>        meta target-role="Master"
> ms AdminClone AdminDrbd \
>        meta master-max="2" master-node-max="1" clone-max="2" \
>        clone-node-max="1" notify="true" interleave="true"
> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
> clone FilesystemClone FilesystemGroup \
>        meta interleave="true" target-role="Started"
> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>
> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>
> When I look in /var/log/messages, I see no error messages or indications why the
> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
> on both nodes. There are no error messages when I force the issue with:
>
> crm resource cleanup AdminClone
>
> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
> resource is ready to be promoted.
>
> This is not just an abstract case for me. At my site, it's not uncommon for
> there to be lengthy power outages that will bring down the cluster. Both systems
> will come up when power is restored, and I need for cluster services to be
> available shortly afterward, not 15 minutes later.
>
> Any ideas?

Not without any logs

>
> Details:
>
> # rpm -q kernel cman pacemaker drbd
> kernel-2.6.32-220.4.1.el6.x86_64
> cman-3.0.12.1-23.el6.x86_64
> pacemaker-1.1.6-3.el6.x86_64
> drbd-8.4.1-1.el6.x86_64
>
> Output of crm_mon after two-node reboot or pacemaker restart:
> <http://pastebin.com/jzrpCk3i>
> cluster.conf: <http://pastebin.com/sJw4KBws>
> "crm configure show": <http://pastebin.com/MgYCQ2JH>
> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: pacemaker+drbd promotion delay [ In reply to ]
On 3/29/12 3:19 AM, Andrew Beekhof wrote:
> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
> <seligman@nevis.columbia.edu> wrote:
>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>> files and versions below.
>>
>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>>
>> - I manually type "crm resource cleanup <ms-resource-name>"
>>
>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>> resources are promoted.
>>
>> The key resource definitions:
>>
>> primitive AdminDrbd ocf:linbit:drbd \
>> � � � �params drbd_resource="admin" \
>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>> � � � �op stop interval="0" timeout="100" \
>> � � � �op start interval="0" timeout="240" \
>> � � � �meta target-role="Master"
>> ms AdminClone AdminDrbd \
>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>> � � � �clone-node-max="1" notify="true" interleave="true"
>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>> clone FilesystemClone FilesystemGroup \
>> � � � �meta interleave="true" target-role="Started"
>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>
>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>>
>> When I look in /var/log/messages, I see no error messages or indications why the
>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>> on both nodes. There are no error messages when I force the issue with:
>>
>> crm resource cleanup AdminClone
>>
>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>> resource is ready to be promoted.
>>
>> This is not just an abstract case for me. At my site, it's not uncommon for
>> there to be lengthy power outages that will bring down the cluster. Both systems
>> will come up when power is restored, and I need for cluster services to be
>> available shortly afterward, not 15 minutes later.
>>
>> Any ideas?
>
> Not without any logs

Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>

Before you click on the link (it's a big wall of text), here are what I think
are the landmarks:

- The extract starts just after the node boots, at the start of syslog at time
10:49:21.
- I've highlighted when pacemakerd starts, at 10:49:46.
- I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
10:50:10.
- One last highlight: When pacemaker finally promotes the drbd resource to
Primary on both nodes, at 11:05:11.

> Details:
>>
>> # rpm -q kernel cman pacemaker drbd
>> kernel-2.6.32-220.4.1.el6.x86_64
>> cman-3.0.12.1-23.el6.x86_64
>> pacemaker-1.1.6-3.el6.x86_64
>> drbd-8.4.1-1.el6.x86_64
>>
>> Output of crm_mon after two-node reboot or pacemaker restart:
>> <http://pastebin.com/jzrpCk3i>
>> cluster.conf: <http://pastebin.com/sJw4KBws>
>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Re: pacemaker+drbd promotion delay [ In reply to ]
On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
<seligman@nevis.columbia.edu> wrote:
> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>> <seligman@nevis.columbia.edu> wrote:
>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>>> files and versions below.
>>>
>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>>>
>>> - I manually type "crm resource cleanup <ms-resource-name>"
>>>
>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>>> resources are promoted.
>>>
>>> The key resource definitions:
>>>
>>> primitive AdminDrbd ocf:linbit:drbd \
>>> � � � �params drbd_resource="admin" \
>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>>> � � � �op stop interval="0" timeout="100" \
>>> � � � �op start interval="0" timeout="240" \
>>> � � � �meta target-role="Master"
>>> ms AdminClone AdminDrbd \
>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>>> � � � �clone-node-max="1" notify="true" interleave="true"
>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>>> clone FilesystemClone FilesystemGroup \
>>> � � � �meta interleave="true" target-role="Started"
>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>>
>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>>>
>>> When I look in /var/log/messages, I see no error messages or indications why the
>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>>> on both nodes. There are no error messages when I force the issue with:
>>>
>>> crm resource cleanup AdminClone
>>>
>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>>> resource is ready to be promoted.
>>>
>>> This is not just an abstract case for me. At my site, it's not uncommon for
>>> there to be lengthy power outages that will bring down the cluster. Both systems
>>> will come up when power is restored, and I need for cluster services to be
>>> available shortly afterward, not 15 minutes later.
>>>
>>> Any ideas?
>>
>> Not without any logs
>
> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>
> Before you click on the link (it's a big wall of text),

I'm used to trawling the logs. Grep is a wonderful thing :-)

At this stage it is apparent that I need to see
/var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
Do you have this file still?

> here are what I think
> are the landmarks:
>
> - The extract starts just after the node boots, at the start of syslog at time
> 10:49:21.
> - I've highlighted when pacemakerd starts, at 10:49:46.
> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
> 10:50:10.
> - One last highlight: When pacemaker finally promotes the drbd resource to
> Primary on both nodes, at 11:05:11.
>
>> Details:
>>>
>>> # rpm -q kernel cman pacemaker drbd
>>> kernel-2.6.32-220.4.1.el6.x86_64
>>> cman-3.0.12.1-23.el6.x86_64
>>> pacemaker-1.1.6-3.el6.x86_64
>>> drbd-8.4.1-1.el6.x86_64
>>>
>>> Output of crm_mon after two-node reboot or pacemaker restart:
>>> <http://pastebin.com/jzrpCk3i>
>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: pacemaker+drbd promotion delay [ In reply to ]
On 3/30/12 1:13 AM, Andrew Beekhof wrote:
> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
> <seligman@nevis.columbia.edu> wrote:
>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>>> <seligman@nevis.columbia.edu> wrote:
>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>>>> files and versions below.
>>>>
>>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>>>>
>>>> - I manually type "crm resource cleanup <ms-resource-name>"
>>>>
>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>>>> resources are promoted.
>>>>
>>>> The key resource definitions:
>>>>
>>>> primitive AdminDrbd ocf:linbit:drbd \
>>>> � � � �params drbd_resource="admin" \
>>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>>>> � � � �op stop interval="0" timeout="100" \
>>>> � � � �op start interval="0" timeout="240" \
>>>> � � � �meta target-role="Master"
>>>> ms AdminClone AdminDrbd \
>>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>>>> � � � �clone-node-max="1" notify="true" interleave="true"
>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>>>> clone FilesystemClone FilesystemGroup \
>>>> � � � �meta interleave="true" target-role="Started"
>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>>>
>>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>>>>
>>>> When I look in /var/log/messages, I see no error messages or indications why the
>>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>>>> on both nodes. There are no error messages when I force the issue with:
>>>>
>>>> crm resource cleanup AdminClone
>>>>
>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>>>> resource is ready to be promoted.
>>>>
>>>> This is not just an abstract case for me. At my site, it's not uncommon for
>>>> there to be lengthy power outages that will bring down the cluster. Both systems
>>>> will come up when power is restored, and I need for cluster services to be
>>>> available shortly afterward, not 15 minutes later.
>>>>
>>>> Any ideas?
>>>
>>> Not without any logs
>>
>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>>
>> Before you click on the link (it's a big wall of text),
>
> I'm used to trawling the logs. Grep is a wonderful thing :-)
>
> At this stage it is apparent that I need to see
> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
> Do you have this file still?

No, so I re-ran the test. Here's the log extract from the test I did today
<http://pastebin.com/6QYH2jkf>.

Based on what you asked for from the previous extract, I think what you want
from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
three pe-input files mentioned in the log messages:

pe-input-4: <http://pastebin.com/Txx50BJp>
pe-input-5: <http://pastebin.com/zzppL6DF>
pe-input-6: <http://pastebin.com/1dRgURK5>

I pray to the gods of Grep that you find a clue in all of that!

>> here are what I think
>> are the landmarks:
>>
>> - The extract starts just after the node boots, at the start of syslog at time
>> 10:49:21.
>> - I've highlighted when pacemakerd starts, at 10:49:46.
>> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
>> 10:50:10.
>> - One last highlight: When pacemaker finally promotes the drbd resource to
>> Primary on both nodes, at 11:05:11.
>>
>>> Details:
>>>>
>>>> # rpm -q kernel cman pacemaker drbd
>>>> kernel-2.6.32-220.4.1.el6.x86_64
>>>> cman-3.0.12.1-23.el6.x86_64
>>>> pacemaker-1.1.6-3.el6.x86_64
>>>> drbd-8.4.1-1.el6.x86_64
>>>>
>>>> Output of crm_mon after two-node reboot or pacemaker restart:
>>>> <http://pastebin.com/jzrpCk3i>
>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
Re: pacemaker+drbd promotion delay [ In reply to ]
It looks like the drbd RA is calling crm_master during the monitor action.
That wouldn't seem like a good idea as the value isn't counted until
the resource is started and if the transition is interrupted (as it is
here) then the PE won't try to promote it (because the value didn't
change).

Has the drbd RA always done this?

On Sat, Mar 31, 2012 at 2:56 AM, William Seligman
<seligman@nevis.columbia.edu> wrote:
> On 3/30/12 1:13 AM, Andrew Beekhof wrote:
>> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
>> <seligman@nevis.columbia.edu> wrote:
>>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>>>> <seligman@nevis.columbia.edu> wrote:
>>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>>>>> files and versions below.
>>>>>
>>>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>>>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>>>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>>>>>
>>>>> - I manually type "crm resource cleanup <ms-resource-name>"
>>>>>
>>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>>>>> resources are promoted.
>>>>>
>>>>> The key resource definitions:
>>>>>
>>>>> primitive AdminDrbd ocf:linbit:drbd \
>>>>> � � � �params drbd_resource="admin" \
>>>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>>>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>>>>> � � � �op stop interval="0" timeout="100" \
>>>>> � � � �op start interval="0" timeout="240" \
>>>>> � � � �meta target-role="Master"
>>>>> ms AdminClone AdminDrbd \
>>>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>>>>> � � � �clone-node-max="1" notify="true" interleave="true"
>>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>>>>> clone FilesystemClone FilesystemGroup \
>>>>> � � � �meta interleave="true" target-role="Started"
>>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>>>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>>>>>
>>>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>>>>>
>>>>> When I look in /var/log/messages, I see no error messages or indications why the
>>>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>>>>> on both nodes. There are no error messages when I force the issue with:
>>>>>
>>>>> crm resource cleanup AdminClone
>>>>>
>>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>>>>> resource is ready to be promoted.
>>>>>
>>>>> This is not just an abstract case for me. At my site, it's not uncommon for
>>>>> there to be lengthy power outages that will bring down the cluster. Both systems
>>>>> will come up when power is restored, and I need for cluster services to be
>>>>> available shortly afterward, not 15 minutes later.
>>>>>
>>>>> Any ideas?
>>>>
>>>> Not without any logs
>>>
>>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>>>
>>> Before you click on the link (it's a big wall of text),
>>
>> I'm used to trawling the logs.  Grep is a wonderful thing :-)
>>
>> At this stage it is apparent that I need to see
>> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
>> Do you have this file still?
>
> No, so I re-ran the test. Here's the log extract from the test I did today
> <http://pastebin.com/6QYH2jkf>.
>
> Based on what you asked for from the previous extract, I think what you want
> from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
> three pe-input files mentioned in the log messages:
>
> pe-input-4: <http://pastebin.com/Txx50BJp>
> pe-input-5: <http://pastebin.com/zzppL6DF>
> pe-input-6: <http://pastebin.com/1dRgURK5>
>
> I pray to the gods of Grep that you find a clue in all of that!
>
>>> here are what I think
>>> are the landmarks:
>>>
>>> - The extract starts just after the node boots, at the start of syslog at time
>>> 10:49:21.
>>> - I've highlighted when pacemakerd starts, at 10:49:46.
>>> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
>>> 10:50:10.
>>> - One last highlight: When pacemaker finally promotes the drbd resource to
>>> Primary on both nodes, at 11:05:11.
>>>
>>>> Details:
>>>>>
>>>>> # rpm -q kernel cman pacemaker drbd
>>>>> kernel-2.6.32-220.4.1.el6.x86_64
>>>>> cman-3.0.12.1-23.el6.x86_64
>>>>> pacemaker-1.1.6-3.el6.x86_64
>>>>> drbd-8.4.1-1.el6.x86_64
>>>>>
>>>>> Output of crm_mon after two-node reboot or pacemaker restart:
>>>>> <http://pastebin.com/jzrpCk3i>
>>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: pacemaker+drbd promotion delay [ In reply to ]
On Wed, Apr 11, 2012 at 08:22:59AM +1000, Andrew Beekhof wrote:
> It looks like the drbd RA is calling crm_master during the monitor action.
> That wouldn't seem like a good idea as the value isn't counted until
> the resource is started and if the transition is interrupted (as it is
> here) then the PE won't try to promote it (because the value didn't
> change).

I did not get the last part.
Why would it not be promoted,
even though it has positive master score?

> Has the drbd RA always done this?

Yes.

When else should we call crm_master?

Preference changes: we may lose a local disk,
we may have been outdated or inconsistent,
then sync up, etc.

> On Sat, Mar 31, 2012 at 2:56 AM, William Seligman
> <seligman@nevis.columbia.edu> wrote:
> > On 3/30/12 1:13 AM, Andrew Beekhof wrote:
> >> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
> >> <seligman@nevis.columbia.edu> wrote:
> >>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
> >>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
> >>>> <seligman@nevis.columbia.edu> wrote:
> >>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
> >>>>> files and versions below.
> >>>>>
> >>>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
> >>>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
> >>>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
> >>>>>
> >>>>> - I manually type "crm resource cleanup <ms-resource-name>"
> >>>>>
> >>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
> >>>>> resources are promoted.
> >>>>>
> >>>>> The key resource definitions:
> >>>>>
> >>>>> primitive AdminDrbd ocf:linbit:drbd \
> >>>>> � � � �params drbd_resource="admin" \
> >>>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
> >>>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
> >>>>> � � � �op stop interval="0" timeout="100" \
> >>>>> � � � �op start interval="0" timeout="240" \
> >>>>> � � � �meta target-role="Master"
> >>>>> ms AdminClone AdminDrbd \
> >>>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
> >>>>> � � � �clone-node-max="1" notify="true" interleave="true"
> >>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
> >>>>> clone FilesystemClone FilesystemGroup \
> >>>>> � � � �meta interleave="true" target-role="Started"
> >>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
> >>>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
> >>>>>
> >>>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
> >>>>>
> >>>>> When I look in /var/log/messages, I see no error messages or indications why the
> >>>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
> >>>>> on both nodes. There are no error messages when I force the issue with:
> >>>>>
> >>>>> crm resource cleanup AdminClone
> >>>>>
> >>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
> >>>>> resource is ready to be promoted.
> >>>>>
> >>>>> This is not just an abstract case for me. At my site, it's not uncommon for
> >>>>> there to be lengthy power outages that will bring down the cluster. Both systems
> >>>>> will come up when power is restored, and I need for cluster services to be
> >>>>> available shortly afterward, not 15 minutes later.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> Not without any logs
> >>>
> >>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
> >>>
> >>> Before you click on the link (it's a big wall of text),
> >>
> >> I'm used to trawling the logs.  Grep is a wonderful thing :-)
> >>
> >> At this stage it is apparent that I need to see
> >> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
> >> Do you have this file still?
> >
> > No, so I re-ran the test. Here's the log extract from the test I did today
> > <http://pastebin.com/6QYH2jkf>.
> >
> > Based on what you asked for from the previous extract, I think what you want
> > from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
> > three pe-input files mentioned in the log messages:
> >
> > pe-input-4: <http://pastebin.com/Txx50BJp>
> > pe-input-5: <http://pastebin.com/zzppL6DF>
> > pe-input-6: <http://pastebin.com/1dRgURK5>
> >
> > I pray to the gods of Grep that you find a clue in all of that!
> >
> >>> here are what I think
> >>> are the landmarks:
> >>>
> >>> - The extract starts just after the node boots, at the start of syslog at time
> >>> 10:49:21.
> >>> - I've highlighted when pacemakerd starts, at 10:49:46.
> >>> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
> >>> 10:50:10.
> >>> - One last highlight: When pacemaker finally promotes the drbd resource to
> >>> Primary on both nodes, at 11:05:11.
> >>>
> >>>> Details:
> >>>>>
> >>>>> # rpm -q kernel cman pacemaker drbd
> >>>>> kernel-2.6.32-220.4.1.el6.x86_64
> >>>>> cman-3.0.12.1-23.el6.x86_64
> >>>>> pacemaker-1.1.6-3.el6.x86_64
> >>>>> drbd-8.4.1-1.el6.x86_64
> >>>>>
> >>>>> Output of crm_mon after two-node reboot or pacemaker restart:
> >>>>> <http://pastebin.com/jzrpCk3i>
> >>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
> >>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
> >>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
> >
> > --
> > Bill Seligman             | Phone: (914) 591-2823
> > Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
> > PO Box 137                |
> > Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: pacemaker+drbd promotion delay [ In reply to ]
On Thu, Apr 12, 2012 at 5:26 PM, Lars Ellenberg
<lars.ellenberg@linbit.com> wrote:
> On Wed, Apr 11, 2012 at 08:22:59AM +1000, Andrew Beekhof wrote:
>> It looks like the drbd RA is calling crm_master during the monitor action.
>> That wouldn't seem like a good idea as the value isn't counted until
>> the resource is started and if the transition is interrupted (as it is
>> here) then the PE won't try to promote it (because the value didn't
>> change).
>
> I did not get the last part.
> Why would it not be promoted,
> even though it has positive master score?

Because we don't know that we need to run the PE again - because the
only changes in the PE were things we expected.

See:
https://github.com/beekhof/pacemaker/commit/65f1a22a4b66581159d8b747dbd49fa5e2ef34e1

This "only" becomes and issue when the transition is interrupted
between the non-recurring monitor and the start, which I guess was
rare enough that we hadn't noticed it for 4 years :-(

>
>> Has the drbd RA always done this?
>
> Yes.
>
> When else should we call crm_master?

I guess the only situation you shouldn't is during a non-recurring
monitor if you're about to return 7.
Which I'll concede isn't exactly obvious.

>
> Preference changes: we may lose a local disk,
> we may have been outdated or inconsistent,
> then sync up, etc.
>
>> On Sat, Mar 31, 2012 at 2:56 AM, William Seligman
>> <seligman@nevis.columbia.edu> wrote:
>> > On 3/30/12 1:13 AM, Andrew Beekhof wrote:
>> >> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
>> >> <seligman@nevis.columbia.edu> wrote:
>> >>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
>> >>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
>> >>>> <seligman@nevis.columbia.edu> wrote:
>> >>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
>> >>>>> files and versions below.
>> >>>>>
>> >>>>> Problem: If I restart both nodes at the same time, or even just start pacemaker
>> >>>>> on both nodes at the same time, the drbd ms resource starts, but both nodes stay
>> >>>>> in slave mode. They'll both stay in slave mode until one of the following occurs:
>> >>>>>
>> >>>>> - I manually type "crm resource cleanup <ms-resource-name>"
>> >>>>>
>> >>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
>> >>>>> resources are promoted.
>> >>>>>
>> >>>>> The key resource definitions:
>> >>>>>
>> >>>>> primitive AdminDrbd ocf:linbit:drbd \
>> >>>>> � � � �params drbd_resource="admin" \
>> >>>>> � � � �op monitor interval="59s" role="Master" timeout="30s" \
>> >>>>> � � � �op monitor interval="60s" role="Slave" timeout="30s" \
>> >>>>> � � � �op stop interval="0" timeout="100" \
>> >>>>> � � � �op start interval="0" timeout="240" \
>> >>>>> � � � �meta target-role="Master"
>> >>>>> ms AdminClone AdminDrbd \
>> >>>>> � � � �meta master-max="2" master-node-max="1" clone-max="2" \
>> >>>>> � � � �clone-node-max="1" notify="true" interleave="true"
>> >>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin below
>> >>>>> clone FilesystemClone FilesystemGroup \
>> >>>>> � � � �meta interleave="true" target-role="Started"
>> >>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
>> >>>>> order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start
>> >>>>>
>> >>>>> Note that I stuck in "target-role" options to try to solve the problem; no effect.
>> >>>>>
>> >>>>> When I look in /var/log/messages, I see no error messages or indications why the
>> >>>>> promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
>> >>>>> on both nodes. There are no error messages when I force the issue with:
>> >>>>>
>> >>>>> crm resource cleanup AdminClone
>> >>>>>
>> >>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
>> >>>>> resource is ready to be promoted.
>> >>>>>
>> >>>>> This is not just an abstract case for me. At my site, it's not uncommon for
>> >>>>> there to be lengthy power outages that will bring down the cluster. Both systems
>> >>>>> will come up when power is restored, and I need for cluster services to be
>> >>>>> available shortly afterward, not 15 minutes later.
>> >>>>>
>> >>>>> Any ideas?
>> >>>>
>> >>>> Not without any logs
>> >>>
>> >>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
>> >>>
>> >>> Before you click on the link (it's a big wall of text),
>> >>
>> >> I'm used to trawling the logs.  Grep is a wonderful thing :-)
>> >>
>> >> At this stage it is apparent that I need to see
>> >> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
>> >> Do you have this file still?
>> >
>> > No, so I re-ran the test. Here's the log extract from the test I did today
>> > <http://pastebin.com/6QYH2jkf>.
>> >
>> > Based on what you asked for from the previous extract, I think what you want
>> > from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed all
>> > three pe-input files mentioned in the log messages:
>> >
>> > pe-input-4: <http://pastebin.com/Txx50BJp>
>> > pe-input-5: <http://pastebin.com/zzppL6DF>
>> > pe-input-6: <http://pastebin.com/1dRgURK5>
>> >
>> > I pray to the gods of Grep that you find a clue in all of that!
>> >
>> >>> here are what I think
>> >>> are the landmarks:
>> >>>
>> >>> - The extract starts just after the node boots, at the start of syslog at time
>> >>> 10:49:21.
>> >>> - I've highlighted when pacemakerd starts, at 10:49:46.
>> >>> - I've highlighted when drbd reports that the 'admin' resource is UpToDate, at
>> >>> 10:50:10.
>> >>> - One last highlight: When pacemaker finally promotes the drbd resource to
>> >>> Primary on both nodes, at 11:05:11.
>> >>>
>> >>>> Details:
>> >>>>>
>> >>>>> # rpm -q kernel cman pacemaker drbd
>> >>>>> kernel-2.6.32-220.4.1.el6.x86_64
>> >>>>> cman-3.0.12.1-23.el6.x86_64
>> >>>>> pacemaker-1.1.6-3.el6.x86_64
>> >>>>> drbd-8.4.1-1.el6.x86_64
>> >>>>>
>> >>>>> Output of crm_mon after two-node reboot or pacemaker restart:
>> >>>>> <http://pastebin.com/jzrpCk3i>
>> >>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
>> >>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
>> >>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
>> >
>> > --
>> > Bill Seligman             | Phone: (914) 591-2823
>> > Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
>> > PO Box 137                |
>> > Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>> >
>> >
>> > _______________________________________________
>> > Linux-HA mailing list
>> > Linux-HA@lists.linux-ha.org
>> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > See also: http://linux-ha.org/ReportingProblems
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: pacemaker+drbd promotion delay [ In reply to ]
On Fri, Apr 13, 2012 at 11:47 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
> On Thu, Apr 12, 2012 at 5:26 PM, Lars Ellenberg
> <lars.ellenberg@linbit.com> wrote:
>> On Wed, Apr 11, 2012 at 08:22:59AM +1000, Andrew Beekhof wrote:
>>> It looks like the drbd RA is calling crm_master during the monitor action.
>>> That wouldn't seem like a good idea as the value isn't counted until
>>> the resource is started and if the transition is interrupted (as it is
>>> here) then the PE won't try to promote it (because the value didn't
>>> change).
>>
>> I did not get the last part.
>> Why would it not be promoted,
>> even though it has positive master score?
>
> Because we don't know that we need to run the PE again - because the
> only changes in the PE were things we expected.
>
> See:
>  https://github.com/beekhof/pacemaker/commit/65f1a22a4b66581159d8b747dbd49fa5e2ef34e1
>
> This "only" becomes and issue when the transition is interrupted
> between the non-recurring monitor and the start, which I guess was
> rare enough that we hadn't noticed it for 4 years :-(
>
>>
>>> Has the drbd RA always done this?
>>
>> Yes.
>>
>> When else should we call crm_master?
>
> I guess the only situation you shouldn't is during a non-recurring
> monitor if you're about to return 7.
> Which I'll concede isn't exactly obvious.

I'm thinking about applying this, which restricts the previous patch
to cases when the state of the resource is unknown.
Existing regression tests appear to pass while enabling the expected
behavior here.

Can anyone see something wrong with it?


diff --git a/pengine/master.c b/pengine/master.c
index 7af1936..77a82e6 100644
--- a/pengine/master.c
+++ b/pengine/master.c
@@ -410,14 +410,18 @@ master_score(resource_t * rsc, node_t * node,
int not_set_value)
return score;
}

- if (rsc->fns->state(rsc, TRUE) < RSC_ROLE_STARTED) {
- return score;
- }
+ if (node == NULL) {
+ if(rsc->fns->state(rsc, TRUE) < RSC_ROLE_STARTED) {
+ crm_trace("Ingoring master score for %s: unknown state on %s",
+ rsc->id, node->details->uname);
+ return score;
+ }

- if (node != NULL) {
+ } else {
node_t *match = pe_find_node_id(rsc->running_on, node->details->id);
+ node_t *known = pe_hash_table_lookup(rsc->known_on, node->details->id);

- if (match == NULL) {
+ if (match == NULL && known == NULL) {
crm_trace("%s is not active on %s - ignoring", rsc->id,
node->details->uname);
return score;
}
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems