Mailing List Archive

ocfs2_controld.pcmk process issue
Hi!

I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
twice, actually. The most recent happenstance was after a multi-node
failure. One node stayed alive, two nodes had to be rebooted. After
the reboots, one of the two came back without issue, and was able to
mount the OCFS2 stores. The second node exhibited high-cpu usage on the
ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
logs were being voraciously filled with the following message:

ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
does not exist

This message was being output so frequently that syslogd was starting to
rate-limit it. I suspect this accounts for the high CPU usage. After
restarting the troubled node several times, I found the solution was to
order the OCFS2/DLM resource group to stop, cluster-wide, and then
restart it. Normal behavior followed. (In a prior post to the list, I
referenced hard-killing the ocfs2_controld.pcmk process. This was a
more graceful shutdown.)

Attached are two strace outputs. I'm sorry I'm not very familiar with
strace, so the value of these files may be questionable. If there is
anything else I can provide the next time this happens, I'd be happy to
do so! The log-f.txt file was generated with the -f option, and the
log-fc.txt file was generated with -f -c.

Here also is a snippet from the syslog, during the cluster-wide shutdown
of the OCFS2/DLM group:

May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:14 ocfs2_controld: last message repeated 199 times
May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
terminate_ais_connection: Disconnecting from AIS
May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
(p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:20 ocfs2_controld: last message repeated 199 times
May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:26 ocfs2_controld: last message repeated 199 times
May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:32 ocfs2_controld: last message repeated 199 times
May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:38 ocfs2_controld: last message repeated 199 times

One other interesting bit of log (well, to me), was this bit that
occurred when I tried to manually mount the OCFS2 store on the afflicted
server:

mount.ocfs2: Unable to access cluster service while trying to join
the group

One other note - I discovered I had not specified a monitor for either
the pacemaker:o2cb or the pacemaker:controld RA. Could that have
possibly triggered this issue?

--

Sincerely,
Matthew O'Connor

-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4

Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817

Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt@ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------

CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.
Re: ocfs2_controld.pcmk process issue [ In reply to ]
Is this on SLES by any chance?
SUSE are about the only ones with knowledge in this area I'm afraid.

On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <matt@ecsorl.com> wrote:
> Hi!
>
> I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
> twice, actually.  The most recent happenstance was after a multi-node
> failure.  One node stayed alive, two nodes had to be rebooted.  After
> the reboots, one of the two came back without issue, and was able to
> mount the OCFS2 stores.  The second node exhibited high-cpu usage on the
> ocfs2_controld.pcmk process, and could not mount the OCFS2 stores.  The
> logs were being voraciously filled with the following message:
>
>   ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
> does not exist
>
> This message was being output so frequently that syslogd was starting to
> rate-limit it.  I suspect this accounts for the high CPU usage.  After
> restarting the troubled node several times, I found the solution was to
> order the OCFS2/DLM resource group to stop, cluster-wide, and then
> restart it.  Normal behavior followed.  (In a prior post to the list, I
> referenced hard-killing the ocfs2_controld.pcmk process.  This was a
> more graceful shutdown.)
>
> Attached are two strace outputs.  I'm sorry I'm not very familiar with
> strace, so the value of these files may be questionable.  If there is
> anything else I can provide the next time this happens, I'd be happy to
> do so!  The log-f.txt file was generated with the -f option, and the
> log-fc.txt file was generated with -f -c.
>
> Here also is a snippet from the syslog, during the cluster-wide shutdown
> of the OCFS2/DLM group:
>
> May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:14  ocfs2_controld: last message repeated 199 times
> May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
> May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
> terminate_ais_connection: Disconnecting from AIS
> May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
> (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
> May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:20  ocfs2_controld: last message repeated 199 times
> May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:26  ocfs2_controld: last message repeated 199 times
> May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:32  ocfs2_controld: last message repeated 199 times
> May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:38  ocfs2_controld: last message repeated 199 times
>
> One other interesting bit of log (well, to me), was this bit that
> occurred when I tried to manually mount the OCFS2 store on the afflicted
> server:
>
>   mount.ocfs2: Unable to access cluster service while trying to join
> the group
>
> One other note - I discovered I had not specified a monitor for either
> the pacemaker:o2cb or the pacemaker:controld RA.  Could that have
> possibly triggered this issue?
>
> --
>
> Sincerely,
>  Matthew O'Connor
>
> -----------------------------------------------------------------
> Sr. Software Engineer
> PGP/GPG Key: 0x55F981C4
> Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
>
> Engineering and Computer Simulations, Inc.
> 11825 High Tech Ave Suite 250
> Orlando, FL 32817
>
> Tel:   407-823-9991 x315
> Fax:   407-823-8299
> Email: matt@ecsorl.com
> Web:   www.ecsorl.com
> -----------------------------------------------------------------
>
> CONFIDENTIAL NOTICE: The information contained in this electronic
> message is legally privileged, confidential and exempt from disclosure
> under applicable law. It is intended only for the use of the individual
> or entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution
> or copying of this message is strictly prohibited. If you have received
> this communication in error, please notify the sender immediately by
> return e-mail and delete the original message and any copies of it from
> your computer system. Thank you.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: ocfs2_controld.pcmk process issue [ In reply to ]
I'm sorry, no. It's on Ubuntu 11.10... I was looking into grabbing a
copy of the SUSE community dvd iso the other night - would this come
with all the necessary packages for setting up Pacemaker/Corosync +
OCFS2? If nothing else I'd be happy to see if I could replicate the
issue consistently, and among at least two distributions.


On 5/15/2012 8:34 PM, Andrew Beekhof wrote:
> Is this on SLES by any chance?
> SUSE are about the only ones with knowledge in this area I'm afraid.
>
> On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <matt@ecsorl.com> wrote:
>> Hi!
>>
>> I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
>> twice, actually. The most recent happenstance was after a multi-node
>> failure. One node stayed alive, two nodes had to be rebooted. After
>> the reboots, one of the two came back without issue, and was able to
>> mount the OCFS2 stores. The second node exhibited high-cpu usage on the
>> ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
>> logs were being voraciously filled with the following message:
>>
>> ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
>> does not exist
>>
>> This message was being output so frequently that syslogd was starting to
>> rate-limit it. I suspect this accounts for the high CPU usage. After
>> restarting the troubled node several times, I found the solution was to
>> order the OCFS2/DLM resource group to stop, cluster-wide, and then
>> restart it. Normal behavior followed. (In a prior post to the list, I
>> referenced hard-killing the ocfs2_controld.pcmk process. This was a
>> more graceful shutdown.)
>>
>> Attached are two strace outputs. I'm sorry I'm not very familiar with
>> strace, so the value of these files may be questionable. If there is
>> anything else I can provide the next time this happens, I'd be happy to
>> do so! The log-f.txt file was generated with the -f option, and the
>> log-fc.txt file was generated with -f -c.
>>
>> Here also is a snippet from the syslog, during the cluster-wide shutdown
>> of the OCFS2/DLM group:
>>
>> May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:14 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
>> May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
>> terminate_ais_connection: Disconnecting from AIS
>> May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
>> (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
>> May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:20 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:26 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:32 ocfs2_controld: last message repeated 199 times
>> May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
>> "ocfs2:controld": Object does not exist
>> May 14 15:22:38 ocfs2_controld: last message repeated 199 times
>>
>> One other interesting bit of log (well, to me), was this bit that
>> occurred when I tried to manually mount the OCFS2 store on the afflicted
>> server:
>>
>> mount.ocfs2: Unable to access cluster service while trying to join
>> the group
>>
>> One other note - I discovered I had not specified a monitor for either
>> the pacemaker:o2cb or the pacemaker:controld RA. Could that have
>> possibly triggered this issue?
>>
>> --
>>
>> Sincerely,
>> Matthew O'Connor
>>
>> -----------------------------------------------------------------
>> Sr. Software Engineer
>> PGP/GPG Key: 0x55F981C4
>> Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
>>
>> Engineering and Computer Simulations, Inc.
>> 11825 High Tech Ave Suite 250
>> Orlando, FL 32817
>>
>> Tel: 407-823-9991 x315
>> Fax: 407-823-8299
>> Email: matt@ecsorl.com
>> Web: www.ecsorl.com
>> -----------------------------------------------------------------
>>
>> CONFIDENTIAL NOTICE: The information contained in this electronic
>> message is legally privileged, confidential and exempt from disclosure
>> under applicable law. It is intended only for the use of the individual
>> or entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution
>> or copying of this message is strictly prohibited. If you have received
>> this communication in error, please notify the sender immediately by
>> return e-mail and delete the original message and any copies of it from
>> your computer system. Thank you.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>

--

Sincerely,
Matthew O'Connor

-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4

Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817

Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt@ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------

CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: ocfs2_controld.pcmk process issue [ In reply to ]
On 05/16/2012 12:42 PM, Matthew O'Connor wrote:
> I'm sorry, no. It's on Ubuntu 11.10... I was looking into grabbing a
> copy of the SUSE community dvd iso the other night - would this come
> with all the necessary packages for setting up Pacemaker/Corosync +
> OCFS2? If nothing else I'd be happy to see if I could replicate the
> issue consistently, and among at least two distributions.

You can either:

* Download the latest SLES DVD + SLE HA DVD
(http://www.suse.com/products/server/ and
http://www.suse.com/products/highavailability/) which is free to try for
60 days, after which you need a subscription to get maintenance etc.

-- or --

* Use the latest openSUSE release (http://www.opensuse.org/en/) which
also includes pacemaker, corosync, ocfs2 etc.

Either way, please refer to the SLE HA docs
(http://www.suse.com/documentation/sle_ha/) for configuration/setup -
these are pretty much equally applicable for both SLES and openSUSE.

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tserong@suse.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: ocfs2_controld.pcmk process issue [ In reply to ]
Great! Thanks for the info+links!!

On 5/16/2012 12:20 AM, Tim Serong wrote:
> On 05/16/2012 12:42 PM, Matthew O'Connor wrote:
>> I'm sorry, no. It's on Ubuntu 11.10... I was looking into grabbing a
>> copy of the SUSE community dvd iso the other night - would this come
>> with all the necessary packages for setting up Pacemaker/Corosync +
>> OCFS2? If nothing else I'd be happy to see if I could replicate the
>> issue consistently, and among at least two distributions.
> You can either:
>
> * Download the latest SLES DVD + SLE HA DVD
> (http://www.suse.com/products/server/ and
> http://www.suse.com/products/highavailability/) which is free to try for
> 60 days, after which you need a subscription to get maintenance etc.
>
> -- or --
>
> * Use the latest openSUSE release (http://www.opensuse.org/en/) which
> also includes pacemaker, corosync, ocfs2 etc.
>
> Either way, please refer to the SLE HA docs
> (http://www.suse.com/documentation/sle_ha/) for configuration/setup -
> these are pretty much equally applicable for both SLES and openSUSE.
>
> Regards,
>
> Tim

--

Sincerely,
Matthew O'Connor

-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4

Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817

Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt@ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------

CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org