Hi!
I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
twice, actually. The most recent happenstance was after a multi-node
failure. One node stayed alive, two nodes had to be rebooted. After
the reboots, one of the two came back without issue, and was able to
mount the OCFS2 stores. The second node exhibited high-cpu usage on the
ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
logs were being voraciously filled with the following message:
ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
does not exist
This message was being output so frequently that syslogd was starting to
rate-limit it. I suspect this accounts for the high CPU usage. After
restarting the troubled node several times, I found the solution was to
order the OCFS2/DLM resource group to stop, cluster-wide, and then
restart it. Normal behavior followed. (In a prior post to the list, I
referenced hard-killing the ocfs2_controld.pcmk process. This was a
more graceful shutdown.)
Attached are two strace outputs. I'm sorry I'm not very familiar with
strace, so the value of these files may be questionable. If there is
anything else I can provide the next time this happens, I'd be happy to
do so! The log-f.txt file was generated with the -f option, and the
log-fc.txt file was generated with -f -c.
Here also is a snippet from the syslog, during the cluster-wide shutdown
of the OCFS2/DLM group:
May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:14 ocfs2_controld: last message repeated 199 times
May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
terminate_ais_connection: Disconnecting from AIS
May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
(p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:20 ocfs2_controld: last message repeated 199 times
May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:26 ocfs2_controld: last message repeated 199 times
May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:32 ocfs2_controld: last message repeated 199 times
May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:38 ocfs2_controld: last message repeated 199 times
One other interesting bit of log (well, to me), was this bit that
occurred when I tried to manually mount the OCFS2 store on the afflicted
server:
mount.ocfs2: Unable to access cluster service while trying to join
the group
One other note - I discovered I had not specified a monitor for either
the pacemaker:o2cb or the pacemaker:controld RA. Could that have
possibly triggered this issue?
--
Sincerely,
Matthew O'Connor
-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817
Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt@ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------
CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.
I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
twice, actually. The most recent happenstance was after a multi-node
failure. One node stayed alive, two nodes had to be rebooted. After
the reboots, one of the two came back without issue, and was able to
mount the OCFS2 stores. The second node exhibited high-cpu usage on the
ocfs2_controld.pcmk process, and could not mount the OCFS2 stores. The
logs were being voraciously filled with the following message:
ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
does not exist
This message was being output so frequently that syslogd was starting to
rate-limit it. I suspect this accounts for the high CPU usage. After
restarting the troubled node several times, I found the solution was to
order the OCFS2/DLM resource group to stop, cluster-wide, and then
restart it. Normal behavior followed. (In a prior post to the list, I
referenced hard-killing the ocfs2_controld.pcmk process. This was a
more graceful shutdown.)
Attached are two strace outputs. I'm sorry I'm not very familiar with
strace, so the value of these files may be questionable. If there is
anything else I can provide the next time this happens, I'd be happy to
do so! The log-f.txt file was generated with the -f option, and the
log-fc.txt file was generated with -f -c.
Here also is a snippet from the syslog, during the cluster-wide shutdown
of the OCFS2/DLM group:
May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:14 ocfs2_controld: last message repeated 199 times
May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
terminate_ais_connection: Disconnecting from AIS
May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
(p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:20 ocfs2_controld: last message repeated 199 times
May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:26 ocfs2_controld: last message repeated 199 times
May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:32 ocfs2_controld: last message repeated 199 times
May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
"ocfs2:controld": Object does not exist
May 14 15:22:38 ocfs2_controld: last message repeated 199 times
One other interesting bit of log (well, to me), was this bit that
occurred when I tried to manually mount the OCFS2 store on the afflicted
server:
mount.ocfs2: Unable to access cluster service while trying to join
the group
One other note - I discovered I had not specified a monitor for either
the pacemaker:o2cb or the pacemaker:controld RA. Could that have
possibly triggered this issue?
--
Sincerely,
Matthew O'Connor
-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817
Tel: 407-823-9991 x315
Fax: 407-823-8299
Email: matt@ecsorl.com
Web: www.ecsorl.com
-----------------------------------------------------------------
CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.