Mailing List Archive

Drbd : PingAsk timeout, about 10 mins.
Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd can't switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either
down or subject to sufficient errors to prevent both nodes to reach each
other in a timely manner. I had the occasion to experience such behavior
because of bad optical fibers for instance, generating huge number of
network errors. You also have “network failure” messages in your logs and
it’s “Waiting for connection”. In your case I’d say the first thing to do is
to test this network : Can both nodes ping each other address on this
network ? Does an ifconfig of each address report errors ? Etc… I bet when
your replication network is up again, your cluster will run fine.



Pascal.



De : drbd-user-bounces@lists.linbit.com
[mailto:drbd-user-bounces@lists.linbit.com] De la part de simon
Envoyé : samedi 18 août 2012 03:37
À : drbd-user@lists.linbit.com
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd can’t switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Pasical,

Thanks your reply.

Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?

Do you have a good idea to make it switch immediately in the condition?

Thanks.

Simon

-----原始邮件-----
发件人: "Pascal BERTON" <pascal.berton3@free.fr>
发送时间: 2012年8月18日 星期六
收件人: 'simon' <litao5@hisense.com>, drbd-user@lists.linbit.com
抄送:
主题: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either down or subject to sufficient errors to prevent both nodes to reach each other in a timely manner. I had the occasion to experience such behavior because of bad optical fibers for instance, generating huge number of network errors. You also have “network failure” messages in your logs and it’s “Waiting for connection”. In your case I’d say the first thing to do is to test this network : Can both nodes ping each other address on this network ? Does an ifconfig of each address report errors ? Etc… I bet when your replication network is up again, your cluster will run fine.



Pascal.



De :drbd-user-bounces@lists.linbit.com [mailto:drbd-user-bounces@lists.linbit.com] De la part de simon
Envoyé : samedi 18 août 2012 03:37
À :drbd-user@lists.linbit.com
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from Master to Slave, the drbd can’t switch because it spends 10 minutes to mount its partition. But the time is timeout to HA.(in HA, default overtime is 2 miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn( NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver (re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn( Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn( NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver (re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn( Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting. Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0, internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.



simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Dear friends

I would like to suggest you to edit your messages before to issuing them to the list.

Nowadays emails are often read on mobile devices, such as smartphones and so on.

The editing phase should focus on remove (as example, in this thread), such kilometric log text:
what should be the sense of keeping multiples and multiples repetitions of in ALL the replies?

Thank you for understanding my critic that wants to be constructive as much as possible.

Kind regards and thank you really much for sharing your experiences.

Robert

Le mail ti raggiungono ovunque con BlackBerry® from Vodafone!

-----Original Message-----
From: drbd-user-request@lists.linbit.com
Sender: drbd-user-bounces@lists.linbit.com
Date: Sat, 18 Aug 2012 16:24:45
To: <drbd-user@lists.linbit.com>
Reply-To: drbd-user@lists.linbit.com
Subject: drbd-user Digest, Vol 97, Issue 18

Send drbd-user mailing list submissions to
drbd-user@lists.linbit.com

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.linbit.com/mailman/listinfo/drbd-user
or, via email, send a message with subject or body 'help' to
drbd-user-request@lists.linbit.com

You can reach the person managing the list at
drbd-user-owner@lists.linbit.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of drbd-user digest..."


Today's Topics:

1. Re: Drbd : PingAsk timeout, about 10 mins. (Pascal BERTON)
2. Re: Drbd : PingAsk timeout, about 10 mins. (?? (??))


----------------------------------------------------------------------

Message: 1
Date: Sat, 18 Aug 2012 12:46:01 +0200
From: "Pascal BERTON" <pascal.berton3@free.fr>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: "'simon'" <litao5@hisense.com>, <drbd-user@lists.linbit.com>
Message-ID: <000f01cd7d2e$a9caa4d0$fd5fee70$@berton3@free.fr>
Content-Type: text/plain; charset="iso-8859-1"

Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either
down or subject to sufficient errors to prevent both nodes to reach each
other in a timely manner. I had the occasion to experience such behavior
because of bad optical fibers for instance, generating huge number of
network errors. You also have ?network failure? messages in your logs and
it?s ?Waiting for connection?. In your case I?d say the first thing to do is
to test this network : Can both nodes ping each other address on this
network ? Does an ifconfig of each address report errors ? Etc? I bet when
your replication network is up again, your cluster will run fine.



Pascal.



De : drbd-user-bounces@lists.linbit.com
[mailto:drbd-user-bounces@lists.linbit.com] De la part de simon
Envoy? : samedi 18 ao?t 2012 03:37
? : drbd-user@lists.linbit.com
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from
Master to Slave, the drbd can?t switch because it spends 10 minutes to mount
its partition. But the time is timeout to HA.(in HA, default overtime is 2
miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating
asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read
expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection
closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn(
NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver
terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting
receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver
(re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn(
Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed
active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol
family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role(
Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating
new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did
not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer(
Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating
asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating
new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read
expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection
closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn(
NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver
terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting
receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver
(re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn(
Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting.
Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal
mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0,
internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery
complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted
filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.




simon







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120818/cb6ca975/attachment-0001.htm>

------------------------------

Message: 2
Date: Sat, 18 Aug 2012 22:24:14 +0800 (CST)
From: ??(??) <litao5@hisense.com>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: "Pascal BERTON" <pascal.berton3@free.fr>
Cc: drbd-user@lists.linbit.com
Message-ID: <1408fd4.96f3e.1393a1eb50b.Coremail.litao5@hisense.com>
Content-Type: text/plain; charset="utf-8"

Hi Pasical,

Thanks your reply.

Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?

Do you have a good idea to make it switch immediately in the condition?

Thanks.

Simon

-----????-----
???: "Pascal BERTON" <pascal.berton3@free.fr>
????: 2012?8?18? ???
???: 'simon' <litao5@hisense.com>, drbd-user@lists.linbit.com
??:
??: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon.



AFAIK, the Ping Ack error means your replication network links are either down or subject to sufficient errors to prevent both nodes to reach each other in a timely manner. I had the occasion to experience such behavior because of bad optical fibers for instance, generating huge number of network errors. You also have ?network failure? messages in your logs and it?s ?Waiting for connection?. In your case I?d say the first thing to do is to test this network : Can both nodes ping each other address on this network ? Does an ifconfig of each address report errors ? Etc? I bet when your replication network is up again, your cluster will run fine.



Pascal.



De :drbd-user-bounces@lists.linbit.com [mailto:drbd-user-bounces@lists.linbit.com] De la part de simon
Envoy? : samedi 18 ao?t 2012 03:37
? :drbd-user@lists.linbit.com
Objet : [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi all,



I used drbd 8.3.7 on HA. When Master host is dead and HA swatches from Master to Slave, the drbd can?t switch because it spends 10 minutes to mount its partition. But the time is timeout to HA.(in HA, default overtime is 2 miniutes).



Why does drbd spent that long time?



The log is:

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739458] block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739468] block drbd1: asender terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739470] block drbd1: Terminating asender thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739526] block drbd1: short read expecting header on sock: r=-512

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739666] block drbd1: Connection closed

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739672] block drbd1: conn( NetworkFailure -> Unconnected )

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739678] block drbd1: receiver terminated

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739680] block drbd1: Restarting receiver thread

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739683] block drbd1: receiver (re)started

Jul 22 21:06:34 QD-CS-MDC-B kernel: [325560.739687] block drbd1: conn( Unconnected -> WFConnection )

Jul 22 21:06:39 QD-CS-MDC-B pengine: [17776]: info: crm_log_init: Changed active directory to /usr/var/lib/heartbeat/cores/root

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.727331] NET: Registered protocol family 17

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.768912] block drbd0: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772742] block drbd1: role( Secondary -> Primary )

Jul 22 21:06:47 QD-CS-MDC-B kernel: [325573.772997] block drbd1: Creating new current UUID

Jul 22 21:08:47 QD-CS-MDC-B su: (to hitv) root on none

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032485] block drbd0: PingAck did not arrive in time.

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032493] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032503] block drbd0: asender terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032506] block drbd0: Terminating asender thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032514] block drbd0: Creating new current UUID

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032567] block drbd0: short read expecting header on sock: r=-512

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032868] block drbd0: Connection closed

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032875] block drbd0: conn( NetworkFailure -> Unconnected )

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032879] block drbd0: receiver terminated

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032881] block drbd0: Restarting receiver thread

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032884] block drbd0: receiver (re)started

Jul 22 21:16:47 QD-CS-MDC-B kernel: [326174.032888] block drbd0: conn( Unconnected -> WFConnection )

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600888] kjournald starting. Commit interval 15 seconds

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.600956] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601330] EXT3 FS on drbd0, internal journal

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601334] EXT3-fs: recovery complete.

Jul 22 21:16:48 QD-CS-MDC-B kernel: [326174.601392] EXT3-fs: mounted filesystem with ordered data mode.



According to the log, the timeout is PingAsk operation.





Thanks your help.



simon







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120818/c5f788f1/attachment.htm>

------------------------------

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


End of drbd-user Digest, Vol 97, Issue 18
*****************************************
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over
its work and mount the drbd partition on Slave host. When mounting , the
timeout issued. But the default timeout of network of drdb is 6 senconds
(it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately
in the network anomaly?



Thanks.



Simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Simon !



Sorry for the delay, return day to work, busy day...

Eeh yup, if the former master is down, that’s effectively a fairly good
reason for having a down replication network… J Just to better understand
your specific context, could you please let me know what “cat /proc/drbd”
does report, especially during this blackout period of 10 minutes if you may
reproduce it, and also “drbdsetup 0 show” and “crm configure show”, just to
have the big picture of your configuration.



Regards,



Pascal.



De : simon [mailto:litao5@hisense.com]
Envoyé : lundi 20 août 2012 03:05
À : 'Pascal BERTON'; drbd-user@lists.linbit.com
Objet : RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over
its work and mount the drbd partition on Slave host. When mounting , the
timeout issued. But the default timeout of network of drdb is 6 senconds
(it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately
in the network anomaly?



Thanks.



Simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Pascal,



I can’t reproduce the error because the condition that it issues is very especially. The Master host is in the “not real dead” status. ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master host. Now I don’t want to avoid it because I can’t reproduce it. I only want to succeed to switch form Master to Slave so that my service can be supplied normally. But I can’t right to switch because of the 10 minutes delay of Drbd.



I run “drbdsetup 0 show” on my host, it shows as following,



disk {

size 0s _is_default; # bytes

on-io-error detach;

fencing dont-care _is_default;

max-bio-bvecs 0 _is_default;

}

net {

timeout 60 _is_default; # 1/10 seconds

max-epoch-size 2048 _is_default;

max-buffers 2048 _is_default;

unplug-watermark 128 _is_default;

connect-int 10 _is_default; # seconds

ping-int 10 _is_default; # seconds

sndbuf-size 0 _is_default; # bytes

rcvbuf-size 0 _is_default; # bytes

ko-count 0 _is_default;

allow-two-primaries;

after-sb-0pri discard-least-changes;

after-sb-1pri discard-secondary;

after-sb-2pri disconnect _is_default;

rr-conflict disconnect _is_default;

ping-timeout 5 _is_default; # 1/10 seconds

}

syncer {

rate 102400k; # bytes/second

after -1 _is_default;

al-extents 257;

}

protocol C;

_this_host {

device minor 0;

disk "/dev/cciss/c0d0p7";

meta-disk internal;

address ipv4 172.17.5.152:7900;

}

_remote_host {

address ipv4 172.17.5.151:7900;

}





In the list , there is “timeout 60 _is_default; # 1/10 seconds”.





Thanks.



Simon



发件人: Pascal BERTON [mailto:pascal.berton3@free.fr]
发送时间: 2012年8月21日 星期二 4:58
收件人: 'simon'; drbd-user@lists.linbit.com
主题: RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Simon !



Sorry for the delay, return day to work, busy day...

Eeh yup, if the former master is down, that’s effectively a fairly good reason for having a down replication network… J Just to better understand your specific context, could you please let me know what “cat /proc/drbd” does report, especially during this blackout period of 10 minutes if you may reproduce it, and also “drbdsetup 0 show” and “crm configure show”, just to have the big picture of your configuration.



Regards,



Pascal.



De : simon [mailto:litao5@hisense.com]
Envoyé : lundi 20 août 2012 03:05
À : 'Pascal BERTON'; drbd-user@lists.linbit.com
Objet : RE: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.



Hi Pasical,



Thanks your reply.



Yes, the network was bad. Master host was dead so that Slave host took over its work and mount the drbd partition on Slave host. When mounting , the timeout issued. But the default timeout of network of drdb is 6 senconds (it can be set in drbd.conf). But it failed to take effect. why?



Do you have a good idea to make it switch from Master to Slave immediately in the network anomaly?



Thanks.



Simon
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
>
>
>
> I can’t reproduce the error because the condition that it issues is
> very especially. The Master host is in the “not real dead” status.
> ( I doubt it is Linux’s panic). The TCP stack maybe is bad in Master
> host. Now I don’t want to avoid it because I can’t reproduce it. I
> only want to succeed to switch form Master to Slave so that my
> service can be supplied normally. But I can’t right to switch because
> of the 10 minutes delay of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.


To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.


> I run “drbdsetup 0 show” on my host, it shows as following,
>
> disk {
> size 0s _is_default; # bytes
> on-io-error detach;
> fencing dont-care _is_default;
> max-bio-bvecs 0 _is_default;
> }
>
> net {
> timeout 60 _is_default; # 1/10 seconds
> max-epoch-size 2048 _is_default;
> max-buffers 2048 _is_default;
> unplug-watermark 128 _is_default;
> connect-int 10 _is_default; # seconds
> ping-int 10 _is_default; # seconds
> sndbuf-size 0 _is_default; # bytes
> rcvbuf-size 0 _is_default; # bytes
> ko-count 0 _is_default;
> allow-two-primaries;


Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

> after-sb-0pri discard-least-changes;
> after-sb-1pri discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...


>
> after-sb-2pri disconnect _is_default;
> rr-conflict disconnect _is_default;
> ping-timeout 5 _is_default; # 1/10 seconds
> }
>
> syncer {
> rate 102400k; # bytes/second
> after -1 _is_default;
> al-extents 257;
> }
>
> protocol C;
> _this_host {
> device minor 0;
> disk "/dev/cciss/c0d0p7";
> meta-disk internal;
> address ipv4 172.17.5.152:7900;
> }
>
> _remote_host {
> address ipv4 172.17.5.151:7900;
> }
>
>
>
>
>
> In the list , there is “timeout 60 _is_default; # 1/10 seconds”.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Lars Ellenberg,

The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
but can't login by ssh.
So I think maybe the linux is panic.

Eth0 is dead, but drbd can't detect it and return immediately. Why?

Thanks.

-----ÓʼþÔ­¼þ-----
·¢¼þÈË: drbd-user-bounces@lists.linbit.com
[mailto:drbd-user-bounces@lists.linbit.com] ´ú±í
drbd-user-request@lists.linbit.com
·¢ËÍʱ¼ä: 2012Äê8ÔÂ22ÈÕ ÐÇÆÚÈý 18:00
ÊÕ¼þÈË: drbd-user@lists.linbit.com
Ö÷Ìâ: drbd-user Digest, Vol 97, Issue 23

Send drbd-user mailing list submissions to
drbd-user@lists.linbit.com

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.linbit.com/mailman/listinfo/drbd-user
or, via email, send a message with subject or body 'help' to
drbd-user-request@lists.linbit.com

You can reach the person managing the list at
drbd-user-owner@lists.linbit.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of drbd-user digest..."


Today's Topics:

1. Re: Drbd : PingAsk timeout, about 10 mins. (Lars Ellenberg)


----------------------------------------------------------------------

Message: 1
Date: Tue, 21 Aug 2012 12:50:12 +0200
From: Lars Ellenberg <lars.ellenberg@linbit.com>
Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
To: drbd-user@lists.linbit.com
Message-ID: <20120821105012.GG20059@soda.linbit>
Content-Type: text/plain; charset=utf-8

On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> Hi Pascal,
>
>
>
> I can?t reproduce the error because the condition that it issues is
> very especially. The Master host is in the ?not real dead? status.
> ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> host. Now I don?t want to avoid it because I can?t reproduce it. I
> only want to succeed to switch form Master to Slave so that my
> service can be supplied normally. But I can?t right to switch because
> of the 10 minutes delay of Drbd.

Well. If it was "not real dead", then I'd suspect that the DRBD
connection was still "sort of up", and thus DRBD saw the other node as
Primary still, and correctly refused to be promoted locally.


To have your cluster recover from a "almost but not quite dead node"
scenario, you need to add stonith aka node level fencing to your
cluster stack.


> I run ?drbdsetup 0 show? on my host, it shows as following,
>
> disk {
> size 0s _is_default; # bytes
> on-io-error detach;
> fencing dont-care _is_default;
> max-bio-bvecs 0 _is_default;
> }
>
> net {
> timeout 60 _is_default; # 1/10 seconds
> max-epoch-size 2048 _is_default;
> max-buffers 2048 _is_default;
> unplug-watermark 128 _is_default;
> connect-int 10 _is_default; # seconds
> ping-int 10 _is_default; # seconds
> sndbuf-size 0 _is_default; # bytes
> rcvbuf-size 0 _is_default; # bytes
> ko-count 0 _is_default;
> allow-two-primaries;


Uh. You are sure about that?

Two primaries, and dont-care for fencing?

You are aware that you just subscribed to data corruption, right?

If you want two primaries, you MUST have proper fencing,
on both the cluster level (stonith) and the drbd level (fencing
resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).

> after-sb-0pri discard-least-changes;
> after-sb-1pri discard-secondary;

And here you configure automatic data loss.
Which is ok, as long as you are aware of that and actually mean it...


>
> after-sb-2pri disconnect _is_default;
> rr-conflict disconnect _is_default;
> ping-timeout 5 _is_default; # 1/10 seconds
> }
>
> syncer {
> rate 102400k; # bytes/second
> after -1 _is_default;
> al-extents 257;
> }
>
> protocol C;
> _this_host {
> device minor 0;
> disk "/dev/cciss/c0d0p7";
> meta-disk internal;
> address ipv4 172.17.5.152:7900;
> }
>
> _remote_host {
> address ipv4 172.17.5.151:7900;
> }
>
>
>
>
>
> In the list , there is ?timeout 60 _is_default; # 1/10
seconds?.

Then guess what, maybe the timeout did not trigger,
because the peer was still "sort of" responsive?


--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed


------------------------------

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


End of drbd-user Digest, Vol 97, Issue 23
*****************************************


_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
>
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
>
> Eth0 is dead, but drbd can't detect it and return immediately. Why?

As I said, most likely because eth0 still was not that dead as you think it was.

And read again what I said about fencing and stonith.

> Thanks.

Cheers.


> Date: Tue, 21 Aug 2012 12:50:12 +0200
> From: Lars Ellenberg <lars.ellenberg@linbit.com>
> Subject: Re: [DRBD-user] Drbd : PingAsk timeout, about 10 mins.
> To: drbd-user@lists.linbit.com
> Message-ID: <20120821105012.GG20059@soda.linbit>
> Content-Type: text/plain; charset=utf-8
>
> On Tue, Aug 21, 2012 at 03:40:34PM +0800, simon wrote:
> > Hi Pascal,
> >
> >
> >
> > I can?t reproduce the error because the condition that it issues is
> > very especially. The Master host is in the ?not real dead? status.
> > ( I doubt it is Linux?s panic). The TCP stack maybe is bad in Master
> > host. Now I don?t want to avoid it because I can?t reproduce it. I
> > only want to succeed to switch form Master to Slave so that my
> > service can be supplied normally. But I can?t right to switch because
> > of the 10 minutes delay of Drbd.
>
> Well. If it was "not real dead", then I'd suspect that the DRBD
> connection was still "sort of up", and thus DRBD saw the other node as
> Primary still, and correctly refused to be promoted locally.
>
>
> To have your cluster recover from a "almost but not quite dead node"
> scenario, you need to add stonith aka node level fencing to your
> cluster stack.
>
>
> > I run ?drbdsetup 0 show? on my host, it shows as following,
> >
> > disk {
> > size 0s _is_default; # bytes
> > on-io-error detach;
> > fencing dont-care _is_default;
> > max-bio-bvecs 0 _is_default;
> > }
> >
> > net {
> > timeout 60 _is_default; # 1/10 seconds
> > max-epoch-size 2048 _is_default;
> > max-buffers 2048 _is_default;
> > unplug-watermark 128 _is_default;
> > connect-int 10 _is_default; # seconds
> > ping-int 10 _is_default; # seconds
> > sndbuf-size 0 _is_default; # bytes
> > rcvbuf-size 0 _is_default; # bytes
> > ko-count 0 _is_default;
> > allow-two-primaries;
>
>
> Uh. You are sure about that?
>
> Two primaries, and dont-care for fencing?
>
> You are aware that you just subscribed to data corruption, right?
>
> If you want two primaries, you MUST have proper fencing,
> on both the cluster level (stonith) and the drbd level (fencing
> resource-and-stonith; fence-peer handler: e.g. crm-fence-peer.sh).
>
> > after-sb-0pri discard-least-changes;
> > after-sb-1pri discard-secondary;
>
> And here you configure automatic data loss.
> Which is ok, as long as you are aware of that and actually mean it...
>
>
> >
> > after-sb-2pri disconnect _is_default;
> > rr-conflict disconnect _is_default;
> > ping-timeout 5 _is_default; # 1/10 seconds
> > }
> >
> > syncer {
> > rate 102400k; # bytes/second
> > after -1 _is_default;
> > al-extents 257;
> > }
> >
> > protocol C;
> > _this_host {
> > device minor 0;
> > disk "/dev/cciss/c0d0p7";
> > meta-disk internal;
> > address ipv4 172.17.5.152:7900;
> > }
> >
> > _remote_host {
> > address ipv4 172.17.5.151:7900;
> > }
> >
> >
> >
> >
> >
> > In the list , there is ?timeout 60 _is_default; # 1/10
> seconds?.
>
> Then guess what, maybe the timeout did not trigger,
> because the peer was still "sort of" responsive?

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
On Thu, Aug 23, 2012 at 09:45:21AM +0800, simon wrote:
> Hi Lars Ellenberg,
>
> The Master Host has two network cards, eth0 and eth1. Drbd uses eth0. "not
> real dead" means eth0 is dead. ( it can get by ha log). Eth1 can ping good
> but can't login by ssh.
> So I think maybe the linux is panic.
>
> Eth0 is dead, but drbd can't detect it and return immediately. Why?

As I said, most likely because eth0 still was not that dead as you think it
was.

And read again what I said about fencing and stonith.

> Thanks.

Cheers.

Hi Cheers£¬

I can ensure eth0 is disconnect. Because it can't ping each other.

Now I only want to know how to switch immediately , and why doesn't the
timeout option
take effect?

Can you tell me some implement detail on coding of drbd?

Thanks

simon


_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Simon,

You have only sent me the results of "drbdadm 0 show", you forgot to send me the result of "crm configure show" that I had also asked, could you please send it too ? I also would like to see what "ifconfig" of your 2 replication interfaces do report (The ones on subnet 172.17). Please send it along.
And finally, what type of application do you host on this cluster ? What kind of filesystem do you have on your DRBD resources ?

Apart from that, Lars has detected couple of issues in you DRBD configuration. Have you addressed them ? Namely, the dual primary configuration and the rest.

Please send the above informations to help us understand more clearly your whole setup.

Regards,

Pascal.



_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi Pascal,

Sorry for my late reply because there are too many work to do recently.

I send you the results of ' drbdsetup 0 show' again,

disk {
size 0s _is_default; # bytes
on-io-error detach;
fencing dont-care _is_default;
max-bio-bvecs 0 _is_default;
}
net {
timeout 60 _is_default; # 1/10 seconds
max-epoch-size 2048 _is_default;
max-buffers 2048 _is_default;
unplug-watermark 128 _is_default;
connect-int 10 _is_default; # seconds
ping-int 10 _is_default; # seconds
sndbuf-size 0 _is_default; # bytes
rcvbuf-size 0 _is_default; # bytes
ko-count 0 _is_default;
allow-two-primaries;
after-sb-0pri discard-least-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect _is_default;
rr-conflict disconnect _is_default;
ping-timeout 5 _is_default; # 1/10 seconds
}
syncer {
rate 102400k; # bytes/second
after -1 _is_default;
al-extents 257;
}
protocol C;
_this_host {
device minor 0;
disk "/dev/cciss/c0d0p7";
meta-disk internal;
address ipv4 192.168.1.2:7900;
}
_remote_host {
address ipv4 192.168.1.1:7900;

"crm configure show" isn't excused on my computer because I didn't install Pacemaker.

"fconfig" is :

eth0 Link encap:Ethernet HWaddr 3C:D9:2B:07:8A:42
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:329137 errors:0 dropped:0 overruns:0 frame:0
TX packets:115697 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:463432396 (441.9 Mb) TX bytes:13923644 (13.2 Mb)
Interrupt:16 Memory:f4000000-f4012800

eth1 Link encap:Ethernet HWaddr 3C:D9:2B:07:8A:44
inet addr:172.17.5.152 Bcast:172.17.5.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5530 errors:0 dropped:0 overruns:0 frame:0
TX packets:3375 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:677856 (661.9 Kb) TX bytes:645750 (630.6 Kb)
Interrupt:17 Memory:f2000000-f2012800

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:716 errors:0 dropped:0 overruns:0 frame:0
TX packets:716 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:142793 (139.4 Kb) TX bytes:142793 (139.4 Kb)


My file system on DRBD partition is EXT3.

Thanks.

simon

You have only sent me the results of "drbdadm 0 show", you forgot to send me the result of "crm configure show" that I had also asked, could you please send it too ? I also would like to see what "ifconfig" of your 2 replication interfaces do report (The ones on subnet 172.17). Please send it along.
And finally, what type of application do you host on this cluster ? What kind of filesystem do you have on your DRBD resources ?

Apart from that, Lars has detected couple of issues in you DRBD configuration. Have you addressed them ? Namely, the dual primary configuration and the rest.

Please send the above informations to help us understand more clearly your whole setup.

Regards,

Pascal.





_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
Re: Drbd : PingAsk timeout, about 10 mins. [ In reply to ]
Hi,

On 08/19/2012 09:01 AM, roberto.fastec@gmail.com wrote:
> The editing phase should focus on remove (as example, in this thread), such kilometric log text:
> what should be the sense of keeping multiples and multiples repetitions of in ALL the replies?

while I more or less sympathize, it makes me giggle that on my PC
(rather large screen even), your mail is 12 pages long including all
those quotes you're criticizing.
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user