Mailing List Archive

NetApp SDK for cDOT: any API call fails if a cluster node is not available
Hi All,

We extensively use NetApp API calls to monitor 7Mode filers, and took
the same approach for cDOT monitoring.

Here is a very unpleasant discovery:

1. Take one node (or more nodes) *offline*, eg power it off for maintenance.
2. Try to run *any* API call against cluster interface and get the
following error:

OUTPUT:
<results reason="RPC: Port mapper failure - RPC: Timed out"
status="failed" errno="13001"></results>

It effectively makes your cluster wide monitoring useless.

Any ideas? Is it a feature or a bug?

Cheers,
Vladimir
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Did the cluster interface failover correctly? can the host doing the API
calls "ssh" into the cluster address?

--tmac

*Tim McCarthy, **Principal Consultant*





On Wed, Mar 30, 2016 at 8:34 AM, Momonth <momonth@gmail.com> wrote:

> Hi All,
>
> We extensively use NetApp API calls to monitor 7Mode filers, and took
> the same approach for cDOT monitoring.
>
> Here is a very unpleasant discovery:
>
> 1. Take one node (or more nodes) *offline*, eg power it off for
> maintenance.
> 2. Try to run *any* API call against cluster interface and get the
> following error:
>
> OUTPUT:
> <results reason="RPC: Port mapper failure - RPC: Timed out"
> status="failed" errno="13001"></results>
>
> It effectively makes your cluster wide monitoring useless.
>
> Any ideas? Is it a feature or a bug?
>
> Cheers,
> Vladimir
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
RE: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Try narrowing your API call to a specific node. It’s possible it’s trying to query the node that’s down and causing the timeout.

API might not be smart enough to know to ignore a node that is not up.

Also be sure to check that it did fail over properly as tmac mentioned. And that the cluster is in quorum. (set diag; cluster show; cluster ring show)

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of tmac
Sent: Wednesday, March 30, 2016 8:59 AM
To: Vladimir Zhigulin
Cc: toasters@teaparty.net
Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available

Did the cluster interface failover correctly? can the host doing the API calls "ssh" into the cluster address?

--tmac

Tim McCarthy, Principal Consultant





On Wed, Mar 30, 2016 at 8:34 AM, Momonth <momonth@gmail.com<mailto:momonth@gmail.com>> wrote:
Hi All,

We extensively use NetApp API calls to monitor 7Mode filers, and took
the same approach for cDOT monitoring.

Here is a very unpleasant discovery:

1. Take one node (or more nodes) *offline*, eg power it off for maintenance.
2. Try to run *any* API call against cluster interface and get the
following error:

OUTPUT:
<results reason="RPC: Port mapper failure - RPC: Timed out"
status="failed" errno="13001"></results>

It effectively makes your cluster wide monitoring useless.

Any ideas? Is it a feature or a bug?

Cheers,
Vladimir
_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Yes, the cluster interface is SSH-able, i can with no issues.

On Wed, Mar 30, 2016 at 2:59 PM, tmac <tmacmd@gmail.com> wrote:
> Did the cluster interface failover correctly? can the host doing the API
> calls "ssh" into the cluster address?
>
> --tmac
>
> Tim McCarthy, Principal Consultant
>
>
>
>
>
> On Wed, Mar 30, 2016 at 8:34 AM, Momonth <momonth@gmail.com> wrote:
>>
>> Hi All,
>>
>> We extensively use NetApp API calls to monitor 7Mode filers, and took
>> the same approach for cDOT monitoring.
>>
>> Here is a very unpleasant discovery:
>>
>> 1. Take one node (or more nodes) *offline*, eg power it off for
>> maintenance.
>> 2. Try to run *any* API call against cluster interface and get the
>> following error:
>>
>> OUTPUT:
>> <results reason="RPC: Port mapper failure - RPC: Timed out"
>> status="failed" errno="13001"></results>
>>
>> It effectively makes your cluster wide monitoring useless.
>>
>> Any ideas? Is it a feature or a bug?
>>
>> Cheers,
>> Vladimir
>> _______________________________________________
>> Toasters mailing list
>> Toasters@teaparty.net
>> http://www.teaparty.net/mailman/listinfo/toasters
>
>
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
A correction to my initial state:

1. I have the whole HA-pair (ie two nodes) being powered off.

On Wed, Mar 30, 2016 at 3:30 PM, Parisi, Justin
<Justin.Parisi@netapp.com> wrote:
> Try narrowing your API call to a specific node. It’s possible it’s trying to
> query the node that’s down and causing the timeout.
>
>

I initially noticed this behavior with "diagnosis-alert-get-iter"
call, which doesn't require a node parameter.
But even simple thing like "version" fails.

>
> API might not be smart enough to know to ignore a node that is not up.
>
>

The reality proves otherwise =) I'm on 8.3.1.

>
> Also be sure to check that it did fail over properly as tmac mentioned. And
> that the cluster is in quorum. (set diag; cluster show; cluster ring show)
>

Since both nodes are down, there was actually no failover taking place.

Here is what I get:

cdot::*> cluster ring show
..
<output of healthy nodes here>
..

Warning: Unable to list entries on node na101node-4a. RPC: Port mapper
failure - RPC: Timed out
Unable to list entries on node na101node-4b. RPC: Port mapper
failure - RPC: Timed out
30 entries were displayed.

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
RE: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Oh, well that's different entirely. :)

The cluster may be out of quorum, which is causing this issue.

Did you capture the aforementioned commands?

"RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.

Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.

-----Original Message-----
From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth
Sent: Wednesday, March 30, 2016 10:27 AM
To: Parisi, Justin
Cc: NGC-tmacmd-gmail.com; toasters@teaparty.net
Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available

A correction to my initial state:

1. I have the whole HA-pair (ie two nodes) being powered off.

On Wed, Mar 30, 2016 at 3:30 PM, Parisi, Justin <Justin.Parisi@netapp.com> wrote:
> Try narrowing your API call to a specific node. It’s possible it’s
> trying to query the node that’s down and causing the timeout.
>
>

I initially noticed this behavior with "diagnosis-alert-get-iter"
call, which doesn't require a node parameter.
But even simple thing like "version" fails.

>
> API might not be smart enough to know to ignore a node that is not up.
>
>

The reality proves otherwise =) I'm on 8.3.1.

>
> Also be sure to check that it did fail over properly as tmac
> mentioned. And that the cluster is in quorum. (set diag; cluster show;
> cluster ring show)
>

Since both nodes are down, there was actually no failover taking place.

Here is what I get:

cdot::*> cluster ring show
..
<output of healthy nodes here>
..

Warning: Unable to list entries on node na101node-4a. RPC: Port mapper failure - RPC: Timed out
Unable to list entries on node na101node-4b. RPC: Port mapper failure - RPC: Timed out
30 entries were displayed.

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
On Wed, Mar 30, 2016 at 4:31 PM, Parisi, Justin
<Justin.Parisi@netapp.com> wrote:
> Oh, well that's different entirely. :)
>

true =)

> The cluster may be out of quorum, which is causing this issue.
>

I don't think so, my cluster is 6 nodes, "-" 2 nodes being powered
off, "+" one of the running nodes has got "epsilon", IMO, it's still
"quorum".

> Did you capture the aforementioned commands?
>
> "RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.

Yes, i did, the healthy nodes reply just fine.

Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
na101node-1a true true false
na101node-1b true true false
na101node-2a true true true
na101node-2b true true false
na101node-3a true true false
na101node-3b true true false
na101node-4a false true false
na101node-4b false true false

>
> Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.
>

Well, I just tried to set "eligibility" to false, but it didn't fix
the API calls issue:

::*> node modify -node na101node-4* -eligibility false

Warning: When a node's eligibility is set to "false," it cannot serve
SAN data, and NAS access might also be affected. This setting should
be used only for unusual maintenance operations. To restore the node's
data-serving capabilities,
set the eligibility to "true" and reboot the node. Continue? {y|n}: y
2 entries were modified.

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
RE: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Well I'd suggest opening a support case and getting a bug filed.

While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.

-----Original Message-----
From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth
Sent: Wednesday, March 30, 2016 10:42 AM
To: Parisi, Justin
Cc: NGC-tmacmd-gmail.com; toasters@teaparty.net
Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available

On Wed, Mar 30, 2016 at 4:31 PM, Parisi, Justin <Justin.Parisi@netapp.com> wrote:
> Oh, well that's different entirely. :)
>

true =)

> The cluster may be out of quorum, which is causing this issue.
>

I don't think so, my cluster is 6 nodes, "-" 2 nodes being powered off, "+" one of the running nodes has got "epsilon", IMO, it's still "quorum".

> Did you capture the aforementioned commands?
>
> "RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.

Yes, i did, the healthy nodes reply just fine.

Node Health Eligibility Epsilon
-------------------- ------- ------------ ------------
na101node-1a true true false
na101node-1b true true false
na101node-2a true true true
na101node-2b true true false
na101node-3a true true false
na101node-3b true true false
na101node-4a false true false
na101node-4b false true false

>
> Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.
>

Well, I just tried to set "eligibility" to false, but it didn't fix the API calls issue:

::*> node modify -node na101node-4* -eligibility false

Warning: When a node's eligibility is set to "false," it cannot serve SAN data, and NAS access might also be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities,
set the eligibility to "true" and reboot the node. Continue? {y|n}: y
2 entries were modified.

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
Yes, I do agree it's worth a ticket ..

I think the following CLI command relies on the same API call "internally":

::*> system health alert show
This table is currently empty.

Warning: Unable to list entries on node na101node-4a. RPC: Port mapper
failure - RPC: Timed out
Unable to list entries on node na101node-4b. RPC: Port mapper
failure - RPC: Timed out


On Wed, Mar 30, 2016 at 4:47 PM, Parisi, Justin
<Justin.Parisi@netapp.com> wrote:
> Well I'd suggest opening a support case and getting a bug filed.
>
> While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.
>
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available [ In reply to ]
I dug into this a bit more, and it seems that not every single API
call fails, but just some, eg:

system-get-version - works
version - fails

Cheers,
Vladimir

On Wed, Mar 30, 2016 at 4:51 PM, Momonth <momonth@gmail.com> wrote:
> Yes, I do agree it's worth a ticket ..
>
> I think the following CLI command relies on the same API call "internally":
>
> ::*> system health alert show
> This table is currently empty.
>
> Warning: Unable to list entries on node na101node-4a. RPC: Port mapper
> failure - RPC: Timed out
> Unable to list entries on node na101node-4b. RPC: Port mapper
> failure - RPC: Timed out
>
>
> On Wed, Mar 30, 2016 at 4:47 PM, Parisi, Justin
> <Justin.Parisi@netapp.com> wrote:
>> Well I'd suggest opening a support case and getting a bug filed.
>>
>> While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.
>>
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters