Mailing List Archive

ssh timing out (filer high CPU load)
On the off chance - I'm having trouble with a filer. I can't ssh to it
reliably (at all, mostly).

I'm pretty sure that's correlated with some high CPU load - my system
console has it 'spiked' at >95% for the last 24h, and that's much higher
than 'normal'.

What i'm not sure of is quite what's causing it - the filer is busy, but
not abnormally so.

The only thing I can think of that _might_ have changed it, is api calls
(qtree-list, get-file-info) - I've recently started doing quota snmp trap
enrichment. (but thats 'every few minutes' at most).

But otherwise - I'm not sure what might be causing sshd to stall, and if
there's a way to 'kick' it?

This is a 7 mode filer, on 8.2.1

I've got a case open, but would appreciate any further insight on how to
track a high CPU-causing ssh to not respond type issue.

I'm pretty sure a failover/failback will do the trick, but that'll have to
wait until the weekend - I'd like not to if I can manage it.

My current ps list looks like:

Process statistics over 67.328 seconds...

ID State Domain %CPU StackUsed %StackUsed Name

195 RR N 47% 6928 10% NwkThd_00

196 RR N 47% 7880 12% NwkThd_01

197 RR 0 47% 6928 10% NwkThd_02

223 BR s 7% 7648 46% pmcsas_intrd_1

259 BR e 5% 2440 19% fal_io_thread2

502 BR R 7% 7448 45% raidio_thread

503 BR R 7% 7448 45% raidio_thread

635 BG k 6% 15184 11% snmpd

1614 BR 0 5% 3464 10% ntm_main

1711 RR w 35% 14256 21% wafl_exempt00

1712 BR w 35% 14136 21% wafl_exempt01

1713 BR w 35% 14136 21% wafl_exempt02

2599 BR k 5% 2752 8% gr_scheduler


That seems pretty busy for a 4cpu system...


Thanks and regards,
Ed.
Re: ssh timing out (filer high CPU load) [ In reply to ]
The first thing that caught my eye was the snmpd, any chance you set up new
SNMP polling from monitoring stations that is querying the disks over and
over? If you can, turn off SNMP for a short bit to see if it goes away.

On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com>
wrote:

> On the off chance - I'm having trouble with a filer. I can't ssh to it
> reliably (at all, mostly).
>
> I'm pretty sure that's correlated with some high CPU load - my system
> console has it 'spiked' at >95% for the last 24h, and that's much higher
> than 'normal'.
>
> What i'm not sure of is quite what's causing it - the filer is busy, but
> not abnormally so.
>
> The only thing I can think of that _might_ have changed it, is api calls
> (qtree-list, get-file-info) - I've recently started doing quota snmp trap
> enrichment. (but thats 'every few minutes' at most).
>
> But otherwise - I'm not sure what might be causing sshd to stall, and if
> there's a way to 'kick' it?
>
> This is a 7 mode filer, on 8.2.1
>
> I've got a case open, but would appreciate any further insight on how to
> track a high CPU-causing ssh to not respond type issue.
>
> I'm pretty sure a failover/failback will do the trick, but that'll have to
> wait until the weekend - I'd like not to if I can manage it.
>
> My current ps list looks like:
>
> Process statistics over 67.328 seconds...
>
> ID State Domain %CPU StackUsed %StackUsed Name
>
> 195 RR N 47% 6928 10% NwkThd_00
>
> 196 RR N 47% 7880 12% NwkThd_01
>
> 197 RR 0 47% 6928 10% NwkThd_02
>
> 223 BR s 7% 7648 46% pmcsas_intrd_1
>
> 259 BR e 5% 2440 19% fal_io_thread2
>
> 502 BR R 7% 7448 45% raidio_thread
>
> 503 BR R 7% 7448 45% raidio_thread
>
> 635 BG k 6% 15184 11% snmpd
>
> 1614 BR 0 5% 3464 10% ntm_main
>
> 1711 RR w 35% 14256 21% wafl_exempt00
>
> 1712 BR w 35% 14136 21% wafl_exempt01
>
> 1713 BR w 35% 14136 21% wafl_exempt02
>
> 2599 BR k 5% 2752 8% gr_scheduler
>
>
> That seems pretty busy for a 4cpu system...
>
>
> Thanks and regards,
> Ed.
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
Re: ssh timing out (filer high CPU load) [ In reply to ]
Thanks for the response. Yes, we're polling with Zabbix (and generating
snmp traps).

So I'll shut those down for a while, and see if that helps.

On 10 August 2016 at 16:33, Douglas Siggins <siggins@gmail.com> wrote:

> The first thing that caught my eye was the snmpd, any chance you set up
> new SNMP polling from monitoring stations that is querying the disks over
> and over? If you can, turn off SNMP for a short bit to see if it goes away.
>
> On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com>
> wrote:
>
>> On the off chance - I'm having trouble with a filer. I can't ssh to it
>> reliably (at all, mostly).
>>
>> I'm pretty sure that's correlated with some high CPU load - my system
>> console has it 'spiked' at >95% for the last 24h, and that's much higher
>> than 'normal'.
>>
>> What i'm not sure of is quite what's causing it - the filer is busy, but
>> not abnormally so.
>>
>> The only thing I can think of that _might_ have changed it, is api calls
>> (qtree-list, get-file-info) - I've recently started doing quota snmp trap
>> enrichment. (but thats 'every few minutes' at most).
>>
>> But otherwise - I'm not sure what might be causing sshd to stall, and if
>> there's a way to 'kick' it?
>>
>> This is a 7 mode filer, on 8.2.1
>>
>> I've got a case open, but would appreciate any further insight on how to
>> track a high CPU-causing ssh to not respond type issue.
>>
>> I'm pretty sure a failover/failback will do the trick, but that'll have
>> to wait until the weekend - I'd like not to if I can manage it.
>>
>> My current ps list looks like:
>>
>> Process statistics over 67.328 seconds...
>>
>> ID State Domain %CPU StackUsed %StackUsed Name
>>
>> 195 RR N 47% 6928 10% NwkThd_00
>>
>> 196 RR N 47% 7880 12% NwkThd_01
>>
>> 197 RR 0 47% 6928 10% NwkThd_02
>>
>> 223 BR s 7% 7648 46% pmcsas_intrd_1
>>
>> 259 BR e 5% 2440 19% fal_io_thread2
>>
>> 502 BR R 7% 7448 45% raidio_thread
>>
>> 503 BR R 7% 7448 45% raidio_thread
>>
>> 635 BG k 6% 15184 11% snmpd
>>
>> 1614 BR 0 5% 3464 10% ntm_main
>>
>> 1711 RR w 35% 14256 21% wafl_exempt00
>>
>> 1712 BR w 35% 14136 21% wafl_exempt01
>>
>> 1713 BR w 35% 14136 21% wafl_exempt02
>>
>> 2599 BR k 5% 2752 8% gr_scheduler
>>
>>
>> That seems pretty busy for a 4cpu system...
>>
>>
>> Thanks and regards,
>> Ed.
>>
>> _______________________________________________
>> Toasters mailing list
>> Toasters@teaparty.net
>> http://www.teaparty.net/mailman/listinfo/toasters
>>
>>
>
Re: ssh timing out (filer high CPU load) [ In reply to ]
Yep, it was zabbix for me as well. Killed the CPU on all my filers. You
will have to go through and remove a bunch of checks


On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <ed.rolison@gmail.com>
wrote:

> Thanks for the response. Yes, we're polling with Zabbix (and generating
> snmp traps).
>
> So I'll shut those down for a while, and see if that helps.
>
> On 10 August 2016 at 16:33, Douglas Siggins <siggins@gmail.com> wrote:
>
>> The first thing that caught my eye was the snmpd, any chance you set up
>> new SNMP polling from monitoring stations that is querying the disks over
>> and over? If you can, turn off SNMP for a short bit to see if it goes away.
>>
>> On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com>
>> wrote:
>>
>>> On the off chance - I'm having trouble with a filer. I can't ssh to it
>>> reliably (at all, mostly).
>>>
>>> I'm pretty sure that's correlated with some high CPU load - my system
>>> console has it 'spiked' at >95% for the last 24h, and that's much higher
>>> than 'normal'.
>>>
>>> What i'm not sure of is quite what's causing it - the filer is busy, but
>>> not abnormally so.
>>>
>>> The only thing I can think of that _might_ have changed it, is api calls
>>> (qtree-list, get-file-info) - I've recently started doing quota snmp trap
>>> enrichment. (but thats 'every few minutes' at most).
>>>
>>> But otherwise - I'm not sure what might be causing sshd to stall, and if
>>> there's a way to 'kick' it?
>>>
>>> This is a 7 mode filer, on 8.2.1
>>>
>>> I've got a case open, but would appreciate any further insight on how to
>>> track a high CPU-causing ssh to not respond type issue.
>>>
>>> I'm pretty sure a failover/failback will do the trick, but that'll have
>>> to wait until the weekend - I'd like not to if I can manage it.
>>>
>>> My current ps list looks like:
>>>
>>> Process statistics over 67.328 seconds...
>>>
>>> ID State Domain %CPU StackUsed %StackUsed Name
>>>
>>> 195 RR N 47% 6928 10% NwkThd_00
>>>
>>> 196 RR N 47% 7880 12% NwkThd_01
>>>
>>> 197 RR 0 47% 6928 10% NwkThd_02
>>>
>>> 223 BR s 7% 7648 46% pmcsas_intrd_1
>>>
>>> 259 BR e 5% 2440 19% fal_io_thread2
>>>
>>> 502 BR R 7% 7448 45% raidio_thread
>>>
>>> 503 BR R 7% 7448 45% raidio_thread
>>>
>>> 635 BG k 6% 15184 11% snmpd
>>>
>>> 1614 BR 0 5% 3464 10% ntm_main
>>>
>>> 1711 RR w 35% 14256 21% wafl_exempt00
>>>
>>> 1712 BR w 35% 14136 21% wafl_exempt01
>>>
>>> 1713 BR w 35% 14136 21% wafl_exempt02
>>>
>>> 2599 BR k 5% 2752 8% gr_scheduler
>>>
>>>
>>> That seems pretty busy for a 4cpu system...
>>>
>>>
>>> Thanks and regards,
>>> Ed.
>>>
>>> _______________________________________________
>>> Toasters mailing list
>>> Toasters@teaparty.net
>>> http://www.teaparty.net/mailman/listinfo/toasters
>>>
>>>
>>
>
Re: ssh timing out (filer high CPU load) [ In reply to ]
With Zabbix off all night, we've got as far as picking up a possible bug
with 'sshd' - the login is actually 'going' in that it's connecting and
doing key- exchange, it's just not actually getting as far as the 'shell'
login.

(And on the filer, I get 'connection timed out' messages).

I am still unsure quite why - rshstat/rshkill cleared out some stale
processes, but I think they were more like symptom than cause.

Our next line is 'reboot it', which'll have to wait until an outage window.

Don't suppose anyone has any handy tricks for 'force kill' on sshd on a
filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem
to respond to kill signals).


On 10 August 2016 at 18:04, Douglas Siggins <siggins@gmail.com> wrote:

> Yep, it was zabbix for me as well. Killed the CPU on all my filers. You
> will have to go through and remove a bunch of checks
>
>
> On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <ed.rolison@gmail.com>
> wrote:
>
>> Thanks for the response. Yes, we're polling with Zabbix (and generating
>> snmp traps).
>>
>> So I'll shut those down for a while, and see if that helps.
>>
>> On 10 August 2016 at 16:33, Douglas Siggins <siggins@gmail.com> wrote:
>>
>>> The first thing that caught my eye was the snmpd, any chance you set up
>>> new SNMP polling from monitoring stations that is querying the disks over
>>> and over? If you can, turn off SNMP for a short bit to see if it goes away.
>>>
>>> On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com>
>>> wrote:
>>>
>>>> On the off chance - I'm having trouble with a filer. I can't ssh to it
>>>> reliably (at all, mostly).
>>>>
>>>> I'm pretty sure that's correlated with some high CPU load - my system
>>>> console has it 'spiked' at >95% for the last 24h, and that's much higher
>>>> than 'normal'.
>>>>
>>>> What i'm not sure of is quite what's causing it - the filer is busy,
>>>> but not abnormally so.
>>>>
>>>> The only thing I can think of that _might_ have changed it, is api
>>>> calls (qtree-list, get-file-info) - I've recently started doing quota snmp
>>>> trap enrichment. (but thats 'every few minutes' at most).
>>>>
>>>> But otherwise - I'm not sure what might be causing sshd to stall, and
>>>> if there's a way to 'kick' it?
>>>>
>>>> This is a 7 mode filer, on 8.2.1
>>>>
>>>> I've got a case open, but would appreciate any further insight on how
>>>> to track a high CPU-causing ssh to not respond type issue.
>>>>
>>>> I'm pretty sure a failover/failback will do the trick, but that'll have
>>>> to wait until the weekend - I'd like not to if I can manage it.
>>>>
>>>> My current ps list looks like:
>>>>
>>>> Process statistics over 67.328 seconds...
>>>>
>>>> ID State Domain %CPU StackUsed %StackUsed Name
>>>>
>>>> 195 RR N 47% 6928 10% NwkThd_00
>>>>
>>>> 196 RR N 47% 7880 12% NwkThd_01
>>>>
>>>> 197 RR 0 47% 6928 10% NwkThd_02
>>>>
>>>> 223 BR s 7% 7648 46% pmcsas_intrd_1
>>>>
>>>> 259 BR e 5% 2440 19% fal_io_thread2
>>>>
>>>> 502 BR R 7% 7448 45% raidio_thread
>>>>
>>>> 503 BR R 7% 7448 45% raidio_thread
>>>>
>>>> 635 BG k 6% 15184 11% snmpd
>>>>
>>>> 1614 BR 0 5% 3464 10% ntm_main
>>>>
>>>> 1711 RR w 35% 14256 21% wafl_exempt00
>>>>
>>>> 1712 BR w 35% 14136 21% wafl_exempt01
>>>>
>>>> 1713 BR w 35% 14136 21% wafl_exempt02
>>>>
>>>> 2599 BR k 5% 2752 8% gr_scheduler
>>>>
>>>>
>>>> That seems pretty busy for a 4cpu system...
>>>>
>>>>
>>>> Thanks and regards,
>>>> Ed.
>>>>
>>>> _______________________________________________
>>>> Toasters mailing list
>>>> Toasters@teaparty.net
>>>> http://www.teaparty.net/mailman/listinfo/toasters
>>>>
>>>>
>>>
>>
>
Re: ssh timing out (filer high CPU load) [ In reply to ]
Again similar issue. Pretty sure I did a kill -9, but not positive. I
believe my issue was similar to this:

http://community.netapp.com/t5/Data-ONTAP-Discussions/Systemshell-cant-ssh-to-filer-after-systemshell-session-left-open/td-p/101445


On Thu, Aug 11, 2016 at 7:43 AM, Edward Rolison <ed.rolison@gmail.com>
wrote:

> With Zabbix off all night, we've got as far as picking up a possible bug
> with 'sshd' - the login is actually 'going' in that it's connecting and
> doing key- exchange, it's just not actually getting as far as the 'shell'
> login.
>
> (And on the filer, I get 'connection timed out' messages).
>
> I am still unsure quite why - rshstat/rshkill cleared out some stale
> processes, but I think they were more like symptom than cause.
>
> Our next line is 'reboot it', which'll have to wait until an outage
> window.
>
> Don't suppose anyone has any handy tricks for 'force kill' on sshd on a
> filer? (I've gone as far as firing up systemshell, but 'sshd' doesn't seem
> to respond to kill signals).
>
>
> On 10 August 2016 at 18:04, Douglas Siggins <siggins@gmail.com> wrote:
>
>> Yep, it was zabbix for me as well. Killed the CPU on all my filers. You
>> will have to go through and remove a bunch of checks
>>
>>
>> On Wed, Aug 10, 2016 at 12:26 PM, Edward Rolison <ed.rolison@gmail.com>
>> wrote:
>>
>>> Thanks for the response. Yes, we're polling with Zabbix (and generating
>>> snmp traps).
>>>
>>> So I'll shut those down for a while, and see if that helps.
>>>
>>> On 10 August 2016 at 16:33, Douglas Siggins <siggins@gmail.com> wrote:
>>>
>>>> The first thing that caught my eye was the snmpd, any chance you set up
>>>> new SNMP polling from monitoring stations that is querying the disks over
>>>> and over? If you can, turn off SNMP for a short bit to see if it goes away.
>>>>
>>>> On Wed, Aug 10, 2016 at 7:51 AM, Edward Rolison <ed.rolison@gmail.com>
>>>> wrote:
>>>>
>>>>> On the off chance - I'm having trouble with a filer. I can't ssh to it
>>>>> reliably (at all, mostly).
>>>>>
>>>>> I'm pretty sure that's correlated with some high CPU load - my system
>>>>> console has it 'spiked' at >95% for the last 24h, and that's much higher
>>>>> than 'normal'.
>>>>>
>>>>> What i'm not sure of is quite what's causing it - the filer is busy,
>>>>> but not abnormally so.
>>>>>
>>>>> The only thing I can think of that _might_ have changed it, is api
>>>>> calls (qtree-list, get-file-info) - I've recently started doing quota snmp
>>>>> trap enrichment. (but thats 'every few minutes' at most).
>>>>>
>>>>> But otherwise - I'm not sure what might be causing sshd to stall, and
>>>>> if there's a way to 'kick' it?
>>>>>
>>>>> This is a 7 mode filer, on 8.2.1
>>>>>
>>>>> I've got a case open, but would appreciate any further insight on how
>>>>> to track a high CPU-causing ssh to not respond type issue.
>>>>>
>>>>> I'm pretty sure a failover/failback will do the trick, but that'll
>>>>> have to wait until the weekend - I'd like not to if I can manage it.
>>>>>
>>>>> My current ps list looks like:
>>>>>
>>>>> Process statistics over 67.328 seconds...
>>>>>
>>>>> ID State Domain %CPU StackUsed %StackUsed Name
>>>>>
>>>>> 195 RR N 47% 6928 10% NwkThd_00
>>>>>
>>>>> 196 RR N 47% 7880 12% NwkThd_01
>>>>>
>>>>> 197 RR 0 47% 6928 10% NwkThd_02
>>>>>
>>>>> 223 BR s 7% 7648 46% pmcsas_intrd_1
>>>>>
>>>>> 259 BR e 5% 2440 19% fal_io_thread2
>>>>>
>>>>> 502 BR R 7% 7448 45% raidio_thread
>>>>>
>>>>> 503 BR R 7% 7448 45% raidio_thread
>>>>>
>>>>> 635 BG k 6% 15184 11% snmpd
>>>>>
>>>>> 1614 BR 0 5% 3464 10% ntm_main
>>>>>
>>>>> 1711 RR w 35% 14256 21% wafl_exempt00
>>>>>
>>>>> 1712 BR w 35% 14136 21% wafl_exempt01
>>>>>
>>>>> 1713 BR w 35% 14136 21% wafl_exempt02
>>>>>
>>>>> 2599 BR k 5% 2752 8% gr_scheduler
>>>>>
>>>>>
>>>>> That seems pretty busy for a 4cpu system...
>>>>>
>>>>>
>>>>> Thanks and regards,
>>>>> Ed.
>>>>>
>>>>> _______________________________________________
>>>>> Toasters mailing list
>>>>> Toasters@teaparty.net
>>>>> http://www.teaparty.net/mailman/listinfo/toasters
>>>>>
>>>>>
>>>>
>>>
>>
>