Mailing List Archive

Raid disk layout - the ability to lose a shelf.
Hello,

I am deploying a new 8040 and it was requested that the aggregates / raid
groups are laid out in such a way that no more than 2 disks in any raid
group are within the same shelf.

At first this sounds like it reduces single points of failure and could
protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had
any feedback.

My thought is that this configuration is marginally increasing availability
at the sacrifice of additional risk to data integrity. With this strategy,
each time a disk failed we would endure not only the initial rebuilt to
spare, but a second rebuild when a disk replace is executed to put the
original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it
would even be possible to limp along. In an example configuration, we would
be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
Those rebuilds along should require significant cpu to occur concurrently
and I expect would impact data services significantly. Additionally, at
least 10 other raid groups would be either single or double degraded. I
expect the performance degradation at this point would be so great that the
most practical course of action would be to shutdown the system until the
failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience
thinking through this type of scenario. Is considering this configuration
interesting or perhaps silly? Are any best practice recommendations being
violated?

Thanks in advance.

--Jordan
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I know a medical pharm customer does it. Just raises complexity and you have to make sure the rebuilds happen on the proper drives. I think unless you have a had a shelf failure (or questionable power), maybe justified , but let “Netapp do, what Netapp does”. Might be a challenge evacuating a shelf, say for decommissioning an older smaller disk type (migrating away from 1TB SATA shelves as an example).


From: <toasters-bounces@teaparty.net> on behalf of jordan slingerland <jordan.slingerland@gmail.com>
Date: Tuesday, June 21, 2016 at 8:11 AM
To: "toasters@teaparty.net" <toasters@teaparty.net>
Subject: Raid disk layout - the ability to lose a shelf.

Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had any feedback.
My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.
--Jordan
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
This can be crazy! You will end up with small raid groups eating up parity
to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can
start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per
raidgroup per shelf.
We took this one a bit farther too....all even numbered disks
(*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a
spare. I tried to keep the spares on one shelf thinking in the event of
failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the
spares are and watch for more than 2 disks in a raidgroup showing in the
same shelf. Then possibly forcing the running of a "disk copy start"
command to nondisruptively move the disk. THIS TAKES LONGER than a
reconstruction!!! Why? the process is NICE'd to use limited resources
because it it not critical yet.


--tmac

*Tim McCarthy, **Principal Consultant*



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <
jordan.slingerland@gmail.com> wrote:

> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates / raid
> groups are laid out in such a way that no more than 2 disks in any raid
> group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could
> protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had
> any feedback.
>
> My thought is that this configuration is marginally increasing
> availability at the sacrifice of additional risk to data integrity. With
> this strategy, each time a disk failed we would endure not only the initial
> rebuilt to spare, but a second rebuild when a disk replace is executed to
> put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it
> would even be possible to limp along. In an example configuration, we would
> be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
> Those rebuilds along should require significant cpu to occur concurrently
> and I expect would impact data services significantly. Additionally, at
> least 10 other raid groups would be either single or double degraded. I
> expect the performance degradation at this point would be so great that the
> most practical course of action would be to shutdown the system until the
> failed shelf could be replaced.
>
>
> Thanks for any input. I would like to know if anyone has any experience
> thinking through this type of scenario. Is considering this configuration
> interesting or perhaps silly? Are any best practice recommendations being
> violated?
>
> Thanks in advance.
>
> --Jordan
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
Your customer obviously has (ir)rational reasons for their lack of confidence in disk shelf HA.

After mentioning all the objections you're likely to get in this forum you might ask your customer if maybe there's a subset of data that could be protected with this approach or maybe sync mirror might be an alternative?

.

> On Jun 21, 2016, at 5:13 PM, jordan slingerland <jordan.slingerland@gmail.com> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>
> Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
agreed... for shelf resiliency, SyncMirror to a different stack is a no charge license... double the disk isn't, but for stack and shelf resiliency that might be a better option.


From: tmac <tmacmd@gmail.com>
To: jordan slingerland <jordan.slingerland@gmail.com>
Cc: Toasters <toasters@teaparty.net>
Sent: Tuesday, June 21, 2016 8:26 AM
Subject: Re: Raid disk layout - the ability to lose a shelf.

This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?
Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once....ONCE
We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1When a disk fails, assign even to node 2, odd to node 1
This made the aggregates a bit trickier to place, but it happened.
Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.
However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac
Tim McCarthy, Principal Consultant


On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <jordan.slingerland@gmail.com> wrote:

Hello,

I am deploying a new 8040 and it was requested that the aggregates /  raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf. 

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback. 

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity.  With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available.  Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly.  Additionally, at least 10 other raid groups would be either single or double degraded.  I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.

 
Thanks for any input.  I would like to know if anyone has any experience thinking through this type of scenario.  Is considering this configuration interesting or perhaps silly?  Are any best practice recommendations being violated? 

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters




_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I find the whole exercise a fine idea. However, I've never seen whole shelf
failures, usually its one of the modules (thats why there is two). Has
anyone ever experienced a whole shelf go offline because of a failure
outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <tmacmd@gmail.com> wrote:

> This can be crazy! You will end up with small raid groups eating up parity
> to sacrifice for what? Sacrificing space for what?
>
> Now, given a large environment (like 10 shelves or more)...maybe you can
> start with this. I did this for a customer once.
> ...ONCE
>
> We ended up with 16 or 18 disk raidgroups and there was no more than 2 per
> raidgroup per shelf.
> We took this one a bit farther too....all even numbered disks
> (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
> When a disk fails, assign even to node 2, odd to node 1
>
> This made the aggregates a bit trickier to place, but it happened.
>
> Now, when a disk fails, I cannot control where it rebuilds other than a
> spare. I tried to keep the spares on one shelf thinking in the event of
> failures, they will likley be different raidgroups.
>
> However, one could script some monitoring software to wathc where the
> spares are and watch for more than 2 disks in a raidgroup showing in the
> same shelf. Then possibly forcing the running of a "disk copy start"
> command to nondisruptively move the disk. THIS TAKES LONGER than a
> reconstruction!!! Why? the process is NICE'd to use limited resources
> because it it not critical yet.
>
>
> --tmac
>
> *Tim McCarthy, **Principal Consultant*
>
>
>
> On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <
> jordan.slingerland@gmail.com> wrote:
>
>> Hello,
>>
>> I am deploying a new 8040 and it was requested that the aggregates /
>> raid groups are laid out in such a way that no more than 2 disks in any
>> raid group are within the same shelf.
>>
>> At first this sounds like it reduces single points of failure and could
>> protect availability from the failure of a full disk shelf.
>>
>> I argue against this strategy and was wondering if anyone in this list
>> had any feedback.
>>
>> My thought is that this configuration is marginally increasing
>> availability at the sacrifice of additional risk to data integrity. With
>> this strategy, each time a disk failed we would endure not only the initial
>> rebuilt to spare, but a second rebuild when a disk replace is executed
>> to put the original shelf/slot/disk back into the the active raid group.
>>
>> Additional, if a shelf failure were encountered, I question whether it
>> would even be possible to limp along. In an example configuration, we would
>> be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
>> Those rebuilds along should require significant cpu to occur concurrently
>> and I expect would impact data services significantly. Additionally, at
>> least 10 other raid groups would be either single or double degraded. I
>> expect the performance degradation at this point would be so great that the
>> most practical course of action would be to shutdown the system until the
>> failed shelf could be replaced.
>>
>>
>> Thanks for any input. I would like to know if anyone has any experience
>> thinking through this type of scenario. Is considering this configuration
>> interesting or perhaps silly? Are any best practice recommendations being
>> violated?
>>
>> Thanks in advance.
>>
>> --Jordan
>>
>> _______________________________________________
>> Toasters mailing list
>> Toasters@teaparty.net
>> http://www.teaparty.net/mailman/listinfo/toasters
>>
>>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
RE: Raid disk layout - the ability to lose a shelf. [ In reply to ]
For the most part, the shelf is just sheet metal. The controllers, power supplies, and all that are redundant. I'm not in hardware engineering, but I imagine that's the reason there isn't a native ability to isolate data within certain shelves. Catastrophic sheet metal failure is unlikely.

I was a NetApp customer for about 10 years before joining the company and the only shelf "failure" I encountered was a mildly annoying temperature sensor failure. There was only the one sensor, so I had to replace the shelf. It didn't cause downtime or anything.


From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Douglas Siggins
Sent: Tuesday, June 21, 2016 5:42 PM
To: NGC-tmacmd-gmail.com <tmacmd@gmail.com>
Cc: Toasters <toasters@teaparty.net>
Subject: Re: Raid disk layout - the ability to lose a shelf.

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <tmacmd@gmail.com<mailto:tmacmd@gmail.com>> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <jordan.slingerland@gmail.com<mailto:jordan.slingerland@gmail.com>> wrote:
Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had any feedback.
My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.
--Jordan

_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
RE: Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I have seen whole shelf failures due to Bug ID 902420. (Don’t bother trying to look up the bug, it’s one of those wonderful, completely blank ones.)

Doug Clendening

(c) 713-516-4671

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Douglas Siggins
Sent: Tuesday, June 21, 2016 10:42 AM
To: tmac
Cc: Toasters
Subject: [**EXTERNAL**] Re: Raid disk layout - the ability to lose a shelf.

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <tmacmd@gmail.com<mailto:tmacmd@gmail.com>> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <jordan.slingerland@gmail.com<mailto:jordan.slingerland@gmail.com>> wrote:
Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had any feedback.
My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.
--Jordan

_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
Another issue to think about besides resiliency… what happens in this “no more than two RAID group disks per shelf” scheme when they want to add another shelf because they’re running out of capacity?

> On Jun 21, 2016, at 10:11, jordan slingerland <jordan.slingerland@gmail.com> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>
> Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?
>
> Thanks in advance.
>
> --Jordan
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
thanks for all the reply so far. That is a valid point but I believe in
that situation each raidgroup could be extended by 1 or 2 disks. The
initial configuration will be 12 shelves@tmac and 12 disks raid groups.
so though the raid groups will end up being smaller than I would typically
recommend, not tiny.

On Tue, Jun 21, 2016 at 12:01 PM, Rhorer, Kyle L. (JSC-OD)[THE BOEING
COMPANY] <kyle.l.rhorer@nasa.gov> wrote:

> Another issue to think about besides resiliency… what happens in this “no
> more than two RAID group disks per shelf” scheme when they want to add
> another shelf because they’re running out of capacity?
>
> > On Jun 21, 2016, at 10:11, jordan slingerland <
> jordan.slingerland@gmail.com> wrote:
> >
> > Hello,
> >
> > I am deploying a new 8040 and it was requested that the aggregates /
> raid groups are laid out in such a way that no more than 2 disks in any
> raid group are within the same shelf.
> >
> > At first this sounds like it reduces single points of failure and could
> protect availability from the failure of a full disk shelf.
> >
> > I argue against this strategy and was wondering if anyone in this list
> had any feedback.
> >
> > My thought is that this configuration is marginally increasing
> availability at the sacrifice of additional risk to data integrity. With
> this strategy, each time a disk failed we would endure not only the initial
> rebuilt to spare, but a second rebuild when a disk replace is executed to
> put the original shelf/slot/disk back into the the active raid group.
> >
> > Additional, if a shelf failure were encountered, I question whether it
> would even be possible to limp along. In an example configuration, we would
> be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
> Those rebuilds along should require significant cpu to occur concurrently
> and I expect would impact data services significantly. Additionally, at
> least 10 other raid groups would be either single or double degraded. I
> expect the performance degradation at this point would be so great that the
> most practical course of action would be to shutdown the system until the
> failed shelf could be replaced.
> >
> >
> > Thanks for any input. I would like to know if anyone has any experience
> thinking through this type of scenario. Is considering this configuration
> interesting or perhaps silly? Are any best practice recommendations being
> violated?
> >
> > Thanks in advance.
> >
> > --Jordan
> > _______________________________________________
> > Toasters mailing list
> > Toasters@teaparty.net
> > http://www.teaparty.net/mailman/listinfo/toasters
>
>
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I experienced a disk shelf failure once, some internal electronics failed,
smoke and all that fun ... no data availability.

Another possible option is to enable data mirror across two separate sas
domains, it used to need snapmirror_local license on 7Mode OnTAP.
On Jun 21, 2016 18:15, "jordan slingerland" <jordan.slingerland@gmail.com>
wrote:

> thanks for all the reply so far. That is a valid point but I believe in
> that situation each raidgroup could be extended by 1 or 2 disks. The
> initial configuration will be 12 shelves@tmac and 12 disks raid groups.
> so though the raid groups will end up being smaller than I would typically
> recommend, not tiny.
>
> On Tue, Jun 21, 2016 at 12:01 PM, Rhorer, Kyle L. (JSC-OD)[THE BOEING
> COMPANY] <kyle.l.rhorer@nasa.gov> wrote:
>
>> Another issue to think about besides resiliency… what happens in this “no
>> more than two RAID group disks per shelf” scheme when they want to add
>> another shelf because they’re running out of capacity?
>>
>> > On Jun 21, 2016, at 10:11, jordan slingerland <
>> jordan.slingerland@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I am deploying a new 8040 and it was requested that the aggregates /
>> raid groups are laid out in such a way that no more than 2 disks in any
>> raid group are within the same shelf.
>> >
>> > At first this sounds like it reduces single points of failure and could
>> protect availability from the failure of a full disk shelf.
>> >
>> > I argue against this strategy and was wondering if anyone in this list
>> had any feedback.
>> >
>> > My thought is that this configuration is marginally increasing
>> availability at the sacrifice of additional risk to data integrity. With
>> this strategy, each time a disk failed we would endure not only the initial
>> rebuilt to spare, but a second rebuild when a disk replace is executed to
>> put the original shelf/slot/disk back into the the active raid group.
>> >
>> > Additional, if a shelf failure were encountered, I question whether it
>> would even be possible to limp along. In an example configuration, we would
>> be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
>> Those rebuilds along should require significant cpu to occur concurrently
>> and I expect would impact data services significantly. Additionally, at
>> least 10 other raid groups would be either single or double degraded. I
>> expect the performance degradation at this point would be so great that the
>> most practical course of action would be to shutdown the system until the
>> failed shelf could be replaced.
>> >
>> >
>> > Thanks for any input. I would like to know if anyone has any
>> experience thinking through this type of scenario. Is considering this
>> configuration interesting or perhaps silly? Are any best practice
>> recommendations being violated?
>> >
>> > Thanks in advance.
>> >
>> > --Jordan
>> > _______________________________________________
>> > Toasters mailing list
>> > Toasters@teaparty.net
>> > http://www.teaparty.net/mailman/listinfo/toasters
>>
>>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I have seen (more than once) NetApp support recommending power cycling shelf. And I did full shelf replacement less than a month ago.

?????????? ? iPhone

21 ???? 2016 ?., ? 18:53, Douglas Siggins <siggins@gmail.com<mailto:siggins@gmail.com>> ???????(?):

I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <tmacmd@gmail.com<mailto:tmacmd@gmail.com>> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <jordan.slingerland@gmail.com<mailto:jordan.slingerland@gmail.com>> wrote:
Hello,

I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.

At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.

I argue against this strategy and was wondering if anyone in this list had any feedback.

My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.

Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.

--Jordan

_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters



_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
RE: Raid disk layout - the ability to lose a shelf. [ In reply to ]
A shelf can certainly have problems like any electronic device, I just question whether isolating aggregates to particular shelves solves a real problem. If a temperature sensor went bad or a particular drive connector in one of the bays was bent you still have to deal maintenance work and potential downtime to address it irrespective of which drives from which aggregates are where.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of andrei.borzenkov@ts.fujitsu.com
Sent: Tuesday, June 21, 2016 6:53 PM
To: Douglas Siggins <siggins@gmail.com>
Cc: Toasters <toasters@teaparty.net>
Subject: Re: Raid disk layout - the ability to lose a shelf.

I have seen (more than once) NetApp support recommending power cycling shelf. And I did full shelf replacement less than a month ago.

ïÔÐÒÁ×ÌÅÎÏ Ó iPhone

21 ÉÀÎÑ 2016 Ç., × 18:53, Douglas Siggins <siggins@gmail.com<mailto:siggins@gmail.com>> ÎÁÐÉÓÁÌ(Á):
I find the whole exercise a fine idea. However, I've never seen whole shelf failures, usually its one of the modules (thats why there is two). Has anyone ever experienced a whole shelf go offline because of a failure outside of power?

I've only had one of my 4243s with a single slot failure.



On Tue, Jun 21, 2016 at 11:26 AM, tmac <tmacmd@gmail.com<mailto:tmacmd@gmail.com>> wrote:
This can be crazy! You will end up with small raid groups eating up parity to sacrifice for what? Sacrificing space for what?

Now, given a large environment (like 10 shelves or more)...maybe you can start with this. I did this for a customer once.
...ONCE

We ended up with 16 or 18 disk raidgroups and there was no more than 2 per raidgroup per shelf.
We took this one a bit farther too....all even numbered disks (*0,*2,*4,*6,*8) were assigned to node 2. The rest to node 1
When a disk fails, assign even to node 2, odd to node 1

This made the aggregates a bit trickier to place, but it happened.

Now, when a disk fails, I cannot control where it rebuilds other than a spare. I tried to keep the spares on one shelf thinking in the event of failures, they will likley be different raidgroups.

However, one could script some monitoring software to wathc where the spares are and watch for more than 2 disks in a raidgroup showing in the same shelf. Then possibly forcing the running of a "disk copy start" command to nondisruptively move the disk. THIS TAKES LONGER than a reconstruction!!! Why? the process is NICE'd to use limited resources because it it not critical yet.


--tmac

Tim McCarthy, Principal Consultant



On Tue, Jun 21, 2016 at 11:11 AM, jordan slingerland <jordan.slingerland@gmail.com<mailto:jordan.slingerland@gmail.com>> wrote:
Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had any feedback.
My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.
--Jordan

_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
Re: Raid disk layout - the ability to lose a shelf. [ In reply to ]
"A shelf can certainly have problems like any electronic device, I just question whether isolating aggregates to particular shelves solves a real problem."---
That's been my thinking the whole thread.

All great thoughts, but it's really deep in the weeds to waste time over thinking the problem on enterprise HW, using retail JBOD thinking.

We're way down into the .00x afr percentages here even where specific targeted/found bugs actually exist.  Add more decimal points for the much more rare smoke events that can kill a shelf.

ONTAP does a good job at layout given a lot of practical experience/data based on best placement for highest reliability.
 _________________________________Jeff MohlerTech Yahoo, Storage Architect, Principal(831)454-6712
YPAC Gold Member
Twitter: @PrincipalYahoo
CorpIM:  Hipchat & Iris
RE: Raid disk layout - the ability to lose a shelf. [ In reply to ]
I don’t know that I would try to do it deliberately but OnTap does it automatically when you have enough shelves.

And I have seen this save a panic when a whole shelf failed. Facilities were testing a power rail without checking that the power supplies in all devices were operational. The engineer was opening the rack with power supply in hand when it went down.

It took a while to rebuild but it didn’t go down.



From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of jordan slingerland
Sent: Wednesday, 22 June 2016 1:12 AM
To: toasters@teaparty.net
Subject: Raid disk layout - the ability to lose a shelf.

Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had any feedback.
My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.


Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?

Thanks in advance.
--Jordan

Duncan Cummings
NetApp Specialist
Interactive Pty Ltd
Telephone +61 7 3323 0800
Facsimile +61 7 3323 0899
Mobile +61 403 383 050
www.interactive.com.au<http://www.interactive.com.au>

-------Confidentiality & Legal Privilege-------------
"This email is intended for the named recipient only. The information contained in this message may be confidential, or commercially sensitive. If you are not the intended recipient you must not reproduce or distribute any part of the email, disclose its contents to any other party, or take any action in reliance on it. If you have received this email in error, please contact the sender immediately. Please delete this message from your computer. Confidentiality and legal privilege are not waived or lost by reason of mistaken delivery to you."