Mailing List Archive

error lasted more than 10 minutes....
https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history

quest: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:04:23 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:08:30 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:09:55 GMT

Request: GET http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history, from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Sun, 27 Nov 2011 22:14:53 GMT

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
I am also getting this from en.m.wikipedia.org:

Error 503 Service Unavailable

Service Unavailable

Guru Meditation:

XID: 1592365530

Varnish cache server

On android, verizon 4g, default browser, 10+ minutes
On Nov 27, 2011 5:18 PM, "William Allen Simpson" <
william.allen.simpson@gmail.com> wrote:

> https://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history
>
> quest: GET
> http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history,
> from 208.80.152.70 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
> Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at
> Sun, 27 Nov 2011 22:04:23 GMT
>
> Request: GET
> http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history,
> from 208.80.152.85 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
> Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at
> Sun, 27 Nov 2011 22:08:30 GMT
>
> Request: GET
> http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history,
> from 208.80.152.75 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
> Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at
> Sun, 27 Nov 2011 22:09:55 GMT
>
> Request: GET
> http://en.wikipedia.org/w/index.php?title=Sharon_Aguilar&action=history,
> from 208.80.152.50 via sq66.wikimedia.org (squid/2.7.STABLE9) to ()
> Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at
> Sun, 27 Nov 2011 22:14:53 GMT
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
We had a site outage of about 30 mins, caused by a major issue,
potentially hardware-related, with a database server, which blocked
all MediaWiki application servers (and thereby rendered most of our
sites unusable). Should be fixed now; we'll prepare a more
comprehensive incident analysis soon.

Thanks to the ops team for their speedy response.

All best,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
It's back up.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
It appears that we were actually taken down by the reddit community, after
a link to the fundraising stats page was posted under Brandon's IAMA there.

sq71.wikimedia.org 943326197 2011-11-27T22:51:09.075 62032 109.125.42.71
TCP_MISS/200 1035 GET
http://wikimediafoundation.org/wiki/Special:FundraiserStatistics ANY_PARENT/
208.80.152.47 text/html *
http://www.reddit.com/r/IAmA/comments/mr4pf/i_am_wikipedia_programmer_brandon_harris_ama/
* -
Mozilla/5.0%20(Windows%20NT%206.1;%20WOW64)%20AppleWebKit/535.2%20(KHTML,%20like%20Gecko)%20Chrome/15.0.874.121%20Safari/535.2

That page wasn't suitable for high volume public consumption (very
expensive db query + not properly cached), so the site problem persisted
even after the db initially suspected as bad was rotated out.

On Sun, Nov 27, 2011 at 2:39 PM, Erik Moeller <erik@wikimedia.org> wrote:

> We had a site outage of about 30 mins, caused by a major issue,
> potentially hardware-related, with a database server, which blocked
> all MediaWiki application servers (and thereby rendered most of our
> sites unusable). Should be fixed now; we'll prepare a more
> comprehensive incident analysis soon.
>
> Thanks to the ops team for their speedy response.
>
> All best,
> Erik
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman <afeldman@wikimedia.org> wrote:
> That page wasn't suitable for high volume public consumption (very
> expensive db query + not properly cached), so the site problem persisted
> even after the db initially suspected as bad was rotated out.
>
What happened to it? When this page was introduced, it did have proper
caching in memcached. Was that removed? Or did we get a cache
stampede?

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
Roan Kattouw wrote:
> On Mon, Nov 28, 2011 at 12:21 AM, Asher Feldman <afeldman@wikimedia.org>
> wrote:
>> That page wasn't suitable for high volume public consumption (very
>> expensive db query + not properly cached), so the site problem persisted
>> even after the db initially suspected as bad was rotated out.
>>
> What happened to it? When this page was introduced, it did have proper
> caching in memcached. Was that removed? Or did we get a cache
> stampede?

I asked roughly the same thing yesterday (more along the lines of "shouldn't
it take someone ten minutes to add memcache support to the extension?").
Reedy said it was long-running queries that never timed out that apparently
caused the issue.

The ContributionReporting extension being disabled is being tracked here:
<https://bugzilla.wikimedia.org/show_bug.cgi?id=32679>.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride <z@mzmcbride.com> wrote:
> I asked roughly the same thing yesterday (more along the lines of "shouldn't
> it take someone ten minutes to add memcache support to the extension?").
> Reedy said it was long-running queries that never timed out that apparently
> caused the issue.
>
I read the code and sent some unsolicited advice to the fundraising
team. Essentially, the recaching operation could be doing ~35 times
fewer queries, that should help.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On 28/11/11 14:57, Roan Kattouw wrote:
> On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride<z@mzmcbride.com> wrote:
>> I asked roughly the same thing yesterday (more along the lines of "shouldn't
>> it take someone ten minutes to add memcache support to the extension?").
>> Reedy said it was long-running queries that never timed out that apparently
>> caused the issue.
>>
> I read the code and sent some unsolicited advice to the fundraising
> team. Essentially, the recaching operation could be doing ~35 times
> fewer queries, that should help.
>
> Roan
>

And adding memcached caching with even, say, as little as a 1 minute
cache entry timeout, should dilute that reduced load even more, and put
an upperbound on the load generated, just in case it gets
slashdot/reddited again.

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 8:28 PM, Neil Harris <neil@tonal.clara.co.uk> wrote:
> And adding memcached caching with even, say, as little as a 1 minute cache
> entry timeout, should dilute that reduced load even more, and put an
> upperbound on the load generated, just in case it gets slashdot/reddited
> again.
>
It was already in memcached, cached for 15 minutes. However, if
recaching takes a long time and your page gets a lot of traffic, you
can get a cache stampede (just like when Michael Jackson died): while
the recache is in progress, there are more hits for your page and a
zillion Apache workers all race to rebuild the cache, unaware of each
other. I have no evidence that that's what happened, but that's my
theory. Making the recache faster and/or upping the cache timeout
reduces the size and the frequency, respectively, of the window in
which this can happen.

The cache stampede problem was solved for the particular case of the
parser cache using PoolCounter, but I don't think it's necessary for
other types of caching. Computing fundraiser statistics simply
shouldn't be that slow.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On 28/11/11 19:36, Roan Kattouw wrote:
> On Mon, Nov 28, 2011 at 8:28 PM, Neil Harris<neil@tonal.clara.co.uk> wrote:
>> And adding memcached caching with even, say, as little as a 1 minute cache
>> entry timeout, should dilute that reduced load even more, and put an
>> upperbound on the load generated, just in case it gets slashdot/reddited
>> again.
>>
> It was already in memcached, cached for 15 minutes. However, if
> recaching takes a long time and your page gets a lot of traffic, you
> can get a cache stampede (just like when Michael Jackson died): while
> the recache is in progress, there are more hits for your page and a
> zillion Apache workers all race to rebuild the cache, unaware of each
> other. I have no evidence that that's what happened, but that's my
> theory. Making the recache faster and/or upping the cache timeout
> reduces the size and the frequency, respectively, of the window in
> which this can happen.
>
> The cache stampede problem was solved for the particular case of the
> parser cache using PoolCounter, but I don't think it's necessary for
> other types of caching. Computing fundraiser statistics simply
> shouldn't be that slow.
>
> Roan
>

I hadn't thought properly about cache stampedes: since the parser cache
is only part of page rendering, this might also explain some of the
other occasional slowdowns I've seen on Wikipedia.

It would be really cool if there could be some sort of general mechanism
to enable this to be prevented this for all page URLs protected by
memcaching, throughout the system.

-- N.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris <neil@tonal.clara.co.uk> wrote:
> I hadn't thought properly about cache stampedes: since the parser cache is
> only part of page rendering, this might also explain some of the other
> occasional slowdowns I've seen on Wikipedia.
>
> It would be really cool if there could be some sort of general mechanism to
> enable this to be prevented this for all page URLs protected by memcaching,
> throughout the system.
>
I'm not very familiar with PoolCounter but I suspect it's a fairly
generic system for handling this sort of thing. However, stampedes
have never been a practical problem for anything except massive
traffic combined with slow recaching, and that's a fairly rare case.
So I don't think we want to add that sort of concurrency protection
everywhere.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 12:06 PM, Roan Kattouw <roan.kattouw@gmail.com>wrote:

> On Mon, Nov 28, 2011 at 8:59 PM, Neil Harris <neil@tonal.clara.co.uk>
> wrote:
> > I hadn't thought properly about cache stampedes: since the parser cache
> is
> > only part of page rendering, this might also explain some of the other
> > occasional slowdowns I've seen on Wikipedia.
> >
> > It would be really cool if there could be some sort of general mechanism
> to
> > enable this to be prevented this for all page URLs protected by
> memcaching,
> > throughout the system.
> >
> I'm not very familiar with PoolCounter but I suspect it's a fairly
> generic system for handling this sort of thing. However, stampedes
> have never been a practical problem for anything except massive
> traffic combined with slow recaching, and that's a fairly rare case.
> So I don't think we want to add that sort of concurrency protection
> everywhere.
>

For memcache objects that can be grouped together into an "ok to use if a
bit stale" bucket (such as all kinds of stats), there is also the
possibility of lazy async regeneration.

Data is stored in memcache with a fuzzy expire time, i..e { data:foo,
stale:$now+15min } and a cache ttl of forever. When getting the key, if
the time stamp inside marks the data as stale, you can 1) attempt to obtain
a exclusive (acq4me) lock from poolcounter. If immediately successful,
launch an async job to regenerate the cache (while holding the lock) but
continue the request with stale data. In all other cases, just use the
stale data. Mainly useful if the regeneration work is hideously expensive,
such that you wouldn't want clients blocking on even a single cache regen
(as is the behavior with poolcounter as deployed for the parser cache.)
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Tue, Nov 29, 2011 at 12:57 AM, Roan Kattouw <roan.kattouw@gmail.com> wrote:
> On Mon, Nov 28, 2011 at 3:50 PM, MZMcBride <z@mzmcbride.com> wrote:
>> I asked roughly the same thing yesterday (more along the lines of "shouldn't
>> it take someone ten minutes to add memcache support to the extension?").
>> Reedy said it was long-running queries that never timed out that apparently
>> caused the issue.
>>
> I read the code and sent some unsolicited advice to the fundraising
> team. Essentially, the recaching operation could be doing ~35 times
> fewer queries, that should help.
>
> Roan
Wasn't the extension ever reviewed before being enabled? Shouldn't of
the review catch-ed this?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On Mon, Nov 28, 2011 at 9:32 PM, K. Peachey <p858snake@gmail.com> wrote:
> Wasn't the extension ever reviewed before being enabled? Shouldn't of
> the review catch-ed this?
>
The relevant code wasn't present in the extension when it was
originally enabled (in 2009 I think), it was introduced this year. I
don't know who reviewed those changes, all I know is it wasn't me.

Roan

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: error lasted more than 10 minutes.... [ In reply to ]
On 28/11/11 21:26, Asher Feldman wrote:
> For memcache objects that can be grouped together into an "ok to use if a
> bit stale" bucket (such as all kinds of stats), there is also the
> possibility of lazy async regeneration.
>
> Data is stored in memcache with a fuzzy expire time, i..e { data:foo,
> stale:$now+15min } and a cache ttl of forever. When getting the key, if
> the time stamp inside marks the data as stale, you can 1) attempt to obtain
> a exclusive (acq4me) lock from poolcounter. If immediately successful,
> launch an async job to regenerate the cache (while holding the lock) but
> continue the request with stale data. In all other cases, just use the
> stale data. Mainly useful if the regeneration work is hideously expensive,
> such that you wouldn't want clients blocking on even a single cache regen
> (as is the behavior with poolcounter as deployed for the parser cache.)

I see you looked at the poolcounter code :)
Yes, it would be usefult to have a class handling that kind of that,
which would store data valid for a known time (usually a guess) with a
slightly longer expiry, packed with a timestamp. And if it was overdue,
launch an update protected with an acq4any with 0 queue.
I hadn't considered showing "stale" data for that first hit, but could
be easily done through DeferredUpdates. You only need to be careful not
to be reentrant to the same key, since you might deadlock (although with
a 0-queue that's unlikely).


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l