Mailing List Archive

Christmas server failure report
Earlier today, /a filled with binlogs in db27, which was s3 & s7 master.
nagios had warned too early / nobody noticed. Slaves lagged, lots of
locks, the wikis got to a halt.
Revisions between 6:50 and 8:20 pm UTC were lost (although they can be
manually reimported from db27).
The new s3 and s7 master is db17, with only one slave: db25.
After the master switch, we started having problems due to cached
revision text in memcached, due to the duplication of old_id values,
so we made them read-only until UTC midnight.

We decided not to disable $wgRevisionCacheExpiry but to remove the
faulty entries, thus I quickly prepared the script
maintenance/purgeStaleMemcachedText.php to clean them.

There were problems in hewiki, since data there didn't clean. On one
instance doing $wgMemc->get persisted even after a $wgMemc->delete on
that same key (???).
Other than the hewiki issues, it seemed to run fine. There will be lots
of wrong entries in diff and parser cache needing a manual action=purge
but a purge will clean them.
Flagged revs caches were not touched. Wikis using it may show the wrong
content (with the additional fun of some users viewing the right one).

There are also PPFrame_DOM->expand errors that started around the same
time, even on wikis on a different cluster. They usually only happen
once, and it succeeds just reloading.
https://bugzilla.wikimedia.org/show_bug.cgi?id=26429


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Christmas server failure report [ In reply to ]
On Sat, Dec 25, 2010 at 7:49 PM, Platonides <Platonides@gmail.com> wrote:
> Revisions between 6:50 and 8:20 pm UTC were lost (although they can be
> manually reimported from db27).
Will they actually be reimported? That sounds like it would be messy.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Christmas server failure report [ In reply to ]
Ryan Lane wrote a script to purge some of the Flaged Rev memcached
entries; that ran last night as well.

The DOM-related errors all seem to have come from srv227; apache on that
host was restarted about half an hour ago and the results look good.

Ariel

Στις 26-12-2010, ημέρα Κυρ, και ώρα 01:49 +0100, ο/η Platonides έγραψε:
> Earlier today, /a filled with binlogs in db27, which was s3 & s7 master.
> nagios had warned too early / nobody noticed. Slaves lagged, lots of
> locks, the wikis got to a halt.
> Revisions between 6:50 and 8:20 pm UTC were lost (although they can be
> manually reimported from db27).
> The new s3 and s7 master is db17, with only one slave: db25.
> After the master switch, we started having problems due to cached
> revision text in memcached, due to the duplication of old_id values,
> so we made them read-only until UTC midnight.
>
> We decided not to disable $wgRevisionCacheExpiry but to remove the
> faulty entries, thus I quickly prepared the script
> maintenance/purgeStaleMemcachedText.php to clean them.
>
> There were problems in hewiki, since data there didn't clean. On one
> instance doing $wgMemc->get persisted even after a $wgMemc->delete on
> that same key (???).
> Other than the hewiki issues, it seemed to run fine. There will be lots
> of wrong entries in diff and parser cache needing a manual action=purge
> but a purge will clean them.
> Flagged revs caches were not touched. Wikis using it may show the wrong
> content (with the additional fun of some users viewing the right one).
>
> There are also PPFrame_DOM->expand errors that started around the same
> time, even on wikis on a different cluster. They usually only happen
> once, and it succeeds just reloading.
> https://bugzilla.wikimedia.org/show_bug.cgi?id=26429
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Christmas server failure report [ In reply to ]
On 26/12/10 01:49, Platonides wrote:
> Earlier today, /a filled with binlogs in db27, which was s3& s7 master.
> nagios had warned too early / nobody noticed. Slaves lagged, lots of
> locks, the wikis got to a halt.

Would it be possible to automatically move the binlogs from the master
DB to a dedicated storage space ? This way the disk should always has
disk space available and the binlogs would still be available.

--
Ashar Voultoiz


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Christmas server failure report [ In reply to ]
Ashar Voultoiz wrote:
> On 26/12/10 01:49, Platonides wrote:
>> Earlier today, /a filled with binlogs in db27, which was s3& s7 master.
>> nagios had warned too early / nobody noticed. Slaves lagged, lots of
>> locks, the wikis got to a halt.
>
> Would it be possible to automatically move the binlogs from the master
> DB to a dedicated storage space ? This way the disk should always has
> disk space available and the binlogs would still be available.

You could only move it when it's no longer in use. So this is equivalent
to doing an automatic binlog purge when it is no longer needed, preceded
by a file copy.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l