Mailing List Archive

the skin change in 1.21wmf5, display breakage, & fix retrospective
Basics: We rolled out 1.21wmf5 to the non-Wikipedia sites today, after a
brief reversion and re-deployment to fix breakage in how we were
displaying some styling. We are on track to deploy 1.21wmf5 to English
Wikipedia on Monday, December 3 per .

Below: why this happened and how it got fixed, and what we should change
to prevent problems like this in the future.

What happened: changed the headings in the
Vector skin. The new code didn't take the WMF config into account, as
the author wasn't expecting styles and HTML to be cached in such
different ways.

The headings were changed from "h4"/"h5", but the CSS used those tags to
identify them (instead of using CSS classes). Which means, as expected,
that the page layout breaks for up to 30 days.

Page cache is controlled by the wiki page content. Unless the page is
modified, the cache is kept for up to 30 days for anonymous users.
Resource modules, however, are served by ResourceLoader which has its
own much more efficient and deployable cache mechanism. But this means
that the resources for the skin are deployed globally and site-wide
within 5 minutes.... whereas the HTML isn't for another 2 weeks.

The issues that caused were visible in beta labs for the last three
days, but none of us realized they were significant, we thought they
were caused by a misconfigured memcache; see .

We knew that this particular change and the related change might be problematic and sent
out a note about it on Monday --
-- but it looks like we didn't test thoroughly enough on Monday and
Tuesday to catch it before the Wednesday deploy. Only anonymous users
would have been affected. We don't cache logged-in users in Squid. So
logged-in users didn't notice problems on and after the first deploy.

Problems popped up after the Phase 2 deployment to non-Wikipedia sites,
so we reverted the 1.21wmf5 deployment and then redeployed while fixing,
purging, etc.

Gerrit changes: , ,

What we should fix for the future:

This is why client resources must always be backwards compatible.

"Don't change the HTML in incompatible ways" is probably a good
general rule to live by--but having an easy way to say "start purging
all pages on $theseWikis from Squid/Varnish" would also be nice.

get more manual testing on and
immediately after Phase I deployment, including as anonymous reader and
editor to ensure we catch Squid caching issues

train more people to review code well, to reduce backlog and catch
these kinds of problems?

get more people to +2 in core and in important extensions

beta labs needs to be trustworthy enough to make this sort of thing
a blocker immediately

Chris McMahon's take: (for what it's worth, this seems to me to be
a sign that beta labs is becoming more and more trustworthy all the
time. The more we actually use it, the more we'll understand what does
and does not work there. We fixed the memcache problem, which fixed the
ability to login, but didn't investigate the display problems because
we're used to beta not being very reliable. In this case, beta was
reliable, and we didn't understand that. Even with a bug report in
bugzilla with 9 subscribers, no one recognized a real issue.)

Chris McMahon said: I think this could be framed as an issue of signal,
noise, and bandwidth. Beta labs being broken a lot, review backlog in
gerrit, false failures in tests are all noise. Given the constraints of
ongoing projects, it is difficult to pick out the signal from the noise.
We can take steps to reduce the noise so that the signal stands out
more by reducing technical debt: make the tests green, make the test
environment robust, keep up with code review.

(I assembled this just now from IRC & mailing list chatter from several
people, and errors are mine -- sorry for missing attributions here.
Drafting was on )

Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation

Wikitech-l mailing list
Re: the skin change in 1.21wmf5, display breakage, & fix retrospective [ In reply to ]
I still experience the problem on Wikidata main page in Monobook skin.

Wikitech-l mailing list
Re: the skin change in 1.21wmf5, display breakage, & fix retrospective [ In reply to ]
> "Don't change the HTML in incompatible ways" is probably a good
>general rule to live by

Not necessarily - I've only read your summary and not looked into what
happened in depth, but the issue seems to have occurred from changing
the CSS [and possibly js] in a non backwards compatible way. Changing
the HTML didn't really matter.

>but having an easy way to say "start purging
>all pages on $theseWikis from Squid/Varnish" would also be nice.

That sounds like something that could hurt the server kitties unless
done rather slowly [at least for enwiki]...


Wikitech-l mailing list