Mailing List Archive

Convention for logged vs not-logged page requests
Hi all,

In diving into a problem with logging[1], we discovered that we were
unintentionally treating several special page accesses (in this case,
containing included Javascript) as normal pageviews, thus throwing our
pageview statistics way off. The proposed solution involves changing
the way we access those Javascript requests from this form:
http://en.wikipedia.org/wiki/Special:BannerController

...to this form:
http://en.wikipedia.org/w/index.php?title=Special:BannerController

I'm assuming this convention isn't documented anywhere (other than
earlier today on the wikitech wiki[2]). Before we run off and
document this as something code reviewers need to look out for, I'd
like to make sure this is really how we'd like to make the
distinction.

Is this a sensible convention, or is there a different convention we
should implement? Note that any changes to the convention would need
to be implemented here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?view=markup

...so futzing with the convention isn't free, but *may* be worth it if
we have arrive at a vastly superior convention.

Rob
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564
[2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
2010/10/19 Rob Lanphier <robla@wikimedia.org>:
> Is this a sensible convention, or is there a different convention we
> should implement?  Note that any changes to the convention would need
> to be implemented here:
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?view=markup
>
Never before did we load JS through a special page like that, and with
the resource loader coming up it will never be needed ever again,
cause we can and will run everything through load.php . It's a
one-time anomaly, so no need for any convention.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
If I've understood you correctly, your suggestion is that, to make
logging easier, we should adopt a convention of how we call certain web
resources.

I can't imagine that you will ever be able to get all the programmers to
agree not to use URLs that way. It's not like we can mark the URL as
being dangerous somehow. As long as the URL works, they'll want to use
it... and really, why shouldn't they?

Is there some other way we could achieve those objectives? Are there
other patterns that already exist that we could use to notice when it's
not a full page request?


On 10/19/10 1:15 PM, Rob Lanphier wrote:
> Hi all,
>
> In diving into a problem with logging[1], we discovered that we were
> unintentionally treating several special page accesses (in this case,
> containing included Javascript) as normal pageviews, thus throwing our
> pageview statistics way off. The proposed solution involves changing
> the way we access those Javascript requests from this form:
> http://en.wikipedia.org/wiki/Special:BannerController
>
> ...to this form:
> http://en.wikipedia.org/w/index.php?title=Special:BannerController
>
> I'm assuming this convention isn't documented anywhere (other than
> earlier today on the wikitech wiki[2]). Before we run off and
> document this as something code reviewers need to look out for, I'd
> like to make sure this is really how we'd like to make the
> distinction.
>
> Is this a sensible convention, or is there a different convention we
> should implement? Note that any changes to the convention would need
> to be implemented here:
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?view=markup
>
> ...so futzing with the convention isn't free, but *may* be worth it if
> we have arrive at a vastly superior convention.
>
> Rob
> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564
> [2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

--
Neil Kandalgaonkar |) <neilk@wikimedia.org>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On 10/19/10 1:29 PM, Roan Kattouw wrote:
> 2010/10/19 Rob Lanphier<robla@wikimedia.org>:
>> Is this a sensible convention, or is there a different convention we
>> should implement? Note that any changes to the convention would need
>> to be implemented here:
>> http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?view=markup
>>
> Never before did we load JS through a special page like that, and with
> the resource loader coming up it will never be needed ever again,
> cause we can and will run everything through load.php . It's a
> one-time anomaly, so no need for any convention.

Isn't it fairly common to load data or other such page fragments in this
way, though? Or does it only seem common to me because I commonly work
with Commons?

--
Neil Kandalgaonkar |) <neilk@wikimedia.org>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Assuming that both

http://en.wikipedia.org/wiki/Special:BannerController
http://en.wikipedia.org/w/index.php?title=Special:BannerController

will still return the same results, wouldn't it make more sense to
teach the stat's logger to ignore both? Or is there a reason that we
actually want to track one and not the other?

It seems like an awful lot of trouble to teach every software author
that they need to follow a particular convention just so the stats
engine will work as intended. It would seem like it would be much
simpler to teach the stats engine to simply detect and ignore this
special case. Or is there a reason that doing so is not possible?

-Robert Rohde

On Tue, Oct 19, 2010 at 1:15 PM, Rob Lanphier <robla@wikimedia.org> wrote:
> Hi all,
>
> In diving into a problem with logging[1], we discovered that we were
> unintentionally treating several special page accesses (in this case,
> containing included Javascript) as normal pageviews, thus throwing our
> pageview statistics way off.  The proposed solution involves changing
> the way we access those Javascript requests from this form:
> http://en.wikipedia.org/wiki/Special:BannerController
>
> ...to this form:
> http://en.wikipedia.org/w/index.php?title=Special:BannerController
>
> I'm assuming this convention isn't documented anywhere (other than
> earlier today on the wikitech wiki[2]).  Before we run off and
> document this as something code reviewers need to look out for, I'd
> like to make sure this is really how we'd like to make the
> distinction.
>
> Is this a sensible convention, or is there a different convention we
> should implement?  Note that any changes to the convention would need
> to be implemented here:
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/filter.c?view=markup
>
> ...so futzing with the convention isn't free, but *may* be worth it if
> we have arrive at a vastly superior convention.
>
> Rob
> [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=25564
> [2] http://wikitech.wikimedia.org/view/Squid_logging#Inflated_Stats
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Rob Lanphier wrote:
> Hi all,
>
> In diving into a problem with logging[1], we discovered that we were
> unintentionally treating several special page accesses (in this case,
> containing included Javascript) as normal pageviews, thus throwing our
> pageview statistics way off. The proposed solution involves changing
> the way we access those Javascript requests from this form:
> http://en.wikipedia.org/wiki/Special:BannerController
>
> ...to this form:
> http://en.wikipedia.org/w/index.php?title=Special:BannerController
>
> I'm assuming this convention isn't documented anywhere (other than
> earlier today on the wikitech wiki[2]). Before we run off and
> document this as something code reviewers need to look out for, I'd
> like to make sure this is really how we'd like to make the
> distinction.

I think the anomally is to have a Special page that is javascript.

A special page should look like a wiki page.

In your case, I would append ctype=text/javascript to the query string,
so it
a) Looks more like something that will give out javascript.
b) Forces it to use the long style.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On Tue, Oct 19, 2010 at 11:41 PM, Platonides <Platonides@gmail.com> wrote:
> Rob Lanphier wrote:
>> Hi all,
>>
>> In diving into a problem with logging[1], we discovered that we were
>> unintentionally treating several special page accesses (in this case,
>> containing included Javascript) as normal pageviews, thus throwing our
>> pageview statistics way off.  The proposed solution involves changing
>> the way we access those Javascript requests from this form:
>> http://en.wikipedia.org/wiki/Special:BannerController
>>
>> ...to this form:
>> http://en.wikipedia.org/w/index.php?title=Special:BannerController
>>
>> I'm assuming this convention isn't documented anywhere (other than
>> earlier today on the wikitech wiki[2]).  Before we run off and
>> document this as something code reviewers need to look out for, I'd
>> like to make sure this is really how we'd like to make the
>> distinction.
>
> I think the anomally is to have a Special page that is javascript.
>
> A special page should look like a wiki page.
>
> In your case, I would append ctype=text/javascript to the query string,
> so it
> a) Looks more like something that will give out javascript.
> b) Forces it to use the long style.
Nope, appending parameters works also in the short form:
http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript

Works also for ?action=edit etc.

Marco

--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Op 20 okt 2010, om 00:09 heeft Marco Schuster het volgende geschreven:

> On Tue, Oct 19, 2010 at 11:41 PM, Platonides <Platonides@gmail.com>
> wrote:
>> Rob Lanphier wrote:
>>> Hi all,
>>>
>>> In diving into a problem with logging[1], we discovered that we were
>>> unintentionally treating several special page accesses (in this
>>> case,
>>> containing included Javascript) as normal pageviews, thus throwing
>>> our
>>> pageview statistics way off. The proposed solution involves
>>> changing
>>> the way we access those Javascript requests from this form:
>>> http://en.wikipedia.org/wiki/Special:BannerController
>>>
>>> ...to this form:
>>> http://en.wikipedia.org/w/index.php?title=Special:BannerController
>>>
>>> I'm assuming this convention isn't documented anywhere (other than
>>> earlier today on the wikitech wiki[2]). Before we run off and
>>> document this as something code reviewers need to look out for, I'd
>>> like to make sure this is really how we'd like to make the
>>> distinction.
>>
>> I think the anomally is to have a Special page that is javascript.
>>
>> A special page should look like a wiki page.
>>
>> In your case, I would append ctype=text/javascript to the query
>> string,
>> so it
>> a) Looks more like something that will give out javascript.
>> b) Forces it to use the long style.
> Nope, appending parameters works also in the short form:
> http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript
>
> Works also for ?action=edit etc.
>
> Marco
>
> --
> VMSoft GbR
> Nabburger Str. 15
> 81737 München
> Geschäftsführer: Marco Schuster, Volker Hemmert
> http://vmsoft-gbr.de
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


But the short version without /w/index.php but with direct ?parameters
doensn't for for action=raw (&ctype=text/javascript)

See the errror on: http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw

Nor does (atleast did) the software never point to a non-viewing page
in the short form.


--
Krinkle
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On Wed, Oct 20, 2010 at 12:49 AM, Krinkle <krinklemail@gmail.com> wrote:
> But the short version without /w/index.php but with direct ?parameters
> doensn't for for action=raw (&ctype=text/javascript)
>
> See the errror on: http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw

Strange. I'm sure this is to prevent users from using Wikipedia as
spy-javascript-hoster, but why does
http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&action=raw
work then?

Marco


--
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On Tue, Oct 19, 2010 at 4:15 PM, Marco Schuster <
marco@harddisk.is-a-geek.org> wrote:

> On Wed, Oct 20, 2010 at 12:49 AM, Krinkle <krinklemail@gmail.com> wrote:
> > But the short version without /w/index.php but with direct ?parameters
> > doensn't for for action=raw (&ctype=text/javascript)
> >
> > See the errror on:
> http://meta.wikimedia.org/wiki/User:Krinkle/global.js?action=raw
>
> Strange. I'm sure this is to prevent users from using Wikipedia as
> spy-javascript-hoster, but why does
>
> http://meta.wikimedia.org/w/index.php?title=User:Krinkle/global.js&action=raw
> work then?
>

Internet Explorer, at least until recently (might finally be fixed?), would
sometimes interpret "file extensions" on the end of a URL's path component
as if they were meaningful file type information, especially when combined
with actual content-type headers it considered "ambiguous".

A pretty URL such as "
http://meta.wikimedia.org/wiki/Something.html?action=raw" would thus be
dangerous, as the ".html" on the end of the wiki page -- a completely
meaningless piece of an opaque URL path -- could trigger interpretation of
the file's content as actual HTML, etc, thus become a vector for JavaScript
injection into the wiki's same-origin security context.

To keep that nailed down, we forbade access to action=raw unless the URL's
path portion matched the wiki's core entry point exactly. There may be nicer
ways to do this now. :)


Back to the original issue -- I agree with Roan that the best way to go is
to make sure most such things as the BannerLoader get converted to use the
ResourceLoader interface, which eliminates the need to create and manage as
many JS/CSS special-page points like this.

I think BannerLoader is part of CentralNotice, which is Scary Code and may
or may not fit in nicely though. *shudder* If making short-term tweaks to it
without redoing it, be very careful about caching!

-- brion
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On Tue, Oct 19, 2010 at 1:57 PM, Neil Kandalgaonkar <neilk@wikimedia.org> wrote:
> If I've understood you correctly, your suggestion is that, to make
> logging easier, we should adopt a convention of how we call certain web
> resources.

I'm not so much suggesting it as I am stating the status quo, and
asking whether we should document it well or change the code.

> I can't imagine that you will ever be able to get all the programmers to
> agree not to use URLs that way. It's not like we can mark the URL as
> being dangerous somehow. As long as the URL works, they'll want to use
> it... and really, why shouldn't they?

And they did. See https://bugzilla.wikimedia.org/show_bug.cgi?id=25564

On 10/19/10 1:29 PM, Roan Kattouw wrote:
> Never before did we load JS through a special page like that, and with
> the resource loader coming up it will never be needed ever again,
> cause we can and will run everything through load.php . It's a
> one-time anomaly, so no need for any convention.

I guess I'm not quite so confident this problem won't rear it's head
again, but since it's a theoretical problem at this point, and we have
enough actual problems to deal with, I'm happy to drop it for now.

Rob

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Rob Lanphier wrote:
> I guess I'm not quite so confident this problem won't rear it's head
> again, but since it's a theoretical problem at this point, and we have
> enough actual problems to deal with, I'm happy to drop it for now.

This thread reminded me of the old Webalizer hack of including
"&dontcountme=s" in URLs to avoid things like JavaScript loads inflating the
stats it collected. (Or at least I think that's the issue the URL parameter
was trying to solve.)

The URL trick crept into all sorts of places, including site-wide JavaScript
pages, article text, and even core code.[1]

I think setting a standard is a good idea in general, though I worry about
making it prefix-based (like all URLs starting with
"http://foo.wikiproject.org/wiki/"). I can see potential problems with
counting hits to the secure server and I can see potential problems if the
URL structure changes in the future (possibly to something more sensible
like "http://foo.wikiproject.org/view/"). These problems might be
non-existent or unavoidable, I'm not completely sure.

MZMcBride

[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/35103



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
2010/10/19 Robert Rohde <rarohde@gmail.com>:
> It seems like an awful lot of trouble to teach every software author
> that they need to follow a particular convention just so the stats
> engine will work as intended.  It would seem like it would be much
> simpler to teach the stats engine to simply detect and ignore this
> special case.  Or is there a reason that doing so is not possible?
>
As Domas pointed out to RobLa and myself, special page names are
internationalized, so you'd have to teach the stats logger about every
translation of it, which is impractical compared to using a different
URL.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Hi!

> will still return the same results, wouldn't it make more sense to
> teach the stat's logger to ignore both?  Or is there a reason that we
> actually want to track one and not the other?

Pretty URLs are for being pretty URLs (e.g. in your address bar). That
leads to very easy assumption, that if there's a pretty URL, it
probably indicates a pageview :-) We quite like other pretty URLs for
Special pages e.g. Watchlist or Recentchanges - as we track their
accesses.

> It seems like an awful lot of trouble to teach every software author
> that they need to follow a particular convention just so the stats
> engine will work as intended.  It would seem like it would be much
> simpler to teach the stats engine to simply detect and ignore this
> special case.  Or is there a reason that doing so is not possible?

Heh, apparently stats became a big deal lately, so one with powers to
change that can feel important! ;-)

Anyway, there're few choices to resolve it on the stats side:

1) Implement pulling of a namespace map for each project, build out
an efficient rules engine (in C) for dealing with this (do note, every
project will have different namespace for this URL). Also, make it
extensible, so each developer tells about which names will be
not-a-pageview ;-) There's nothing as fun as writing that kind of
code, and do note, it won't be just five (or fifty) lines.

2) Add additional internal header (X-Pageview: true!), that would be
logged by squids inside the stream :) That probably asks for large
review inside MediaWiki, as well as squid code changes (and of course,
rollout of new binary). Would be nice inter-group effort.

3) Not care about inflated per-project numbers, or have people adjust
the numbers, as the source data is there (They can filter out banner
loader themselves!)

You can pick any of these, make sure it gets into strategy plan, as we
don't decide things on wikitech-l anymore :)
I prefer, hehehe, not doing anything, and just having pretty URLs just
for pageviews ;-)

Domas

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
On Wed, Oct 20, 2010 at 5:51 AM, Domas Mituzas <midom.lists@gmail.com> wrote:
>> It seems like an awful lot of trouble to teach every software author
>> that they need to follow a particular convention just so the stats
>> engine will work as intended.  It would seem like it would be much
>> simpler to teach the stats engine to simply detect and ignore this
>> special case.  Or is there a reason that doing so is not possible?
>
> Heh, apparently stats became a big deal lately, so one with powers to
> change that can feel important! ;-)
>
> Anyway, there're few choices to resolve it on the stats side:
>
> 1) Implement pulling of a namespace  map for each project, build out
> an efficient rules engine (in C) for dealing with this (do note, every
> project will have different namespace for this URL). Also, make it
> extensible, so each developer tells about which names will be
> not-a-pageview ;-) There's nothing as fun as writing that kind of
> code, and do note, it won't be just five (or fifty) lines.

<snip>

> 3) Not care about inflated per-project numbers, or have people adjust
> the numbers, as the source data is there (They can filter out banner
> loader themselves!)

I think my comment about "stats engine" may have been confusing. I
tend to think of the entire process chain as part of the stats engine,
even though it is implemented as distinct collection and
interpretation bits.

There is no reason that the filtering has to be done in the stats
collector. It could be done there, but given the language variants
that is likely to be hard to code and slow, as you rightly point out.
I think I had more in mind that it be filtered at the interpretation
side of the stats process. In other words, that Zachte (or whoever)
generate a list of pages that are ignored for the purposes of counting
stats. That would seem to be an easier place to deal with an
exclusion list and to pull all language versions of those page names,
and such. Having such an exclusion list for interpretation will be
necessary anyway if we plan to reprocess the existing logs that don't
follow the suggested convention. (I'm assuming we don't want to
simply throw out three weeks of logs.)

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Marco Schuster wrote:
>> In your case, I would append ctype=text/javascript to the query string,
>> so it
>> a) Looks more like something that will give out javascript.
>> b) Forces it to use the long style.
> Nope, appending parameters works also in the short form:
> http://en.wikipedia.org/wiki/Special:BannerController?ctype=text/javascript
>
> Works also for ?action=edit etc.
>
> Marco

You could do that. But using the appropiate functions for creating the
link, you will be given the "ugly url". That's what I referred to.




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Convention for logged vs not-logged page requests [ In reply to ]
Rob Lanphier <robla <at> wikimedia.org> writes:

> In diving into a problem with logging[1], we discovered that we were
> unintentionally treating several special page accesses (in this case,
> containing included Javascript) as normal pageviews, thus throwing our
> pageview statistics way off. The proposed solution involves changing
> the way we access those Javascript requests from this form:
> http://en.wikipedia.org/wiki/Special:BannerController
>
> ...to this form:
> http://en.wikipedia.org/w/index.php?title=Special:BannerController

The problem with that is that most of the time, URLs like that *should* be
logged - they are simply the result of someone using a special page. For
example, search page loads (about 3% of all page loads!) go completely under the
radar this way, and while some wikipedias use hacks like [1] to avoid that, it
really isn't an ideal situation. Also, page edits and other actions are not
logged, nor page loads for old versions of pages, or for pages linked from
recentchanges, or unstable versions where FlaggedRevs are enabled.


[1]
http://de.wiktionary.org/w/index.php?title=MediaWiki:If-search.js&action=raw&ctype=text/css


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l