Mailing List Archive

Page views
I'm telling people that the Swedish Wikipedia has 90-100
million page views per month or on average ten per month
per Swedish citizen. This is based on stats.wikimedia.org
(Wikistats), but is it really true? It would be really
embarrassing if it were wrong by some order of magnitude.

There is of course a difference between the language and
the country. Another measure says Internet users in Sweden
(some 90 percent of all citizens) make 16 page views to
Wikipedia per month, including all languages. Both numbers
10 or 16 make sense. But are they correct?

Wikistats also says Swedish Wikisource has 300-400 thousand
page views per month, which would be 10-13 thousand per day
on average. Knowing how small the Swedish Wikisource is (only
16,000 wiki pages + 37,000 facsimile pages), and comparing
to other Swedish language websites, I'm surprised that
Swedish Wikisource could attract even this much traffic.
Now we're at such a small scale, that reading through a
day's logfile with 13,000 lines is realistic for a human.

Is there a chance WMF could publish the logfile for Swedish
Wikisource for a typical day, with just the IP addresses
anonymized? Plus the source code that counts the number
of page views, by filtering out accesses from robot crawlers
and accesses to non-pages (like images and style sheets).

Page views for individual pages (on stats.grok.se) shows
the Main page of Swedish Wikisource is shown 120 times/day
while Recent changes is shown 160 times/day. From my own
experience, contributors are the only ones to look at
Recent changes, while they almost never look at the Main
page. If IP addresses are scrambled but not removed, the
log file should be able to show this pattern. Is it possible
to tell apart the IP addresses for contributors and
non-contributors, and present page views from each
category?


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
Hi Lars,

You have a point here, especially for smaller projects:

For Swedish Wikisource:

zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' |
awk '{print $9, $11,$14}'

returns 20 lines from this 1:1000 sampled squid log file
after removing javascript/json/robots.txt there are 13 left,
which fits perfectly with 10,000 to 13,000 per day

however 9 of these are bots!!

http://sv.wikisource.org/wiki/Snabbt_jagar_stormen_v%C3%A5ra_%C3%A5r,text/ht
ml,Mozilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_2)%20AppleWebKit/
535.19%20(KHTML,%20like%20Gecko)%20Chrome/18.0.1025.142%20Safari/535.19
http://sv.wikisource.org/wiki/Special:Log?page=User:Sarahmaethomas,text/html
,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.ht
ml)
http://sv.wikisource.org/wiki/Underbar-k%C3%A4rlek-s%C3%A5-stor,text/html,Mo
zilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/w/index.php?title=Diskussion%3aBer%c3%a4ttelser+ur+
svenska+historien%2fHedniska+tiden%2f2&redirect=no&action=raw&ctype=text/pla
in&dontcountme=s,text/x-wiki,DotNetWikiBot/2.97%20(Microsoft%20Windows%20NT%
206.1.7601%20Service%20Pack%201;%20.NET%20CLR%202.0.50727.5448)
http://sv.wikisource.org/wiki/Sida:SOU_1962_36.djvu/36,-,Mozilla/5.0%20(comp
atible;%20Googlebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/wiki/Till_Polen,-,Mozilla/5.0%20(compatible;%20Goog
lebot/2.1;%20+http://www.google.com/bot.html)
http://sv.wikisource.org/wiki/Bibeln_1917/F%C3%B6rsta_Moseboken,text/html,Mo
zilla/5.0%20(Macintosh;%20Intel%20Mac%20OS%20X%2010_7_3)%20AppleWebKit/534.5
3.11%20(KHTML,%20like%20Gecko)%20Version/5.1.3%20Safari/534.53.10
http://sv.wikisource.org/wiki/Arbetare,text/html,Mozilla/5.0%20(compatible;%
20YandexBot/3.0;%20+http://yandex.com/bots)
http://sv.wikisource.org/wiki/Industrin_och_kvinnofr,text/html,Mozilla/5.0%2
0(compatible;%20Baiduspider/2.0;%20+http://www.baidu.com/search/spider.html)
http://sv.wikisource.org/wiki/Sida:Berzelius_Reseanteckningar_1903.djvu/120,
text/html,Mozilla/5.0%20(compatible;%20Googlebot/2.1;%20+http://www.google.c
om/bot.html)
http://sv.wikisource.org/wiki/Kategori:Ordspr%C3%A5ksboken,text/html,Mozilla
/5.0%20(compatible;%20MSIE%209.0;%20Windows%20NT%206.1;%20Win64;%20x64;%20Tr
ident/5.0)
http://sv.wikisource.org/wiki/Special:L%C3%A4nkar_hit/Kategori:Karin_Boye,te
xt/html,Mozilla/5.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bot
s)
http://sv.wikisource.org/wiki/Sida:Om_arternas_uppkomst.djvu/235,-,Mozilla/5
.0%20(compatible;%20YandexBot/3.0;%20+http://yandex.com/bots)

The page view report
http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthlyOriginal.htm
is based on
http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-04/
collected by
http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/

By sheer coincidence we have been discussing filtering bots from the
projectcounts files (at the source : webstatscollector) last week.
Not a new discussion, but with current resources this may be feasible now,
where it wasn't years ago.

Erik




-----Original Message-----
From: Lars Aronsson [mailto:lars@aronsson.se]
Sent: Sunday, April 08, 2012 2:24 AM
To: Wikimedia developers
Cc: Erik Zachte
Subject: Page views

I'm telling people that the Swedish Wikipedia has 90-100 million page views
per month or on average ten per month per Swedish citizen. This is based on
stats.wikimedia.org (Wikistats), but is it really true? It would be really
embarrassing if it were wrong by some order of magnitude.

There is of course a difference between the language and the country.
Another measure says Internet users in Sweden (some 90 percent of all
citizens) make 16 page views to Wikipedia per month, including all
languages. Both numbers
10 or 16 make sense. But are they correct?

Wikistats also says Swedish Wikisource has 300-400 thousand page views per
month, which would be 10-13 thousand per day on average. Knowing how small
the Swedish Wikisource is (only
16,000 wiki pages + 37,000 facsimile pages), and comparing to other Swedish
language websites, I'm surprised that Swedish Wikisource could attract even
this much traffic.
Now we're at such a small scale, that reading through a day's logfile with
13,000 lines is realistic for a human.

Is there a chance WMF could publish the logfile for Swedish Wikisource for a
typical day, with just the IP addresses anonymized? Plus the source code
that counts the number of page views, by filtering out accesses from robot
crawlers and accesses to non-pages (like images and style sheets).

Page views for individual pages (on stats.grok.se) shows the Main page of
Swedish Wikisource is shown 120 times/day while Recent changes is shown 160
times/day. From my own experience, contributors are the only ones to look at
Recent changes, while they almost never look at the Main page. If IP
addresses are scrambled but not removed, the log file should be able to show
this pattern. Is it possible to tell apart the IP addresses for contributors
and non-contributors, and present page views from each category?


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
2012/4/8 Erik Zachte <ezachte@wikimedia.org>

> Hi Lars,
>
> You have a point here, especially for smaller projects:
>
> For Swedish Wikisource:
>
> zcat sampled-1000.log-20120404.gz | grep 'GET http://sv.wikisource.org' |
> awk '{print $9, $11,$14}'
>
> returns 20 lines from this 1:1000 sampled squid log file
> after removing javascript/json/robots.txt there are 13 left,
> which fits perfectly with 10,000 to 13,000 per day
>
> however 9 of these are bots!!
>
>
How many of that 1000 sample log were robots (including all languages)?

--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com
Pre-doctoral student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
On Mon, Apr 9, 2012 at 00:46, Erik Zachte <ezachte@wikimedia.org> wrote:

> returns 20 lines from this 1:1000 sampled squid log file
> after removing javascript/json/robots.txt there are 13 left,
> which fits perfectly with 10,000 to 13,000 per day
>
> however 9 of these are bots!!
>

Is this the same case for mobile stats as well? I don't think there could
be sudden 100% growth for 2 months now across wikis[1] without some reason
like this.

[1] http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm

--
Regards
Srikanth.L
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
Hi Srikanth,

Yes, we are looking into the growth percentages as they seem
unrealistically high.
Best,
Diederik


On Mon, Apr 9, 2012 at 3:30 AM, Srikanth Lakshmanan <srik.lak@gmail.com> wrote:
>
>
> On Mon, Apr 9, 2012 at 00:46, Erik Zachte <ezachte@wikimedia.org> wrote:
>>
>> returns 20 lines from this 1:1000 sampled squid log file
>> after removing javascript/json/robots.txt there are 13 left,
>> which fits perfectly with 10,000 to 13,000 per day
>>
>> however 9 of these are bots!!
>
>
> Is this the same case for mobile stats as well? I don't think there could be
> sudden 100% growth for 2 months now across wikis[1] without some reason like
> this.
>
> [1]  http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm
>
> --
> Regards
> Srikanth.L

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
Since a few weeks Google bot crawls the mobile site (on purpose,
they want to know what mobile-friendly content is on the web,
even though it is similar to our main site).

This makes all the difference.

Here is a random example of how our traffic on smaller Wikipedias changed.
http://stats.wikimedia.org/wikimedia/misc/MobileTrafficCA.png
(based on Domas' hourly projectcounts files)

The peak occurred at 8 March 2 AM, we find 46957 page views for that one
hour.
In the 1:1000 sampled squid log we should find ~47 of those.

"zcat sampled-1000.log-20120308.gz | grep ca.m.wikipedia.org"
yields 339 records for the whole day, most are Google bot.

(some more numbers on bot share of total traffic tomorrow)

Erik Zachte

-----Original Message-----
From: wikitech-l-bounces@lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Diederik van
Liere
Sent: Monday, April 09, 2012 9:28 PM
To: Srikanth Lakshmanan
Cc: Wikimedia developers; Diederik van Liere; Lars Aronsson
Subject: Re: [Wikitech-l] Page views

Hi Srikanth,

Yes, we are looking into the growth percentages as they seem unrealistically
high.
Best,
Diederik


On Mon, Apr 9, 2012 at 3:30 AM, Srikanth Lakshmanan <srik.lak@gmail.com>
wrote:
>
>
> On Mon, Apr 9, 2012 at 00:46, Erik Zachte <ezachte@wikimedia.org> wrote:
>>
>> returns 20 lines from this 1:1000 sampled squid log file after
>> removing javascript/json/robots.txt there are 13 left, which fits
>> perfectly with 10,000 to 13,000 per day
>>
>> however 9 of these are bots!!
>
>
> Is this the same case for mobile stats as well? I don't think there
> could be sudden 100% growth for 2 months now across wikis[1] without
> some reason like this.
>
> [1] 
> http://stats.wikimedia.org/EN_India/TablesPageViewsMonthlyMobile.htm
>
> --
> Regards
> Srikanth.L

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
Here are some numbers on total bot burden:

1)
http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states
for March 2012:

In total 69.5 M page requests (mime type text/html only!) per day are
considered crawler requests, out of 696 M page requests (10.0%) or 469 M
external page requests (14.8%). About half (35.1 M) of crawler requests come
from Google.

2)
Here are counts from one day log, as sanity check:

zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P
'/wiki/|index.php' | grep -cP ' - |text/html' => 678325

zcat sampled-1000.log-20120404.gz | awk '{print $9, $11, $14}' | grep -P
'/wiki/|index.php' | grep -P ' - |text/html' | grep -ciP
'bot|crawler|spider' => 68027

68027 / 678325 = 10.0% which matches really well with numbers from
SquidReportCrawlers.htm

---

My suggestion for how to filter these bots efficiently in c program (no
costly nuanced regexps) before sending data to webstatscollector:

a) Find 14th field in space delimited log line = user agent (but beware of
false delimiters in logs from varnish, if still applicable)
b) Search this field case insensitive for bot/crawler/spider/http (by
convention only bots have url in agent string)

That will filter out most bot pollution. We still want those records in
sampled log though.

Any thoughts?

Erik Zachte

-----Original Message-----
From: wikitech-l-bounces@lists.wikimedia.org
[mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of emijrp
Sent: Sunday, April 08, 2012 9:21 PM
To: Wikimedia developers
Cc: Diederik van Liere; Lars Aronsson
Subject: Re: [Wikitech-l] Page views

2012/4/8 Erik Zachte <ezachte@wikimedia.org>

> Hi Lars,
>
> You have a point here, especially for smaller projects:
>
> For Swedish Wikisource:
>
> zcat sampled-1000.log-20120404.gz | grep 'GET
> http://sv.wikisource.org' | awk '{print $9, $11,$14}'
>
> returns 20 lines from this 1:1000 sampled squid log file after
> removing javascript/json/robots.txt there are 13 left, which fits
> perfectly with 10,000 to 13,000 per day
>
> however 9 of these are bots!!
>
>
How many of that 1000 sample log were robots (including all languages)?

--
Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral
student at the University of Cádiz (Spain)
Projects: AVBOT <http://code.google.com/p/avbot/> |
StatMediaWiki<http://statmediawiki.forja.rediris.es>
| WikiEvidens <http://code.google.com/p/wikievidens/> |
WikiPapers<http://wikipapers.referata.com>
| WikiTeam <http://code.google.com/p/wikiteam/>
Personal website: https://sites.google.com/site/emijrp/
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
On 04/11/2012 01:45 AM, Erik Zachte wrote:
> Here are some numbers on total bot burden:
>
> 1)
> http://stats.wikimedia.org/wikimedia/squids/SquidReportCrawlers.htm states
> for March 2012:
>
> In total 69.5 M page requests (mime type text/html only!) per day are
> considered crawler requests, out of 696 M page requests (10.0%) or 469 M
> external page requests (14.8%). About half (35.1 M) of crawler requests come
> from Google.

The fraction will be larger than average (larger than 10%) for
a) sites with many small pages (Wiktionary) and
b) sites in languages with a smaller audience (Swedish sites).
Bots will index these pages as they are found, but each
of these pages can expect fewer search hits and less human
traffic than long articles (Wikipedia) in languages with many
speakers (English). The bot traffic is like a constant
background noise, and the human traffic is the signal on top.
Sites with many small pages and a small audience will have
a lower signal-to-noise ratio. The long tail of seldom
visited pages is drowning in that noise.

I should disclose that I "work for the competition". I tried
to add books to Wikisource, but its complexity slows me down
so I'm now focusing on my own Scandinavian book scanning
website Project Runeberg, http://runeberg.org/

It has 700,000 scanned book pages, the same size as the
English Wikisource, which is a large number of pages for
a small language audience (mostly Swedish). Yesterday,
April 10, its Apache access log had 291,000 hits, of which
116,000 are HTML pages, but 71,000 match bot/spider/crawler,
leaving only 45,000 human page views. If Swedish Wikisource
which is 1/20 that size would get 10-13 thousand human page
views per day or 1/4 of that web traffic, I'd be surprised.
It is more likely that 71/116 = 61% is bot traffic.

(Are we competitors? Really not. We're both liberating
content. Swedish Wikipedia has more external links
to runeberg.org than to any other website.)


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se

Project Runeberg - free Nordic literature - http://runeberg.org/


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Page views [ In reply to ]
> My suggestion for how to filter these bots efficiently in c program (no
> costly nuanced regexps) before sending data to webstatscollector:
>
> a) Find 14th field in space delimited log line = user agent (but beware of
> false delimiters in logs from varnish, if still applicable)
> b) Search this field case insensitive for bot/crawler/spider/http (by
> convention only bots have url in agent string)
>
> That will filter out most bot pollution. We still want those records in
> sampled log though.
>
> Any thoughts?
I did some research on fast string matching and it seems that the
recently developed algorithm by Leonid Volnitsky
is very fast (http://volnitsky.com/project/str_search/index.html). I
will do some benchmarks vs the ordinary C strstr function but
the author claims it's 20x faster.

So instead of hard coding where the bot information should be, just
search the entire logline for the bot information and if it is
present discard the logline and else process as-is.

Best,
Diederik

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l