Mailing List Archive

[Wikimedia-l] Wikipedia article per speaker
When you get data, at some point of time you start thinking about
quite fringe comparisons. But that could actually give some useful
conclusions, like this time it did [1].

We did the next:
* Used the number of primary speakers from Ethnologue. (Erik Zachte is
using approximate number of primary + secondary speakers; that could
be good for correction of this data.)
* Categorized languages according to the logarithmic number of
speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
* Took the number of articles of Wikipedia in particular language and
created ration (number of articles / number of speakers).
* This list is consisted just of languages with Ethnologue status 1
(national), 2 (provincial) or 3 (wider communication). In fact, we
have a lot of projects (more than 100) with worse language status; a
number of them are actually threatened or even on the edge of
extinction.

Those are the preliminary results and I will definitely have to pass
through all the numbers. I fixed manually some serious errors, like
not having English Wikipedia itself inside of data :D

Putting the languages into the logarithmic categories proved to be
useful, as we are now able to compare the Wikipedias according to
their gross capacity (numbers of speakers). I suppose somebody well
introduced into statistics could even create the function which could
be used to check how good one project stays, no matter of those strict
categories.

It's obvious that as more speakers one language has, it's harder to
the community to follow the ratio.

So, the winners per category are:
1) >= 1k: Hawaiian, ratio 0.96900
2) >= 10k: Mirandese, ratio 0.18073
3) >= 100k: Basque, ratio 0.38061
4) >= 1M: Swedish, ratio 0.21381
5) >= 10M: Dutch, ratio 0.08305
6) >= 100M: English, ratio 0.01447

However, keep in mind that we removed languages not inside categories
1, 2 or 3. That affected >=10k languages, as, for example, Upper
Sorbian stays much better than Mirandese (0.67). (Will fix it while
creating the full report. Obviously, in this case logarithmic
categories of numbers of speakers are much more important than what's
the state of the language.)

It's obvious that we could draw the line between 1:1 for 1-10k
speakers to 10:1 for >=100M speakers. But, again, I would like to get
input of somebody more competent.

One very important category is missing here and it's about the level
of development of the speakers. That could be added: GDP/PPP per
capita for spoken country or countries would be useful as measurement.
And I suppose somebody with statistical knowledge would be able to
give us the number which would have meaning "ability to create
Wikipedia article".

Completed in such way, we'd be able to measure the success of
particular Wikimedia groups and organizations. OK. Articles per
speaker are not the only way to do so, but we could use other
parameters, as well: number of new/active/very active editors etc. And
we could put it into time scale.

I'll make some other results. And to remind: I'd like to have the
formula to count "ability to create Wikipedia article" and then to
produce "level of particular community success in creating Wikipedia
articles". And, of course, to implement it for editors.

[1] https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Wikipedia article per speaker [ In reply to ]
Interesting, but you miss Latin language which is official language of a
country (even if English Wikipedia says differently).

regards

On Mon, Jun 8, 2015 at 12:23 AM, Milos Rancic <millosh@gmail.com> wrote:

> When you get data, at some point of time you start thinking about
> quite fringe comparisons. But that could actually give some useful
> conclusions, like this time it did [1].
>
> We did the next:
> * Used the number of primary speakers from Ethnologue. (Erik Zachte is
> using approximate number of primary + secondary speakers; that could
> be good for correction of this data.)
> * Categorized languages according to the logarithmic number of
> speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
> * Took the number of articles of Wikipedia in particular language and
> created ration (number of articles / number of speakers).
> * This list is consisted just of languages with Ethnologue status 1
> (national), 2 (provincial) or 3 (wider communication). In fact, we
> have a lot of projects (more than 100) with worse language status; a
> number of them are actually threatened or even on the edge of
> extinction.
>
> Those are the preliminary results and I will definitely have to pass
> through all the numbers. I fixed manually some serious errors, like
> not having English Wikipedia itself inside of data :D
>
> Putting the languages into the logarithmic categories proved to be
> useful, as we are now able to compare the Wikipedias according to
> their gross capacity (numbers of speakers). I suppose somebody well
> introduced into statistics could even create the function which could
> be used to check how good one project stays, no matter of those strict
> categories.
>
> It's obvious that as more speakers one language has, it's harder to
> the community to follow the ratio.
>
> So, the winners per category are:
> 1) >= 1k: Hawaiian, ratio 0.96900
> 2) >= 10k: Mirandese, ratio 0.18073
> 3) >= 100k: Basque, ratio 0.38061
> 4) >= 1M: Swedish, ratio 0.21381
> 5) >= 10M: Dutch, ratio 0.08305
> 6) >= 100M: English, ratio 0.01447
>
> However, keep in mind that we removed languages not inside categories
> 1, 2 or 3. That affected >=10k languages, as, for example, Upper
> Sorbian stays much better than Mirandese (0.67). (Will fix it while
> creating the full report. Obviously, in this case logarithmic
> categories of numbers of speakers are much more important than what's
> the state of the language.)
>
> It's obvious that we could draw the line between 1:1 for 1-10k
> speakers to 10:1 for >=100M speakers. But, again, I would like to get
> input of somebody more competent.
>
> One very important category is missing here and it's about the level
> of development of the speakers. That could be added: GDP/PPP per
> capita for spoken country or countries would be useful as measurement.
> And I suppose somebody with statistical knowledge would be able to
> give us the number which would have meaning "ability to create
> Wikipedia article".
>
> Completed in such way, we'd be able to measure the success of
> particular Wikimedia groups and organizations. OK. Articles per
> speaker are not the only way to do so, but we could use other
> parameters, as well: number of new/active/very active editors etc. And
> we could put it into time scale.
>
> I'll make some other results. And to remind: I'd like to have the
> formula to count "ability to create Wikipedia article" and then to
> produce "level of particular community success in creating Wikipedia
> articles". And, of course, to implement it for editors.
>
> [1]
> https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>




--
Ilario Valdelli
Wikimedia CH
Verein zur Förderung Freien Wissens
Association pour l’avancement des connaissances libre
Associazione per il sostegno alla conoscenza libera
Switzerland - 8008 Zürich
Wikipedia: Ilario <https://meta.wikimedia.org/wiki/User:Ilario>
Skype: valdelli
Facebook: Ilario Valdelli <https://www.facebook.com/ivaldelli>
Twitter: Ilario Valdelli <https://twitter.com/ilariovaldelli>
Linkedin: Ilario Valdelli <http://www.linkedin.com/profile/view?id=6724469>
Tel: +41764821371
http://www.wikimedia.ch
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Wikipedia article per speaker [ In reply to ]
(adding Analytics, as a relevant group for this discussion.)

I think this is next to meaningless, because the differing bot policies and
practices on different wikis skew the data into incoherence.

The (already existing) metric of active-editors-per-million-speakers is, it
seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is
offering that metric.

A.

On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic <millosh@gmail.com> wrote:

> When you get data, at some point of time you start thinking about
> quite fringe comparisons. But that could actually give some useful
> conclusions, like this time it did [1].
>
> We did the next:
> * Used the number of primary speakers from Ethnologue. (Erik Zachte is
> using approximate number of primary + secondary speakers; that could
> be good for correction of this data.)
> * Categorized languages according to the logarithmic number of
> speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
> * Took the number of articles of Wikipedia in particular language and
> created ration (number of articles / number of speakers).
> * This list is consisted just of languages with Ethnologue status 1
> (national), 2 (provincial) or 3 (wider communication). In fact, we
> have a lot of projects (more than 100) with worse language status; a
> number of them are actually threatened or even on the edge of
> extinction.
>
> Those are the preliminary results and I will definitely have to pass
> through all the numbers. I fixed manually some serious errors, like
> not having English Wikipedia itself inside of data :D
>
> Putting the languages into the logarithmic categories proved to be
> useful, as we are now able to compare the Wikipedias according to
> their gross capacity (numbers of speakers). I suppose somebody well
> introduced into statistics could even create the function which could
> be used to check how good one project stays, no matter of those strict
> categories.
>
> It's obvious that as more speakers one language has, it's harder to
> the community to follow the ratio.
>
> So, the winners per category are:
> 1) >= 1k: Hawaiian, ratio 0.96900
> 2) >= 10k: Mirandese, ratio 0.18073
> 3) >= 100k: Basque, ratio 0.38061
> 4) >= 1M: Swedish, ratio 0.21381
> 5) >= 10M: Dutch, ratio 0.08305
> 6) >= 100M: English, ratio 0.01447
>
> However, keep in mind that we removed languages not inside categories
> 1, 2 or 3. That affected >=10k languages, as, for example, Upper
> Sorbian stays much better than Mirandese (0.67). (Will fix it while
> creating the full report. Obviously, in this case logarithmic
> categories of numbers of speakers are much more important than what's
> the state of the language.)
>
> It's obvious that we could draw the line between 1:1 for 1-10k
> speakers to 10:1 for >=100M speakers. But, again, I would like to get
> input of somebody more competent.
>
> One very important category is missing here and it's about the level
> of development of the speakers. That could be added: GDP/PPP per
> capita for spoken country or countries would be useful as measurement.
> And I suppose somebody with statistical knowledge would be able to
> give us the number which would have meaning "ability to create
> Wikipedia article".
>
> Completed in such way, we'd be able to measure the success of
> particular Wikimedia groups and organizations. OK. Articles per
> speaker are not the only way to do so, but we could use other
> parameters, as well: number of new/active/very active editors etc. And
> we could put it into time scale.
>
> I'll make some other results. And to remind: I'd like to have the
> formula to count "ability to create Wikipedia article" and then to
> produce "level of particular community success in creating Wikipedia
> articles". And, of course, to implement it for editors.
>
> [1]
> https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>




--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Wikipedia article per speaker [ In reply to ]
Read the rest :P
On Jun 13, 2015 02:43, "Asaf Bartov" <abartov@wikimedia.org> wrote:

> (adding Analytics, as a relevant group for this discussion.)
>
> I think this is next to meaningless, because the differing bot policies and
> practices on different wikis skew the data into incoherence.
>
> The (already existing) metric of active-editors-per-million-speakers is, it
> seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is
> offering that metric.
>
> A.
>
> On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic <millosh@gmail.com> wrote:
>
> > When you get data, at some point of time you start thinking about
> > quite fringe comparisons. But that could actually give some useful
> > conclusions, like this time it did [1].
> >
> > We did the next:
> > * Used the number of primary speakers from Ethnologue. (Erik Zachte is
> > using approximate number of primary + secondary speakers; that could
> > be good for correction of this data.)
> > * Categorized languages according to the logarithmic number of
> > speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
> > * Took the number of articles of Wikipedia in particular language and
> > created ration (number of articles / number of speakers).
> > * This list is consisted just of languages with Ethnologue status 1
> > (national), 2 (provincial) or 3 (wider communication). In fact, we
> > have a lot of projects (more than 100) with worse language status; a
> > number of them are actually threatened or even on the edge of
> > extinction.
> >
> > Those are the preliminary results and I will definitely have to pass
> > through all the numbers. I fixed manually some serious errors, like
> > not having English Wikipedia itself inside of data :D
> >
> > Putting the languages into the logarithmic categories proved to be
> > useful, as we are now able to compare the Wikipedias according to
> > their gross capacity (numbers of speakers). I suppose somebody well
> > introduced into statistics could even create the function which could
> > be used to check how good one project stays, no matter of those strict
> > categories.
> >
> > It's obvious that as more speakers one language has, it's harder to
> > the community to follow the ratio.
> >
> > So, the winners per category are:
> > 1) >= 1k: Hawaiian, ratio 0.96900
> > 2) >= 10k: Mirandese, ratio 0.18073
> > 3) >= 100k: Basque, ratio 0.38061
> > 4) >= 1M: Swedish, ratio 0.21381
> > 5) >= 10M: Dutch, ratio 0.08305
> > 6) >= 100M: English, ratio 0.01447
> >
> > However, keep in mind that we removed languages not inside categories
> > 1, 2 or 3. That affected >=10k languages, as, for example, Upper
> > Sorbian stays much better than Mirandese (0.67). (Will fix it while
> > creating the full report. Obviously, in this case logarithmic
> > categories of numbers of speakers are much more important than what's
> > the state of the language.)
> >
> > It's obvious that we could draw the line between 1:1 for 1-10k
> > speakers to 10:1 for >=100M speakers. But, again, I would like to get
> > input of somebody more competent.
> >
> > One very important category is missing here and it's about the level
> > of development of the speakers. That could be added: GDP/PPP per
> > capita for spoken country or countries would be useful as measurement.
> > And I suppose somebody with statistical knowledge would be able to
> > give us the number which would have meaning "ability to create
> > Wikipedia article".
> >
> > Completed in such way, we'd be able to measure the success of
> > particular Wikimedia groups and organizations. OK. Articles per
> > speaker are not the only way to do so, but we could use other
> > parameters, as well: number of new/active/very active editors etc. And
> > we could put it into time scale.
> >
> > I'll make some other results. And to remind: I'd like to have the
> > formula to count "ability to create Wikipedia article" and then to
> > produce "level of particular community success in creating Wikipedia
> > articles". And, of course, to implement it for editors.
> >
> > [1]
> >
> https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
>
>
>
> --
> Asaf Bartov
> Wikimedia Foundation <http://www.wikimediafoundation.org>
>
> Imagine a world in which every single human being can freely share in the
> sum of all knowledge. Help us make it a reality!
> https://donate.wikimedia.org
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Wikipedia article per speaker [ In reply to ]
Read the rest :P
On Jun 13, 2015 02:43, "Asaf Bartov" <abartov@wikimedia.org> wrote:

> (adding Analytics, as a relevant group for this discussion.)
>
> I think this is next to meaningless, because the differing bot policies and
> practices on different wikis skew the data into incoherence.
>
> The (already existing) metric of active-editors-per-million-speakers is, it
> seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is
> offering that metric.
>
> A.
>
> On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic <millosh@gmail.com> wrote:
>
> > When you get data, at some point of time you start thinking about
> > quite fringe comparisons. But that could actually give some useful
> > conclusions, like this time it did [1].
> >
> > We did the next:
> > * Used the number of primary speakers from Ethnologue. (Erik Zachte is
> > using approximate number of primary + secondary speakers; that could
> > be good for correction of this data.)
> > * Categorized languages according to the logarithmic number of
> > speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
> > * Took the number of articles of Wikipedia in particular language and
> > created ration (number of articles / number of speakers).
> > * This list is consisted just of languages with Ethnologue status 1
> > (national), 2 (provincial) or 3 (wider communication). In fact, we
> > have a lot of projects (more than 100) with worse language status; a
> > number of them are actually threatened or even on the edge of
> > extinction.
> >
> > Those are the preliminary results and I will definitely have to pass
> > through all the numbers. I fixed manually some serious errors, like
> > not having English Wikipedia itself inside of data :D
> >
> > Putting the languages into the logarithmic categories proved to be
> > useful, as we are now able to compare the Wikipedias according to
> > their gross capacity (numbers of speakers). I suppose somebody well
> > introduced into statistics could even create the function which could
> > be used to check how good one project stays, no matter of those strict
> > categories.
> >
> > It's obvious that as more speakers one language has, it's harder to
> > the community to follow the ratio.
> >
> > So, the winners per category are:
> > 1) >= 1k: Hawaiian, ratio 0.96900
> > 2) >= 10k: Mirandese, ratio 0.18073
> > 3) >= 100k: Basque, ratio 0.38061
> > 4) >= 1M: Swedish, ratio 0.21381
> > 5) >= 10M: Dutch, ratio 0.08305
> > 6) >= 100M: English, ratio 0.01447
> >
> > However, keep in mind that we removed languages not inside categories
> > 1, 2 or 3. That affected >=10k languages, as, for example, Upper
> > Sorbian stays much better than Mirandese (0.67). (Will fix it while
> > creating the full report. Obviously, in this case logarithmic
> > categories of numbers of speakers are much more important than what's
> > the state of the language.)
> >
> > It's obvious that we could draw the line between 1:1 for 1-10k
> > speakers to 10:1 for >=100M speakers. But, again, I would like to get
> > input of somebody more competent.
> >
> > One very important category is missing here and it's about the level
> > of development of the speakers. That could be added: GDP/PPP per
> > capita for spoken country or countries would be useful as measurement.
> > And I suppose somebody with statistical knowledge would be able to
> > give us the number which would have meaning "ability to create
> > Wikipedia article".
> >
> > Completed in such way, we'd be able to measure the success of
> > particular Wikimedia groups and organizations. OK. Articles per
> > speaker are not the only way to do so, but we could use other
> > parameters, as well: number of new/active/very active editors etc. And
> > we could put it into time scale.
> >
> > I'll make some other results. And to remind: I'd like to have the
> > formula to count "ability to create Wikipedia article" and then to
> > produce "level of particular community success in creating Wikipedia
> > articles". And, of course, to implement it for editors.
> >
> > [1]
> >
> https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
>
>
>
> --
> Asaf Bartov
> Wikimedia Foundation <http://www.wikimediafoundation.org>
>
> Imagine a world in which every single human being can freely share in the
> sum of all knowledge. Help us make it a reality!
> https://donate.wikimedia.org
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>