Mailing List Archive

Flattening a wikimedia category
Seems like it is no easy way to display all the media files under a
wikimedia category -- for example if someone wants a picture of a
library, he or she will need to go into each sub-category under
"Libraries":

http://commons.wikimedia.org/wiki/Category:Libraries

While Wikimedia is not yet the most popular stock photo source, IMO
having this flattening functionality would be useful to those who are
looking for stock photos.

Rayson

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
> While Wikimedia is not yet the most popular stock photo source, IMO
> having this flattening functionality would be useful to those who are
> looking for stock photos.

Just I love this recurring debate sooo much I drop a two more bits:

* atomic categorization would solve this
* category intersection would be useful (imagine a user searching for
a picture of a library in asia)

open fire!

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Wed, Feb 3, 2010 at 10:10 PM, Rayson Ho <raysonlogin@gmail.com> wrote:
> Seems like it is no easy way to display all the media files under a
> wikimedia category -- for example if someone wants a picture of a
> library, he or she will need to go into each sub-category under
> "Libraries":
>
> http://commons.wikimedia.org/wiki/Category:Libraries
>
> While Wikimedia is not yet the most popular stock photo source, IMO
> having this flattening functionality would be useful to those who are
> looking for stock photos.

This is a regular request. There are two major problems:

1) Our database schema is not set up to handle this efficiently for
large result sets. At least I don't think so, off the top of my head.

2) In practice, collapsing categories like this can often lead to
crazy stuff being included, because subcategory relations aren't used
strictly in a "everything in category A is also in category B" sense.
It's easy to come up with examples. For instance:
[[Category:Punishments in religion]] -> [[Category:Religion and
capital punishment]] -> [[Category:People executed for heresy]] ->
[[Category:Joan of Arc]] -> [[English claims to the French throne]].
Thus, if you try to get all articles in [[Category:Punishments in
religion]] or subcategories, you'll get results like [[English claims
to the French throne]].

However, this is definitely on the long-term "it would be nice if
someone did this someday" list.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 9:27 AM, Aryeh Gregor
<Simetrical+wikilist@gmail.com> wrote:
> 1) Our database schema is not set up to handle this efficiently for
> large result sets.  At least I don't think so, off the top of my head.

I've never been able to come up with an acceptable data-structure for
flattening on the fly.
(I think acceptable is something like O(1) or O(log something) on
insert, delete, and no worse then something like O(results log
something) on query).

But if you do atomic categories explicitly enumerated on the pages
then you get the right properties, and fast search with intersections
is the same problem as full text search. I.e. solved.

> 2) In practice, collapsing categories like this can often lead to
> crazy stuff being included, because subcategory relations aren't used
> strictly in a "everything in category A is also in category B" sense.

Yea, automatic collapsing is mostly good for hilarious results...
manual collapsing OTOH.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 10:02 AM, Gregory Maxwell <gmaxwell@gmail.com> wrote:
> But if you do atomic categories explicitly enumerated on the pages
> then you get the right properties, and fast search with intersections
> is the same problem as full text search. I.e. solved.

Right. Supporting category intersection and search in category with
better UI (we already sort of support it if you know the right magic
terms) is what we should be aiming for here.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
Aryeh Gregor wrote:
> Right. Supporting category intersection and search in category with
> better UI (we already sort of support it if you know the right magic
> terms) is what we should be aiming for here.
>

Last year, just around this time, we came to the exactly same
conclusion. And similarly like then, there is no shortage of good
opinions on how to do it, but people to actually do the programming.

r.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 11:03 AM, Robert Stojnic <rainmansr@gmail.com> wrote:
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.

Yup. Any volunteers? My understanding is that right now, the backend
supports category searches as long as the categories are spelled out
literally in the wikitext (not via template). That's not a big
restriction, so what we could really use right now is UI, which
shouldn't require such specialized skills.

So, does anyone want to:

1) Mock up basic UI for category intersections/search in category?

2) Implement it?

After that we can talk about fancy things like automatically
suggesting categories to intersect with or whatever . . . we don't
even have the most basic UI right now.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 04/02/10 16:03, Robert Stojnic wrote:
> Aryeh Gregor wrote:
>
>> Right. Supporting category intersection and search in category with
>> better UI (we already sort of support it if you know the right magic
>> terms) is what we should be aiming for here.
>>
>>
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.
>
> r.
>
>
I'm working on it.

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 02/04/2010 04:10 PM, Aryeh Gregor wrote:
>
> Yup. Any volunteers? My understanding is that right now, the backend
> supports category searches as long as the categories are spelled out
> literally in the wikitext (not via template).

Presumably it would not be too hard to append the full category list to
the blob that gets sent to the search engine, (perhaps as part of
fixing: https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 -nudge-nudge)

Whether this is a big restriction or not depends a lot on your wiki, I
estimate that 90% or more of categories on en.wiktionary are added by
templates (but then so's most of our output anyway).

Conrad

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
This is putting the cart in front of the ox yet again. A few mails up
Aryeh and Gregory both come to the conclusion that automatic
flattening is useless.
Yet category flattening would be a prerequisite to intersections.
The only way to get proper intersection is manual flattening i.e.
atomic categorization. As long as nobody is pushing commons _hard_ to
change their categorization system _nothing_ will happen and we'll
meet on this list again in about one year repeating the same
discussion.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 11:28 AM, Conrad Irwin
<conrad.irwin@googlemail.com> wrote:
> Presumably it would not be too hard to append the full category list to
> the blob that gets sent to the search engine

No, probably not, but it would be even easier to not worry about it
yet (unless someone wants to!).

On Thu, Feb 4, 2010 at 11:37 AM, Daniel Schwen <lists@schwen.de> wrote:
> Yet category flattening would be a prerequisite to intersections.
> The only way to get proper intersection is manual flattening i.e.
> atomic categorization.

Correct. Automatic flattening is not good enough -- manual flattening
is necessary. Maybe if we had a better category intersect feature,
more wikis would do manual flattening. If they don't, I guess they
won't get the feature. Automatic flattening is not a substitute.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
Robert Stojnic schrieb:
> Aryeh Gregor wrote:
>> Right. Supporting category intersection and search in category with
>> better UI (we already sort of support it if you know the right magic
>> terms) is what we should be aiming for here.
>>
>
> Last year, just around this time, we came to the exactly same
> conclusion. And similarly like then, there is no shortage of good
> opinions on how to do it, but people to actually do the programming.
>
> r.

Wikimedia Germany has contracted Neil Harris to work on implementing deep
category intersection. The goal is basically a rewrite of my sucky CatScan tool.
The result is hopefully fast & generic enough so it can be used as a service
that integrates with the current search infrastructure.

The project has started, there is funding and a project plan. I expect to see
usable results soon. In fact, I hope to present this at the developer meeting in
april (neil, contact me about attending) and discuss the integration into lucene
search.

I agree that full recursive flattening of the current category structure leads
to bad results some times (especially on the english wikipedia, commons is quite
bad too), a depth of 5 however is generally useful. One common use case is
intersecting a content category with a maintenance category, for organizing
editorial work in a wiki project. In that case, at least one category comes from
a template.

Atomic categorization aka tagging however also sucks: the tags are either too
generic (so it's hard to find stuff) or too specific (you never know what to
search for). tags implying/including other tags is very useful. which is exactly
what categories with deep intersection will provide.


-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 4 February 2010 16:37, Daniel Schwen <lists@schwen.de> wrote:

> The only way to get proper intersection is manual flattening i.e.
> atomic categorization. As long as nobody is pushing commons _hard_ to
> change their categorization system _nothing_ will happen and we'll
> meet on this list again in about one year repeating the same
> discussion.


Commons really wants this. LOTS AND LOTS.

But we need the functionality there first, so we can *then* flatten.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
> But we need the functionality there first, so we can *then* flatten.

Ahh, the good old chicken and egg ;-)
I don't let that count. We have plenty of working category
intersection tools already. Their usefulness is limited however
because the category system is so screwed up.
The ball is definitely in the categorization-court!

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 4 February 2010 17:38, Daniel Schwen <lists@schwen.de> wrote:

>> But we need the functionality there first, so we can *then* flatten.

> Ahh, the good old chicken and egg ;-)
> I don't let that count. We have plenty of working category
> intersection tools already.


Yes, but they're not part of the interface.

The technology needs to work with the data - the six million files and
their categories, carefully added by hand by humans.

If category intersections worked, they could then be broken down to
work better with category intersections.

Demanding that all six million files be de-categorised before you'll
even allow a category intersection tool to *possibly* be deployed is
backward.

People need to be able to go gradually.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 5:02 PM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> Robert Stojnic schrieb:
>> Aryeh Gregor wrote:
>>> Right.  Supporting category intersection and search in category with
>>> better UI (we already sort of support it if you know the right magic
>>> terms) is what we should be aiming for here.
>>>
>>
>> Last year, just around this time, we came to the exactly same
>> conclusion. And similarly like then, there is no shortage of good
>> opinions on how to do it, but people to actually do the programming.
>>
>> r.
>
> Wikimedia Germany has contracted Neil Harris to work on implementing deep
> category intersection. The goal is basically a rewrite of my sucky CatScan tool.

In the meantime:
http://toolserver.org/~magnus/catscan_rewrite.php

(toolserver seems to have a problem ATM, though...)

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
Magnus Manske schrieb:
> In the meantime:
> http://toolserver.org/~magnus/catscan_rewrite.php
>
> (toolserver seems to have a problem ATM, though...)

Yes, lots more options than my old thingy, thanks magnus :) but still bound to
recursive calls to the database, which is what i really want to get rid of. the
lookup needs to be snappy.

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
Daniel Kinzler <daniel@brightbyte.de> wrote:

>> In the meantime:
>> http://toolserver.org/~magnus/catscan_rewrite.php

>> (toolserver seems to have a problem ATM, though...)

> Yes, lots more options than my old thingy, thanks magnus :) but still bound to
> recursive calls to the database, which is what i really want to get rid of. the
> lookup needs to be snappy.

Is there any reason not to have a flatted structure some-
where on the toolserver (or, in the long run, in MediaWiki)?
A quick look at recentchanges for dewp shows about
22000 changes per month, about one every two minutes. With
about 80000 categories in all, it should be feasible to up-
date the structure incrementally, with daily/weekly/monthly
clean new full "dumps" (or even dispense with up-to-the-se-
cond data and just dump the flat structure hourly).

Tim


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Thu, Feb 4, 2010 at 6:40 PM, Tim Landscheidt <tim@tim-landscheidt.de> wrote:
> Is there any reason not to have a flatted structure some-
> where on the toolserver (or, in the long run, in MediaWiki)?
> A quick look at recentchanges for dewp shows about
> 22000 changes per month, about one every two minutes. With
> about 80000 categories in all, it should be feasible to up-
> date the structure incrementally, with daily/weekly/monthly
> clean new full "dumps" (or even dispense with up-to-the-se-
> cond data and just dump the flat structure hourly).

Incremental updates for a 'flattened copy' aren't especially
realistic... as one user operation can produce millions of operations
on the server.

I won't bother saying much more, Daniel Schwen pretty much speaks for my view.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
Tim Landscheidt schrieb:
> Daniel Kinzler <daniel@brightbyte.de> wrote:
>
>>> In the meantime:
>>> http://toolserver.org/~magnus/catscan_rewrite.php
>
>>> (toolserver seems to have a problem ATM, though...)
>
>> Yes, lots more options than my old thingy, thanks magnus :) but still bound to
>> recursive calls to the database, which is what i really want to get rid of. the
>> lookup needs to be snappy.
>
> Is there any reason not to have a flatted structure some-
> where on the toolserver (or, in the long run, in MediaWiki)?
> A quick look at recentchanges for dewp shows about
> 22000 changes per month, about one every two minutes. With
> about 80000 categories in all, it should be feasible to up-
> date the structure incrementally, with daily/weekly/monthly
> clean new full "dumps" (or even dispense with up-to-the-se-
> cond data and just dump the flat structure hourly).

Basically: yes, this is the idea, but detecting categorization changes isn't
trivial. also, really keeping a copy of the flat content of each category would
be redundant to the extreme. it would result in hundreds of millions of entries,
and would be hard to handle. a data structure for fast recursive lookup makes
more sense. Neil is working on this.

As to the general approach: I hope that by providing a way to intersect
categories, we can get rid of most of the "Foo in Bar" cross-section catgories.
I still believe hierarchical structuring/inclusion of categories is useful. Or,
to put it differently: let people use "flat tagging", but let's keep the notion
of one tag implying another, i.e. math implying science and texas implying america.

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Fri, Feb 5, 2010 at 3:57 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> Or,
> to put it differently: let people use "flat tagging", but let's keep the notion
> of one tag implying another, i.e. math implying science and texas implying america.

And as for [[Category:People executed for heresy]] -> [[Category:Joan
of Arc]] -> [[English claims to the French throne]]? That's only two
steps, and it already doesn't make sense. You could argue that
[[Category:Joan of Arc]] really means [[Category:Stuff related to Joan
of Arc]] and shouldn't be in [[Category:People executed for heresy]],
but that sounds like it would take as much recategorization work as
just using atomic categories -- and much subtler.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 5 February 2010 20:17, Aryeh Gregor <Simetrical+wikilist@gmail.com> wrote:
> On Fri, Feb 5, 2010 at 3:57 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
>> Or,
>> to put it differently: let people use "flat tagging", but let's keep the notion
>> of one tag implying another, i.e. math implying science and texas implying america.
>
> And as for [[Category:People executed for heresy]] -> [[Category:Joan
> of Arc]] -> [[English claims to the French throne]]?  That's only two
> steps, and it already doesn't make sense.  You could argue that
> [[Category:Joan of Arc]] really means [[Category:Stuff related to Joan
> of Arc]] and shouldn't be in [[Category:People executed for heresy]],
> but that sounds like it would take as much recategorization work as
> just using atomic categories -- and much subtler.
>


off-topic

all these "->" make me salivate for a good plot graph
(http://www.graphviz.org/?)


--
--
ℱin del ℳensaje.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 6/02/10 6:44 AM, Tei wrote:
> On 5 February 2010 20:17, Aryeh Gregor<Simetrical+wikilist@gmail.com> wrote:
>
>> On Fri, Feb 5, 2010 at 3:57 AM, Daniel Kinzler<daniel@brightbyte.de> wrote:
>>
>>> Or,
>>> to put it differently: let people use "flat tagging", but let's keep the notion
>>> of one tag implying another, i.e. math implying science and texas implying america.
>>>
>> And as for [[Category:People executed for heresy]] -> [[Category:Joan
>> of Arc]] -> [[English claims to the French throne]]? That's only two
>> steps, and it already doesn't make sense. You could argue that
>> [[Category:Joan of Arc]] really means [[Category:Stuff related to Joan
>> of Arc]] and shouldn't be in [[Category:People executed for heresy]],
>> but that sounds like it would take as much recategorization work as
>> just using atomic categories -- and much subtler.
>>
>
> off-topic
>

Not at all, it's entirely reasonable to discuss the problems associated
with the current categorisation system, and what methods we'd like to
use to improve it.

--
Andrew Garrett
agarrett@wikimedia.org
http://werdn.us


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On 7 February 2010 08:45, Andrew Garrett <agarrett@wikimedia.org> wrote:

> Not at all, it's entirely reasonable to discuss the problems associated
> with the current categorisation system, and what methods we'd like to
> use to improve it.


The current categorization system is per-wiki-specific. It's done
differently in different places. So it's not clear that you won't
require 750 different discussions.

To get back to the topic of category intersections on Commons:

Could the developers please outline, point by point, the precise hoops
we need to jump through to get category intersections on Commons? New
hoops seem to have been introduced during the currently discussion.

Please make an unambiguous list of the hoops Commons will be required
to jump through before this feature can happen, so it's actually clear
to all and we're all working from the same page, rather than trying to
guess what shrubbery you'll be demanding next.

Thanks!


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Flattening a wikimedia category [ In reply to ]
On Sun, Feb 7, 2010 at 7:01 AM, David Gerard <dgerard@gmail.com> wrote:
> Could the developers please outline, point by point, the precise hoops
> we need to jump through to get category intersections on Commons? New
> hoops seem to have been introduced during the currently discussion.

Right now, I'd try just waiting. As Daniel pointed out in this
thread, Neil Harris is already being paid to work on it.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 2  View All