Mailing List Archive

The never-dying topic: category intersection
(feel free to bash me if we had this variant already, I couldn't find
it in the list archives)

Task: On German Wikipedia (yay atomic categories!), find women who
were born in 1901 and died in 1986.
Runtime : Toolserver, <2 sec
Query:
SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
"Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
tbl1.cnt = 3 ;

Trying to "poison" the query by also looking in all GFDL images
("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
so not that bad.


I've implemented this as a tool now:
http://toolserver.org/~magnus/category_intersection.php

Queries seem to take a little longer there (2-4 sec) compared to the
command line.

Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec.
OTOH, looking for images on Commons in "GFDL" and "Buildings in
Berlin" took ~2min. Might be the giant GFDL category, or the
toolserver, or both. I'll try to fiddle with it some more utilising
cat_pages/cat_files.

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
> OTOH, looking for images on Commons in "GFDL" and "Buildings in
> Berlin" took ~2min. Might be the giant GFDL category, or the
> toolserver, or both. I'll try to fiddle with it some more utilising
> cat_pages/cat_files.


Hah! By using small categories first, then restricting possible
page_ids in the query for the larger categories, I got it down to 3
sec!

Testing "Buildings in Berlin" and "PD Old" (to avoid false timings
from cache) : < 0.6 sec.

This way, adding more intersections with small categories (where
currently "small" is < 20.000 pages) will actually make the query run
faster.

I think I'm onto something here. Then again, I thought that before :-)

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 2:01 PM, Magnus Manske
<magnusmanske@googlemail.com>wrote:

> (feel free to bash me if we had this variant already, I couldn't find
> it in the list archives)
>
> Task: On German Wikipedia (yay atomic categories!), find women who
> were born in 1901 and died in 1986.
> Runtime : Toolserver, <2 sec
> Query:
> SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
> page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
> "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
> tbl1.cnt = 3 ;
>
> Trying to "poison" the query by also looking in all GFDL images
> ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
> so not that bad.
>
>
> I've implemented this as a tool now:
> http://toolserver.org/~magnus/category_intersection.php<http://toolserver.org/%7Emagnus/category_intersection.php>
>
> Queries seem to take a little longer there (2-4 sec) compared to the
> command line.
>
> Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4
> sec.
> OTOH, looking for images on Commons in "GFDL" and "Buildings in
> Berlin" took ~2min. Might be the giant GFDL category, or the
> toolserver, or both. I'll try to fiddle with it some more utilising
> cat_pages/cat_files.
>
> Magnus
>

Very nice. Danke.

You should mention that categories are entered each at a separate line (or
an example), as it took me some trials to figure it out.

--
--alnokta
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
<magnusmanske@googlemail.com> wrote:
> (feel free to bash me if we had this variant already, I couldn't find
> it in the list archives)
>
> Task: On German Wikipedia (yay atomic categories!), find women who
> were born in 1901 and died in 1986.
> Runtime : Toolserver, <2 sec
> Query:
> SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
> page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
> "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
> tbl1.cnt = 3 ;

This will fail with a syntax error on the main servers, because
subqueries aren't supported in MySQL 4.0. You don't really need the
subquery, though; you should be able to just use HAVING:

SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND
cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY
cl_from HAVING COUNT(cl_to) = 3;

Your solution requires filesorting the union of the categories, as far
as I can tell. I would expect it, offhand, to be significantly slower
than a solution using joins:

SELECT page_title FROM page JOIN categorylinks AS cl1 ON
page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from
JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE
cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to =
'Gestorben_1986';

But I haven't benchmarked it, and who knows what kind of execution
quirks are happening here.

> Trying to "poison" the query by also looking in all GFDL images
> ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
> so not that bad.

3 seconds is a very long time for a query to run. Typical queries
should take more like, say, 10 ms. Occasional selects taking three
seconds might or might not kill the servers, but they're far from
optimal. Also, did you try in a really worst-case scenario, like
intersecting "Unprintworthy redirects" with "Stub-Class biography
articles" on enwiki? Obviously users aren't likely to legitimately
run an intersection of those exact categories (since they're logically
disjoint), but you should test this kind of thing to ensure
scalability. The query appears to take 16s on your tool.

Again, the only really scalable solution looks to be fulltext search
of some kind. We've known for a long time that category intersections
can easily be done well enough, for a modest standard of "well
enough", but that hasn't been considered good enough to run on
Wikipedia.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 2:57 PM, Aryeh Gregor
<Simetrical+wikilist@gmail.com> wrote:
> On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
> <magnusmanske@googlemail.com> wrote:
>> (feel free to bash me if we had this variant already, I couldn't find
>> it in the list archives)
>>
>> Task: On German Wikipedia (yay atomic categories!), find women who
>> were born in 1901 and died in 1986.
>> Runtime : Toolserver, <2 sec
>> Query:
>> SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
>> page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
>> "Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
>> tbl1.cnt = 3 ;
>
> This will fail with a syntax error on the main servers, because
> subqueries aren't supported in MySQL 4.0. You don't really need the
> subquery, though; you should be able to just use HAVING:
>
> SELECT page_title FROM page, categorylinks WHERE page_id=cl_from AND
> cl_to in ( 'Frau', 'Geboren_1901' , 'Gestorben_1986' ) GROUP BY
> cl_from HAVING COUNT(cl_to) = 3;

Your're right. Fixed in the tool.

> Your solution requires filesorting the union of the categories, as far
> as I can tell. I would expect it, offhand, to be significantly slower
> than a solution using joins:
>
> SELECT page_title FROM page JOIN categorylinks AS cl1 ON
> page_id=cl1.cl_from JOIN categorylinks AS cl2 ON page_id=cl2.cl_from
> JOIN categorylinks AS cl3 ON page_id=cl3.cl_from WHERE
> cl1.cl_to='Frau' AND cl2.cl_to='Geboren_1901' AND cl3.cl_to =
> 'Gestorben_1986';
>
> But I haven't benchmarked it, and who knows what kind of execution
> quirks are happening here.

It seems the JOIN query is significantly faster when all categories are large.

However, with one or more small categories, I can do a pre-selection
of pages (get page_ids for the intersection of the small categories,
then look only for these in the larger ones), which in turn is
significantly faster than the JOIN.
My tool now uses the algorithm appropriate for the respective query.

>> Trying to "poison" the query by also looking in all GFDL images
>> ("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
>> so not that bad.
>
> 3 seconds is a very long time for a query to run. Typical queries
> should take more like, say, 10 ms. Occasional selects taking three
> seconds might or might not kill the servers, but they're far from
> optimal.

I am uncertain how much the toolserver factors in here. The poor thing
is under a lot of stress ;-)

> Also, did you try in a really worst-case scenario, like
> intersecting "Unprintworthy redirects" with "Stub-Class biography
> articles" on enwiki? Obviously users aren't likely to legitimately
> run an intersection of those exact categories (since they're logically
> disjoint), but you should test this kind of thing to ensure
> scalability. The query appears to take 16s on your tool.

I ran it again now, and it falls back to the JOIN solution, taking ~10
sec. As a worst-case scenario, I call that acceptable for the tool.

It might not be acceptable for Wikipedia ATM. We could experiment how
this performs on the "real" servers, though.

Also, we could restrict certain queries. We know the category size,
and in my approach, we know how many articles are in the "small
category" intersection. Form there, we could guesstimate the
worst-case time, and kill the query, or run it in MySQL slow mode
(forgot the correct name) to not stress the servers too much.

> Again, the only really scalable solution looks to be fulltext search
> of some kind. We've known for a long time that category intersections
> can easily be done well enough, for a modest standard of "well
> enough", but that hasn't been considered good enough to run on
> Wikipedia.

No matter what method, I think the problem should get high priority. I
currently see a case on Commons, where there's now "Category:Paintings
by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
(or is already there).

Magnus

P.S.: Just got a message on my Commons talk page about the
"+incategory:" search function.
* This is based on the Lucene index, right? How often is that updated?
* Is there a decent interface/special page for that? It's a pain to
enter this manually, and I doubt many people know about it
* Is there a machine-readable interface for this? One that will return
5K hits without screenscraping?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 11:40 AM, Magnus Manske
<magnusmanske@googlemail.com> wrote:
> I am uncertain how much the toolserver factors in here. The poor thing
> is under a lot of stress ;-)

The query has to scan all of the categorylinks rows for all of the
categories you specify, at least in the worst case. That could be a
few hundred thousand rows, maybe a million or more if you combine
several very large categories. That will take a few seconds even on
the real servers, probably (from experience with SELECT COUNT(*) FROM
categorylinks WHERE cl_to='Foo' in Special:Category).

> I ran it again now, and it falls back to the JOIN solution, taking ~10
> sec. As a worst-case scenario, I call that acceptable for the tool.
>
> It might not be acceptable for Wikipedia ATM. We could experiment how
> this performs on the "real" servers, though.

It might be acceptable if it's not run too often, it just wouldn't be
ideal. We're not talking about running such queries on every page
view, I assume, so it shouldn't be the end of the world. It would be
good to get a more efficient way, but the important thing is for
someone to actually get something in the core software period, IMO.
We have any number of toolserver tools to do this, probably at least
five, but that's not going to get us progress.

> Also, we could restrict certain queries. We know the category size,
> and in my approach, we know how many articles are in the "small
> category" intersection. Form there, we could guesstimate the
> worst-case time, and kill the query, or run it in MySQL slow mode
> (forgot the correct name) to not stress the servers too much.

Read uncommitted?

> No matter what method, I think the problem should get high priority. I
> currently see a case on Commons, where there's now "Category:Paintings
> by Vincent van Gogh in this-and-that-museum". It's getting ridiculous
> (or is already there).

Lots of things should get high priority and don't. Look at how
sorting on category pages is completely broken for a lot of languages,
for instance, due to sorting in code point order. Someone with commit
access has to spend the time fix it, is all.

> P.S.: Just got a message on my Commons talk page about the
> "+incategory:" search function.

That doesn't include transcluded categories, so it's not a proper
solution. I don't know much about it. Lucene would likely be a good
choice for a "real" solution, but we'd need to make up a separate
table for it, not just use the page text table.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 5:40 PM, Magnus Manske
<magnusmanske@googlemail.com> wrote:

> * Is there a machine-readable interface for this? One that will return
> 5K hits without screenscraping?
>
api.php?list=search?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
> P.S.: Just got a message on my Commons talk page about the
> "+incategory:" search function.
> * This is based on the Lucene index, right? How often is that updated?
>
It is updated daily. As already pointed out, it doesn't do transcluded
categories, but just looks at Category: links within raw article wikitext.

> * Is there a decent interface/special page for that? It's a pain to
> enter this manually, and I doubt many people know about it
>
Nope, no interface. I've pretty much made it just because it was easy to
do, and doesn't really take up any significant space in the index. If
one dared to make a category intersection frontend it could possibly be
useful for testing.

However, as discussed before, making an efficient and
easily-integrable-into-WMF-type-setup backend is not exactly
straightforward.

Cheers, Robert



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Tue, Dec 2, 2008 at 7:01 AM, Magnus Manske
<magnusmanske@googlemail.com> wrote:
[snip]
> Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec.
> OTOH, looking for images on Commons in "GFDL" and "Buildings in
> Berlin" took ~2min. Might be the giant GFDL category, or the
> toolserver, or both. I'll try to fiddle with it some more utilising
> cat_pages/cat_files.

No. Bleh. The horrible slowness in your results is a result of broken
methodology. (2 seconds is unacceptably slow by a factor of 10x, as
far as I'm concerned)

Please see: https://lists.wikimedia.org/mailman/htdig/wikitech-l/2006-September/026715.html

If you go around blaming big categories I will be forced hunt you down
and kill you. The constant mindset of "big categories = slow" results
in people building pre-made intersections to reduce category sizes
rather than using atomic categories. We can make big categories
blindingly fast, but we simply can not make the recursion needed to
sensible outcomes on pre-made intersections fast.

I had a tool on on toolserver that gave a HTML and JSON interfaces for
doing queries against your choice of enwp or commons, ... the worst
case results I could get out of it were on the order of ~30ms when
using up to 10 categories. I didn't bother to maintain it because I
mostly got complaints that it was not useful because it didn't find
most things because it couldn't walk the category tree.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Wed, 03 Dec 2008 17:05:39 +0100, Roan Kattouw <roan.kattouw@home.nl>
wrote:
>
>
> Daniel Schwen schreef:
> > So how does this take care of deep indexing non-atomic categories?
> >
> Err.. what? Please explain what you mean by that.


I think he means finding stuff that's already buried in sub-sub categories,
when you query on a parent category. Like querying for and intersection
of [[Category:Deceased people]] and [[Category:Presidents of the
United States]] won't find the guys listed in [[Category:Deceased
Presidents of the United States]] without re categorizing those entries.


>
> > =>How will this extension be even remotely useful for let's say commons?
> >
> Without addressing Commons in particular, having an efficient way to get
> pages in the intersection of multiple categories would allow wikis to
> delete a category such as [[Category:Deceased Presidents of the United
> States]] and replace it by, say, [[Intersection:Deceased Presidents of
> the United States]], which would list all articles in
> [[Category:Deceased people]] and [[Category:Presidents of the United
> States]]. My extension alone doesn't make that possible, but it makes
> implementing such a feature considerably easier.
> > This discussion is far from over. The basic problems are _not_ solved.
> >
> Would you care to elaborate on what those unsolved problems are?


I thought we were 90% of the way there when you wrote this extension, having
reasonably solved the efficiency (speed) issues with the fulltext and lucene
based approaches, and the view of the atomic categories problem was that it
would be solved by people, not tech. In other words, I thought we all
assumed that once people were empowered with category intersections, they'd
make categories that make use of them. If not, then that's a problem to
solve, but not an obstacle to implementing category intersection. My input
would be to implement intersections, see what happens, and look at other
functionality for intersections v.2.


>
> > I'm sure this thread will die out soon.
> > Half of the participants will again be soothed by the promise of some
> easy
> > solution just barely beyond the horizon, while the half that realizes
> that
> > said solution _cannot possibly work_ without a radical reform of the
> category
> > system will again be too annoyed (I'm getting there already) to continue
> > discussing.
> It would be nice if you didn't judge people as naive rightaway.
>

Seconded.

But it sounds like maybe those of us who'd like to see this happen should
discuss a UI (or several) for it. I was thinking the most intuitive
interface was a sort of "browse" type function, where for any given group
of categories (could just be one category), you have two result sets:
related categories (other categories of pages in the starting category),
and articles at the intersection of the group. The articles are what we
generally think of, but the related categories gives us an intuitive way to
navigate through category intersections.

The articles in the group of categories are the problem we've already solved
(mostly): they are the result from the fulltext or lucene search. The
related categories problem is harder, I think, as the most obvious way to
get to that is to get all the categories belonging to those articles, and
then collapse them and rank them. For large result sets, this can get time
consuming again, and we would not want to (I think) build the related
categories only with the first page of results. OTOH... if we took the
first 100 results of a given category intersection, then queries the
categorylinks table for all the categories belonging to those articles, and
collapsed that... that would be a pretty good estimate at related
categories. It wouldn't give all of them, but it would be a nice set of
sample data.

What do you think?

Onto a soap box for a minute: the fact that this topic won't die, in 4
years, to me means that it's a really needed feature. Once implemented it
will give people a great tool to more efficiently find information. Looking
at things that are happening around the web with tags, Google adopting ideas
from Wikia search, semantic web stuff, I'm thinking that we are really at
the beginning of a movement to add structured metadata to information on the
net. In concert with all the wonderful algorithms that try to guess what a
given web page is about, we are doing things to explicitly state what a web
page is about, providing users a much better chance at being able to find
it. Developing category intersections for Wikipedia would be a milestone in
that movement.

Aerik


--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Wed, Dec 3, 2008 at 12:37 PM, Aerik Sylvan <aerik@thesylvans.com> wrote:
[snip]
> But it sounds like maybe those of us who'd like to see this happen should
> discuss a UI (or several) for it. I was thinking the most intuitive
> interface was a sort of "browse" type function, where for any given group
> of categories (could just be one category), you have two result sets:
> related categories (other categories of pages in the starting category),
> and articles at the intersection of the group. The articles are what we
> generally think of, but the related categories gives us an intuitive way to
> navigate through category intersections.
>
> The articles in the group of categories are the problem we've already solved
> (mostly): they are the result from the fulltext or lucene search. The
> related categories problem is harder,
[snip]

So an interface I had that was really pleasing was that I asked the
database to find a random subset of the results, which it could do
quickly, (or I used the whole results if the initial query contained
them) and I found the set of categories which maximally bisected the
result and presented the list with a set of +/- buttons.

I.e. you search for Animal and you'd get:
Mammal[+/-] Reptile[+/-] Kittens[+/-] Taken with Canon Camera[+/-] Human[+/-]

based on the how close to 50% of the results have the suggested category.

It's not exactly a 'related category', but I thought it was very useful.

I also did a fuzzy text matching search one the category names using a
trigram index, so it was always sure to suggest Category:Cats when you
searched for Cat, or whatever. (I did this with an ajaxy-search-while
you type, it was handy)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Gregory Maxwell wrote:
>
> So an interface I had that was really pleasing was that I asked the
> database to find a random subset of the results, which it could do
> quickly, (or I used the whole results if the initial query contained
> them) and I found the set of categories which maximally bisected the
> result and presented the list with a set of +/- buttons.
>
> I.e. you search for Animal and you'd get:
> Mammal[+/-] Reptile[+/-] Kittens[+/-] Taken with Canon Camera[+/-] Human[+/-]
>
> based on the how close to 50% of the results have the suggested category.
>
> It's not exactly a 'related category', but I thought it was very useful.

Wow! And this was at some point live, directly on the Commons category
pages?!

Has the whole thing been scrapped since, or is there some way to still
try it out, e.g. by installing some custom JavaScript?

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Aerik Sylvan wrote:
>
> But it sounds like maybe those of us who'd like to see this happen should
> discuss a UI (or several) for it. I was thinking the most intuitive
> interface was a sort of "browse" type function, where for any given group
> of categories (could just be one category), you have two result sets:
> related categories (other categories of pages in the starting category),
> and articles at the intersection of the group. The articles are what we
> generally think of, but the related categories gives us an intuitive way to
> navigate through category intersections.

Another useful feature, which would probably make the system much more
likely to be adopted in practice, would be an easy interface to get from
articles (or images, etc.) to various relevant intersections.

For example, if I'm looking at an image which is in the categories
"Maple", "Leaves" and "Green", I should be able to easily get to pages
where I can browse other pictures of either maple leaves or green
leaves, not to mention other pictures of green maple leaves.

A _minimal_ solution would be simply to present a link to the
intersection of _all_ the categories (which might well have only one
page on it) and let the user broaden the intersection from there. Even
better if this can be done in an AJAXish way directly on the image page
itself, though obviously some fallback interface would still be needed
for users without JavaScript.

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Thu, Dec 4, 2008 at 7:39 AM, Ilmari Karonen <nospam@vyznev.net> wrote:
> A _minimal_ solution would be simply to present a link to the
> intersection of _all_ the categories (which might well have only one
> page on it) and let the user broaden the intersection from there. Even
> better if this can be done in an AJAXish way directly on the image page
> itself, though obviously some fallback interface would still be needed
> for users without JavaScript.

As for the JavaScript, add
importScript('User:Magnus_Manske/category_intersection.js');
to your monobook.js

Currently, this links to my tool on toolserver. It could support other
tools as well. If you like it, someone make a gagdet from it ;-)

Cheers,
Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Magnus - I checked out your tool, but it looks like you're using a query
against the categorylinks table? Have you played with setting up a new
table for categories and fulltext indexing it? Use group_concat to get all
of a pages categories into one field, then create a fulltext index on that
field. You get much better performance than using the categorylinks table
(kind of weird, eh?)
Are you pinging a live database, or a copy made from a dump? (please excuse
my ignorance if this is common knowledge)

I'm working on dummying up a UI using the same approach (fulltext index of
categories) on wikidweb and will write back when I've got something worth
looking at...

Best Regards,
Aerik

--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Wed, Dec 3, 2008 at 12:37 PM, Aerik Sylvan <aerik@thesylvans.com> wrote:
> But it sounds like maybe those of us who'd like to see this happen should
> discuss a UI (or several) for it.

No, someone should *write* a UI. It should be written and added to
the software. If it's subpar, fine, it can be improved later. Better
that a mediocre UI should be written and committed now than that yet
another category intersection discussion should die away as they
always do.

On Thu, Dec 4, 2008 at 11:43 AM, Aerik Sylvan <aerik@thesylvans.com> wrote:
> Are you pinging a live database, or a copy made from a dump? (please excuse
> my ignorance if this is common knowledge)

It's a toolserver tool, so he's most likely using the toolserver
database. This is a read-only copy of the real database, replicated
in real time and used for toolserver tools only (so if someone runs a
query that causes it to lag by two hours, it won't affect the real
site).

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Re-posting, the original seems to have been lost in cyberspace:

Magnus - I checked out your tool, but it looks like you're using a query
against the categorylinks table? Have you played with setting up a new table
for categories and fulltext indexing it? Use group_concat to get all of a
pages categories into one field, then create a fulltext index on that field.
You get much better performance than using the categorylinks table (kind of
weird, eh?)

Are you pinging a live database, or a copy made from a dump? (please excuse my
ignorance if this is common knowledge)

I'm working on dummying up a UI using the same approach (fulltext index of
categories) on wikidweb and will write back when I've got something worth
looking at...

Best Regards,
Aerik


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aryeh Gregor wrote:
> On Wed, Dec 3, 2008 at 12:37 PM, Aerik Sylvan <aerik@thesylvans.com> wrote:
>> But it sounds like maybe those of us who'd like to see this happen should
>> discuss a UI (or several) for it.
>
> No, someone should *write* a UI. It should be written and added to
> the software. If it's subpar, fine, it can be improved later. Better
> that a mediocre UI should be written and committed now than that yet
> another category intersection discussion should die away as they
> always do.

I'm Brion Vibber, and I approve this message.

(Note that we can be open to alternative, more efficient backends such
as the Postgres system Greg's experimented with, or a Lucene backend, or
whatever, but to be something people can actively develop and test with
we need to at least have _something_ that works on MySQL, in the core
software, available by default.)

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk5n4MACgkQwRnhpk1wk46y8gCgmcvmf7zkd1okaOG7/oAvPhFH
GKUAnicNtI8n6LCnwQOHVL568FxLTIis
=Tg+Q
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Fri, 05 Dec 2008 13:39:15 -0800, Brion Vibber <brion@wikimedia.org>
wrote:

>
> Aryeh Gregor wrote:
> > No, someone should *write* a UI. It should be written and added to
> > the software. If it's subpar, fine, it can be improved later. Better
> > that a mediocre UI should be written and committed now than that yet
> > another category intersection discussion should die away as they
> > always do.
>
> I'm Brion Vibber, and I approve this message.
>
> (Note that we can be open to alternative, more efficient backends such
> as the Postgres system Greg's experimented with, or a Lucene backend, or
> whatever, but to be something people can actively develop and test with
> we need to at least have _something_ that works on MySQL, in the core
> software, available by default.)
>
>
Okay, that's a green light if I ever saw one, awesome. So let's
create a a "categorysearch" myisam table, stick all
the categories in it, set up hooks to maintain it, and implement the
"fulltext index solution". We'll use a special page to show the
results (?). I'm working on an interface that primarily would depend
on two links at the bottom of each article, "find similar articles"
and "find related categories" - these
bring up articles having the same categories, and a list of top
categories belonging to those categories, respectively.

Sound good?

Aerik

--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Sat, Dec 6, 2008 at 4:49 PM, <aerik@thesylvans.com> wrote:
> Okay, that's a green light if I ever saw one, awesome. So let's
> create a a "categorysearch" myisam table, stick all
> the categories in it, set up hooks to maintain it, and implement the
> "fulltext index solution". We'll use a special page to show the
> results (?). I'm working on an interface that primarily would depend
> on two links at the bottom of each article, "find similar articles"
> and "find related categories" - these
> bring up articles having the same categories, and a list of top
> categories belonging to those categories, respectively.
>
> Sound good?

Sounds vastly better than the absolutely nothing we presently have.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Okay, it's not quite done, and it's still really crude, but starting to take
shape - I've got some basic intersections functionality running on Wikidweb
- I hacked skin.php and added links to the special intersections page. The
intersections are using a MyIsam fulltext index. I'm not using 'boolean
mode' queries, as this seems to give more interesting results.

(look at any article page at http://wikidweb.com)

The UI on the special page itself is really ugly and needs lots of work, and
once this is all done it will have to be ported to an up-to-date version of
mediawiki (I'm way down rev), but *it's a start*.

Comments, suggestions, criticisms, and offers to help all welcome.

Best Regards,
Aerik

--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Mon, Dec 8, 2008 at 5:38 PM, Aerik Sylvan <aerik@thesylvans.com> wrote:
> Okay, it's not quite done, and it's still really crude, but starting to take
> shape - I've got some basic intersections functionality running on Wikidweb
> - I hacked skin.php and added links to the special intersections page. The
> intersections are using a MyIsam fulltext index. I'm not using 'boolean
> mode' queries, as this seems to give more interesting results.

Is there a patch available somewhere? Have you asked for commit
access to work on this in the Wikimedia repo?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aerik Sylvan wrote:
> Okay, it's not quite done, and it's still really crude, but starting to take
> shape - I've got some basic intersections functionality running on Wikidweb
> - I hacked skin.php and added links to the special intersections page. The
> intersections are using a MyIsam fulltext index. I'm not using 'boolean
> mode' queries, as this seems to give more interesting results.
>
> (look at any article page at http://wikidweb.com)
>
> The UI on the special page itself is really ugly and needs lots of work, and
> once this is all done it will have to be ported to an up-to-date version of
> mediawiki (I'm way down rev), but *it's a start*.

Neat! :)

It seems to be more of a "fuzzy" category search, looking for articles
with a similar set of categories, ordered by match rankings.

Not necessarily the same as a strict category intersection tool, but
perhaps more interesting in some ways.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk9wyAACgkQwRnhpk1wk45PeACdEwLjDH1cCQ6TVvt2mI7xnkHB
+5kAnjiG0inufLVQbf7JxEOPvGLpsuee
=pF2Q
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
On Wed, 03 Dec 2008 16:48:39 +0100, Roan Kattouw <roan.kattouw@home.nl
> wrote:

>
> We had a pretty lengthy discussion about this before the summer, and the
> consensus seemed to be that a fulltext-based approach looked most
> viable. I actually wrote an extension that does that, and promised to
> release it soon; that was quite a few months ago, and I never got around
> to it. I'll release it properly when I have time, which will hopefully
> be before Christmas :D
>
> The code needs some tweaking and refactoring, though. It's pretty
> tightly integrated with the article text search (both functions in one
> form) and has all kinds of weird features, because the guy who paid me
> to write it wanted them. It also doesn't support three-letter word
> searching (which core does these days, using a prefix hack), which is
> pretty bad since categories with short titles (or stopword titles) won't
> be found either.
>
> Roan Kattouw (Catrope)
>
>
Hey Roan, does your code use the a new table for the category search (with
fulltext index) and do you have the hooks for maintaining that table? Do
you display the the results on a new search results page, or did you hack
the existing one? Basically, I'm thinking that even if your stuff isn't
ready for prime time, you may have already done a lot of the heavy
lifting... can we get our hands on it?

Thanks!
Aerik

--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: The never-dying topic: category intersection [ In reply to ]
Aerik Sylvan schreef:
> Hey Roan, does your code use the a new table for the category search (with
> fulltext index)
Yes
> and do you have the hooks for maintaining that table?
Yes. I even have a maintenance script to populate it.
> Do
> you display the the results on a new search results page, or did you hack
> the existing one?
I use a new search interface, which has all kinds of weird features.
> Basically, I'm thinking that even if your stuff isn't
> ready for prime time, you may have already done a lot of the heavy
> lifting... can we get our hands on it?
Sure. You'll have to strip out most of the crazy features and you'll
probably want to write a new UI and refactor some stuff here and there,
but it's quite usable. I'll upload it tomorrow, gotta go get some sleep now.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 2  View All