Mailing List Archive

Blame maps aka authorship detection
Dear All,

Michael Shavlovky and I have been working on blame maps (authorship
detection) for the various Wikipedias.
We have code in the WikiMedia repository that has been written with the
goal to obtain a production system capable of attributing all content (not
just a research demo). Here are some pointers:

- Code <https://gerrit.wikimedia.org/r/#/q/blamemaps,n,z>
- Description of the blame maps mediawiki
extension<https://docs.google.com/document/d/15MEyu5tDZ3mhj_i1fDNFqNxWexK-B3BtbYKJlYEKdiQ/edit>
- Detailed description of the underlying algorithm, with performance
evaluation<https://www.soe.ucsc.edu/research/technical-reports/ucsc-soe-12-21/download>
- Demo <http://blamemaps.wmflabs.org/mw/index.php/Main_Page>

These are also all available from
https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project
In brief, for each page we store metadata that summarizes the entire text
evolution of the page; this metadata, compressed, is about three times the
size of a typical revision. Each time a new revision is made, we read this
metadata, attribute every word of the revision, store updated metadata, and
store authorship data for the revision. The process takes 1-2 seconds
depending on the average revision size (most of the time is actually
devoted to deserializing and reserializing the metadata). Comparing with
all previous revisions takes care of things like content that is deleted
and then later re-inserted, and other various attacks that might happen
once authorship is displayed. I should also add that these algorithms are
independent from the ones in WikiTrust, and should be much better.

We have NOT developed a GUI for this: our plan was just to provide a data
API that gives information on authorship of each word. There are many ways
to display the information, from page summaries of authorship to detailed
word-by-word information, and we thought that surely others would want to
play with the visualization aspect.

I am writing this message as we hope this might be of interest, and as we
would be quite happy to find people willing to collaborate. Is anybody
interested in developing a GUI for it and talk to us about what API we
should have for retrieving this authorship information? Is there anybody
interested in helping to move the code to production-ready stage?

I also would like to mention that Fabian Floeck has developed another very
interesting algorithm for attributing the content, reported in
http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_Andriy_Rodchenko.pdf
Fabian and I are now starting to collaborate: we want to compare the
algorithms, and work together to obtain something we are happy with, and
that can run in production.

Indeed, I think a reasonable first goal would be to:

- Define a data API
- Define some coarse requirements of the system
- Have a look at above results / algorithms / implementation and advise
us.

I am sure that the algorithm details can be fine tuned and changed to no
end in a collaborative effort, once the first version is up and running.
The problem is of putting together a bit of effort to get to that first
running version.

Luca
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
> I am writing this message as we hope this might be of interest, and as we
> would be quite happy to find people willing to collaborate. Is anybody
> interested in developing a GUI for it and talk to us about what API we
> should have for retrieving this authorship information? Is there anybody
> interested in helping to move the code to production-ready stage?

Are you planning to run this live in production (i.e. 1-2 seconds on
every save)?

I think people would be reluctant to slow writes down further. You
could potentially do it deferred, or in the job queue, but I think it
might make more sense on something like Wikimedia Labs
(https://www.mediawiki.org/wiki/Wikimedia_Labs)

Did you try doing it with no caching (similar to git blame, though I
know it's a different algorithm)? I'm wondering how much benefit you
get from the cached info.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
I agree: in fact we don't do it in the write pipeline. The code we wrote
implements a simple queue, where page_id are queued for processing. The
processing job then gets a page_id out of that table, and processes all the
missing revisions for that page_id. So this is useful also if (say) there
is a page merge or something similar: we can just erase all authorship
information for that page, and at the next edit, it will be rebuilt.

What we wrote can work also on labs, but:

- We need a way to poll the database for things like what are all
revision_ids of a given page. We could use the API instead, but it's less
efficient.
- We need a way to read the text of revisions. Again, the API can work,
but having better access is better.
- We need a place where to store the authorship information. This is
several terabytes for enwiki. Basically, we need access to some text
store. Is this available on labs?

We would welcome more information on how much of the above is feasible on
labs.

Luca

On Mon, Feb 25, 2013 at 7:27 PM, Matthew Flaschen
<mflaschen@wikimedia.org>wrote:

> On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
> > I am writing this message as we hope this might be of interest, and as we
> > would be quite happy to find people willing to collaborate. Is anybody
> > interested in developing a GUI for it and talk to us about what API we
> > should have for retrieving this authorship information? Is there anybody
> > interested in helping to move the code to production-ready stage?
>
> Are you planning to run this live in production (i.e. 1-2 seconds on
> every save)?
>
> I think people would be reluctant to slow writes down further. You
> could potentially do it deferred, or in the job queue, but I think it
> might make more sense on something like Wikimedia Labs
> (https://www.mediawiki.org/wiki/Wikimedia_Labs)
>
> Did you try doing it with no caching (similar to git blame, though I
> know it's a different algorithm)? I'm wondering how much benefit you
> get from the cached info.
>
> Matt Flaschen
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
On 02/25/2013 06:21 PM, Luca de Alfaro wrote:
> I am writing this message as we hope this might be of interest, and as we
> would be quite happy to find people willing to collaborate. Is anybody
> interested in developing a GUI for it and talk to us about what API we
> should have for retrieving this authorship information? Is there anybody
> interested in helping to move the code to production-ready stage?

I'm emphasizing this message. Thanks for the roundup, Luca!

--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
It sounds like some of those things should be working in labs soon with
DB replication. I doubt they'll let you store terabytes though.

Alex Monk

On 26/02/13 07:29, Luca de Alfaro wrote:
> What we wrote can work also on labs, but:
>
> - We need a way to poll the database for things like what are all
> revision_ids of a given page. We could use the API instead, but it's less
> efficient.
> - We need a way to read the text of revisions. Again, the API can work,
> but having better access is better.
> - We need a place where to store the authorship information. This is
> several terabytes for enwiki. Basically, we need access to some text
> store. Is this available on labs?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
On 02/25/2013 09:21 PM, Luca de Alfaro wrote:
> The problem is of putting together a bit of effort to get to that first
> running version.

How big are the wikis that you've tried this on? Would smaller academic
wikis be able to use this code?

I may have a use for your code since one of the wikis I'm working on is
targeted to academics where getting a citation really improves wiki
participation.

--
http://hexmode.com/

There is no path to peace. Peace is the path.
-- Mahatma Gandhi, "Non-Violence in Peace and War"

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
I have briefly toyed with something similar. Unlike yours, it has a (very simple and rudimentary) interface, but no sophisticated algorithms inside :) – just a standard LCS diff library. It also works in real time (but is awfully slow).

It can be seen at http://wikiblame.heroku.com/ (source at https://github.com/MatmaRex/wikiblame) – there's some weird bug right now that makes it fail for titles with non-ASCII characters that I haven't had time to investigate, and due to free platform limitations it'll fail if generation of the blame map would takes over 30 seconds (that would be for most articles > 3 kB or with more than 50 revisions); I was intending to move it to some toolserver or labs or something, but haven't had time for this either.

I've also seen some gadget on en.wiki that did something similar, but I don't remember the name and can't find it right now.

--
Matma Rex

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
your site doesn't work

http://blamemaps.wmflabs.org/mw/index.php/Main_Page -> the connection timed out

On Tue, Feb 26, 2013 at 5:52 PM, Bartosz Dziewoński <matma.rex@gmail.com> wrote:
> I have briefly toyed with something similar. Unlike yours, it has a (very
> simple and rudimentary) interface, but no sophisticated algorithms inside :)
> – just a standard LCS diff library. It also works in real time (but is
> awfully slow).
>
> It can be seen at http://wikiblame.heroku.com/ (source at
> https://github.com/MatmaRex/wikiblame) – there's some weird bug right now
> that makes it fail for titles with non-ASCII characters that I haven't had
> time to investigate, and due to free platform limitations it'll fail if
> generation of the blame map would takes over 30 seconds (that would be for
> most articles > 3 kB or with more than 50 revisions); I was intending to
> move it to some toolserver or labs or something, but haven't had time for
> this either.
>
> I've also seen some gadget on en.wiki that did something similar, but I
> don't remember the name and can't find it right now.
>
> --
> Matma Rex
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
Hi Luca,

we are working on somewhat related issues in Parsoid [1][2]. The
modified HTML DOM is diffed vs. the original DOM on the way in. Each
modified node is annotated with the base revision. We don't store this
information yet- right now we use it to selectively serialize modified
parts of the page back to wikitext. We will however soon store the HTML
along with wikitext for each revision, which should make it possible to
display a coarse blame map.

There are several limitations:

* We don't preserve blame information on wikitext edits yet. This should
become possible with the incremental re-parsing optimization which is on
our roadmap for this summer.

* Our DOM diff algorithm is extremely simplistic. We are considering to
port XyDiff for better move detection.

* The information is pretty coarse at a node level. Refining this to a
word level would require an efficient encoding for that information,
possibly as length/revision pairs associated with the wrapping element.

* We have not moved metadata from attributes to a metadata section with
efficient encoding yet.

We don't currently plan to work on blame maps ourselves. Maybe there are
opportunities for collaboration?

Gabriel

[1]: http://www.mediawiki.org/wiki/Parsoid
[2]: http://www.mediawiki.org/wiki/Parsoid/Roadmap

--
Gabriel Wicke
Senior Software Engineer
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
On 02/26/2013 02:29 AM, Luca de Alfaro wrote:
> - We need a way to poll the database for things like what are all
> revision_ids of a given page. We could use the API instead, but it's less
> efficient.

Yes, as others have said LAbs should allow that either now or shortly.
You should sign up for
https://lists.wikimedia.org/mailman/listinfo/labs-l and feel free to ask
Labs questions there.

> - We need a place where to store the authorship information. This is
> several terabytes for enwiki. Basically, we need access to some text
> store. Is this available on labs?

I don't know if you'll be able to get that or not. You'll have to make
a special request.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Blame maps aka authorship detection [ In reply to ]
Hi,
as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann and me) are also working on an algorithm for authorship detection. Our approach is somewhat different than Luca and Michael's in that we rebuild authorship information for words in paragraphs and sentences via MD5-hashes (i.e. see if they have existed before at any time in the article) and use a Diff algorithm to detect the changes in the parts of the articles that haven't been seen before.

We build up on a older, more basic model of ours as described in the paper Luca already included in his mail [1]. Currently we are at 0,04 sec per revision for the pure calculation, without writing/reading the hashes to/from a database. This is the step we are working on now, to make the method incremental. We will make the code publicly available soon. We would like to contribute as much as we can to the Wikipedia authorship project with our solution and are open for any collaboration.

Another issue is of course accuracy of the found words, for which we will ask the community for input to evaluate it. We have set up a small gold standard set of 184 words and their origin (who wrote them in which revision) which can be found here: [2] . The words were randomly selected and their origin determined manually. I invite everyone to look at this set and make comments about if the postulated revisions of origin in this gold standard set seem to be right and extend it maybe. Although we will run an evaluation with a bigger user base, this serves as a useful starting point for preliminary testing. Right now we reach an accuracy of ~85% with this set (compared to ~50% of the old Wikitrust algorithm, see [1]), although there are still a lot of tuning possibilities in our algorithm.

Best,

Fabian

[1] http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_Andriy_Rodchenko.pdf
[2] https://docs.google.com/spreadsheet/ccc?key=0An7RIRiLIXD5dENITFpmU0c1RVZaU1NYeXZ0UEVVaEE#gid=0






--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Dipl.-Medwiss. Fabian Flöck
Research Associate

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l