Mailing List Archive

Can we drop revision hashes (rev_sha1)?
Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR)
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd
like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
expensive with MCR. With multiple content objects per revision, we need to track
the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts
query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core
uses it, and I'm not aware of any extension using it either. It seems to be used
primarily in offline analysis for detecting (manual) reverts by looking for
revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with every
revision update? Or can we just compute the hashes on the fly for the offline
analysis? Computing hashes is slow since the content needs to be loaded first,
but it would only have to be done for pairs of revisions of the same page with
the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking
all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata.
Parsing the full archive dumps is a quite expensive, time-wise.

This may change with Wikistats 2.0 with has a totally different process flow. That I can't tell.

Erik Zachte

-----Original Message-----
From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler
Sent: Friday, September 15, 2017 12:52
To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR) <https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR. With multiple content objects per revision, we need to track the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core uses it, and I'm not aware of any extension using it either. It seems to be used primarily in offline analysis for detecting (manual) reverts by looking for revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with every revision update? Or can we just compute the hashes on the fly for the offline analysis? Computing hashes is slow since the content needs to be loaded first, but it would only have to be done for pairs of revisions of the same page with the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <ezachte@wikimedia.org> wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -----Original Message-----
> From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for the
> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
What I wonder is – does this *need* to be a part of the database table, or
can it be a dataset generated from each revision and then published
separately? This way each user wouldn’t have to individually compute the
hashes while we also get the (ostensible) benefit of getting them out of
the table.

On September 15, 2017 at 12:41:03 PM, Andrew Otto (otto@wikimedia.org)
wrote:

We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
from the little I know:

Most analytical computations (for things like reverts, as you say) don’t
have easy access to content, so computing SHAs on the fly is pretty hard.
MediaWiki history reconstruction relies on the SHA to figure out what
revisions revert other revisions, as there is no reliable way to know if
something is a revert other than by comparing SHAs.

See
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
(particularly the *revert* fields).



On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <ezachte@wikimedia.org> wrote:

> Compute the hashes on the fly for the offline analysis doesn’t work for
> Wikistats 1.0, as it only parses the stub dumps, without article content,
> just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.
>
> This may change with Wikistats 2.0 with has a totally different process
> flow. That I can't tell.
>
> Erik Zachte
>
> -----Original Message-----
> From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On
> Behalf Of Daniel Kinzler
> Sent: Friday, September 15, 2017 12:52
> To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
>
> Hi all!
>
> I'm working on the database schema for Multi-Content-Revisions (MCR) <
> https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> and I'd like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes
> more expensive with MCR. With multiple content objects per revision, we
> need to track the hash for each slot, and then re-calculate the sha1 for
> each revision.
>
> That's expensive especially in terms of bytes-per-database-row, which
> impacts query performance.
>
> So, what do we need the rev_sha1 field for? As far as I know, nothing in
> core uses it, and I'm not aware of any extension using it either. It seems
> to be used primarily in offline analysis for detecting (manual) reverts by
> looking for revisions with the same hash.
>
> Is that reason enough for dragging all the hashes around the database with
> every revision update? Or can we just compute the hashes on the fly for
the
> offline analysis? Computing hashes is slow since the content needs to be
> loaded first, but it would only have to be done for pairs of revisions of
> the same page with the same size, which should be a pretty good
> optimization.
>
> Also, I believe Roan is currently looking for a better mechanism for
> tracking all kinds of reverts directly.
>
> So, can we drop rev_sha1?
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Hi!

> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.

As a random idea - would it be possible to calculate the hashes when
data is transitioned from SQL to Hadoop storage? I imagine that would
slow down the transition, but not sure if it'd be substantial or not. If
we're using the hash just to compare revisions, we could also use
different hash (maybe non-crypto hash?) which may be faster.

--
Stas Malyshev
smalyshev@wikimedia.org

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
> As a random idea - would it be possible to calculate the hashes when data
is transitioned from SQL to Hadoop storage?

We take monthly snapshots of the entire history, so every month we’d have
to pull the content of every revision ever made :o


On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <smalyshev@wikimedia.org>
wrote:

> Hi!
>
> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> > from the little I know:
> >
> > Most analytical computations (for things like reverts, as you say) don’t
> > have easy access to content, so computing SHAs on the fly is pretty hard.
> > MediaWiki history reconstruction relies on the SHA to figure out what
> > revisions revert other revisions, as there is no reliable way to know if
> > something is a revert other than by comparing SHAs.
>
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
>
> --
> Stas Malyshev
> smalyshev@wikimedia.org
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
> can it be a dataset generated from each revision and then published
separately?

Perhaps it be generated asynchronously via a job? Either stored in
revision or a separate table.

On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto <otto@wikimedia.org> wrote:

> > As a random idea - would it be possible to calculate the hashes when data
> is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d have
> to pull the content of every revision ever made :o
>
>
> On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <smalyshev@wikimedia.org>
> wrote:
>
>> Hi!
>>
>> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
>> but
>> > from the little I know:
>> >
>> > Most analytical computations (for things like reverts, as you say) don’t
>> > have easy access to content, so computing SHAs on the fly is pretty
>> hard.
>> > MediaWiki history reconstruction relies on the SHA to figure out what
>> > revisions revert other revisions, as there is no reliable way to know if
>> > something is a revert other than by comparing SHAs.
>>
>> As a random idea - would it be possible to calculate the hashes when
>> data is transitioned from SQL to Hadoop storage? I imagine that would
>> slow down the transition, but not sure if it'd be substantial or not. If
>> we're using the hash just to compare revisions, we could also use
>> different hash (maybe non-crypto hash?) which may be faster.
>>
>> --
>> Stas Malyshev
>> smalyshev@wikimedia.org
>>
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Hi!

On 9/15/17 1:06 PM, Andrew Otto wrote:
>> As a random idea - would it be possible to calculate the hashes
> when data is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d
> have to pull the content of every revision ever made :o

Why? If you already seen that revision in previous snapshot, you'd
already have its hash? Admittedly, I have no idea how the process works,
so I am just talking out of general knowledge and may miss some things.
Also of course you already have hashes from revs till this day and up to
the day we decide to turn the hash off. Starting that day, it'd have to
be generated, but I see no reason to generate one more than once?
--
Stas Malyshev
smalyshev@wikimedia.org

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
We could keep it for the wikitext, but drop the hash for the metadata, and
drop any support for a "combined" hash over wikitext + all-other-pieces.

...which begs the question about how reverts work in MCR. Is it just the
wikitext which is reverted, or do categories and other metadata revert as
well? And perhaps we can just mark these at revert time instead of trying
to reconstruct it after the fact?
--scott

On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <smalyshev@wikimedia.org>
wrote:

> Hi!
>
> On 9/15/17 1:06 PM, Andrew Otto wrote:
> >> As a random idea - would it be possible to calculate the hashes
> > when data is transitioned from SQL to Hadoop storage?
> >
> > We take monthly snapshots of the entire history, so every month we’d
> > have to pull the content of every revision ever made :o
>
> Why? If you already seen that revision in previous snapshot, you'd
> already have its hash? Admittedly, I have no idea how the process works,
> so I am just talking out of general knowledge and may miss some things.
> Also of course you already have hashes from revs till this day and up to
> the day we decide to turn the hash off. Starting that day, it'd have to
> be generated, but I see no reason to generate one more than once?
> --
> Stas Malyshev
> smalyshev@wikimedia.org
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
We could keep it in the XML dumps (it's part of the XSD after all)...just
compute it at export time. Not terribly hard, I don't think, we should have
the parsed content already on hand....

-Chad

On Fri, Sep 15, 2017 at 12:51 PM James Hare <jamesmhare@gmail.com> wrote:

> What I wonder is – does this *need* to be a part of the database table, or
> can it be a dataset generated from each revision and then published
> separately? This way each user wouldn’t have to individually compute the
> hashes while we also get the (ostensible) benefit of getting them out of
> the table.
>
> On September 15, 2017 at 12:41:03 PM, Andrew Otto (otto@wikimedia.org)
> wrote:
>
> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
> from the little I know:
>
> Most analytical computations (for things like reverts, as you say) don’t
> have easy access to content, so computing SHAs on the fly is pretty hard.
> MediaWiki history reconstruction relies on the SHA to figure out what
> revisions revert other revisions, as there is no reliable way to know if
> something is a revert other than by comparing SHAs.
>
> See
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
> (particularly the *revert* fields).
>
>
>
> On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <ezachte@wikimedia.org>
> wrote:
>
> > Compute the hashes on the fly for the offline analysis doesn’t work for
> > Wikistats 1.0, as it only parses the stub dumps, without article content,
> > just metadata.
> > Parsing the full archive dumps is a quite expensive, time-wise.
> >
> > This may change with Wikistats 2.0 with has a totally different process
> > flow. That I can't tell.
> >
> > Erik Zachte
> >
> > -----Original Message-----
> > From: Wikitech-l [mailto:wikitech-l-bounces@lists.wikimedia.org] On
> > Behalf Of Daniel Kinzler
> > Sent: Friday, September 15, 2017 12:52
> > To: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> >
> > Hi all!
> >
> > I'm working on the database schema for Multi-Content-Revisions (MCR) <
> > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema>
> > and I'd like to get rid of the rev_sha1 field:
> >
> > Maintaining revision hashes (the rev_sha1 field) is expensive, and
> becomes
> > more expensive with MCR. With multiple content objects per revision, we
> > need to track the hash for each slot, and then re-calculate the sha1 for
> > each revision.
> >
> > That's expensive especially in terms of bytes-per-database-row, which
> > impacts query performance.
> >
> > So, what do we need the rev_sha1 field for? As far as I know, nothing in
> > core uses it, and I'm not aware of any extension using it either. It
> seems
> > to be used primarily in offline analysis for detecting (manual) reverts
> by
> > looking for revisions with the same hash.
> >
> > Is that reason enough for dragging all the hashes around the database
> with
> > every revision update? Or can we just compute the hashes on the fly for
> the
> > offline analysis? Computing hashes is slow since the content needs to be
> > loaded first, but it would only have to be done for pairs of revisions of
> > the same page with the same size, which should be a pretty good
> > optimization.
> >
> > Also, I believe Roan is currently looking for a better mechanism for
> > tracking all kinds of reverts directly.
> >
> > So, can we drop rev_sha1?
> >
> > --
> > Daniel Kinzler
> > Principal Platform Engineer
> >
> > Wikimedia Deutschland
> > Gesellschaft zur Förderung Freien Wissens e.V.
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Am 15.09.2017 um 19:49 schrieb Erik Zachte:
> Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata.
> Parsing the full archive dumps is a quite expensive, time-wise.

We can always compute the hash when outputting XML dumps that contain the full
content (it's already loaded, so no big deal), and then generate the XML dump
with only meta-data from the full dump.


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
A revert restores a previous revision. It covers all slots.

The fact that reverts, watching, protecting, etc still works per page, while you
can have multiple kinds of different content on the page, is indeed the point of
MCR.

Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
> Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
> We could keep it for the wikitext, but drop the hash for the metadata, and
> drop any support for a "combined" hash over wikitext + all-other-pieces.
>
> ...which begs the question about how reverts work in MCR. Is it just the
> wikitext which is reverted, or do categories and other metadata revert as
> well? And perhaps we can just mark these at revert time instead of trying
> to reconstruct it after the fact?
> --scott
>
> On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <smalyshev@wikimedia.org>
> wrote:
>
>> Hi!
>>
>> On 9/15/17 1:06 PM, Andrew Otto wrote:
>>>> As a random idea - would it be possible to calculate the hashes
>>> when data is transitioned from SQL to Hadoop storage?
>>>
>>> We take monthly snapshots of the entire history, so every month we’d
>>> have to pull the content of every revision ever made :o
>>
>> Why? If you already seen that revision in previous snapshot, you'd
>> already have its hash? Admittedly, I have no idea how the process works,
>> so I am just talking out of general knowledge and may miss some things.
>> Also of course you already have hashes from revs till this day and up to
>> the day we decide to turn the hash off. Starting that day, it'd have to
>> be generated, but I see no reason to generate one more than once?
>> --
>> Stas Malyshev
>> smalyshev@wikimedia.org
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Ok, a little more detail here:

For MCR, we would have to keep around the hash of each content object ("slot")
AND of each revision. This makes the revision and content tables "wider", which
is a problem because they grow quite "tall", too. It also means we have to
compute a hash of hashes for each revision, but that's not horrible.

I'm hoping we can remove the hash from both tables. Keeping the hash of each
content object and/or each revision somewhere else is fine with me. Perhaps it's
sufficient to generate it when generating XML dumps. Maybe we want it in hadoop.
Maybe we want to have it in a separate SQL database. But perhaps we don't
actually need it.

Can someone explain *why* they want the hash at all?

Am 15.09.2017 um 22:01 schrieb Stas Malyshev:
> Hi!
>
>> We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but
>> from the little I know:
>>
>> Most analytical computations (for things like reverts, as you say) don’t
>> have easy access to content, so computing SHAs on the fly is pretty hard.
>> MediaWiki history reconstruction relies on the SHA to figure out what
>> revisions revert other revisions, as there is no reliable way to know if
>> something is a revert other than by comparing SHAs.
>
> As a random idea - would it be possible to calculate the hashes when
> data is transitioned from SQL to Hadoop storage? I imagine that would
> slow down the transition, but not sure if it'd be substantial or not. If
> we're using the hash just to compare revisions, we could also use
> different hash (maybe non-crypto hash?) which may be faster.
>


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> Also, I believe Roan is currently looking for a better mechanism for tracking
> all kinds of reverts directly.

Let's see if we want to use rev_sha1 for that better solution (a way to
track reverts within MW itself) before we drop it.

I know Roan is planning to write an RFC on reverts.

Matt

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
At a quick glance, EventBus and FlaggedRevs are the two extensions using
the hashes. EventBust just puts them into the emitted data; FlaggedRevs
detects reverts to the latest stable revision that way (so there is no
rev_sha1 based lookup in either case, although in the case of FlaggedRevs I
could imagine a use case for something like that).

Files on the other hand use hash lookups a lot, and AIUI they are planned
to become MCR slots eventually.

For a quick win, you could just reduce the hash size. We have around a
billion revisions, and probably won't ever have more than a trillion;
square that for birthday effect and add a couple extra zeros just to be
sure, and it still fits comfortably into 80 bits. If hashes only need to be
unique within the same page then maybe 30-40.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
On 15/09/2017 12:51, Daniel Kinzler wrote:
>
> I'm working on the database schema for Multi-Content-Revisions (MCR)
> <https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and I'd
> like to get rid of the rev_sha1 field:
>
> Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more
> expensive with MCR. With multiple content objects per revision, we need to track
> the hash for each slot, and then re-calculate the sha1 for each revision.
<snip>

Hello,

That was introduced by Aaron Schulz. The purpose is to have them pre
computed since that is quite expensive to have to do it on million of rows.

A use case was to easily detect reverts.

See for reference:
https://phabricator.wikimedia.org/T23860
https://phabricator.wikimedia.org/T27312

I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some
insights about it.



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Antoine Musso wrote:
>I guess Aaron Halfaker, Brion Vibber, Aaron Schulz would have some
>insights about it.

Yes. Brion started a thread about the use of SHA-1 in February 2017:

https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087664.html
https://lists.wikimedia.org/pipermail/wikitech-l/2017-February/087666.html

Of note, we have <https://www.mediawiki.org/wiki/Manual:Hashing>.

The use of base-36 SHA-1 instead of base-16 SHA-1 for revision.rev_sha1
has always perplexed me. It'd be nice to better(?) document that design
decision. It's referenced here:
https://lists.wikimedia.org/pipermail/wikitech-l/2012-September/063445.html

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
>> Also, I believe Roan is currently looking for a better mechanism for tracking
>> all kinds of reverts directly.
>
> Let's see if we want to use rev_sha1 for that better solution (a way to track
> reverts within MW itself) before we drop it.


The problem is that if we don't drop is, we have to *introduce* it for the new
content table for MCR. I'd like to avoid that.

I guess we can define the field and just null it, but... well. I'd like to avoid
that.


--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it. That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.

I defer to your expertise when you say it's expensive to keep in the db,
and I can see how that would get much worse with MCR. I'm sure we can
figure something out, though. Right now it seems like our options are, as
others have pointed out:

* compute async and store in DB or somewhere else that's central and easy
to access from all the branches I mentioned
* update how we detect reverts and keep a revert database with good
references to wiki_db, rev_id so it can be brought back in context.

Personally, I would love to get better revert detection, using sha1 exact
matches doesn't really get to the heart of the issue. Important phenomena
like revert wars, bullying, and stalking are hiding behind bad revert
detection. I'm happy to brainstorm ways we can use Analytics
infrastructure to do this. We definitely have the tools necessary, but not
so much the man-power. That said, please don't strip out rev_sha1 until
we've accounted for all its "data customers".

So, put another way, I think it's totally fine if we say ok everyone, from
date XYZ, you will no longer have rev_sha1 in the database, but if you want
to know whether an edit reverts a previous edit or a series of edits, go
*HERE*. That's fine. And just for context, here's how we do our revert
detection in Hadoop (it's pretty fancy) [2].


[1] https://github.com/mediawiki-utilities/python-mwreverts
[2]
https://github.com/wikimedia/analytics-refinery-source/blob/1d38b8e4acfd10dc811279826ffdff236e8b0f2d/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/denormalized/DenormalizedRevisionsBuilder.scala#L174-L317

On Mon, Sep 18, 2017 at 9:19 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
> wrote:

> Am 16.09.2017 um 01:22 schrieb Matthew Flaschen:
> > On 09/15/2017 06:51 AM, Daniel Kinzler wrote:
> >> Also, I believe Roan is currently looking for a better mechanism for
> tracking
> >> all kinds of reverts directly.
> >
> > Let's see if we want to use rev_sha1 for that better solution (a way to
> track
> > reverts within MW itself) before we drop it.
>
>
> The problem is that if we don't drop is, we have to *introduce* it for the
> new
> content table for MCR. I'd like to avoid that.
>
> I guess we can define the field and just null it, but... well. I'd like to
> avoid
> that.
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
---------- P?vodní e-mail ----------
Od: Dan Andreescu <dandreescu@wikimedia.org>
Komu: Wikimedia developers <wikitech-l@lists.wikimedia.org>
Datum: 18. 9. 2017 16:26:18
P?edm?t: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
"So, as things stand, rev_sha1 in the database is used for:

1. the XML dumps process and all the researchers depending on the XML dumps
(probably just for revert detection)
2. revert detection for libraries like python-mwreverts [1]
3. revert detection in mediawiki history reconstruction processes in Hadoop
(Wikistats 2.0)
4. revert detection in Wikistats 1.0
5. revert detection for tools that run on labs, like Wikimetrics
?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
latest code for that service

If you think about this list above as a flow of data, you'll see that
rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
removing it and adding it back downstream from the main mediawiki database
somewhere, like in XML, cuts off the other places that need it. That means
it must be available either in the mediawiki database or in some other
central database which all those other consumers can pull from.
"



I use rev_sha1 on replicas to check the consistency of modules, templates or
other pages (typically help) which should be same between projects (either
within one language or even crosslanguage, if the page is not language
dependent). In other words to detect possible changes in them and syncing
them.




Also, I haven't noticed it mentioned in the thread: Flow also notices users
on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather
mentioning it.







Kind regards







Danny B.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
I am not a mediawiki developer, but shouldn't sha1 be moved instead of
deleted/not deleted? Moved to the content table- so it is kept
unaltered.

That way it can be used for all the goals that have been discussed
(detect reversions, XML dumps, etc.) and they are not altered, just
moved away (being more compatible). And it is not like structure
compatibility is going to be kept, as many fields are going to be
"moved" there, so code using the tables directly has to change anyway;
but if the actual content is not altered, the sha field can be kept
unaltered with the same value as before. It would also allow to detect
a "partial revertion", that means, mediawiki text is set to the same
than a previous one, which is what I assume it is used now. However,
now there will be other content that can be reverted individually.

I do not know what exactly MCR is going to be used for, but if (silly
idea), main text article and categories are 2 different contents of an
article, if user A edits both, and user B reverts the text only, that
would get a different revision sha1 value; however, most reasons here
would want to detect the reversion by checking the sha of the text
only (aka content). Equally, for backwards compatibility, storing it
on content would allow to not have to recalculate it for all already
existing values literally reducing it to a "trivial" code change,
while keeping all old data valid. Keeping the field as is, on
revision, will mean all historical data and old dumps are invalid.
Full revision reversions, if needed, can be checked by checking each
individual content sha or the linked content ids.

If, on the other side, revision should be kept completely backwards
compatible, some helper views can be created on the cloud
wikireplicas, but other than that, MCR would not be possible.

If at a later time, text with the same hash is detected (and content
double checked), content could be normalized by assigning the same id
to the same content?

On Mon, Sep 18, 2017 at 8:25 PM, Danny B. <Wikipedia.Danny.B@email.cz> wrote:
>
> ---------- P?vodní e-mail ----------
> Od: Dan Andreescu <dandreescu@wikimedia.org>
> Komu: Wikimedia developers <wikitech-l@lists.wikimedia.org>
> Datum: 18. 9. 2017 16:26:18
> P?edm?t: Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?
> "So, as things stand, rev_sha1 in the database is used for:
>
> 1. the XML dumps process and all the researchers depending on the XML dumps
> (probably just for revert detection)
> 2. revert detection for libraries like python-mwreverts [1]
> 3. revert detection in mediawiki history reconstruction processes in Hadoop
> (Wikistats 2.0)
> 4. revert detection in Wikistats 1.0
> 5. revert detection for tools that run on labs, like Wikimetrics
> ?. I think Aaron also uses rev_sha1 in ORES, but I can't seem to find the
> latest code for that service
>
> If you think about this list above as a flow of data, you'll see that
> rev_sha1 is replicated to xml, labs databases, hadoop, ML models, etc. So
> removing it and adding it back downstream from the main mediawiki database
> somewhere, like in XML, cuts off the other places that need it. That means
> it must be available either in the mediawiki database or in some other
> central database which all those other consumers can pull from.
> "
>
>
>
> I use rev_sha1 on replicas to check the consistency of modules, templates or
> other pages (typically help) which should be same between projects (either
> within one language or even crosslanguage, if the page is not language
> dependent). In other words to detect possible changes in them and syncing
> them.
>
>
>
>
> Also, I haven't noticed it mentioned in the thread: Flow also notices users
> on reverts, but IDK whether it uses rev_sha1 or not. So I'm rather
> mentioning it.
>
>
>
>
>
>
>
> Kind regards
>
>
>
>
>
>
>
> Danny B.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Jaime Crespo
<http://wikimedia.org>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
Am 19.09.2017 um 10:15 schrieb Jaime Crespo:
> I am not a mediawiki developer, but shouldn't sha1 be moved instead of
> deleted/not deleted? Moved to the content table- so it is kept
> unaltered.
The background of my original mail is indede the question whether we need the
sha1 field in the content table. The current draft of the DB schema includes it.

That table will be tall, and the sha1 is the (on average) largest field. If we
are going to use a different mechanism for tracking reverts soon, my hope was
that we can do without it.

OIn any case, my impression is that if we want to keep using hashes to detect
reverts, we need to keep rev_sha1 - and to maintain is, we ALSO need content_sha1.

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
There are two important use cases; one where you want to identify previous
reverts, and one where you want to identify close matches. There are other
ways to do the first than to use a digest, but the digest opens up for
alternate client side algorithms. The last would typically be done by some
locally sensitive hashing. In both cases you don't want to download the
content of each revision, that is exactly why you want to use some kind of
hashes. If the hashes could be requested somehow, perhaps as part of the
API, then it should be sufficient. Those hashes could be part of the XML
dump too, but if you have the XML-dump and know the algorithm, then you
don't need the digest.

There are a specific use case when someone want to verify the content. In
those cases you don't want to identify a previous revert, you want to check
whether someone has tempered with the downloaded content. As you don't know
who might have tempered with the content you should also question the
digest delivered by WMF, thus the digest in the database isn't good enough
as it is right now. Instead of a sha-digest each revision should be
properly signed, but then if you can't trust WMF can you trust their
signature? Signatures for revisions should probably be delivered by some
external entity and not WMF itselves.

On Fri, Sep 15, 2017 at 11:44 PM, Daniel Kinzler <
daniel.kinzler@wikimedia.de> wrote:

> A revert restores a previous revision. It covers all slots.
>
> The fact that reverts, watching, protecting, etc still works per page,
> while you
> can have multiple kinds of different content on the page, is indeed the
> point of
> MCR.
>
> Am 15.09.2017 um 22:23 schrieb C. Scott Ananian:
> > Alternatively, perhaps "hash" could be an optional part of an MCR chunk?
> > We could keep it for the wikitext, but drop the hash for the metadata,
> and
> > drop any support for a "combined" hash over wikitext + all-other-pieces.
> >
> > ...which begs the question about how reverts work in MCR. Is it just the
> > wikitext which is reverted, or do categories and other metadata revert as
> > well? And perhaps we can just mark these at revert time instead of
> trying
> > to reconstruct it after the fact?
> > --scott
> >
> > On Fri, Sep 15, 2017 at 4:13 PM, Stas Malyshev <smalyshev@wikimedia.org>
> > wrote:
> >
> >> Hi!
> >>
> >> On 9/15/17 1:06 PM, Andrew Otto wrote:
> >>>> As a random idea - would it be possible to calculate the hashes
> >>> when data is transitioned from SQL to Hadoop storage?
> >>>
> >>> We take monthly snapshots of the entire history, so every month we’d
> >>> have to pull the content of every revision ever made :o
> >>
> >> Why? If you already seen that revision in previous snapshot, you'd
> >> already have its hash? Admittedly, I have no idea how the process works,
> >> so I am just talking out of general knowledge and may miss some things.
> >> Also of course you already have hashes from revs till this day and up to
> >> the day we decide to turn the hash off. Starting that day, it'd have to
> >> be generated, but I see no reason to generate one more than once?
> >> --
> >> Stas Malyshev
> >> smalyshev@wikimedia.org
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >
> >
> >
>
>
> --
> Daniel Kinzler
> Principal Platform Engineer
>
> Wikimedia Deutschland
> Gesellschaft zur Förderung Freien Wissens e.V.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
On Tue, Sep 19, 2017 at 6:42 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
> wrote:

> That table will be tall, and the sha1 is the (on average) largest field.
> If we
> are going to use a different mechanism for tracking reverts soon, my hope
> was
> that we can do without it.
>

Can't you just split it into a separate table? Core would only need to
touch it on insert/update, so that should resolve the performance concerns.

Also, since content is supposed to be deduplicated (so two revisions with
the exact same content will have the same content_address), cannot that
replace content_sha1 for revert detection purposes? That wouldn't work over
large periods of time (when the original revision and the revert live in
different kinds of stores) but maybe that's an acceptable compromise.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Can we drop revision hashes (rev_sha1)? [ In reply to ]
The revision hashes are also supposed to be used by at least some of the
import tools for XML dumps. The dumps would be less valuable without
some way to check their content. Generating hashes on the fly is surely
not an option given exports can also need to happen within the time of a
PHP request (Special:Export for instance).

Nemo

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 2  View All