Mailing List Archive

Spam filters for wikidata.org
Hi!

Once wikidata.org allows for entry of arbitrary properties, we will need some
protection against spam. However, there is a nasty little problem with making
SpamBlacklist, AntiBot, AbuseFilter etc work with Wikidata content:

Wikibase implements editing directly via the API, but using EditPage. But the
spam filters usually hook into EditPage, typically using the EditFilter or
EditFilterMerged resp EditFilterMergedContent.

Wikibase has a utility class called EditEntity which implements many things
otherwise done by the EditPage: token checks, conflict detection and resolution,
permission checks, etc. We could just trigger EditFilterMergedContent there,
and also EditFilterMerged and EditFilter, though we would have to fake the
"text" for these.

There is one problem with this though: These hooks take as their first parameter
an EnditPage object, and the handler functions defined in the various extensions
make use of this. Often, just to get the context, like page title, etc - but
often enough also for non-trivial things, like calling EditPage::spamPage() or
even EditPage::spamPageWithContent().

How can we handle this? I see several possibilities:

1) change the definition of the hook so it just has a ContextSource as it's
first parameter, and fix all extensions that use the hook. However, it is
unclear how functionality like EditPage::spamPageWithContent() can then be
implemented. EditPage::spamPage() could be moved to a utility class, or into
OutputPage.

2) emulate an EditPage object, using a proxy/stub/dummy object. This would need
a bit of coding, and it's prone to get out of sync with the real EditPage. But
things like spamPageWithContent() could be implemented nicely, in a content
model specific manner.

3) we could instantiate a dummy EditPage, and pass that to the hooks. But
EditPage doesn't support non-text content, and even if we force it, we are
likely to end up with an edit field full of json, if we are not very careful.

4) just add another hook, similar to EditFilterMergedContent, but more generic,
and call it in EditEntity (and perhaps also in EditPage!). If we want a spam
filter extension to work with non-text content, it will have to implement that
new hook.

What's the best option, do you think?

There's another closely related problem, btw: showing captchas. How can that be
implemented at all for API based, atomic edits? Would the API return a special
error, which includes a link to the captcha image as a challange? And then
requires thecaptcha's solution via some special arguments to the module call?
How can an extension controll this? How is this done for the API's action=edit
at present?

thanks,
daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On Tue, Dec 4, 2012 at 4:52 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> There's another closely related problem, btw: showing captchas. How can that be
> implemented at all for API based, atomic edits? Would the API return a special
> error, which includes a link to the captcha image as a challange? And then
> requires thecaptcha's solution via some special arguments to the module call?
> How can an extension controll this? How is this done for the API's action=edit
> at present?

The ConfirmEdit extension hooks APIGetAllowedParams and
APIGetParamDescription to add its info to the help output, and
APIEditBeforeSave to check the captcha and/or add the captcha items to
the response.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/04/2012 04:52 AM, Daniel Kinzler wrote:
> 4) just add another hook, similar to EditFilterMergedContent, but more generic,
> and call it in EditEntity (and perhaps also in EditPage!). If we want a spam
> filter extension to work with non-text content, it will have to implement that
> new hook.

I think that makes sense. The spam filters will work best if they are
aware of how wikidata works, and have access to the full JSON
information of the change.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 04.12.2012 18:20, Matthew Flaschen wrote:
> On 12/04/2012 04:52 AM, Daniel Kinzler wrote:
>> 4) just add another hook, similar to EditFilterMergedContent, but more generic,
>> and call it in EditEntity (and perhaps also in EditPage!). If we want a spam
>> filter extension to work with non-text content, it will have to implement that
>> new hook.
>
> I think that makes sense. The spam filters will work best if they are
> aware of how wikidata works, and have access to the full JSON
> information of the change.

You really want the spam filter extensions to have internal knowledge of
Wikibase? That seems like a nasty cross-dependency, and goes directly against
the idea of modularization and separation of concerns...

We are running into the "glue code problem" here. We need code that knows about
the spam filters and about wikibase. Should it be in the spam filter, in
Wikibase, or in a separate, third extension? That would be cleanest, but a
hassle to maintain... Which way would you prefer?

-- daniel


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On Wed, Dec 5, 2012 at 3:34 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> You really want the spam filter extensions to have internal knowledge of
> Wikibase? That seems like a nasty cross-dependency, and goes directly against
> the idea of modularization and separation of concerns...
>
> We are running into the "glue code problem" here. We need code that knows about
> the spam filters and about wikibase. Should it be in the spam filter, in
> Wikibase, or in a separate, third extension? That would be cleanest, but a
> hassle to maintain... Which way would you prefer?

I think Daniel has correctly stated the problem.

My perspective:

One of the directions of the Admin Tools project is to combine some of
the various tools into AbuseFilter, so I think it's safe to assume
that AbuseFilter will be around and maintained for some time, and
Wikidata could easily use the hooks it provides to do a lot of the
work providing the interface. That being said, expanding AbuseFilter
to work on non-article data has already been requested a few times, so
I think we can make AbuseFilter much easier for Wikidata, and AFT to
plug into.

Maybe to start with, we can find out what functionality from
AbuseFilter there is common between AFT and Wikibase, and try to build
in most of the overlapping pieces into AbuseFilter. Then each can also
use the AbuseFilter hooks to complete the functionality?

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/05/2012 06:34 AM, Daniel Kinzler wrote:
>> I think that makes sense. The spam filters will work best if they are
>> aware of how wikidata works, and have access to the full JSON
>> information of the change.
>
> You really want the spam filter extensions to have internal knowledge of
> Wikibase? That seems like a nasty cross-dependency, and goes directly against
> the idea of modularization and separation of concerns...

I agree it should not have internal implementation knowledge. I meant
how it works in a different sense.

More specifically, what if Wikidata exposed a JSON object representing
an external version of each change (essentially a data API).

It could allow hooks to register for this (I think is similar to the
EditEntity idea).

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/05/2012 12:28 PM, Chris Steipp wrote:
> On Wed, Dec 5, 2012 at 3:34 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
>> You really want the spam filter extensions to have internal knowledge of
>> Wikibase? That seems like a nasty cross-dependency, and goes directly against
>> the idea of modularization and separation of concerns...
>>
>> We are running into the "glue code problem" here. We need code that knows about
>> the spam filters and about wikibase. Should it be in the spam filter, in
>> Wikibase, or in a separate, third extension? That would be cleanest, but a
>> hassle to maintain... Which way would you prefer?
>
> I think Daniel has correctly stated the problem.
>
> My perspective:
>
> One of the directions of the Admin Tools project is to combine some of
> the various tools into AbuseFilter, so I think it's safe to assume
> that AbuseFilter will be around and maintained for some time, and
> Wikidata could easily use the hooks it provides to do a lot of the
> work providing the interface.

It makes sense for AbuseFilter and Wikidata to work in conjunction. But
it seems Wikidata should provide a hook that AbuseFilter calls.

What if someone wants to make spam filter that works differently than
AbuseFilter? For example, it uses its own programmatic rules rather
than ones that can be expressed in the Special:AbuseFilter language.

If Wikidata exposes an API, AbuseFilter and other extensions can use it.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On Wed, Dec 5, 2012 at 1:11 PM, Matthew Flaschen
<mflaschen@wikimedia.org> wrote:
> It makes sense for AbuseFilter and Wikidata to work in conjunction. But
> it seems Wikidata should provide a hook that AbuseFilter calls.

I think we agree on this point, although I'll clarify and say I think
AbuseFilter should be calling wfRunHooks, and Wikibase should provide
the functions. I think more 3rd-party wikis will run AbuseFilter than
Wikibase, but that could be my prejudice based on what I work on.

> What if someone wants to make spam filter that works differently than
> AbuseFilter? For example, it uses its own programmatic rules rather
> than ones that can be expressed in the Special:AbuseFilter language.

You are correct, AbuseFilter doesn't currently have hooks to let an
extension run its own logic, but that wouldn't be too difficult to
implement. Maybe run a new hook from AbuseFilter::checkConditions?
Although I would be interested to know what kind of rules you have in
mind, since it's certainly possible that we would want to implement it
as a AbuseFilter operation.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/05/2012 05:54 PM, Chris Steipp wrote:
> On Wed, Dec 5, 2012 at 1:11 PM, Matthew Flaschen
> <mflaschen@wikimedia.org> wrote:
>> It makes sense for AbuseFilter and Wikidata to work in conjunction. But
>> it seems Wikidata should provide a hook that AbuseFilter calls.
>
> I think we agree on this point, although I'll clarify and say I think
> AbuseFilter should be calling wfRunHooks, and Wikibase should provide
> the functions.

No, we disagree on this.

Wikibase should call wfRunHooks. This is analogous to the way it is now
for regular wikitext.

For example, AbuseFilter has:

$wgHooks['EditFilterMerged'][] = 'AbuseFilterHooks::onEditFilterMerged';

Then, core MediaWiki calls:

if ( !wfRunHooks( 'EditFilterMerged', array( $this, $this->textbox1,
&$this->hookError, $this->summary ) ) ) {

The same general idea should apply for Wikibase. The only difference is
that the core functionality of data editing is in Wikibase.

Thus, Wikibase should call wfRunHooks for this.

>> What if someone wants to make spam filter that works differently than
>> AbuseFilter? For example, it uses its own programmatic rules rather
>> than ones that can be expressed in the Special:AbuseFilter language.
>
> You are correct, AbuseFilter doesn't currently have hooks to let an
> extension run its own logic, but that wouldn't be too difficult to
> implement.

I don't think it necessarily needs one. A spam filter with a different
approach (which may not have a rule UI at all) can register its own
hooks, just as AbuseFilter does.

> Although I would be interested to know what kind of rules you have in
> mind, since it's certainly possible that we would want to implement it
> as a AbuseFilter operation.

I don't have an immediate practical suggestion. But I do know that
modern spam filters use a variety of approaches, including Bayesian
filtering.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On Wed, Dec 5, 2012 at 3:53 PM, Matthew Flaschen
<mflaschen@wikimedia.org> wrote:
> No, we disagree on this.

I was afraid that might be the case, so I'm glad we clarified.

> The same general idea should apply for Wikibase. The only difference is
> that the core functionality of data editing is in Wikibase.

Correct, and I would say that Wikibase should be calling the same
hooks that core does, so that AbuseFilter can be used to filter all
incoming data. If Wikibase wants to define another hook, and can
present the data in a generic way (like Daniel did for content
handler) we can probably add it into AbuseFilter. But if the
processing is specific to Wikibase (you pass an Entity into the hook,
for example), then AbuseFilter shouldn't be hooking into something
like that, since it would basically make Wikibase a dependency, and I
do think that more independent wikis are likely to have AbuseFilter
installed without Wikibase than with it.

> I don't think it necessarily needs one. A spam filter with a different
> approach (which may not have a rule UI at all) can register its own
> hooks, just as AbuseFilter does.

I can definitely appreciate that, but that is also why we currently
have so many extensions for spam / bot handling, using the existing
hooks. I would hate to see yet another spam extension that does really
great spam detection, but is has a dependency on Wikibase.

But that's just my preference.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/05/2012 07:55 PM, Chris Steipp wrote:
> If Wikibase wants to define another hook, and can
> present the data in a generic way (like Daniel did for content
> handler) we can probably add it into AbuseFilter.

It should be presented in a suitable way (not obscure Wikibase internal
structures), that still includes the necessary information.

> But if the processing is specific to Wikibase (you pass an Entity into the hook,
> for example), then AbuseFilter shouldn't be hooking into something
> like that, since it would basically make Wikibase a dependency, and I
> do think that more independent wikis are likely to have AbuseFilter
> installed without Wikibase than with it.

AbuseFilter would not depend on Wikibase if AbuseFilter only hooks into it.

It's fine for you to register a hook that is never called:

$wgHooks[ 'WikibaseEditFilterMerged' ][] =
'AbuseFilter::onWikibaseEditFilterMerged';

will not cause an error if Wikibase is not installed.
onWikibaseEditFilterMerged would then transform the data and call
internal AbuseFilter functions/methods.

>> I don't think it necessarily needs one. A spam filter with a different
>> approach (which may not have a rule UI at all) can register its own
>> hooks, just as AbuseFilter does.
>
> I can definitely appreciate that, but that is also why we currently
> have so many extensions for spam / bot handling, using the existing
> hooks. I would hate to see yet another spam extension that does really
> great spam detection, but is has a dependency on Wikibase.

I think inevitably different people are going to address the spam
challenge differently. By using hooks, though, that great extension
does not need a hard dependency on Wikibase.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 06.12.2012 01:55, Chris Steipp wrote:
>> The same general idea should apply for Wikibase. The only difference is
>> that the core functionality of data editing is in Wikibase.
>
> Correct, and I would say that Wikibase should be calling the same
> hooks that core does, so that AbuseFilter can be used to filter all
> incoming data.

That would be great, but as I pointed out in my original mail, not really
possible: the existing hooks guarantee an EditPage as a parameter. There is no
EditPage when editing Wikibase content, and I can see no sensible way to create
one for this purpose.

> If Wikibase wants to define another hook, and can
> present the data in a generic way (like Daniel did for content
> handler) we can probably add it into AbuseFilter.

We can present (some of) the data as plain text, but that removes a lot of
information that could be used for spam detection. Maybe AbuseFilter is flexible
enough to be able to handle more aspects using "variables". But that would
require Wikibase to know about AbuseFilter, and specifically cater to it (or the
other way around).

> But if the
> processing is specific to Wikibase (you pass an Entity into the hook,
> for example), then AbuseFilter shouldn't be hooking into something
> like that, since it would basically make Wikibase a dependency, and I
> do think that more independent wikis are likely to have AbuseFilter
> installed without Wikibase than with it.

No, that is not a dependency in the strong sense; You could easily run one
without the other. But it does imply knowledge. So, should Wikibase have
knowledge of, and contain code specific to, AbuseFilter, or the other way around?

Honestly, I don't like either very much.

>> I don't think it necessarily needs one. A spam filter with a different
>> approach (which may not have a rule UI at all) can register its own
>> hooks, just as AbuseFilter does.

But then Wikibase needs to know about each of them, and implement hook handlers
for each. Or am I misunderstanding you?


So... we are still facing the Glue Code Dilemma.

-- daniel


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 05.12.2012 22:06, Matthew Flaschen wrote:
> More specifically, what if Wikidata exposed a JSON object representing
> an external version of each change (essentially a data API).

This already exists, that's more or less how changes get pushed to client wikis.

> It could allow hooks to register for this (I think is similar to the
> EditEntity idea).

Pretty much the same, actually, yes. Wikibase defines a hook and provides the
data structure. Then, AbuseFilter would need knowledge about Wikibase's data
model(s).

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Spam filters for wikidata.org [ In reply to ]
On 12/06/2012 05:22 AM, Daniel Kinzler wrote:
> On 05.12.2012 22:06, Matthew Flaschen wrote:
>> More specifically, what if Wikidata exposed a JSON object representing
>> an external version of each change (essentially a data API).
>
> This already exists, that's more or less how changes get pushed to client wikis.
>
>> It could allow hooks to register for this (I think is similar to the
>> EditEntity idea).
>
> Pretty much the same, actually, yes. Wikibase defines a hook and provides the
> data structure. Then, AbuseFilter would need knowledge about Wikibase's data
> model(s).

Right, but as you said that doesn't introduce a strong/hard dependency.
I think this is the best solution.

Matt Flaschen

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l