Mailing List Archive

Project Idea for GSoC 2013 - Bayesian Spam Filter
Hi,

I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to
apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a
project for the same. I have drafted the Idea on my
talk<http://www.mediawiki.org/wiki/User:Anubhav_iitr>page.

I request you to go through this and give your suggestions on it.

Hoping for a good feedback

Regards,
Anubhav


Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
Hi Anubhav,

On 04/07/2013 06:05 PM, anubhav agarwal wrote:
> Hi,
>
> I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to
> apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a
> project for the same. I have drafted the Idea on my
> talk<http://www.mediawiki.org/wiki/User:Anubhav_iitr>page.

I have done a first reality check with Chris Steipp, who oversees the
area of security and also spam prevention. Your idea is interesting and
it seems to be feasible. This is a very good first step!

It would require adding a hook to MediaWiki core, but this could be a
small, acceptable change. The rest could be developed as an extension of
the ConfirmEdit extension.

It might have a performance penalty in a site like English Wikipedia
with plenty of concurrent edits, but for starters it could be
potentially useful to the 99% of MediaWiki instances that have a
significantly smaller number of daily edits and especially a very small
number of editors and tools able / happy to deal with spam.

As a next step, please

1. Create a subpage for your proposal e.g.
http://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter

2. File an enhancement request at
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions
under "Extensions requests" explaining your proposal and linking to the
related wiki page.

3. Reply to this thread sharing the link to the bug report so anybody
interested can watch it.


> I request you to go through this and give your suggestions on it.

Yes, but you will get more feedback if you are diligent answering to the
feedback received:

http://www.mediawiki.org/wiki/User_talk:Anubhav_iitr :)


--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On 09/04/13 18:20, Quim Gil wrote:
> Hi Anubhav,
>
> I have done a first reality check with Chris Steipp, who oversees the
> area of security and also spam prevention. Your idea is interesting and
> it seems to be feasible. This is a very good first step!
>
> It would require adding a hook to MediaWiki core, but this could be a
> small, acceptable change.
I agree. Adding a hook is no problem.

> The rest could be developed as an extension of
> the ConfirmEdit extension.

I'm not sure on adding it to ConfirmEdit. I would develop it as an
independent extension, which could then hook into ConfirmEdit or
AbuseFilter.

Anubhav wrote:
> Tasks
>
> Create a tool for wiki users to report Spam. A a simple way to
> train the a Bayesian DB. This should be accessible for any user
> with the permissions to "undo" or "rollback" those changes or to
> delete the new page/file. Understanding the metadata(IP, links,
> user) I can extract from the data (perhaps harnessing other
> services like blacklists).

I think it would be more interesting if it could be trained
automatically. Perhaps by automatically learning rollbacks as "wrong".
Maybe there could be a checkbox to "train as spam" when doing a revert,
but I would avoid anything complex like "Go to Special:TrainSpam and
enter the revision number to mark as spam".

Good luck!


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On 2013-04-12 7:33 PM, "Platonides" <Platonides@gmail.com> wrote:
>
> On 09/04/13 18:20, Quim Gil wrote:
> > Hi Anubhav,
> >
> > I have done a first reality check with Chris Steipp, who oversees the
> > area of security and also spam prevention. Your idea is interesting and
> > it seems to be feasible. This is a very good first step!
> >
> > It would require adding a hook to MediaWiki core, but this could be a
> > small, acceptable change.
> I agree. Adding a hook is no problem.
>

Well a hook is obviously no problem, im not sure why a new one would be
needed. Surely if the abuse filter has all the hooks it needs, so would
this.

Qgill wrote:
>It might have a performance penalty in a site like English Wikipedia with
plenty of concurrent edits, but for starters it could be potentially useful
to the 99% of MediaWiki instances that have a significantly smaller number
of daily edits and especially a very small number of editors and tools able
/ happy to deal with spam.

Hmm. I was playing with nlp-ish automated newpage patrol recently. One
thing that crossed my mind was if it becomes too expensive, one could run
the classifier in the job queue (and hence on a dedicated server(s) ) and
tag changes shortly after the fact.

Last of all I would suggest you also read up on other people who have done
machine learning approaches to vandalism detection. In particular
user:cluebot_NG - http://en.wikipedia.org/wiki/User:Cluebot_NG . There is
also a list of academic papers on the subject at
http://en.wikipedia.org/w/index.php?title=User:Emijrp/Anti-vandalism_bot_census(that
said, an extension like you are proposing does not have to be as good
as the rather complex state of the art in order to be useful. Any effective
system would probably be quite useful).

-bawolff
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On Sat, Apr 13, 2013 at 2:42 AM, Brian Wolff <bawolff@gmail.com> wrote:
>
> Qgill wrote:
>>It might have a performance penalty in a site like English Wikipedia with
> plenty of concurrent edits, but for starters it could be potentially useful
> to the 99% of MediaWiki instances that have a significantly smaller number
> of daily edits and especially a very small number of editors and tools able
> / happy to deal with spam.
>
> Hmm. I was playing with nlp-ish automated newpage patrol recently. One
> thing that crossed my mind was if it becomes too expensive, one could run
> the classifier in the job queue (and hence on a dedicated server(s) ) and
> tag changes shortly after the fact.

We have Parsoid running separately, don't we? Perhaps, the same
approach could work here as well.

>
> -bawolff
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
З павагай,
Павел Селіцкас/Pavel Selitskas
Wizardist @ Wikimedia projects

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
Hey Quim,

Thanks for such a detailed response. Sorry for being inactive for these few
days, I was undergoing some coursework evaluations.

On Tue, Apr 9, 2013 at 9:50 PM, Quim Gil <qgil@wikimedia.org> wrote:

> Hi Anubhav,
>
>
> On 04/07/2013 06:05 PM, anubhav agarwal wrote:
>
>> Hi,
>>
>> I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to
>> apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a
>> project for the same. I have drafted the Idea on my
>> talk<http://www.mediawiki.org/**wiki/User:Anubhav_iitr<http://www.mediawiki.org/wiki/User:Anubhav_iitr>
>> >page.
>>
>
> I have done a first reality check with Chris Steipp, who oversees the area
> of security and also spam prevention. Your idea is interesting and it seems
> to be feasible. This is a very good first step!
>
> It would require adding a hook to MediaWiki core, but this could be a
> small, acceptable change. The rest could be developed as an extension of
> the ConfirmEdit extension.
>
> It might have a performance penalty in a site like English Wikipedia with
> plenty of concurrent edits, but for starters it could be potentially useful
> to the 99% of MediaWiki instances that have a significantly smaller number
> of daily edits and especially a very small number of editors and tools able
> / happy to deal with spam.
>

I was thinking of creating a Job Queue for big websites like Wikipedia,
each edit will go in a queue which will be processed offline and then later
roll backed to the original content if it triggers the alarm.


>
> As a next step, please
>
> 1. Create a subpage for your proposal e.g. http://www.mediawiki.org/wiki/*
> *User:Anubhav_iitr/Bayesan_**spam_filter<http://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter>
>
> 2. File an enhancement request at https://bugzilla.wikimedia.**
> org/enter_bug.cgi?product=**MediaWiki%20extensions<https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions>under "Extensions requests" explaining your proposal and linking to the
> related wiki page.
>
> 3. Reply to this thread sharing the link to the bug report so anybody
> interested can watch it.
>
>
Here is the link for the
bug<https://bugzilla.wikimedia.org/show_bug.cgi?id=47207>,
as you said.


>
>
> I request you to go through this and give your suggestions on it.
>>
>
> Yes, but you will get more feedback if you are diligent answering to the
> feedback received:
>
> http://www.mediawiki.org/wiki/**User_talk:Anubhav_iitr<http://www.mediawiki.org/wiki/User_talk:Anubhav_iitr> :)
>
>
> --
> Quim Gil
> Technical Contributor Coordinator @ Wikimedia Foundation
> http://www.mediawiki.org/wiki/**User:Qgil<http://www.mediawiki.org/wiki/User:Qgil>
>
> ______________________________**_________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>




--
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
Hi Platonides,

On Sat, Apr 13, 2013 at 4:04 AM, Platonides <Platonides@gmail.com> wrote:

> On 09/04/13 18:20, Quim Gil wrote:
> > Hi Anubhav,
> >
> > I have done a first reality check with Chris Steipp, who oversees the
> > area of security and also spam prevention. Your idea is interesting and
> > it seems to be feasible. This is a very good first step!
> >
> > It would require adding a hook to MediaWiki core, but this could be a
> > small, acceptable change.
> I agree. Adding a hook is no problem.
>
> > The rest could be developed as an extension of
> > the ConfirmEdit extension.
>
> I'm not sure on adding it to ConfirmEdit. I would develop it as an
> independent extension, which could then hook into ConfirmEdit or
> AbuseFilter.
>
> Anubhav wrote:
> > Tasks
> >
> > Create a tool for wiki users to report Spam. A a simple way to
> > train the a Bayesian DB. This should be accessible for any user
> > with the permissions to "undo" or "rollback" those changes or to
> > delete the new page/file. Understanding the metadata(IP, links,
> > user) I can extract from the data (perhaps harnessing other
> > services like blacklists).
>
> I think it would be more interesting if it could be trained
> automatically. Perhaps by automatically learning rollbacks as "wrong".
> Maybe there could be a checkbox to "train as spam" when doing a revert,
> but I would avoid anything complex like "Go to Special:TrainSpam and
> enter the revision number to mark as spam".
>

I don't we could take in account the roll back for automated learning. It
is not necessary that the person who edited the document, then rolled it
back did because it was a spam.

Though a "Train as spam" checkbox is a good idea. I was thinking about the
"report spam" button along with "edit" button on the top-right hand corner
of a section.


>
> Good luck!
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On 14/04/13 15:41, anubhav agarwal wrote:
> I don't we could take in account the roll back for automated learning. It
> is not necessary that the person who edited the document, then rolled it
> back did because it was a spam.

Getting the right data to train from is hard, since wiki is so flexible.
The good point of rollback is that a) It's easy to detect, b) It's
restricted (a random user can't use it) and c) On some wikis policy
restricts it's use to “clearly bad edits”.

So you _should_ be training with "unwanted edits". But there will be
false positives.



> Though a "Train as spam" checkbox is a good idea. I was thinking about the
> "report spam" button along with "edit" button on the top-right hand corner
> of a section.

However, that only tells you that "somewhere in the page there is spam",
not what the spam is (the last revision? an edit from 2 months ago?) nor
does it encourage for fixing it.


> I was thinking of creating a Job Queue for big websites like Wikipedia,
> each edit will go in a queue which will be processed offline and then later
> roll backed to the original content if it triggers the alarm.

I'm not a big fan of this. You will have edit-conflicts to handle, and
it looks messy to have reverts by an extension. I recommend you to work
on the bayesian detection of spam, and leave the potential refactoring
to configure it to work through the job queue for later.

I think I could look in the archives of deleted pages from the WM-ES
wiki for spam data for you.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On 04/14/2013 06:34 AM, anubhav agarwal wrote:
> Hey Quim,
>
> Thanks for such a detailed response. Sorry for being inactive for these few
> days, I was undergoing some coursework evaluations.

I hope they went well. First things first!

You have some homework to do here as well. It is time to start drafting
your application, open a related feature request in Bugzilla and find a
mentor. See

https://www.mediawiki.org/wiki/Mentorship_programs/Application_template

--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
Hey Quim,

I have drafted my proposal on my User
page<https://www.mediawiki.org/wiki/User:Anubhav_iitr>.
I have already opened a bug in mediawiki for the Extension request in
bugzilla. Here is the
link<https://bugzilla.wikimedia.org/show_bug.cgi?id=47207>.


I will be glad to have your feedback.
Can you suggest me whom I should I ask to mentor me ?


On Mon, Apr 15, 2013 at 10:50 PM, Quim Gil <qgil@wikimedia.org> wrote:

> On 04/14/2013 06:34 AM, anubhav agarwal wrote:
>
>> Hey Quim,
>>
>> Thanks for such a detailed response. Sorry for being inactive for these
>> few
>> days, I was undergoing some coursework evaluations.
>>
>
> I hope they went well. First things first!
>
> You have some homework to do here as well. It is time to start drafting
> your application, open a related feature request in Bugzilla and find a
> mentor. See
>
> https://www.mediawiki.org/**wiki/Mentorship_programs/**
> Application_template<https://www.mediawiki.org/wiki/Mentorship_programs/Application_template>
>
>
> --
> Quim Gil
> Technical Contributor Coordinator @ Wikimedia Foundation
> http://www.mediawiki.org/wiki/**User:Qgil<http://www.mediawiki.org/wiki/User:Qgil>
>
> ______________________________**_________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>



--
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Project Idea for GSoC 2013 - Bayesian Spam Filter [ In reply to ]
On 04/23/2013 05:42 AM, anubhav agarwal wrote:
> Hey Quim,
>
> I have drafted my proposal on my User
> page<https://www.mediawiki.org/wiki/User:Anubhav_iitr>.
> I have already opened a bug in mediawiki for the Extension request in
> bugzilla. Here is the
> link<https://bugzilla.wikimedia.org/show_bug.cgi?id=47207>.
>
>
> I will be glad to have your feedback.
> Can you suggest me whom I should I ask to mentor me ?

Chris is willing to co-mentor, but not alone. I asked another potential
co-mentor but we are still waiting for his answer. Anybody interested?
MediaWiki extension development skills required.

In any case, please apply to GSoC formally. You don't need to have the
mentors assigned to do this and you can keep improving your proposal
until the deadline.

--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l