Mailing List Archive

[Wikimedia-l] Copy and Paste Detection Bot
The new and improved version of the copy and detection bot that we at [[WP:
MED]] have been using for nearly a year [
https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
to be expanded to other topic areas.

It can be found here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
the common.js code it will give you buttons to click to indicate follow up
of concerns. Additionally one can sort the edits in question by
WikiProject. We are working to set up auto-archiving such that once
concerns are dealt with they will be removed from the main list.

We also want to have automatic compilation of data such as the frequency of
true positives and false positives generated by the bot. A blacklist of
sites that are know mirrors of Wikipedia is here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
list is improved / expanded the accuracy of the bot will improve. Many
thanks to [[User:ערן]] for his amazing work.

The bot also has the potential to work in other languages.

--
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Copy and Paste Detection Bot [ In reply to ]
Hi, James.

Is the source code available anywhere?
IF you want to try your bot in other languages, I could help you with
testing in Russian Wikipedia :)

Best regards.
rubin16

2015-04-03 12:07 GMT+03:00 James Heilman <jmh649@gmail.com>:

> The new and improved version of the copy and detection bot that we at [[WP:
> MED]] have been using for nearly a year [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
> to be expanded to other topic areas.
>
> It can be found here [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
> the common.js code it will give you buttons to click to indicate follow up
> of concerns. Additionally one can sort the edits in question by
> WikiProject. We are working to set up auto-archiving such that once
> concerns are dealt with they will be removed from the main list.
>
> We also want to have automatic compilation of data such as the frequency of
> true positives and false positives generated by the bot. A blacklist of
> sites that are know mirrors of Wikipedia is here [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
> list is improved / expanded the accuracy of the bot will improve. Many
> thanks to [[User:ערן]] for his amazing work.
>
> The bot also has the potential to work in other languages.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Copy and Paste Detection Bot [ In reply to ]
Hi James

I often suspect copy-paste and find exact matches of the text
elsewhere. However, whereas one can painstakingly (unless there is a
trick that I am not aware of) ascertain when text was enetered into
an article, it is not always possible to know when the other text
first appeared on the internet to know for sure who coppied who. From
my limited knowledge, I believe that some trace of the date of upload
must be retained somewhere in the code - will this bot be able to pick
up on that and provide a date?

Thanks and congratulations to all involved and for sharing.

Regards,

Rui

2015-04-03 11:07 GMT+02:00 James Heilman <jmh649@gmail.com>:
> The new and improved version of the copy and detection bot that we at [[WP:
> MED]] have been using for nearly a year [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
> to be expanded to other topic areas.
>
> It can be found here [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
> the common.js code it will give you buttons to click to indicate follow up
> of concerns. Additionally one can sort the edits in question by
> WikiProject. We are working to set up auto-archiving such that once
> concerns are dealt with they will be removed from the main list.
>
> We also want to have automatic compilation of data such as the frequency of
> true positives and false positives generated by the bot. A blacklist of
> sites that are know mirrors of Wikipedia is here [
> https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
> list is improved / expanded the accuracy of the bot will improve. Many
> thanks to [[User:ערן]] for his amazing work.
>
> The bot also has the potential to work in other languages.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>



--
_________________________
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
_______________

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
[Wikimedia-l] Copy and Paste Detection Bot [ In reply to ]
1) Yes the source code is available. User:Eran has posted it here
https://github.com/valhallasw/plagiabot

2) This bot ONLY works on new edits within a couple of hours of them
occurring. This reducing the number of false positives. It DOES NOT look at
old edits.

3) This requires human follow up and common sense. One needs to make sure
that a) the source is not PD/CCBYSA b) that it is not wiki text that has
been moved around c) that the authors of both are not the same, etc

4) True positive rate is around 50% which is from my perspective good /
useful. This bot has flagged a lot of copyright issues would have been
missed otherwise.

--
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
Re: [Wikimedia-l] Copy and Paste Detection Bot [ In reply to ]
Thanks James

Just out of curiosity, the other day I found two articles with a long
section with identical wording, only names and numbers had been
changed. Example:
The town of ....... has a population of ...... . The town is know for
its challenges in fighting poverty. According to local authorities,
trhey have undertaken housing and sanitation projects bla bla bla.

When I queried it, the author of the earlier article responded to say
that 'it was acceptable' so that beginners could find it easier to
start writing articles. From that I dug deeper and discovered that he
had tutored the writer of the derived article.

Regards, and a great weekend,

Rui

2015-04-04 3:49 GMT+02:00 James Heilman <jmh649@gmail.com>:
> 1) Yes the source code is available. User:Eran has posted it here
> https://github.com/valhallasw/plagiabot
>
> 2) This bot ONLY works on new edits within a couple of hours of them
> occurring. This reducing the number of false positives. It DOES NOT look at
> old edits.
>
> 3) This requires human follow up and common sense. One needs to make sure
> that a) the source is not PD/CCBYSA b) that it is not wiki text that has
> been moved around c) that the authors of both are not the same, etc
>
> 4) True positive rate is around 50% which is from my perspective good /
> useful. This bot has flagged a lot of copyright issues would have been
> missed otherwise.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>



--
_________________________
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
_______________

_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>