Mailing List Archive

Is fuzzyocr i.e. Image scanning
Good day Guys

I am getting quite a bit of image spam, and googling put me in the
direction of a tool called FuzzyOCR.

What I did was configure vagrant to install spamassassin and fuzzyocr,
and fuzzyocr does not appear to be catching my spam (The example
provided work).

Before I go down the road of installing and configuring fuzzyocr on my
MTA, I thought I would double check with the spamassassin community and
ask is there still a place for image scanning in 2018?

The documentation is fairly old, so it got me wondering if image
scanning and old technology and method.

Thanks in advance.

Regards
Brent
P.s. Here is a pastebin link of what I am seeing.
https://pastebin.com/raw/gurvFrZw
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
Apologies for the subject.

It was meant to read "Is fuzzyocr i.e. Image scanning, warranted in 2018"

Regards
Brent

On 2018/10/12 15:11, Brent Clark wrote:
> Good day Guys
>
> I am getting quite a bit of image spam, and googling put me in the
> direction of a tool called FuzzyOCR.
>
> What I did was configure vagrant to install spamassassin and fuzzyocr,
> and fuzzyocr does not appear to be catching my spam (The example
> provided work).
>
> Before I go down the road of installing and configuring fuzzyocr on my
> MTA, I thought I would double check with the spamassassin community and
> ask is there still a place for image scanning in 2018?
>
> The documentation is fairly old, so it got me wondering if image
> scanning and old technology and method.
>
> Thanks in advance.
>
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw
>
>
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
Good day Guys

I was fortunate that someone privately emailed me, but is there no one
else, that has any thing they can share (its not only to me, but the
community as a whole). Im sure there is others out there, whose users
dealing with this nonsense.

Please share.

Regards
Brent

On 2018/10/12 15:11, Brent Clark wrote:
> Good day Guys
>
> I am getting quite a bit of image spam, and googling put me in the
> direction of a tool called FuzzyOCR.
>
> What I did was configure vagrant to install spamassassin and fuzzyocr,
> and fuzzyocr does not appear to be catching my spam (The example
> provided work).
>
> Before I go down the road of installing and configuring fuzzyocr on my
> MTA, I thought I would double check with the spamassassin community and
> ask is there still a place for image scanning in 2018?
>
> The documentation is fairly old, so it got me wondering if image
> scanning and old technology and method.
>
> Thanks in advance.
>
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw
>
>
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Mon, 15 Oct 2018, Brent Clark wrote:

> Good day Guys
>
> I was fortunate that someone privately emailed me, but is there no one else,
> that has any thing they can share (its not only to me, but the community as a
> whole). Im sure there is others out there, whose users dealing with this
> nonsense.
>
> Please share.

Text obfuscation via images comes and goes. I've noticed for a while that
it seems to be in the "coming" phase again. I have been getting 419 frauds
where the pitch is in an image.

It might be reasonable to review and freshen the fuzzyOCR code.

> Regards
> Brent
>
> On 2018/10/12 15:11, Brent Clark wrote:
>> Good day Guys
>>
>> I am getting quite a bit of image spam, and googling put me in the
>> direction of a tool called FuzzyOCR.
>>
>> What I did was configure vagrant to install spamassassin and fuzzyocr, and
>> fuzzyocr does not appear to be catching my spam (The example provided
>> work).
>>
>> Before I go down the road of installing and configuring fuzzyocr on my MTA,
>> I thought I would double check with the spamassassin community and ask is
>> there still a place for image scanning in 2018?
>>
>> The documentation is fairly old, so it got me wondering if image scanning
>> and old technology and method.
>>
>> Thanks in advance.
>>
>> Regards
>> Brent
>> P.s. Here is a pastebin link of what I am seeing.
>> https://pastebin.com/raw/gurvFrZw
>>
>>
>

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
But if there is no such inalienable right [to self defense], the
entire nature of the social contract is changed. Each man’s worth
is measured solely by his utility to the state, and as such the
value of his life rides a roller coaster not unlike the stock
market: dependent not only upon the preferences of the party in
power but upon the whims of its political leaders and the
permanent bureaucratic class. -- Mike McDaniel
-----------------------------------------------------------------------
564 days since the first commercial re-flight of an orbital booster (SpaceX)
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
Brent,

I have Fuzzy OCR installed and running, but the only rule that was
trigered 22 times during the past 40 days was FUZZY_OCR_WRONG_CTYPE,
meaning that the image type does not match the content-type set for
MIME.

That is still a valid catch, but not based on the OCR'ed text.

One of my holdback with FuzzyOCR is that you have to provide an
independant word list, while we have a very good tool to analyze text
contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
the OCR'ed text back to SA for further analysis (the way pdfAssassin is
working). But then, we need a way to detect that the OCR process has
worked, that some more or less valid text, in a valid language has been
extracted.

Another approach I like is the one of Image Cerberus (dig in
http://prag.diee.unica.it/amilab) which uses meta data of the image
(size, histogram of colours, etc.) to classify the image as probable
spam or probable ham and then implements Bayes classifier.

As for your question about the place for image scanning, if your MTA has
the resources to do so, why not? And if FuzzyOCR is not yet the ultimate
OCR solution, it is still improving, so why give-up a tool that can
help?

Regards,

Olivier
--
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
Olivier,

Thank you *ever* so much for replying.
Regards
Brent

On 2018/10/16 06:49, Olivier wrote:
> Brent,
>
> I have Fuzzy OCR installed and running, but the only rule that was
> trigered 22 times during the past 40 days was FUZZY_OCR_WRONG_CTYPE,
> meaning that the image type does not match the content-type set for
> MIME.
>
> That is still a valid catch, but not based on the OCR'ed text.
>
> One of my holdback with FuzzyOCR is that you have to provide an
> independant word list, while we have a very good tool to analyze text
> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> the OCR'ed text back to SA for further analysis (the way pdfAssassin is
> working). But then, we need a way to detect that the OCR process has
> worked, that some more or less valid text, in a valid language has been
> extracted.
>
> Another approach I like is the one of Image Cerberus (dig in
> http://prag.diee.unica.it/amilab) which uses meta data of the image
> (size, histogram of colours, etc.) to classify the image as probable
> spam or probable ham and then implements Bayes classifier.
>
> As for your question about the place for image scanning, if your MTA has
> the resources to do so, why not? And if FuzzyOCR is not yet the ultimate
> OCR solution, it is still improving, so why give-up a tool that can
> help?
>
> Regards,
>
> Olivier
>
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Tue, 16 Oct 2018 11:49:54 +0700
Olivier wrote:


> One of my holdback with FuzzyOCR is that you have to provide an
> independant word list, while we have a very good tool to analyze text
> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> the OCR'ed text back to SA for further analysis (the way pdfAssassin
> is working).

That works as long as the OCR remains very accurate. What happened
before was that the deployment of OCR lead spammers to make their text
much less readable.


> As for your question about the place for image scanning, if your MTA
> has the resources to do so, why not?

Because it's better if it's combined with other information.
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
>On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
>> One of my holdback with FuzzyOCR is that you have to provide an
>> independant word list, while we have a very good tool to analyze text
>> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
>> the OCR'ed text back to SA for further analysis (the way pdfAssassin
>> is working).

On 16.10.18 13:34, RW wrote:
>That works as long as the OCR remains very accurate. What happened
>before was that the deployment of OCR lead spammers to make their text
>much less readable.

I think that original reason was that available OCR programs were not
reliable enough.

I have tested gocr, ocrad and tesseract some >10 years ago, with not very
satisfying results, gocr being best at that time.

Since then, google took tesseract and made it much better.

I believe tht currently it would bve viable to push ocr output to
spamassassin for processing with bayes and other rules.


>> As for your question about the place for image scanning, if your MTA
>> has the resources to do so, why not?
>
>Because it's better if it's combined with other information.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
A day without sunshine is like, night.
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Tue, 16 Oct 2018 15:48:34 +0200
Matus UHLAR - fantomas wrote:

> >On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
> >> One of my holdback with FuzzyOCR is that you have to provide an
> >> independant word list, while we have a very good tool to analyze
> >> text contents: SpamAssassin itself. So I would much prefer
> >> FuzzyOCR to feed the OCR'ed text back to SA for further analysis
> >> (the way pdfAssassin is working).
>
> On 16.10.18 13:34, RW wrote:
> >That works as long as the OCR remains very accurate. What happened
> >before was that the deployment of OCR lead spammers to make their
> >text much less readable.
>
> I think that original reason was that available OCR programs were not
> reliable enough.
>
> I have tested gocr, ocrad and tesseract some >10 years ago, with not
> very satisfying results, gocr being best at that time.
>
> Since then, google took tesseract and made it much better.
>
> I believe tht currently it would bve viable to push ocr output to
> spamassassin for processing with bayes and other rules.


Bayes might work, but I wouldn't like to see it added to body text
because corrupted text could look like obfuscation.
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
Hi,
>
> > One of my holdback with FuzzyOCR is that you have to provide an
> > independant word list, while we have a very good tool to analyze text
> > contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> > the OCR'ed text back to SA for further analysis (the way pdfAssassin
> > is working).
>
> That works as long as the OCR remains very accurate. What happened
> before was that the deployment of OCR lead spammers to make their text
> much less readable.

Agreed 100%, we need a way to test the accuracy of the OCRed text before
feeding it to SA. Or have part of the obfuscation tests disabled.

> > As for your question about the place for image scanning, if your MTA
> > has the resources to do so, why not?
>
> Because it's better if it's combined with other information.

That is the way I meant it, it's an AND, not an OR. I see FuzzyOCR as
just one more tool that can be added to SA.

Regards,

Olivier
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
My comments on

http://pralab.diee.unica.it/en/ImageCerberus

IC is an effort to dig a hole in the water, because the problem of image spam with obfuscated text cannot be solved by ocr.

My approach is a "better safe than sorry" best practice that anyone can implement with existing software:

1. do not display inline the content of attachments and linked resources;
2. give high spam score (>=5) to any email with very low text to image ratio.

On pdf and similar attachments, reject anything with built in macros or scripts.

R

On Tue, Oct 16, 2018 at 06:49, Olivier <Olivier.Nicole@cs.ait.ac.th> wrote:

> Brent,
>
> I have Fuzzy OCR installed and running, but the only rule that was
> trigered 22 times during the past 40 days was FUZZY_OCR_WRONG_CTYPE,
> meaning that the image type does not match the content-type set for
> MIME.
>
> That is still a valid catch, but not based on the OCR'ed text.
>
> One of my holdback with FuzzyOCR is that you have to provide an
> independant word list, while we have a very good tool to analyze text
> contents: SpamAssassin itself. So I would much prefer FuzzyOCR to feed
> the OCR'ed text back to SA for further analysis (the way pdfAssassin is
> working). But then, we need a way to detect that the OCR process has
> worked, that some more or less valid text, in a valid language has been
> extracted.
>
> Another approach I like is the one of Image Cerberus (dig in
> http://prag.diee.unica.it/amilab) which uses meta data of the image
> (size, histogram of colours, etc.) to classify the image as probable
> spam or probable ham and then implements Bayes classifier.
>
> As for your question about the place for image scanning, if your MTA has
> the resources to do so, why not? And if FuzzyOCR is not yet the ultimate
> OCR solution, it is still improving, so why give-up a tool that can
> help?
>
> Regards,
>
> Olivier
> --
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
I see a vps and an ".expert" tld sender domain. My servers handle those with a REJECT rule.

On Fri, Oct 12, 2018 at 15:11, Brent Clark <brentgclarklist@gmail.com> wrote:

> Good day Guys
>
> I am getting quite a bit of image spam, and googling put me in the
> direction of a tool called FuzzyOCR.
>
> What I did was configure vagrant to install spamassassin and fuzzyocr,
> and fuzzyocr does not appear to be catching my spam (The example
> provided work).
>
> Before I go down the road of installing and configuring fuzzyocr on my
> MTA, I thought I would double check with the spamassassin community and
> ask is there still a place for image scanning in 2018?
>
> The documentation is fairly old, so it got me wondering if image
> scanning and old technology and method.
>
> Thanks in advance.
>
> Regards
> Brent
> P.s. Here is a pastebin link of what I am seeing.
> https://pastebin.com/raw/gurvFrZw
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Wed, Oct 17, 2018 at 09:21:33AM +0700, Olivier wrote:
>
> That is the way I meant it, it's an AND, not an OR. I see FuzzyOCR as
> just one more tool that can be added to SA.

The problem is it's so inefficient.. I've never seen image spam as a
problem, mostly it hits other rules and MTA blocks if you know what you are
doing. My current spam corpus contains only 7% images. For ham it's over
60%, so that's a horrible amount of executing image transformation tools and
analyzers for nothing, also thinking how many vulnerabilities have
imagemagick etc image tools had. At minimum FuzzyOCR etc should maintain a
hash database of good images to skip.. all the these 10 year old plugins
are pretty horrid code..
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
>> >On Tue, 16 Oct 2018 11:49:54 +0700 Olivier wrote:
>> >> One of my holdback with FuzzyOCR is that you have to provide an
>> >> independant word list, while we have a very good tool to analyze
>> >> text contents: SpamAssassin itself. So I would much prefer
>> >> FuzzyOCR to feed the OCR'ed text back to SA for further analysis
>> >> (the way pdfAssassin is working).
>>
>> On 16.10.18 13:34, RW wrote:
>> >That works as long as the OCR remains very accurate. What happened
>> >before was that the deployment of OCR lead spammers to make their
>> >text much less readable.

>On Tue, 16 Oct 2018 15:48:34 +0200 Matus UHLAR - fantomas wrote:
>> I think that original reason was that available OCR programs were not
>> reliable enough.
>>
>> I have tested gocr, ocrad and tesseract some >10 years ago, with not
>> very satisfying results, gocr being best at that time.
>>
>> Since then, google took tesseract and made it much better.
>>
>> I believe tht currently it would bve viable to push ocr output to
>> spamassassin for processing with bayes and other rules.

On 16.10.18 18:42, RW wrote:
>Bayes might work, but I wouldn't like to see it added to body text
>because corrupted text could look like obfuscation.

it should be pushed back to body text just for filters like bayes.
The same could/should be done for attachhed .doc, .pdf files etc.
--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
42.7 percent of all statistics are made up on the spot.
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Wed, 17 Oct 2018, Matus UHLAR - fantomas wrote:

> On 16.10.18 18:42, RW wrote:
>> Bayes might work, but I wouldn't like to see it added to body text
>> because corrupted text could look like obfuscation.
>
> it should be pushed back to body text just for filters like bayes.
> The same could/should be done for attachhed .doc, .pdf files etc.

...which would be much more reliable than OCR.

If it was a resource-allocation decision for pulling text from doc/pdf vs.
updating OCR, I'd push for the former.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The problem is when people look at Yahoo, slashdot, or groklaw and
jump from obvious and correct observations like "Oh my God, this
place is teeming with utter morons" to incorrect conclusions like
"there's nothing of value here". -- Al Petrofsky, in Y! SCOX
-----------------------------------------------------------------------
566 days since the first commercial re-flight of an orbital booster (SpaceX)
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
On Wed, 17 Oct 2018, Rupert Gallagher wrote:

> IC?is an effort to dig a hole in the water, because the problem of?image spam with obfuscated text?cannot be solved by ocr.?
>
> My approach is a?"better?safe?than sorry"?best practice that anyone can implement with existing software:?
>
> 1. do not display inline?the content of?attachments and linked resources;
> 2. give high spam score (>=5) to any email with very?low text to image ratio.

Your system, your rules, but it won't work for everybody.

We routinely receive messages from users needing help which contain 1~2 lines of
text describing the issue (like: 'my computer crashed' ) and then a screen-shot
taken with a cellphone camera (10~20 megapixel) which is 4~8 MB in size.
Sometimes the text is only in the subject and the screen-shot is the only thing
in the body.

I agree about not displaying inline attachments by default but that is a client
configuration issue and we cannot control our users' clients.


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Is fuzzyocr i.e. Image scanning [ In reply to ]
>>On 16.10.18 18:42, RW wrote:
>>>Bayes might work, but I wouldn't like to see it added to body text
>>>because corrupted text could look like obfuscation.

>On Wed, 17 Oct 2018, Matus UHLAR - fantomas wrote:
>>it should be pushed back to body text just for filters like bayes.
>>The same could/should be done for attachhed .doc, .pdf files etc.

On 17.10.18 07:56, John Hardin wrote:
>...which would be much more reliable than OCR.
>
>If it was a resource-allocation decision for pulling text from doc/pdf
>vs. updating OCR, I'd push for the former.

this could be easily configured by installing modules or loading them.

btw, both PDF and word documents can contain images too ...


--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
99 percent of lawyers give the rest a bad name.