Mailing List Archive

Lots of comment in mail, how to score
I seem to remember we discussed a way to figure out how much HTML comment is
in a message, but I am not able to find a decent ruleset that is trying to
count the amount of comment.

Let me elaborate with an example: http://pastebin.com/AS6kvLH2

I do realize the spamvertized site (way way down the message) is at the
moment in blacklists. But it was not at the time the message was received.
And I reckon a fresh domain will be spammed in the next batch. But they
typically all have _pages_ of comment, and behind that scattering of words,
a small block with the payload.

What would be the best way to score such an unusual amout of HTML comment in
a message?
--
View this message in context: http://old.nabble.com/Lots-of-comment-in-mail%2C-how-to-score-tp33272106p33272106.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Lots of comment in mail, how to score [ In reply to ]
> Let me elaborate with an example: http://pastebin.com/AS6kvLH2

1.0 RCVD_IN_CSS RBL: Received via a relay in Spamhaus CSS
[64.120.212.26 listed in zen.spamhaus.org]
1.3 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
[Blocked - see
<http://www.spamcop.net/bl.shtml?64.120.212.26>]
1.3 RCVD_IN_RP_RNBL RBL: Relay in RNBL,
https://senderscore.org/blacklistlookup/
[64.120.212.26 listed in
bl.score.senderscore.com]
1.4 RCVD_IN_BRBL_LASTEXT RBL: RCVD_IN_BRBL_LASTEXT
[64.120.212.26 listed in
bb.barracudacentral.org]
1.7 URIBL_DBL_SPAM Contains an URL listed in the DBL blocklist
[URIs: universmallmail.com]
1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL
blocklist
[URIs: universmallmail.com]
1.7 URIBL_BLACK Contains an URL listed in the URIBL
blacklist
[URIs: universmallmail.com]
3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
[score: 0.9997]
0.0 RELAY_US Relayed through United States
1.7 RCVD_IN_HOSTKARMA_BL RBL: HostKarma: relay in black list
[64.120.212.26 listed in
hostkarma.junkemailfilter.com]
0.8 SPF_NEUTRAL SPF: sender does not match SPF record
(neutral)
0.1 SPF_HELO_NEUTRAL SPF: HELO does not match SPF record
(neutral)
0.0 HTML_MESSAGE BODY: HTML included in message
0.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
0.1 KHOP_DNSBL_BUMP Hits a trusted non-overlapping DNSBL
0.4 MAY_BE_FORGED Relay IP's reverse DNS does not resolve to
IP
1.0 KHOP_DYNAMIC2 Relay looks like a dynamic address

seems wasted :)
Re: Lots of comment in mail, how to score [ In reply to ]
Benny Pedersen wrote:
>
> 1.0 RCVD_IN_CSS RBL: Received via a relay in Spamhaus CSS
> 1.6 URIBL_WS_SURBL Contains an URL listed in the WS SURBL
> blocklist
> [URIs: universmallmail.com]
>
> seems wasted :)
>

As I said, sure they are in RBL now. They were not when this message was
delivered. That's the whole point of coming up with a diffent approach here,
the amount of comment in the message.
--
View this message in context: http://old.nabble.com/Lots-of-comment-in-mail%2C-how-to-score-tp33272106p33273247.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Lots of comment in mail, how to score [ In reply to ]
> As I said, sure they are in RBL now. They were not when this message
> was
> delivered. That's the whole point of coming up with a diffent
> approach here,
> the amount of comment in the message.

i got bayes_99 on this unknown spam

meta SPF_SPAM_AS_NEUTRAL (SPF_NEUTRAL && SPF_HELO_NEUTRAL)

and set score on this

if you like to make rules on html comments you need rawbody, and i try
keep away from this needs
Re: Lots of comment in mail, how to score [ In reply to ]
On 2/6/2012 12:57 PM, Mynabbler wrote:
> As I said, sure they are in RBL now. They were not when this message was
> delivered.

Looking at the date/time stamps, I'm almost positive that this URI was
blacklisted in BOTH uribl-BLACK and ivmURI *hours* before your sample
message arrived.

But, of course, your question is till valid! Having rules in place in SA
to deal with this kind of attempt at getting around bayes-filtering is a
good idea!

--
Rob McEwen
http://dnsbl.invaluement.com/
rob@invaluement.com
+1 (478) 475-9032
Re: Lots of comment in mail, how to score [ In reply to ]
On Mon, 6 Feb 2012, Benny Pedersen wrote:

>
>> As I said, sure they are in RBL now. They were not when this message was
>> delivered. That's the whole point of coming up with a diffent approach
>> here,
>> the amount of comment in the message.
>
> i got bayes_99 on this unknown spam
>
> meta SPF_SPAM_AS_NEUTRAL (SPF_NEUTRAL && SPF_HELO_NEUTRAL)
>
> and set score on this
>
> if you like to make rules on html comments you need rawbody, and i try keep
> away from this needs

As currently implemented, true. However SA already has some kind of HTML
rendering engine so it knows the size of the raw & rendered message.
If there was some easy way to extract those numbers, calculate the
ratio, and make it available to the rules processor, then a score could be
generated at very little cost.


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Lots of comment in mail, how to score [ In reply to ]
> But, of course, your question is till valid! Having rules in place in
> SA
> to deal with this kind of attempt at getting around bayes-filtering
> is a
> good idea!

imho bayes does not see html comments, but still here it got bayes_99
what did i miss ?
Re: Lots of comment in mail, how to score [ In reply to ]
On Mon, 2012-02-06 at 09:57 -0800, Mynabbler wrote:
> As I said, sure they are in RBL now. They were not when this message was
> delivered. That's the whole point of coming up with a diffent approach here,
> the amount of comment in the message.
>
Something like this might work:

body __SR1 /<html>\s{0,2}<!--/
body __SR2 /-->\s{0,2}<body>/
meta RULE (__SR1 && __SR2)
score RULE 3.5

on the grounds that I've never seen a comment in valid HTML that
immediately follows an <html> tag or immediately precedes a <body> tag.

CAUTION: this has neither been syntax checked or tested.

It would also be quite reasonable to point a rule at the in-body URL, on
which somebody has gone to the trouble of setting up MX records for the
domain, and so may feature in more spam in the future. The URL
references a single, zero length main page called index.html - not a
normal feature of a legitimate site. If many of the spams have this URL
in common, it is definitely worth a few points.

Martin
Re: Lots of comment in mail, how to score [ In reply to ]
> body __SR1 /<html>\s{0,2}<!--/
> body __SR2 /-->\s{0,2}<body>/

does not work since body rules strip html comments

with rawbody it ignore limits but hits on both
Re: Lots of comment in mail, how to score [ In reply to ]
>> body __SR1 /<html>\s{0,2}<!--/
>> body __SR2 /-->\s{0,2}<body>/
>
> does not work since body rules strip html comments
>
> with rawbody it ignore limits but hits on both
>

And don't score too high.

Example: Confirmations from Travelocity contain a 28 KB comment.

Joseph Brennan
Columbia University Information Technology
Re: Lots of comment in mail, how to score [ In reply to ]
Joseph Brennan wrote:
>
>
>>> body __SR1 /<html>\s{0,2}<!--/
>>> body __SR2 /-->\s{0,2}<body>/
>>
>> does not work since body rules strip html comments
>>
>> with rawbody it ignore limits but hits on both
>>
>
> And don't score too high.
>
> Example: Confirmations from Travelocity contain a 28 KB comment.

Eugh.

Any idea what's in that comment?

-kgd
Re: Lots of comment in mail, how to score [ In reply to ]
On Tue, 2012-02-07 at 11:04 -0500, Kris Deugau wrote:
> Joseph Brennan wrote:
> >
> >
> >>> body __SR1 /<html>\s{0,2}<!--/
> >>> body __SR2 /-->\s{0,2}<body>/
> >>
> >> does not work since body rules strip html comments
> >>
> >> with rawbody it ignore limits but hits on both
> >>
> >
> > And don't score too high.
> >
> > Example: Confirmations from Travelocity contain a 28 KB comment.
>
BUT is that comment between <html> and <body> tags in a Travelocity
confirmation? It is in the example mail and, since I've never see a
comment there in mail or or on a web page this seemed like a fairly
safe thing to trigger on.

> Eugh.
>
Kindly note that my suggestion has been misquoted, probably by Joe
Brennan. As he quoted it, its missing the meta which is somewhat
important in thus case. With correction to doing a rawbody scan it
should be:

rawbody __SR1 /<html>\s{0,2}<!--/
rawbody __SR2 /-->\s{0,2}<body>/
meta RULE (__SR1 && __SR2)

which is actually quite specific since it won't fire unless the comment
is between just those tags and separated from them by at most two
whitespace characters.

> Any idea what's in that comment?
>
a huge amount of garbage consisting of English words grouped by matched
parens, something like this: "axe (elsewhere) zoo this (whenever
numeric) ......." with nothing showing an obvious pattern except the
paired parens with text between them. I suppose you could use something
like:

body RULE2 /\([\s\w]{1,30}\)/
tflag RULE2 multiple

which would be specific from this garbage, but would you really want to
run that across more than 80kb of comment? I suggested the approach of
matching each end of the comment and using a meta to ensure both are
present because that should run a lot faster than anything I could dream
up that matched against the guts of the comment.

Martin
Re: Lots of comment in mail, how to score [ In reply to ]
Martin Gregorie wrote:
> BUT is that comment between<html> and<body> tags in a Travelocity
> confirmation? It is in the example mail and, since I've never see a
> comment there in mail or or on a web page this seemed like a fairly
> safe thing to trigger on.

*nod* I should have just trimmed the quote down; I wasn't referring
specifically to those potential rules.

> Kindly note that my suggestion has been misquoted, probably by Joe
> Brennan. As he quoted it, its missing the meta which is somewhat
> important in thus case. With correction to doing a rawbody scan it
> should be:
>
> rawbody __SR1 /<html>\s{0,2}<!--/
> rawbody __SR2 /-->\s{0,2}<body>/
> meta RULE (__SR1&& __SR2)

*nod* I can't say I recall if I've seen comments arranged like that;
I've paid more attention to the length and lack of useful content in the
spamples I've come across.

>> Any idea what's in that comment?
>>
> a huge amount of garbage consisting of English words grouped by matched
> parens, something like this: "axe (elsewhere) zoo this (whenever
> numeric) ......." with nothing showing an obvious pattern except the
> paired parens with text between them.

*nod* Yeah, I've been seeing those.

I've got a number of rules targeting strange things in HTML comments
generally:

rawbody LONG_COMMENT m|<!--[^>{};]{200,}-->|
rawbody DUMB_COMMENT_1 m|<!--\n?\s*\d+\s*\n?-->|
rawbody DUMB_COMMENT_2 m|<!--\n?\s*(?:-{72}\n){2,}-+\n?\s*-->|
rawbody BACK2BACK_COMMENT m|--!><!--[\n\s\w]{,200}--!><!--|
rawbody FILLER_COMMENT
m|<!--\n?\s*(?:\(?[\w.]{2,14}\)?\s{0,2}/\s{0,2}){8}|

Note the first one started at ~60 chars, then I kept having to bump it
up due to Outlook's bizarre HTML generation.

The other oddity I've tripped over are excessively long <style></style>
tags; legit email seems to use as much as ~3K, but I've seen spams put
all kinds of non-CSS garbage in there up to 20-30K in length.

-kgd
Re: Lots of comment in mail, how to score [ In reply to ]
Martin Gregorie <martin@gregorie.org> wrote:

>> > Example: Confirmations from Travelocity contain a 28 KB comment.
>>
> BUT is that comment between <html> and <body> tags in a Travelocity
> confirmation? It is in the example mail and, since I've never see a
> comment there in mail or or on a web page this seemed like a fairly
> safe thing to trigger on.

No, it was inside <body> .. </body> at least. We noticed it a couple
of years ago, and I have only a note on file about it being 28 KB,
without an example. I don't remember exactly what was in it, but it
was some kind of content that seemed to be about the reservation.

Most likely comment before body begins is unique to spam, but... you
never know. It sounds like valid html so some web programmer might
find a reason to put it in mail output.


Now <style> ... </style> with garbage in it is interesting. That
would never be in real mail. Or so you'd think!


Joseph Brennan
Columbia University Information Technology
Re: Lots of comment in mail, how to score [ In reply to ]
On Tue, 7 Feb 2012, Joseph Brennan wrote:

> Now <style> ... </style> with garbage in it is interesting. That would
> never be in real mail. Or so you'd think!

I do have a rule for garbage styles that is doing fairly well in
masschecks:

http://ruleqa.spamassassin.org/rule=STYLE_GIBBERISH

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Your mouse has moved. Your Windows Operating System must be
relicensed due to this hardware change. Please contact Microsoft
to obtain a new activation key. If this hardware change results in
added functionality you may be subject to additional license fees.
Your system will now shut down. Thank you for choosing Microsoft.
-----------------------------------------------------------------------
5 days until Abraham Lincoln's and Charles Darwin's 203rd Birthdays
Re: Lots of comment in mail, how to score [ In reply to ]
On Tue, 2012-02-07 at 20:13 -0500, Joseph Brennan wrote:
> Now <style> ... </style> with garbage in it is interesting. That
> would never be in real mail. Or so you'd think!
>
Maybe, maybe not. I think spammers have found that you can put any old
junk between <style></style> tags. I base this on screwing up styles
when I was learning to use them and noticing that anything the browser
can't parse in there is silently ignored.

For fun I kicked this together:
=================================================================================
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">

<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org">

<title>Big red test</title>
<style type="text/css">
Maybe, maybe not. As a pure guess, I think spammers may have found that
you can put any old junk between [style] and [/style] tags. I base
this on
screwing up styles when I was learning to use them and noticing that
anything the browser can't parse in there is silently ignored.
</style>
<style type="text/css">
p.c1 {color: red; font-size: xx-large; font-weight: bold}
</style>
<style type="text/css">
Maybe, maybe not. As a pure guess, I think spammers may have found that
you can put any old junk between [style] and [/style] tags. I base
this on
screwing up styles when I was learning to use them and noticing that
anything the browser can't parse in there is silently ignored.
p.c1 {color: red; font-size: xx-large; font-weight: bold}
</style>
</head>

<body>
<p class="c1">Big red test</p>

<p>Heading should be red</p>
</body>
</html>
=================================================================================

I used three <style> sections because, when I put the junk text into one
style section in front of the actual style definition, that got ignored.

If you cut and paste this example as a file and feed it to your browser,
you should see the first body line in bold red letters. I've tested this
with FireFox and Lynx, which work as I expected. As you can see, the
file has been passed through HTML by HTML-tidy, which says it is valid
HTML.


Martin
Re: Lots of comment in mail, how to score [ In reply to ]
On Wed, 2012-02-08 at 03:04 +0000, Martin Gregorie wrote:
> If you cut and paste this example as a file and feed it to your browser,
> you should see the first body line in bold red letters. I've tested this
> with FireFox and Lynx, which work as I expected.
>
Correction: FireFox and Opera. Lynx ignores style specs and shows plain
text.

Martin