Mailing List Archive

recognizing pats within text bodies?
For funsies, I decided to play around with adding Eicar to my .sig.

I was unsurprised that clamscan nailed it. I was surprised to find
that Trend didn't, it allowed it through; apparently it doesn't flag
Eicar within a normal text body, only as a separate file or
attachment.

Is this business of flagging on Eicar within a text body intrinsic
to clamav, or is it a defect of the way I'm currently playing with
it?

My current setup ends up using clamscan; it does it from this
wrapper, which I've nicknamed clamit:

#!/bin/sh

die(){ echo "$0: $*">&2; exit 1; }
tmp=/tmp/`basename $0`.$$
trap "rm -rf $tmp" 0 1 2 3
mkdir $tmp || die "mkdir $tmp failed"
cd $tmp
cat >full-message.mbox
mkdir unpack
cd unpack
uudeview -i -a -m -f -t -d -s -q -n - <../full-message.mbox
cd ..
clamscan --quiet -r .
exit $?

which in turn is called using this clause in my .procmailrc:

:0HB
* ! ? clamit
clamav/

One might reasonably ask, why am I bothering with A/V, since I run
entirely on Unix and don't run susceptible MUAs; I added clamav to
my screening to help assist bogofilter, in this age of email worms.

-Bennett
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
Re: recognizing pats within text bodies? [ In reply to ]
On Thu, 14 Aug 2003 13:12:20 -0400
Bennett Todd <bet@rahul.net> wrote:

> For funsies, I decided to play around with adding Eicar to my .sig.
>
> I was unsurprised that clamscan nailed it. I was surprised to find
> that Trend didn't, it allowed it through; apparently it doesn't flag
> Eicar within a normal text body, only as a separate file or
> attachment.
>
> Is this business of flagging on Eicar within a text body intrinsic
> to clamav, or is it a defect of the way I'm currently playing with
> it?

ClamAV doesn't use position indicators in signatures and always scans
all data - generally it's a useful feature but the scanner might be
a little slower in comparison to other scanners when a file is
_huge_. The lack of position indicators sometimes causes false positive
alerts, that's why they will be implemented in the next database
format.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
This is great news!
Some samples i have cannot be detected at the moment without generating
a bunch of false positives. Some hints that could make life easier and
drop scan times too are imho:
- (even basic) file type detection: as a minumum MZ exe's, PE exe's,
maybe also scripts (vbs/bat/whatever) and innocent types as well (say
gif89, jfif, riff, mpeg, etc.)
- smarter handling of the '*' wildcard: being able to limit the range
between a min and a max would be great. I mean something like
".{min,max}" in posix regex
- Entrypoint detection for MZ's and PE's: this would solve the hassle
with encrypting virii having very small or common decription routines
- Also detecting the read/write/exec attribute of the PE section
containing the EP can prove very usefull: in fact many virii and packed
worms rely on these attributes insted of using VirtualProtect and
similar api's; i know this can be very tricky and painful to implement
but would still be appreciated
Thanks,
acab

Tomasz Kojm ha scritto:
> On Thu, 14 Aug 2003 13:12:20 -0400
> Bennett Todd <bet@rahul.net> wrote:
>
>
>>For funsies, I decided to play around with adding Eicar to my .sig.
>>
>>I was unsurprised that clamscan nailed it. I was surprised to find
>>that Trend didn't, it allowed it through; apparently it doesn't flag
>>Eicar within a normal text body, only as a separate file or
>>attachment.
>>
>>Is this business of flagging on Eicar within a text body intrinsic
>>to clamav, or is it a defect of the way I'm currently playing with
>>it?
>
>
> ClamAV doesn't use position indicators in signatures and always scans
> all data - generally it's a useful feature but the scanner might be
> a little slower in comparison to other scanners when a file is
> _huge_. The lack of position indicators sometimes causes false positive
> alerts, that's why they will be implemented in the next database
> format.
>
> Best regards,
> Tomasz Kojm
Re: recognizing pats within text bodies? [ In reply to ]
On Sat, 16 Aug 2003 16:25:15 +0200
aCaB <acabng@digitalfuture.it> wrote:

> maybe also scripts (vbs/bat/whatever) and innocent types as well (say
> gif89, jfif, riff, mpeg, etc.)

Well, I saw an exploit against some image viewer (zgv ?) put inside
a gif file ! Each signature should be marked with a file type(s) it
was written against.

> - smarter handling of the '*' wildcard: being able to limit the range
> between a min and a max would be great. I mean something like
> ".{min,max}" in posix regex

The Perl regular expressions will be implemented.

> - Entrypoint detection for MZ's and PE's: this would solve the hassle
> with encrypting virii having very small or common decription routines
> - Also detecting the read/write/exec attribute of the PE section
> containing the EP can prove very usefull: in fact many virii and
> packed worms rely on these attributes insted of using VirtualProtect
> and similar api's; i know this can be very tricky and painful to
> implement but would still be appreciated

I agree - some PE analyser should be implemented but this will take some
time.. Fortunately current signature format (derived with .db2
extensions from OAV) is able to detect almost every virus. It isn't
false positive proof, though.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
On [08/16/03 20:27], Tomasz Kojm wrote:
> > - smarter handling of the '*' wildcard: being able to limit the range
> > between a min and a max would be great. I mean something like
> > ".{min,max}" in posix regex
>
> The Perl regular expressions will be implemented.

Are you going to use pcre library, or implement certain subset of
regexes yourself?


--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Mon, 18 Aug 2003 07:46:57 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> On [08/16/03 20:27], Tomasz Kojm wrote:
> > > - smarter handling of the '*' wildcard: being able to limit the
> > > range between a min and a max would be great. I mean something
> > > like ".{min,max}" in posix regex
> >
> > The Perl regular expressions will be implemented.
>
> Are you going to use pcre library, or implement certain subset of
> regexes yourself?

Yes, we plan to use the pcre library.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
>>Are you going to use pcre library, or implement certain subset of
>>regexes yourself?
>
>
> Yes, we plan to use the pcre library.
>
> Best regards,
> Tomasz Kojm

Cool!!!
Re: recognizing pats within text bodies? [ In reply to ]
On [08/15/03 04:57], Tomasz Kojm wrote:
> On Thu, 14 Aug 2003 13:12:20 -0400
>
> ClamAV doesn't use position indicators in signatures and always scans
> all data - generally it's a useful feature but the scanner might be
> a little slower in comparison to other scanners when a file is
> _huge_. The lack of position indicators sometimes causes false positive
> alerts, that's why they will be implemented in the next database
> format.

Is there an alpha (or even pre alpha) version of this implementation?
Re: recognizing pats within text bodies? [ In reply to ]
On Tue, 19 Aug 2003 15:10:53 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> On [08/15/03 04:57], Tomasz Kojm wrote:
> > On Thu, 14 Aug 2003 13:12:20 -0400
> >
> > ClamAV doesn't use position indicators in signatures and always
> > scans all data - generally it's a useful feature but the scanner
> > might be a little slower in comparison to other scanners when a file
> > is_huge_. The lack of position indicators sometimes causes false
> > positive alerts, that's why they will be implemented in the next
> > database format.
>
> Is there an alpha (or even pre alpha) version of this implementation?

No, there isn't :(

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
Please ingore clamav-0.60.patch from prev email.
Correct patch attached.

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On [08/19/03 23:14], Tomasz Kojm wrote:
> On Tue, 19 Aug 2003 15:10:53 -0400
> Yevgeniy Miretskiy <eugene@invision.net> wrote:
>
> > Is there an alpha (or even pre alpha) version of this implementation?
>
> No, there isn't :(
>

That's unfortunate -- I really though I could get my hands on this code :(

Anyway, currently myself + couple other people are working on research project
(antivirus fs systems for Linux) where we are using clamav to provide
kernel based virus detection.

While experimenting with clamav, we found that clamav performance can be
significantly improved by increasing number of levels in the search trie.

Below are the results of timing clamscan, scanning a 2 gigabyte
VMware virtual disk which does not contain any viruses.
During our benchmarking, we ran similar tests many times with roughly
the same results. The table below shows speed improvements (over 1 run).
Currently, clamav uses a 2 level trie.

Time | Level 2 | Level 3 | Level 4 | Level 5
-----------------------------------------------------
real | 3m56.477s | 1m47.712s | 1m40.420s | 1m31.998s
user | 3m19.270s | 1m18.230s | 1m7.070s | 1m0.020s
sys | 0m8.770s | 0m6.400s | 0m8.710s | 0m7.090

Memory usage increases by roughly 5-7 MB per each level.
Level 5 memory usage is around 25 MB.
Considering that most people use clamd, I think 25MB
usage for 1 process with 3X performance is a fair tradeoff.

We found that going beyond L5 does not buy you anything
in terms of speed -- only increased memory usage.
You can use attached dbstats.pl program to get an idea
of what the optimal level might be (run the program
giving it a list of viruses database files on command line).

The speed of virus clamav improves with additional trie levels
because the average and the maximum patterns linked list
length decreases.
We are looking for a level which has avg linked list close
to 1 and has relativelly small max linked list length.
For example, the output below shows the difference between
the trie with 2 and 3 levels:

Level 2:
Unique Prefixes: 3529
Avg Linked List Length: 2.23
Avg Linked List Length Descrease: 92.82%
Min Linked List Length: 1.00
Max Linked List Length: 240.00
Memory Usage: 948.29 KB
Memory Usage Increase: 27.55%
Number of cl_node structures: 3945
Level 3:
Unique Prefixes: 5080
Avg Linked List Length: 1.55
Avg Linked List Length Descrease: 30.49%
Min Linked List Length: 1.00
Max Linked List Length: 238.00
Memory Usage: 4728.38 KB
Memory Usage Increase: 79.94%
Number of cl_node structures: 9225

Percentages above indicate improvements from previous level.
dbstats.pl program terminates when avg linked list length becomes 1.

We also ran benchmarks with much larger virus database
(we made up about 80000 addition "signature"), and, not surprisingly,
found that speed improvements with more signatures are even more significant.

Also attached you will find a patch against clamav-0.60 which adds
configuration option to change trie depth.

Let me know how it goes -- hope you find this info helpfull.

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Wed, 20 Aug 2003 18:10:14 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> While experimenting with clamav, we found that clamav performance can
> be significantly improved by increasing number of levels in the search
> trie.
>
> Below are the results of timing clamscan, scanning a 2 gigabyte
> VMware virtual disk which does not contain any viruses.
> During our benchmarking, we ran similar tests many times with roughly
> the same results. The table below shows speed improvements (over 1
> run). Currently, clamav uses a 2 level trie.
>
> Time | Level 2 | Level 3 | Level 4 | Level 5
> -----------------------------------------------------
> real | 3m56.477s | 1m47.712s | 1m40.420s | 1m31.998s
> user | 3m19.270s | 1m18.230s | 1m7.070s | 1m0.020s
> sys | 0m8.770s | 0m6.400s | 0m8.710s | 0m7.090
>
> Memory usage increases by roughly 5-7 MB per each level.
> Level 5 memory usage is around 25 MB.
> Considering that most people use clamd, I think 25MB
> usage for 1 process with 3X performance is a fair tradeoff.

No ! :
1) under BSD the memory usage will be about 50 MB (_now_)
2) under higher level new signatures will cause a _BRUTAL_ memory
usage because new nodes will be created for most signatures
(there are only few signatures that have the same first 5
characters)
3) higher levels brake some polymorphic signatures (eg. W32/Magistr.B,
W32/Hybris.C), because we don't realize regular expressions in the
trie

> We also ran benchmarks with much larger virus database
> (we made up about 80000 addition "signature"), and, not surprisingly,

Please create a _random_ signatures (using /dev/urandom or so) - I
wonder if your virtual memory will hold a few clamscan process...

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
Sorry for rather long response.

On [08/21/03 20:46], Tomasz Kojm wrote:
> On Wed, 20 Aug 2003 18:10:14 -0400
> No ! :
> 1) under BSD the memory usage will be about 50 MB (_now_)

Don't know about BSD -- but on FreeBSD using default 2 levels, my
memory usage was 12MB while using 5 levels it was 48MB -- which is acceptible
to me.
While this memory usage is about 2x of usage on Linux, it's still
acceptible to me.

> 2) under higher level new signatures will cause a _BRUTAL_ memory
> usage because new nodes will be created for most signatures
> (there are only few signatures that have the same first 5
> characters)

I think _BRUTAL_ is pretty rough estimate. I also think if I have lots of
ram, it should be left up to me to decide whether I want to use 500 MB for
my database to gain XXX speed improvement.

The memory usage directly relates to the number of unique prefixes in the
database of length CL_MIN_LENGTH (i.e. the height of the trie).
If you think that the memory usage will increate exponentially, you are mistaken.
It will only increase exponentially in a database that has every possible
prefix exercised (255^(prefixlength))
As you increase the height, the number of unique prefixes does NOT increase
nearly as fast as you would think (this, of course depends on the database).

For example, here is a table with number of unique prefixes as you increase
levels from 1 - 5 (default database):

Level 1: Unique Prefixes: 254
Level 2: Unique Prefixes: 3529
Level 3: Unique Prefixes: 5080
Level 4: Unique Prefixes: 5894
Level 5: Unique Prefixes: 6347

As you can see, the increase is not nearly exponential. As a matter of fact,
as you add more levels, you're increase your memory usage by smaller, and
smaller percentage as compared to previous level increase.

> 3) higher levels brake some polymorphic signatures (eg. W32/Magistr.B,
> W32/Hybris.C), because we don't realize regular expressions in the
> trie

Please elaborate. I think poly viruses (even with short subpatterns) will
be detected just fine.

>
> Please create a _random_ signatures (using /dev/urandom or so) - I
> wonder if your virtual memory will hold a few clamscan process...
>

Now this is where you are not entire correct. Running scanners with
random database is a nice benchmark of WORST case scenario. However,
virus database is all but random -- you do have pattern prefixes
that occur more often the others.

Having said that, I ran my benchmarks with the database generated
from /dev/urandom, containing 100000 simple signatures with length
from 2 - 2048, and with 10000 poly signaturs each containing random
number of subpatterns from 2-10, and each subpattern having length
from 2-2048 bytes. The size of the database was 214 MB. I scanned
a 240 MB file w/out viruses.

Here is a result for default clamav-0.60:

Database | Mem | Real Time | Sys Time | User Time
---------------------------------------------------
Default DB| 6MB | 0m26.537s | 0m26.080s | 0m0.460s
Random DB|275MB| 6m1.162s | 5m54.050s | 0m0.870

To figure out how deep of a tree I want to have for huge database,
I ran dbstats.pl program (attached in earlier email), and got the following results:
Loading virus databes ... done
cl_node_size=1033 totalPatternMemusage=114588591

Level 1:
Unique Prefixes: 256
Avg Linked List Length: 468.75
Avg Linked List Length Descrease: 100.00%
Min Linked List Length: 419.00
Max Linked List Length: 531.00
Memory Usage: 111902.92 KB
Memory Usage Increase: 100.00%
Number of cl_node structures: 257
Level 2:
Unique Prefixes: 54996
Avg Linked List Length: 2.18
Avg Linked List Length Descrease: 99.53%
Min Linked List Length: 1.00
Max Linked List Length: 10.00
Memory Usage: 112162.18 KB
Memory Usage Increase: 0.23%
Number of cl_node structures: 55445
Level 3:
Unique Prefixes: 119561
Avg Linked List Length: 1.00
Avg Linked List Length Descrease: 54.13%
Min Linked List Length: 1.00
Max Linked List Length: 3.00
Memory Usage: 167835.23 KB
Memory Usage Increase: 33.17%
Number of cl_node structures: 175198

This told me, that there was NO need to go beyond level 3 because max linked list
length was 1.00 after 3 levels (optimal). Also notice that the improvement from Level 2,
to Level 3 is not that great in terms of linked list length reduction.
This make perfect sense: since the database was random, and it contained 110000 virus
definitions, the first 2 characters would uniquely identify 65K patterns -- more then
half. So, for this database, I would have configured clamav with 2 levels, but just
for the sake of argument, I ran it with 3 levels:

Here are the results:

Database | Mem | Real Time | Sys Time | User Time
----------------------------------------------------
Default DB|11MB | 0m7.095 | 0m6.430 | 0m0.440
Random DB|275MB| 6m12.710s | 6m10.350s | 0m1.070

So, with 3 levels on random db, clamav performed marginally slower then default clamav,
which makes sense for the reaons described above -- i.e. there was no need to go
to 3 levels.

I will however generate a much larger database of about 500K virus definitions, and
rerun the stats. I have a feeling that 3 levels will run quite a bit nicer.

BTW, the stats were collected from a server running Linux 2.4.19:
Dual 2.7 GHz Xeon with 2 GB ram.

I'm attaching fixed clamav-0.60 patch if anybody's interested.

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Fri, 22 Aug 2003 13:22:10 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> While this memory usage is about 2x of usage on Linux, it's still
> acceptible to me.

But it should not be acceptable to miss the polymorphic viruses I
mentioned.

> The memory usage directly relates to the number of unique prefixes in
> the database of length CL_MIN_LENGTH (i.e. the height of the trie).
> If you think that the memory usage will increate exponentially, you
> are mistaken.

Yevgeniy, please don't offend but your observations are obvious to me.
I analyzed this algorithm very deeply and even had few lectures on it. I
can send you some papers/slides (unfortunately only in polish) if you
want. Because the algorithm is data-dependend (which are very random in
out case) there's no good estimator for a speed of memory growth. So the
word _BRUTAL_ should rather be swapped with _UNPREDICTABLE_ (of course
the upper limit of memory usage may be calculated). This is not
acceptable, because some software eg. qmail-scanner uses a hardcoded
softlimit value which stops clamscan if it exceeds the limit. Please
check qmail-scanner archives and I hope you will realize the situation.

ClamAV 0.2x had CL_MIN_LENGTH hardcoded to 5 and there were critical
problems(hangups, etc.) after a big database update.

> As you can see, the increase is not nearly exponential. As a matter of
> fact, as you add more levels, you're increase your memory usage by
> smaller, and smaller percentage as compared to previous level
> increase.

Do you really want an anti-virys software which consumes 50 MB of your
system's memory ?

> > 3) higher levels brake some polymorphic signatures (eg.
> > W32/Magistr.B, W32/Hybris.C), because we don't realize regular
> > expressions in the trie
>
> Please elaborate. I think poly viruses (even with short subpatterns)
> will be detected just fine.

No, they won't. For speed (and simplicity) reason we do not realize
regular expression matching in the trie. Consider the signature for
W32/Hybris.C:

4000??????????????????????????83??????75f2e9????ffff00000000

Actually (with CL_MIN_LENGTH = 2) only the first two bytes (4000) are
inside of the trie (the rest is being kept in a linked list in a
corresponding node). You should guess what will the increase of the
height do.

> Now this is where you are not entire correct. Running scanners with
> random database is a nice benchmark of WORST case scenario. However,
> virus database is all but random -- you do have pattern prefixes
> that occur more often the others.

We really _must_ consider the worst scenario. ClamAV is a "mission
critical" application and we cannot depend on our expectations.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
On [08/23/03 02:00], Tomasz Kojm wrote:
> On Fri, 22 Aug 2003 13:22:10 -0400
> Yevgeniy Miretskiy <eugene@invision.net> wrote:
>
> > While this memory usage is about 2x of usage on Linux, it's still
> > acceptible to me.
>
> But it should not be acceptable to miss the polymorphic viruses I
> mentioned.

See below.

>
> Yevgeniy, please don't offend but your observations are obvious to me.
> I analyzed this algorithm very deeply and even had few lectures on it. I
> can send you some papers/slides (unfortunately only in polish) if you
> want.

Thanks for the offer. I found quite a lot of material on the web, and
studied & discussed the algorithm in detail with other people.

> Because the algorithm is data-dependend (which are very random in
> out case)

Let's clarify something: by data-dependent do you mean that it depends
on input? or do you mean that it depends on the database?

If you're saying that the algorithm memory usage/speed
is input dependent, then I think you're wrong. In a case of
direct Aho-corassick implementation, it's always linear. In clamav case,
it's a bit worse. How much worse depends on the number of patterns
hanging in a linked list off of trie node.

If you're saying that the algorithm memory usage/speed depends on
the database, then I would agree with you.
But, by saying that the data (i.e. the database) is very random,
are you stating something that's a fact, or is something that you
think is a fact? Define random please.
The way I define it, is that if the data was indeed random,
then every possible prefix of a pattern would be equally likelly to appear.
This would mean, that a 2 character prefix should yeald 65K differnt patterns.
Is you database like this?

Your input is random, but the database is not. The database is finite in
size, and is constant.

> there's no good estimator for a speed of memory growth.

I think there are pretty good estimators for both. You can estimate
the speed of algorithm as linear to input size (not counting loading/unloading
of the trie). You can also estimate memory usage _very_ accuratly given
input pattern database.

> So the word _BRUTAL_ should rather be swapped with _UNPREDICTABLE_ (of course
> the upper limit of memory usage may be calculated). This is not
> acceptable, because some software eg. qmail-scanner uses a hardcoded
> softlimit value which stops clamscan if it exceeds the limit. Please
> check qmail-scanner archives and I hope you will realize the situation.

How exactly is your memory usage is unpredictable? I really don't follow
your logic. Not only is it predictable, it can be calculated _EXACTLY_
even before running clamav. Or, if you don't feel like calculating, just
run it, and watch top. Clamav memory usage should _never_ go up after
db is loaded and first buffer is read.

>
> ClamAV 0.2x had CL_MIN_LENGTH hardcoded to 5 and there were critical
> problems(hangups, etc.) after a big database update.

Well, I guess 0.6 improved quite a bit as compared to 0.2.
We had absolutelly _no_ problems reloading 5 level trie with 80K
patterns while running in kernel mode, which is probably a bit
more unforgiving then userlevel program. We went as far
as 8 levels, just to test things, again, w/out any problems.

>
> > As you can see, the increase is not nearly exponential. As a matter of
> > fact, as you add more levels, you're increase your memory usage by
> > smaller, and smaller percentage as compared to previous level
> > increase.
>
> Do you really want an anti-virys software which consumes 50 MB of your
> system's memory ?

Why NOT? I have 1 process that consumes 50MB. Every modern OS supports
copy on write. I don't have to fork off 50MB for each scanner instance.
Since root trie is readonly, this 50 MB is loaded ONCE.
I'm sorry, but as I stated before, I think it should be left up to the
user to deside how much ram can be allocated to clamav.
Beside, if I run dedicated scanner servers with 2 GB of ram, why the hell shouldn't
I use that ram for scanning?

>
> > > 3) higher levels brake some polymorphic signatures (eg.
> > > W32/Magistr.B, W32/Hybris.C), because we don't realize regular
> > > expressions in the trie
> >
> > Please elaborate. I think poly viruses (even with short subpatterns)
> > will be detected just fine.
>
> No, they won't. For speed (and simplicity) reason we do not realize
> regular expression matching in the trie. Consider the signature for
> W32/Hybris.C:
>
> 4000??????????????????????????83??????75f2e9????ffff00000000
>
> Actually (with CL_MIN_LENGTH = 2) only the first two bytes (4000) are
> inside of the trie (the rest is being kept in a linked list in a
> corresponding node). You should guess what will the increase of the
> height do.

I'm sorry, but this makes not sense to me.
First 2 characters (4000) will be used to locate some node on the second
level of the trie. Then entire pattern will be added to that nodes linked
list. The matching will continue the same way whether it's a 2, or 5 level
trie. Very simply, the nodes that contain pattern linked lists are marked
with is_last=1 (the name should probably change).

Why don't you try running the patched clamav with 5 (or however many) levels
on Hybris.C virus and see if it detects it. I just did -- detected it
just fine.

>
> > Now this is where you are not entire correct. Running scanners with
> > random database is a nice benchmark of WORST case scenario. However,
> > virus database is all but random -- you do have pattern prefixes
> > that occur more often the others.
>
> We really _must_ consider the worst scenario. ClamAV is a "mission
> critical" application and we cannot depend on our expectations.
>

I'm not sure if I made it clear: I never assumed that I cannot observe
certain pattern in input. All I said was that the database is NOT
random (it might seem like it is, but it's really not because
it is _finite_). So, having really large, really random database
is a good test case for WORST case memory usage/performance. This has
nothing to do with input.


--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Fri, 22 Aug 2003 22:19:47 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> > Because the algorithm is data-dependend (which are very random in
> > out case)
>
> Let's clarify something: by data-dependent do you mean that it depends
> on input? or do you mean that it depends on the database?

Don't make a fool of me, please.

> But, by saying that the data (i.e. the database) is very random,
> are you stating something that's a fact, or is something that you
> think is a fact? Define random please.

Do you think a new (non existing yet) virus signatures are predictable ?

> Your input is random, but the database is not. The database is finite
> in size, and is constant.

What does "constant" mean here ?

> How exactly is your memory usage is unpredictable? I really don't
> follow your logic. Not only is it predictable, it can be calculated
> _EXACTLY_ even before running clamav. Or, if you don't feel like
> calculating, just run it, and watch top. Clamav memory usage should
> _never_ go up after db is loaded and first buffer is read.

Yevgeniy, you're still writing obvious things. To make aware of the
problem imagine the following _real_ problem: we have just received
about 1100 virus samples. Imagine we have just created the signatures.
Now please tell me the exact clamav memory usage with that new
signatures !?

> > Do you really want an anti-virys software which consumes 50 MB of
> > your system's memory ?
>
> Why NOT? I have 1 process that consumes 50MB. Every modern OS
> supports copy on write. I don't have to fork off 50MB for each
> scanner instance.

Every modern OS supports threads. clamd is a multithreaded application
and shares the database between all threads without all that copy on
write trickery, which is defacto non standard (derives from System V)
and we cannot depend on it.

> I'm sorry, but this makes not sense to me.
> First 2 characters (4000) will be used to locate some node on the
> second level of the trie. Then entire pattern will be added to that
> nodes linked list. The matching will continue the same way whether
> it's a 2, or 5 level trie. Very simply, the nodes that contain
> pattern linked lists are marked with is_last=1 (the name should
> probably change).
>
> Why don't you try running the patched clamav with 5 (or however many)
> levels on Hybris.C virus and see if it detects it. I just did --
> detected it just fine.

Bullshit !!! Sorry, it seems you don't understand the problem. Please
download the file http://www.mat.uni.torun.pl/~tk/magistr.zip (password:
virus). First thing - I've just realized clamav WILL NOT run with the
level value higher than 2:

clamscan$ ./clamscan
LibClamAV Error: readdb(): Malformed pattern line 10 (file
/usr/local/share/clamav/viruses.db2). ERROR: Too short pattern detected.

You must remove the W32/BadTrans from viruses.db2. Now scan the
oriente.com file from the zip archive with level 2:

zolw@Wierszokleta:/tmp$ clamscan oriente.com
oriente.com: W32/Magistr.B FOUND

and with level 3:

zolw@Wierszokleta:~/tests/Clam/clamscan$ ./clamscan oriente.com
oriente.com: OK

The virus will be available on the website for a week so everyone can
verify I'm right.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
On [08/23/03 21:25], Tomasz Kojm wrote:
> On Fri, 22 Aug 2003 22:19:47 -0400

Sorry, forgot to attach script file...
Re: recognizing pats within text bodies? [ In reply to ]
On [08/23/03 21:25], Tomasz Kojm wrote:
> On Fri, 22 Aug 2003 22:19:47 -0400
>
> Don't make a fool of me, please.
>

I wasnt, but I had a feeling you did (by assuming that
nobody else could find details on the algorithm).

>
> Yevgeniy, you're still writing obvious things. To make aware of the
> problem imagine the following _real_ problem: we have just received
> about 1100 virus samples. Imagine we have just created the signatures.
> Now please tell me the exact clamav memory usage with that new
> signatures !?
>

No, you are writing obvious things. You obviously do not want
to listen or to try anything. Well, here are step by step
algorithm on how to compute exact memory usage of the trie
(just in case you did not want to take a look at dbstats.pl):

total_size = sizof(struct cl_node) // root node usage
level = 2
while(level <= CL_MIN_LENGTH) {
total_size = total_size +
(total number of unique prefixes of size level - 1)*struct(cl_node)
level = level + 1;
}

Of course I'm not accounting the usage used due to pattern linked lists --
this is a cost you pay regardless of tree depth.

>
> Every modern OS supports threads. clamd is a multithreaded application
> and shares the database between all threads without all that copy on
> write trickery, which is defacto non standard (derives from System V)
> and we cannot depend on it.

And thread will duplicate 50MB database or reuse it?
My point was that even in forking server, your database would be
50 MB in a moder OS regardless of number of forked instance running.
What's you point here?

> Bullshit !!! Sorry, it seems you don't understand the problem. Please
> download the file http://www.mat.uni.torun.pl/~tk/magistr.zip (password:
> virus). First thing - I've just realized clamav WILL NOT run with the
> level value higher than 2:

You are imagining things -- and not proving them.

>
> clamscan$ ./clamscan
> LibClamAV Error: readdb(): Malformed pattern line 10 (file
> /usr/local/share/clamav/viruses.db2). ERROR: Too short pattern detected.
>
> You must remove the W32/BadTrans from viruses.db2. Now scan the
> oriente.com file from the zip archive with level 2:

I did not get such error.

>
> zolw@Wierszokleta:/tmp$ clamscan oriente.com
> oriente.com: W32/Magistr.B FOUND
>
> and with level 3:
>
> zolw@Wierszokleta:~/tests/Clam/clamscan$ ./clamscan oriente.com
> oriente.com: OK

No THIS IS BULLSHIT. See attached script file which CLEARLY
shows that magistr was deteccted FINE with 3 levels.

I'm curious Tomasz, did you even BOTHER applying the patch, or
you're just in the bullshitting mood?

>
> The virus will be available on the website for a week so everyone can
> verify I'm right.
>

You can keep it as long as you want.

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Sun, 24 Aug 2003 09:08:45 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> > Yevgeniy, you're still writing obvious things. To make aware of the
> > problem imagine the following _real_ problem: we have just received
> > about 1100 virus samples. Imagine we have just created the
> > signatures. Now please tell me the exact clamav memory usage with
> > that new signatures !?
> >
>
> No, you are writing obvious things. You obviously do not want
> to listen or to try anything. Well, here are step by step
> algorithm on how to compute exact memory usage of the trie
> (just in case you did not want to take a look at dbstats.pl):

YOU DON'T UNDERSTAND THE PROBLEM, COMPLETELY. I want you to answer
on the question above. I want the exact number of bytes it will use
afer that "virtual update".

> You are imagining things -- and not proving them.
>
> >
> > clamscan$ ./clamscan
> > LibClamAV Error: readdb(): Malformed pattern line 10 (file
> > /usr/local/share/clamav/viruses.db2). ERROR: Too short pattern
> > detected.
> >
> > You must remove the W32/BadTrans from viruses.db2. Now scan the
> > oriente.com file from the zip archive with level 2:
>
> I did not get such error.

???!!!!! Do you really use clamav from clamav.elektrapro.com ???

> No THIS IS BULLSHIT. See attached script file which CLEARLY
> shows that magistr was deteccted FINE with 3 levels.

I'm not going to check that script. Please read the libclamav/matcher.c
file very carefully and maybe you will understand why it _must_ fail.
In our implementation the regular expressions in the trie CAN'T WORK.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
On [08/24/03 15:40], Tomasz Kojm wrote:
> On Sun, 24 Aug 2003 09:08:45 -0400
>
> YOU DON'T UNDERSTAND THE PROBLEM, COMPLETELY. I want you to answer
> on the question above. I want the exact number of bytes it will use
> afer that "virtual update".

This is getting pointless. Give me the databsae, and I'll tell
you memory usage.

>
> ???!!!!! Do you really use clamav from clamav.elektrapro.com ???
>

Yes, I'm using clamav-0.60 from clamav.elektrapro.com with the
patch from my previous email applied. Did you apply the patch
before reporting the error? I think you just changed CL_MIN_LENGHT
and expect things to work.

> > No THIS IS BULLSHIT. See attached script file which CLEARLY
> > shows that magistr was deteccted FINE with 3 levels.
>
> I'm not going to check that script. Please read the libclamav/matcher.c
> file very carefully and maybe you will understand why it _must_ fail.
> In our implementation the regular expressions in the trie CAN'T WORK.
>

Just like I though: in essense, you're saying: I know it CAN'T WORK, so
I'm not going to bother applying the patch, and I'm simply going to make
baseless, unverified, and untrue statements, and pretend they are true.
Great logic -- Way to go. It's always nice to talk to somebody who is
open to ideas and suggestions.
BTW, I did read matcher.c very carefully, unlike you, who did not even
bother to look at the patch (this is pretty obvious to me at point).
You ask not make make a fool of you. I'd appreciate if you did not take
me for an idiot who did not do his research before posting it.
Pointing out my mistakes is fine, making baseless statements is not.

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
On Sun, 24 Aug 2003 18:20:36 -0400
Yevgeniy Miretskiy <eugene@invision.net> wrote:

> On [08/24/03 15:40], Tomasz Kojm wrote:
> > On Sun, 24 Aug 2003 09:08:45 -0400
> >
> > YOU DON'T UNDERSTAND THE PROBLEM, COMPLETELY. I want you to answer
> > on the question above. I want the exact number of bytes it will use
> > afer that "virtual update".
>
> This is getting pointless. Give me the databsae, and I'll tell
> you memory usage.

No, that's the problem I tried to explain you: you can't calculate
exact memory usage "a priori".

> >
> > ???!!!!! Do you really use clamav from clamav.elektrapro.com ???
> >
>
> Yes, I'm using clamav-0.60 from clamav.elektrapro.com with the
> patch from my previous email applied. Did you apply the patch
> before reporting the error? I think you just changed CL_MIN_LENGHT
> and expect things to work.

Sorry, but your patch only brake the things. It removes the important
condition on minimal signature length.

> Just like I though: in essense, you're saying: I know it CAN'T WORK,
> so I'm not going to bother applying the patch, and I'm simply going to
> make baseless, unverified, and untrue statements, and pretend they
> are true. Great logic -- Way to go. It's always nice to talk to
> somebody who is open to ideas and suggestions.
> BTW, I did read matcher.c very carefully, unlike you, who did not even
> bother to look at the patch (this is pretty obvious to me at point).

No, you're not right. I've applied your patch and the Magistr virus in
that file was not detected. I hope someone following this thread can
check it(at least I will ask other ClamAV developers) and confirm my
test.

> Pointing out my mistakes is fine, making baseless statements is not.

OK, I will describe you the problem:

The Magistr signature is:

W32/Magistr.B=0000??2e??????????0000ed????0000????0000????0000????00000
000000000??0000*e804720000

In cl_hex2str() [libclamav/str.c, lines 63-64] the question marks ?? are
translated into CLI_IGN. Now with the level _higher_ then 2 in
cli_addpatt() CLI_IGN goes into the trie "as a node". Because it's not
possible to match CLI_IGN while travelling the trie, the Magistr pattern
will never be matched... That's why I haven't read your last attachment.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
Sorry it took so long to answer, I was on vacation.

On [08/25/03 03:21], Tomasz Kojm wrote:
> On Sun, 24 Aug 2003 18:20:36 -0400
> > This is getting pointless. Give me the databsae, and I'll tell
> > you memory usage.
>
> No, that's the problem I tried to explain you: you can't calculate
> exact memory usage "a priori".
>

Why would you need to know memory usage a priori? You can reavaluate
you memory requirements once in a while. If you decided on level 5 trie,
adding 1100 signatures might not even make a difference. But as a rule
of thumb, when you upgrade, you should examine your database to see
if you need to adjust number of levels to get best performance...

>
> Sorry, but your patch only brake the things. It removes the important
> condition on minimal signature length.
>

Yup, it does remove this condition. But I would rather call it a limitation.

>
> No, you're not right. I've applied your patch and the Magistr virus in
> that file was not detected. I hope someone following this thread can
> check it(at least I will ask other ClamAV developers) and confirm my
> test.
>
> > Pointing out my mistakes is fine, making baseless statements is not.
>
> OK, I will describe you the problem:
>
> The Magistr signature is:
>
> W32/Magistr.B=0000??2e??????????0000ed????0000????0000????0000????00000
> 000000000??0000*e804720000
>
> In cl_hex2str() [libclamav/str.c, lines 63-64] the question marks ?? are
> translated into CLI_IGN. Now with the level _higher_ then 2 in
> cli_addpatt() CLI_IGN goes into the trie "as a node". Because it's not
> possible to match CLI_IGN while travelling the trie, the Magistr pattern
> will never be matched... That's why I haven't read your last attachment.
>

Point taken -- it's a bug in my patch.
But, you know what they say in the movies: "Never say never" :)
So, a new patch is attached which correctly identifies magistr.b w/out
requirement to have at most 2 levels.


>
> -------------------------------------------------------
> This SF.net email is sponsored by: VM Ware
> With VMware you can run multiple operating systems on a single machine.
> WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
> at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
> _______________________________________________
> Clamav-devel mailing list
> Clamav-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clamav-devel

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com
Re: recognizing pats within text bodies? [ In reply to ]
> Point taken -- it's a bug in my patch.
> But, you know what they say in the movies: "Never say never" :)
> So, a new patch is attached which correctly identifies magistr.b w/out
> requirement to have at most 2 levels.

This patch looks better (at least more complicated ;)) than the last one.
I'm very busy now but will check it soon. Thanks.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
> Point taken -- it's a bug in my patch.
> But, you know what they say in the movies: "Never say never" :)
> So, a new patch is attached which correctly identifies magistr.b w/out
> requirement to have at most 2 levels.

This patch is OK. You have added the support (missing in the previous patch)
for proper handling of CLI_IGN. I didn't test it but I'm sure it will work
just fine - good work. Your patch will be applied just after the 0.61
release. Thanks.

Best regards,
Tomasz Kojm
--
oo ..... zolw@konarski.edu.pl
(\/)\......... http://www.konarski.edu.pl/~zolw
\..........._ I nie zapomnij kliknac w brzuszek...
//\ /\\ <- C. Amboinensis www.pajacyk.pl
Re: recognizing pats within text bodies? [ In reply to ]
On [09/04/03 19:15], Tomasz Kojm wrote:
>
> This patch is OK. You have added the support (missing in the previous patch)
> for proper handling of CLI_IGN. I didn't test it but I'm sure it will work
> just fine - good work. Your patch will be applied just after the 0.61
> release. Thanks.
>

Thanks -- glad it worked out in the end.

Since the last patch I made some additions:
1. enable trie depth to be specified on command line (with default trie depth=2)
2. modified struct cl_node so that level member is unsigned char instead of
unsigned short int.
Are you interested in these changes?


> Best regards,
> Tomasz Kojm
> --
> oo ..... zolw@konarski.edu.pl
> (\/)\......... http://www.konarski.edu.pl/~zolw
> \..........._ I nie zapomnij kliknac w brzuszek...
> //\ /\\ <- C. Amboinensis www.pajacyk.pl
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Clamav-devel mailing list
> Clamav-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clamav-devel

--
Eugene Miretskiy <eugene@invision.net>
INVISION.COM, INC. (631) 543-1000
www.invision.net / www.longisland.com

1 2  View All