[Kolab-devel] Documentation Review: Combating Spam

Soliva Andrea comcept.ch User soliva at comcept.ch
Fri Dec 16 09:56:16 CET 2011



Hi  

one hint probably which will help.....as you wrote correctly the corpus for the bayes db will fit only if the db itself has 200 entries in it. To get a good corpus to start you can use following example emails: 

http://spamassassin.apache.org/publiccorpus/ 

extract it and deliver it to the ham or spam mailfolder as run sa-learn....I did this for every customer and it fits really good. It gives you a really good corpus to start. This only as a hint how to fill up first time the bayes db. Below the README regarding this files etc.: 

Not that I worry about but all this stuff was documented within the Wiki of kolab for about 2 years including your stuff and much more but it seems that my site within the wiki was gone (do not know why). If you look within google for "kolab ; solaris" you will still find the link but as mentioned it is not anymore there (http://wiki.kolab.org/Solaris). The site was for Solaris actually but I used it also on CentOS without any problems because OpenPKG is OpenPKG and it does not really matter on which Unix you use it...within OpenPKG it is always the same command etc. 

 --------------- http://spamassassin.org/publiccorpus/README.txt ---------------  

 Welcome to the SpamAssassin public mail corpus.  This is a selection of mail
 messages, suitable for use in testing spam filtering systems.  Pertinent
 points:

   - All headers are reproduced in full.  Some address obfuscation has taken
     place, and hostnames in some cases have been replaced with
     "spamassassin.taint.org" (which has a valid MX record).  In most cases
     though, the headers appear as they were received.

   - All of these messages were posted to public fora, were sent to me in the
     knowledge that they may be made public, were sent by me, or originated as
     newsletters from public news web sites.

   - relying on data from public networked blacklists like DNSBLs, Razor, DCC
     or Pyzor for identification of these messages is not recommended, as a
     previous downloader of this corpus might have reported them!

   - Copyright for the text in the messages remains with the original senders.

 OK, now onto the corpus description.  It's split into three parts, as follows:

   - spam: 500 spam messages, all received from non-spam-trap sources.

   - easy_ham: 2500 non-spam messages.  These are typically quite easy to
     differentiate from spam, since they frequently do not contain any spammish
     signatures (like HTML etc).

   - hard_ham: 250 non-spam messages which are closer in many respects to
     typical spam: use of HTML, unusual HTML markup, coloured text,
     "spammish-sounding" phrases etc.

   - easy_ham_2: 1400 non-spam messages.  A more recent addition to the set.

   - spam_2: 1397 spam messages.  Again, more recent.

 Total count: 6047 messages, with about a 31% spam ratio.

 The corpora are prefixed with the date they were assembled.  They are
 compressed using "bzip2".  The messages are named by a message number and
 their MD5 checksum.

 The "obsolete" dir contains old versions of the corpus, for reference,
 in case you need to correlate test results using these older versions
 against the source messages.  The messages in those corpora are generally
 included in the fresher corpora.

 This corpus lives at http://spamassassin.org/publiccorpus/ .  Mail
 jm - public - corpus AT jmason dot org if you have questions, or to donate
 mail.

 (Apr 23 2003 jm)

 --------------- http://spamassassin.org/publiccorpus/README.txt --------------- 

kind regards 

Andrea 

Zitat von "Jeroen van Meeuwen (Kolab Systems)" <vanmeeuwen at kolabsys.com>:

> Largely inspired by the existing article on the wiki[1], and some new
> requirements that I learned of, I've created some documentation on the
> subject of combating spam[2].
>
> It's not done yet, in that no fishing-for-spam and no safety-net for
> discarded messages is documented, but I would appreciate your feedback
> on what's in there so far.
>
> [1] http://wiki.kolab.org/Fighting_spam
> [2]
> http://hosted.kolabsys.com/~vanmeeuwen/kolab-docs/en-US/Kolab_Groupware/2.4/html/Administrator_Guide/chap-Administrator_Guide-Combating_Spam.html
>
> Kind regards,
>
> Jeroen van Meeuwen
>
> --
> Senior Engineer, Kolab Systems AG
>
> e: vanmeeuwen at kolabsys.com
> t: +44 144 340 9500
> m: +44 74 2516 3817
> w: http://www.kolabsys.com
>
> pgp: 9342 BF08
>
> _______________________________________________
> Kolab-devel mailing list
> Kolab-devel at kolab.org
> https://kolab.org/mailman/listinfo/kolab-devel
>

Mit freundlichen Grüssen

Andrea Soliva

soliva at comcept.ch 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kolab.org/pipermail/devel/attachments/20111216/f662fe64/attachment.html>


More information about the devel mailing list