global sa-learn

Adam Katz kolab at khopis.com
Thu Feb 7 01:48:33 CET 2008


Do people actively use the sa-learn directions from this page:
http://wiki.kolab.org/index.php/Fighting_spam ?

"sa-learn ...shared^spam/[1-9]*"  looks extremely un-safe; I just took
a manual look at my shared junk folder and saw that the numbered files
sa-learn was instructed to operate on includes items that have been
deleted ... as in they were wrongly placed there and somebody went in
and took it out, but it was either not expunged, or the cyrus spool
hasn't completely cleared it yet.

I've also found from experience that users tend to never use ham
folders.  I see this as the wrong approach; since all mail is either
ham or spam, anything NOT filed into a folder like "trash" or "spam"
should be learned as ham.

In this proposal, false negatives would get moved into the spam folder
and re-learned as spam (this overrides previous learning of that
message).  The only issue here would be that some users delete false
negatives rather than moving them to the learning folder.  Those false
negatives get learned as ham and never corrected, but this creates
more false negatives rather than more false positives; it's okay to
get a few pieces of spam in your inbox (false negatives), it is not
okay to get ham deleted automatically (false positives).

How do I implement this?  I don't want to learn messages marked for
deletion, and I want to examine folders by name.  I need to be able to
read my users' mail, too.  ... this was a lot easier on our old mail
server, as it used mbox folders, so I could just have root's cron run
a script like this (warning, this is a hypothetical example):

  find /home/*/mail -type f -print0 \
    |egrep -Zziv 'spam|junk|trash|delet' \
    |xargs -0 sa-learn --ham --mbox
  for user in /home/*/mail/teach-spam; do
    sa-learn --spam --mbox $user && printf '' >$user
  done

Maildir is proving far more challenging here.  I'm beginning to think
it will be easiest to use imapsync to push it to an mbox-powered
server and run the above script...

IMAP Spam Begone and sa-learn-cyrus both look like good starts, but
they seem rather hackish and they don't have mechanisms for learning
somebody's entire mail history.

One more note:  I run SpamAssassin on a non-kolab server.  The kolab
box has an installation for failover purposes, but the ideal solution
would be to train the spam relay and then copy over the new db.
Unfortunately, sa-learn doesn't support the -d flag found in spamc.

PS:  Yes, I use global bayes db, global AWL, etc.




More information about the users mailing list