Spam/Ham

Mon Dec 31 12:23:02 CET 2012

On 2012-12-30 01:42, Stefan Froehlich wrote:
> Personally I don't like the way to handle spam and ham described in 
> the
> documentation. No user is willing to send real spam to the spam folder
> and real ham to the ham folder.

Actually, a "Mark as Junk" button is available in Roundcube (enable the 
"markasjunk" plugin), and this button will move a message to a 
(previously defined, user configurable, "Spam" folder by default) junk 
folder.

That said, the documentation so far has been written from an 
administrator's perspective - the required user interaction has not been 
thoroughly examined and as such the flow for nor expectations of the 
user have necessarily been subjected to the right amount of scrutiny.

> I'd like to implement another approach which is easier for the user. 
> I'd
> like to learn all mails in the user's spam folder as spam and all 
> mails
> not in the spam folder as ham. I think this is the natural intuitive 
> way
> for a user.

I would love to consider what is a more intuitive work-flow for a 
regular user, but what I've found to be possible with the pieces in 
place tends to lead to needing to compromise something somewhere;

- One could set delete_mode and expunge_mode back to 'immediate', but 
this compromises the ability to ensure all message(-file)s ever in any 
mailbox are also included in at least one backup.

- One could maintain delete_mode and expunge_mode set to 'delayed' and 
only learn spam and ham after;

   1) a ((virtual?) full?) backup has completed,
   2) message files for expunged messages are deleted from the 
filesystem (by running cyr_expire -D 0s -X 0s -E 3?)

     2a) I've found that it is not possible to run, for example, 
cyr_expire -E 3 -X 0s user/john.doe/Spam at example.org, which may be a bug 
in the software but is the status quo nonetheless.

     2b) I've found that it is actually particularly hard to, in real 
life, recognize which folder a user believes contains or is to contain 
the messages that that user believes are indeed real spam, unless 
semi-strict defaults are offered and only a limited number of options 
are offered to change the spam folder; My point is localization, 
recognizing the different names a user may give (in any locale), 
capitalization and case-sensitivity. "The problem is choice", if you 
will - though we have an annotation '/vendor/kolab/folder-type = 
mail.junkemail'. Users will press "Delete" for spam messages and press 
"Mark as Junk" for newsletters they themselves have subscribed 
themselves to.

> I created a sieve filter which moves all mails tagged as X-Spam-Flag:
> YES to the user's spam folder. Also if a user sees a message in the 
> spam
> folder which he thinks is not spam he simply moves it out from there 
> to
> the inbox (or a subfolder).

With later versions of Kolab, we'll have a feature that is called 
"Sieve Script Management" - an administrator can then specify a set of 
MUST-HAVE rules for a user, under KEP #14[1].

This will allow an administrator to make sure the user's sieve scripts 
are preceeded by a "managed" segment; that may contain, for example, a 
'fileinto "Spam";' action.

> All these messages should be learned as ham.
> I started writing a bash script handle all these things. The idea is 
> (to
> increase speed) to insert another X- Tag into the mail, let's say
> X-<ServerName>-Learned-As: and the possible values are spam or ham. I
> introduced this for performance reasons.

I would recommend *not* changing the contents of the email (on the 
filesystem or otherwise, in fact);

- Learning spam / ham will remember the tokens learned and as such an 
email will not be "learned" twice.

- Using the filesystem ctime/mtime for a message is, I think, a more 
appropriate approach (i.e. "only learn messages that are 'new' since $x 
days"),

- After Spam / Ham is learned, the folder can be pruned from contents 
using /usr/lib/cyrus-imapd/ipurge

> Now I ran into several problems (All is on Debian Wheezy):
> 
> 1) A folder does not necessarily contains only valid undeleted mail
> files. Let's say a user moves some mails out of the spam folder the 
> mail
> files are still in the spam folder. I can't see a way how to 
> distinguish
> between real mail files and those that have been deleted already but 
> not
> deleted from the filesystem yet.
> 

This problem is two-fold;

1) a client application only needs to flag the messages as \Deleted, 
but does not need to issue an EXPUNGE to the folder, and

2) Individual message files (that correspond to messages previously 
flagged as \Deleted and in folders on which an EXPUNGE has indeed been 
issued) are not immediately deleted from the filesystem.

The way that I myself (therefore) learn Spam (and Ham) is to first 
learn Spam, and on top of that learn Ham; My understanding is that when 
the same message(s, -tokens) have first been learned as Spam, but are 
then learned as Ham, it should forget what it had learned and learn the 
message as Ham.

A resulting instruction to the user is then also, to move/copy messages 
from the Spam folder that are not actually spam to the Ham folder (aside 
from, perhaps, also copying the message back to an INBOX or any other 
folder).

> 2) If I change a file in on filesystem level how can I let cyrus know 
> so
> that it is aware of this change?
> 

I would recommend against changing the files on the filesystem, as the 
only way to let Cyrus IMAP pick them up is to reconstruct the folder - 
this would also re-activate (or re-insert, if you will) the messages in 
the spool that have previously been expunged.

> 3) I somehow corrupted my spam folder. The standard installation of
> kolab doesn't install the reconstruct binary so I was unable to 
> recover
> this folder. Where do I find the reconstruct binary?
> 

This utility should normally be shipped (on Debian Wheezy) as 
/usr/lib/cyrus-imapd/reconstruct. If it's not installed as part of the 
package(s), I would love to see a ticket on issues.kolab.org for it.

> 4) To increase performance I'd like to react on user's move request
> instead of scanning all mail folders. Is there a way to run a script 
> if
> a user is about to move a message or if a user just has moved a 
> message?
> 

There's a notify socket one could listen on in order to pick up 
notifications on events (such as a new delivery, or any other change).

Kind regards,

Jeroen van Meeuwen

[1] https://wiki.kolab.org/KEP:14

-- 
Systems Architect, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
m: +44 74 2516 3817
w: http://www.kolabsys.com

pgp: 9342 BF08