Spam/Ham

Mon Dec 31 14:39:42 CET 2012

On 31/12/2012 9:23 PM, Jeroen van Meeuwen (Kolab Systems) wrote:
> On 2012-12-30 01:42, Stefan Froehlich wrote:
>> Personally I don't like the way to handle spam and ham described in
>> the
>> documentation. No user is willing to send real spam to the spam folder
>> and real ham to the ham folder.
> Actually, a "Mark as Junk" button is available in Roundcube (enable the
> "markasjunk" plugin), and this button will move a message to a
> (previously defined, user configurable, "Spam" folder by default) junk
> folder.
>
> That said, the documentation so far has been written from an
> administrator's perspective - the required user interaction has not been
> thoroughly examined and as such the flow for nor expectations of the
> user have necessarily been subjected to the right amount of scrutiny.
>
>> I'd like to implement another approach which is easier for the user.
>> I'd
>> like to learn all mails in the user's spam folder as spam and all
>> mails
>> not in the spam folder as ham. I think this is the natural intuitive
>> way
>> for a user.
> I would love to consider what is a more intuitive work-flow for a
> regular user, but what I've found to be possible with the pieces in
> place tends to lead to needing to compromise something somewhere;
>
> - One could set delete_mode and expunge_mode back to 'immediate', but
> this compromises the ability to ensure all message(-file)s ever in any
> mailbox are also included in at least one backup.
>
> - One could maintain delete_mode and expunge_mode set to 'delayed' and
> only learn spam and ham after;
>
>     1) a ((virtual?) full?) backup has completed,
>     2) message files for expunged messages are deleted from the
> filesystem (by running cyr_expire -D 0s -X 0s -E 3?)
>
>       2a) I've found that it is not possible to run, for example,
> cyr_expire -E 3 -X 0s user/john.doe/Spam at example.org, which may be a bug
> in the software but is the status quo nonetheless.
>
>       2b) I've found that it is actually particularly hard to, in real
> life, recognize which folder a user believes contains or is to contain
> the messages that that user believes are indeed real spam, unless
> semi-strict defaults are offered and only a limited number of options
> are offered to change the spam folder; My point is localization,
> recognizing the different names a user may give (in any locale),
> capitalization and case-sensitivity. "The problem is choice", if you
> will - though we have an annotation '/vendor/kolab/folder-type =
> mail.junkemail'. Users will press "Delete" for spam messages and press
> "Mark as Junk" for newsletters they themselves have subscribed
> themselves to.
>
>> I created a sieve filter which moves all mails tagged as X-Spam-Flag:
>> YES to the user's spam folder. Also if a user sees a message in the
>> spam
>> folder which he thinks is not spam he simply moves it out from there
>> to
>> the inbox (or a subfolder).
> With later versions of Kolab, we'll have a feature that is called
> "Sieve Script Management" - an administrator can then specify a set of
> MUST-HAVE rules for a user, under KEP #14[1].
>
> This will allow an administrator to make sure the user's sieve scripts
> are preceeded by a "managed" segment; that may contain, for example, a
> 'fileinto "Spam";' action.
>
>> All these messages should be learned as ham.
>> I started writing a bash script handle all these things. The idea is
>> (to
>> increase speed) to insert another X- Tag into the mail, let's say
>> X-<ServerName>-Learned-As: and the possible values are spam or ham. I
>> introduced this for performance reasons.
> I would recommend *not* changing the contents of the email (on the
> filesystem or otherwise, in fact);
>
> - Learning spam / ham will remember the tokens learned and as such an
> email will not be "learned" twice.
>
> - Using the filesystem ctime/mtime for a message is, I think, a more
> appropriate approach (i.e. "only learn messages that are 'new' since $x
> days"),
>
> - After Spam / Ham is learned, the folder can be pruned from contents
> using /usr/lib/cyrus-imapd/ipurge
>
>> Now I ran into several problems (All is on Debian Wheezy):
>>
>> 1) A folder does not necessarily contains only valid undeleted mail
>> files. Let's say a user moves some mails out of the spam folder the
>> mail
>> files are still in the spam folder. I can't see a way how to
>> distinguish
>> between real mail files and those that have been deleted already but
>> not
>> deleted from the filesystem yet.
>>
> This problem is two-fold;
>
> 1) a client application only needs to flag the messages as \Deleted,
> but does not need to issue an EXPUNGE to the folder, and
>
> 2) Individual message files (that correspond to messages previously
> flagged as \Deleted and in folders on which an EXPUNGE has indeed been
> issued) are not immediately deleted from the filesystem.
>
> The way that I myself (therefore) learn Spam (and Ham) is to first
> learn Spam, and on top of that learn Ham; My understanding is that when
> the same message(s, -tokens) have first been learned as Spam, but are
> then learned as Ham, it should forget what it had learned and learn the
> message as Ham.
>
> A resulting instruction to the user is then also, to move/copy messages
> from the Spam folder that are not actually spam to the Ham folder (aside
> from, perhaps, also copying the message back to an INBOX or any other
> folder).
>
>> 2) If I change a file in on filesystem level how can I let cyrus know
>> so
>> that it is aware of this change?
>>
> I would recommend against changing the files on the filesystem, as the
> only way to let Cyrus IMAP pick them up is to reconstruct the folder -
> this would also re-activate (or re-insert, if you will) the messages in
> the spool that have previously been expunged.
>
>> 3) I somehow corrupted my spam folder. The standard installation of
>> kolab doesn't install the reconstruct binary so I was unable to
>> recover
>> this folder. Where do I find the reconstruct binary?
>>
> This utility should normally be shipped (on Debian Wheezy) as
> /usr/lib/cyrus-imapd/reconstruct. If it's not installed as part of the
> package(s), I would love to see a ticket on issues.kolab.org for it.
>
>> 4) To increase performance I'd like to react on user's move request
>> instead of scanning all mail folders. Is there a way to run a script
>> if
>> a user is about to move a message or if a user just has moved a
>> message?
>>
> There's a notify socket one could listen on in order to pick up
> notifications on events (such as a new delivery, or any other change).
>
> Kind regards,
>
> Jeroen van Meeuwen
>
> [1] https://wiki.kolab.org/KEP:14
>
Hi Jeroen,

after some fiddling with a bash script I'm on the way to write a python 
script. I decided against changing mails on a file system level but 
adding a new flag for each mail. I've installed a library IMAPClient 
(see http://imapclient.readthedocs.org/en/latest/) and learned a little 
bit Python (never used that before). The logic is as explained above:
All messages in folder /Spam is treated as spam, all other messages are 
ham. I'm not going to change or delete them from the filesystem but as 
soon as I trained them as ham or spam I set a flag for the message 
LearnedAdHam or LearnedAsSpam. This way I don't have to learn them 
again. With the help of the above library I'm also able to distinguish 
between deleted and normal messages. The only drawback is that I have to 
give cyrus-admin full permission to the mailbox. The script is able to 
do that automatically.
I need a little bit of fine tuning but if you are interested in that 
script I can post it here (or somewhere else).

And yes, there is no binary for reconstruct. I've setup a VM, installed 
cyrus-imap there and copied the binary from there and was able to 
reconstruct the mailbox.

MfG Stefan Fröhlich
42 ;-)