Request for Input: Storing Searches

Wed Aug 31 15:25:40 CEST 2011

Hi,

Christian already pointed out some implications on Kontact/Akonadi, but I have 
some additional remarks on that as well.

On Saturday 27 August 2011 20:39:40 Georg C. F. Greve wrote:
> Some of us have started tossing around thoughts about how to save searches
> in one Kolab client in a way that they are re-usable in all others,
> ideally.
> 
> When giving it some brainspace, it turns out this is not a trivial issue,
> for a variety of reasons, starting with there being a tradeoff decision
> between being expensive for the CPU or storage, for instance. But it is a
> little bit more complex than that, actually.
> 
> Allow me to higlight a couple of scenarios with advantages and
> disadvantages:
> 
>  - Scenario 1: Storage with a new KEP 9 based XML object
> 
> 	One could attempt to model this as a "search" XML object that would
> 	incorporate the fields of the object type searched, plus some special
> 	fields, e.g. folders to search, as well as searches across multiple fields
> 	and search logic (AND/OR etc),

Here you have the first challenge already (we run in to the same issue during 
the work on Akonadi, and it's still not fully solved there): You need to 
define the exact semantics of the query language and make sure all clients can 
actually implement this. Seems easy at first ("person's name equals 'Georg'"), 
but you quickly end up with ugly details: 
- what kind of Boolean logic do you want to support, and in what nesting 
depth?
- what kind of comparison operators do you want to support for strings? equal, 
contains, matches regexp, case sensitive vs. insensitive, etc
- what comparison operators do you want to support for other types (numbers, 
date/time)? equal, greater/less than, occurs at, etc. (keep in mind the extra 
fun with DST/time zones and recurrences), also note the complexity 
implications of recurrence-related queries with negations (e.g. "event does 
never occur on a Friday 13th")
- on what fields are you operating? Just the ones defined in the Kolab format 
spec, or is that too low-level for users and you want higher level composite 
fields? Eg. a person's name maps to a bunch of internal fields.
- next to the low-level search operators, are you considering higher-level 
ones that depend on external context (eg. something along the lines of KMail's 
"mail from someone in my address book" or "mail from someone tagged as 
'Friend'" operators, or your example of "events in Germany")?

If you want client-side search with consistent results, this needs to be 
specified in detail. Unlike the format specification where we only display the 
most of content without interpretation (except most date/time fields), this 
actually requires agreement on exact semantics.

> 	These objects would live in the regular folders for resources, and would
> 	potentially even replace the list object in functionality, as they would
> 	then model a list of recipients as list of address book entries, which is
> 	something that Alain once suggested. [1]
> 
> 	Advantages: Fairly close to existing functionality, and likely not too
> 	hard to implement for most clients (in comparison, at least), no data
> 	duplication anywhere.

implementation complexity for us would mainly depend on the complexity of the 
query language and how well it maps to our existing search support.

> 	Disadvantages: Expensive on the CPU, Does not work on all resources
> 	because we cannot store these XML elements in email type mailboxes.

what resources are you referring to here?

>  - Scenario 2: Creation of new folder type w/KEP #9 annotation for metadata,
> create one folder per saved search
> 
> 	In this approach we'd create a new folder of the corresponding resource
> 	folder type for each search which would be identified as a stored search
> 	folder by existence of the /vendor/kolab/saved-search annotation which
> 	carries the metadata for the search in an array, e.g.
> 
> 	{ 'saved_search':
>     		{ 'search_locations': 'blabla',
>       		'params': 'blabla',
>       		'filter': 'blabla',
>       		'fuzzyness': 'blabla',
> 		'async': '0'
> 		....
>     		}
> 	}
> 
> 	and the folder would be populated with the results of the search.
> 
> 	This DOES mean data duplication on the client, but Cyrus does allow to
> 	deduplicate entries on the server side, so it would not affect storage
> 	there. I am sure something similar would be possible with Dovecot, so
> 	we can for the moment assume data gets duplicated on the client only.

This is actually very close to the Akonadi approach for searching. We also 
have dedicated folders for search results ("virtual folders"), which contain 
only references to the actual object (sort of like symlinks). This avoid 
duplication and de-synchronization issues etc.. If the Kolab server would 
allow us to retrieve information about the original location of an object in a 
pre-populated search folder, we could symlink that on the client as well.

There are some pros and cons of the server-side search approach though:
+ there's a single implementation, so same semantics and expressive power 
everywhere
+ minimal client changes required
- no search possible when offline
- increasing de-synchronization between results and reality while being 
offline

> 	Advantages: Allows clients without search functionality to use results,
> 	can be automatically regenerated on the server if needs be, least CPU
> 	usage, works on email.
> 
> 	Disadvantages: Data duplication on the client, possible data set de-
> 	synchronization (e.g. contact gets edited in search results, same contact
> 	in main box and other search results boxes must be updated, this may be
> 	hard to ensure), increases folder clutter, some folder sharing questions.
> 
> 
>  - Scenario 3: Map searches with tags
> 
> 	As a Kolab object, each search will carry an ID. If we were to introduce
> 	a new email header flag in storage that can carry an arbitrary number
> 	of tags, we could tag each object with the ID of every search that it
> 	matches.
> 
> 	IMAP searching for header fields should make it comparatively easy
> 	and fast to find all objects of type X that match a certain tag Y,
>  	especially if we ask the server to cache this header field.
> 
> 	This would be complemented by a KEP 9 compatible object to describe
> 	the search, which could then be automatically applied to new objects on
> 	the server, or performed by the client, based on the scenario.

This assumes there is agreement on the exact semantics, otherwise you could 
end up with nasty tag ping-pong games between multiple clients and/or the 
server.

> 	Advantages: Low CPU & storage requirements, allows Kolab clients to apply
> 	a tag concept over all object types including email with potential server
> 	side tagging of incoming email
> 
> 	Disadvantages: New concept, some questions around shared folders, e.g.
> 	what if a client sees a shared object tagged with an ID for a search it
> 	does not know because it does not have access to the folder where that
> 	search is defined?

Unlike the other two option, this doesn't really fit well into the Akonadi way 
of dealing with searches (searching by tags is easy, but we are not prepared 
to tag and write back based on queries yet), so this would be the most 
expensive one for us to implement I think.

> There may be other advantages and disadvantages that I did not list.
> 
> Please help us identify them all, so we can come to a good decision.
> 
> Likewise, if you can think of a scenario that should be considered in
> addition to the ones listed here, please let me know. As for the scenarios
> listed, there are two questions in particular that I wonder about:
> 
>  (a) Compatibility with clients, in particular: How will this integrate (or
> 	not) with the new Nepomuk/Akonadi KDE Kontact basis

The biggest challenge for Kontact would be the fact that these searches are 
limited to just one Kolab account (and would most likely be context-free). 
Both are limitations we have been working on getting rid of. Ie. we would like 
the user to see his data as one set, without having to worry where it's coming 
from and what capabilities follow from that. Also, we are moving to more 
higher-level context-aware query/search concepts (e.g. "persons near my 
current location"). We already have a similar problem with KMail local filters 
and server-side Sieve filters, the latter basically having the same 
limitations as a per-account cross-client saved search would have.

Of course, if you look at this from the web client POV, you don't really have 
these kind of problems.

regards
Volker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.kolab.org/pipermail/format/attachments/20110831/497e447f/attachment.sig>