Request for Input: Storing Searches

Jeroen van Meeuwen (Kolab Systems) vanmeeuwen at kolabsys.com
Mon Sep 5 16:37:57 CEST 2011


Christian Mollekopf wrote:
> On Wednesday 31 August 2011 13.13:24 Jeroen van Meeuwen wrote:
> > Christian Mollekopf wrote:
> > > > > - The action would then pick the results from the search which
> > > > > are in
> > > > > this resource, and tag them via ANNOTATE and create an xml
> > > > > object with the search info.
> > > > 
> > > > Where would this XML object be put?
> > > 
> > > I'd imagine that we put those objects in the rootfolder or in a "Saved
> > > Searches" folder. In the "Saved Searches" folder we could then also
> > > create the optional server-side populated search directories.
> > 
> > There's no such thing as a 'root folder'.
> 
> Indeed, lets take the INBOX folder as root folder then.
> 
> I just realized that we are probably thinking of slightly different
> usecases.
> 
> I was mainly thinking about the usecase where a user saves a search for
> himself for use on another client.
> So the searchfolder would be a subfolder of the users INBOX and the search
> is mainly meaningful for that user only. I.e. I search for all emails
> belonging to a project I'm working on (and then I might want to share that
> search with one or two fellow workers).
> 
> Sharing the search would then be more along the lines of copying the query
> to another users INBOX folder so he can execute the same query.
> 
> Thats also why I thought doing the search on the client side isn't a huge
> performance hog. If we have a shared search for 300'000 users, it is a bit
> a different story.
> 
> I assume you were more thinking of something like a shared search for i.e.
> all representatives of a company living in Switzerland, which could be used
> as a dynamically updated distribution list for the whole company. Such
> searches do not belong to a particular user and here what you say makes a
> lot of sense.
> 

A shared search can exist on a personal folder, but on a shared folder, too.

I for one have a saved search called 'mailing list password reminders', 
running over all of our Kolab Systems' shared/lists/ folders.

I would push that saved search out to something like 
shared/administrativia/mailing list password reminders/ so that any of my 
fellow admins can pick it up.

For this particular folder type, I would want the search results in the client 
to be the actual objects, and the search to be executed real-time, so that I 
can delete the mails once I've acted upon them (we want no password reminders 
in the shared folders). I only open this folder every second day of the month.

For a different type of folder, for example, all mails on topic 'foo' sent by 
any of the Contacts in my 'Customer/Company A' address book, whether shared or 
personal, one can imagine searching is expensive and sharing the search would 
only increase the number of times said search is executed.

In the latter case, I'd rather have the opportunity to periodically pre-
populate the folder with the search results, as well as perhaps a right-click 
context menu 'Update Now...' item.

> I guess it's desirable to cover both usecases?
> I think they are really quite different because in the first case the
> search really belongs to a user and is optionally shared while in the
> latter usecase the search doesn't belong to anyone and is global.
> 

For me, the question is not so much whether use case #1 is so much different 
from other use cases, but more whether or not we can confidently allow for the 
option value to exist.

> Also in the first usecase read/write rights are probably more important
> than in the second one. Imagine I have notes in this search, so write
> rights are mandatory.

Let's note that the 'write' right does not include the ability to 'post', 
'insert' or 'delete' messages, but -in Cyrus IMAP- does allow shared 
annotations to be set (METADATA).

> In the second case editing is more a "nice to have" because the search
> serves more as a reference/directory.

Playing with these rights allows one to enforce only the annotations can be 
written, but not the folder contents.

> So it's IMO mainly the first usecase where it makes sense to replace the
> prepopulated results with something like the akonadi virtual folders giving
> full read/write rights.
> 
> > If a 'Saved Searches' folder were to be used, all 'saved search' Kolab
> > XML objects would go into that one folder.
> > 
> > Sharing any particular saved search now becomes a problem,
> > 
> > Clients not compatible with KEP #9 are now helpless, since a top-level
> > folder of unknown type is encountered, but they have to descend in order
> > to get to any sub-folder,
> > 
> > The saved searches folder(s) cannot be pre-populated unless the Kolab XML
> > object for saved searches also states where any pre-populating should go
> > out to, which naturally is subject to too much change,
> > 
> > Keeping 'reference' objects in a 'saved search' folder creates the same
> > 'subject to change' problem... if a reference object where to say,
> > user/john.doe/Contacts at example.org?uid=blabla, renaming the Contacts
> > folder to Kontacten would create a reference issue; any reference should
> > be completely referential (OLAP),
> > 
> > etc.
> 
> Ok, then it would probably make sense to forget about the xml objects and
> have a 'saved search' folder containing all the public shared searches,
> each being a folder which is populated with the results. For "personal"
> saved searches (first usecase above) the client can ignore the
> prepopulated items and execute the query which is stored in an annotation
> to provide a virtual folder with read/write access.
> 

I would not make allowing a client to ignore the pre-populated contents of a 
saved search folder subject to the namespace.

I would make it subject to the client's ability to timely execute a real-time 
search should it choose to ignore the pre-populated contents.

Kontact with akonadi would now be eligible to ignore the 'cached results' and 
instead execute a real-time search.

> > > > > - The resource populates the virtual folder with virtual items
> > > > > based on the tags.
> > > > > 
> > > > > =>  	- No data duplication
> > > > 
> > > > This has always been "optional"; a saved search folder *could* be
> > > > pre-
> > > > populated.
> > > > 
> > > > Imagine a saved search across 5 contact folders with 10.000 contacts
> > > > on
> > > > average.
> > > > 
> > > > When NOT pre-populating the saved search folder with the search
> > > > results, you pay the cost every time the folder is opened. Maybe
> > > > this cost is not so great for a fat client with a local cache to
> > > > query
> > > > (Kontact/akonadi/nepomuk/Disconnected IMAP), but for a
> > > > web-interface...
> > > > well...
> > > 
> > > Yes, since akonadi can create virtual folders it only has to populate
> > > them once AFAIK, and then result is then cached.
> > > For a webclient I guess you're right. But if we have the
> > > dataduplication
> > 
> > Again, the data duplication is *at one's option*. For web-interface, I
> > say one should pre-populate the search folder. For clients like Kontact
> > (with client- side, local caches), perhaps it's feasible to allow them
> > to ignore the cached results and go with a real-time search.
> > 
> > > and it should also be writable it looks somewhat error prone to me.
> > 
> > A clause in the KEP for these types of folders can be, that the content
> > of the folder SHOULD NOT be made writeable for any event other then
> > 'update saved search'.
> 
> Indeed.
> 
> > > Especially if we have to implement that for every client.
> > 
> > We have to implement everything and anything for every client, in case
> > you haven't noticed.
> 
> If we have server-side populated search folders that should work with any
> IMAP client and we don't have to implement anything on the client.
> 

We would still need to implement the handling of annotations set on the folder 
the client doesn't know of, and the prohibiting of editing, right-click 
context menu entry for 'Update now...' as well as 'Forget search...', and de-
duplication of search results for searching all folders including saved search 
folders.

> > > I reckon using akonadi as a cache for the webinterface would solve that
> > > problem?
> > 
> > No, the caching layer is moot unless you also consider all clients use
> > akonadi. One cache (in one location) to rule them all being akonadi is
> > not necessarily the best way to go with this. This, however, is a
> > different topic, and we should talk about caching separately from the
> > saved searches topic.
> > 
> > > > When pre-populating the saved search folder with the search results,
> > > > you pay the cost in "duplicate" storage (as explained, not on the
> > > > server side, perhaps on the client side if it's not intelligent
> > > > enough to de-duplicate).
> > > 
> > > Well, the argument for doing it server side would be that it is
> > > available
> > > on any client (i.e. smartphone), but then read-only is the only option
> > > i see. If it is on the client side, this ends up to be essentially the
> > > same
> > > as the akonadi virtual folders.
> > 
> > Note that the "problem" or "difficulty" is not the editing of an object
> > from within the saved search, not for a client and not for a user.
> > 
> > It is the occurence of said object twice or more times in or across all
> > readable folders that is the first problem.
> 
> Yes, that's why I say that server-side populated search folders should be
> read only.
> 

Even with read-only, searching over the original folder as well as the (saved) 
search folder would give a duplicate search result, unless the client is made 
to ignore the saved search folder's contents when searching.

> > > > Another penalty in pre-populating the saved search folder could,
> > > > arguably, be that perhaps there's results in said saved search
> > > > folder
> > > > that the person using the folder would otherwise not have access to.
> > > > However, this can also be considered a feature; "Share all contacts
> > > > from Vendors folder tagged with 'ict' with helpdesk personnel"
> > > 
> > > If it is being populated on the client side I don't see how you could
> > > get
> > > access to items you shouldn't have access to.
> > 
> > You want to avoid your 350.000 clients from each having to iterate and
> > re- iterate against most of your infrastructure components themselves,
> > just to pre-populate and update their saved (contact) searches cache(s),
> > if you can do so periodically on the server-side, under your own
> > control.
> 
> I agree for globally shared searches (second usecase). For personal saved
> searches I'm not sure if it makes sense to run the queries of 350'000
> clients on the server if most of the queries are only relevant for this
> particular user.
> 

Let's please consider 'personal saved searches not shared with anyone' to be 
'saved searches shared with one individual'. Sharing it globally is one click 
away.

Like for a single domain name space Kolab deployment, a second domain is right 
around the corner and is likely to eventually make the assumed single domain 
deployment a multi-domain deployment.

> > Please note that the folders of type 'contact' that the user has access
> > to are not the only things that are subject to change. Please also note
> > that these folders and other resources can be *huge*.
> > 
> > Saved searches are often not the smallest set of search results. They
> > often include a type of query that is a little less specific then
> > (mail=john.doe at example.org), and inherently can include a lot of
> > attributes that are not (cannot be?) indexed anywhere.
> > 
> > As such, saved searches are *hugely* expensive to execute.
> > 
> > When they happen on the client side, and on one particular client only,
> > configurable per client, then we're all fine. Kontact for instance can do
> > saved searches in a reasonable fashion because of Akonadi. A web
> > interface such as Horde or Roundcube however cannot. Please note a saved
> > search for these interfaces will have to;
> > 
> > 1) execute within the PHP execution timeout (30 seconds),
> > 
> > 2) stay within the memory_limit (64/128/192 MB),
> > 
> > 3) do not use the user's credentials / privileges to query resources
> > other then IMAP
> 
> I thought if we had a cache like akonadi below the webinterface we could