Kolab XML Format: Proposal for an XSD friendly update

Christian Mollekopf mollekopf at kolabsys.com
Wed Oct 19 12:19:51 CEST 2011


On 18.10.2011 21:35, Gunnar Wrobel wrote:
> Hi Christian,
>
Hi Gunnar,

> Quoting Christian Mollekopf <mollekopf at kolabsys.com>:
>
>> Hi,
>>
>> Because the various implementations of the Kolab XML Format are
>> difficult to maintain and are very error prone, the idea of a 
>> library to
>> read/write the XML objects came up. Till and Volker from KDAB 
>> pointed
>> out that using databindings based on an XML Schema (XSD) would be 
>> the
>> ideal tool to develop such a library. The process of writing this 
>> schema
>> brought up several problems with the format which I'm going to 
>> outline
>> here.
>>
>> == Why do we need a schema ==
>>
>> The current format specification is not very explicit about some
>> details and up to interpretation in these parts. A schema would give 
>> us
>> a much stricter specification which also allows XML files to be
>> validated against the spec.
>>
>> Further it is a tedious job to actually implement a specification 
>> and
>> make sure that the implementation really does the correct thing in 
>> all
>> cornercases. Obviously most implementations will behave slightly
>> different, possibly ending up with conflicts.
>>
>> Fortunately there is a tool to write such a specification in a more
>> useful way for XML, an XML Schema. There are various schema 
>> languages
>> but the most promising for our purpose is XSD.
>>
>> Using XSD, we can write a schema which can be used to validate the 
>> XML
>> file. This means the XSD actually holds the promise, If a client can
>> read and write a file accepted by the schema, any other client will 
>> also
>> be able to read those files.
>>
>> Even better, using code generators we don't have to write the
>> parsing/mapping to in memory representation code, but generate it
>> directly from the XSD. This completely removes the need to
>> implement/test/maintain this fairly error prone part of the code.
>>
>> == What will we gain from the schema ==
>>
>> As said, primarily a well defined format and reduced development
>> effort. But we will also get an implementation of the format from 
>> all
>> clients which actually follows the spec, which is very much 
>> different
>> from the situation we're having now.
>>
>> If you try to set the GPG settings with KAddressBook and modify the
>> same contact afterwards in Horde, your GPG settings will be gone,
>> because Kontact makes use of the "unknown tags", which are not 
>> preserved
>> by Horde.
>>
>> With an XSD based databinding such surprises are much less likely to
>> happen. Because all clients which make use of the XSD databindings 
>> will
>> adhere to the spec.
>>
>> You'll realise that one misbehaving client can effectively destroy 
>> the
>> "Kolab expererience" that not matter which client you're using you 
>> get
>> everywhere at least the features defined by the format.
>>
>> Also, instead of the validation of the actual values which is now up 
>> to
>> every implementation, we'll get one centrally defined in the schema.
>> With the databindings we even get typesafety which gives us compile 
>> time
>> errors instead of runtime errors.
>>
>> Databindings based on XSD are available for various languages. We're
>> targeting C++/PHP/Python for now, so we will make sure that a 
>> solution
>> for these languages exists.
>>
>> == Problems with the current Format ==
>>
>> The current format allows some things, or is at least not explicit
>> enough about it, which are hardly implementable using XSD or lead to
>> other problems.
>> The key points are:
>>
>> - Preserving of undefined tags
>> - Undefined order of elements
>> - No defined namespace
>>
>> === Preserving of undefined tags ===
>
> I agree to a large extent with everything else you wrote in your 
> mail.
> I think the only difficult point is "Preserving of undefined tags" -
> so I'll add my comments just here.
>
>>
>> Preserving unknown tags is far from trivial and a rather big
>> development effort. I understand the use of an extensible format as 
>> it
>> makes it very easy for vendors to implement their own special 
>> features
>> using extensions (aka unknown tags). Also the idea that old clients 
>> can
>> still make use of a subset of the data of
>> newer versions of the format is intriguing. However I think there 
>> are a
>> couple of severe drawbacks which make me think unknown tags are not 
>> a
>> good idea after all.
>>
>> - If vendors can implement their features with unknown tags, no one
>> else can make use of it. This effectively works against the idea 
>> that
>> all clients support the same features
>
> I do not think it is very likely that all Kolab clients will ever
> support the same feature set. I do not consider this to be a central
> idea of the format specification. The Kolab format forces the clients
> to adhere to it for Kolab features supported the clients. It does not
> force the clients to support all features though.
>
> Why shouldn't there be a client that only knows how to use the
> "summary" and "body" field of the "note" object? Yes, not very 
> useful.
> But what would force the vendor of this imaginary client to support
> the full Kolab format feature set if he has customers that are happy
> with this extremely reduced set of capabilities?
>

As Jeroen already pointed out adding the tags to the XSD does not imply 
that those elements are mandatory.
We would likely define most of those elements optional, so the client 
does not have to implement those features.
However, a client cannot implement a feature which is not in the XSD.

>> and even encourages vendors to
>> implement their features in their client only to have a market
>> advantage.
>
> Most of the times vendor specific tags have been used because a 
> client
> already had a certain feature that the Kolab format just didn't
> specify. I think most often this was just done because the vendors
> would like to avoid telling people: "oh, but this feature doesn't 
> work
> if you use Kolab as a backend"
>
> And even if it would be used as a competing factor between Kolab
> client - would it be that dramatic? It is not like getting a patent 
> on
> that feature and restricting the other clients from implementing
> something similar. I would assume that a cool new feature that one
> client might offer would draw some attention and finally give birth 
> to
> another KEP so that all clients can implement the feature.
>

Right, but we likely end up to do all the work. I'm not saying that 
vendors are purposefully implementing all features with their own 
extensions, but we're not really motivating them at the moment to 
contribute to upstream Kolab, which is what we should do as an 
opensource based project.

>> This obviously hurts the Kolab platform.
>
> I don't see how the extensibility itself hurts Kolab. What you
> described above - Horde overwriting Kontact extensions - is what 
> hurts
> users. But that is not a problem caused by extensibility. This is a
> client - Horde - not being careful enough and ignoring that
> extensibility feature.
>

First, I consider this feature to be rather difficult to implement, and 
likely going to be a problem in many implementations.
Second, it hurts the Kolab platform in at least two ways (apart from 
the getting stuff upstream argument):
  - application interoperability (If we don't run after each vendor 
we'll get a gpg field for contacts from every single client)
  - no upgrade path

We still might add a controlled extensibility to the format if we see a 
need for it in the future.
I.e. we could add a "custom" element which can appear a number of times 
where such custom values could be stored.
We shouldn't do that lightly though as that works against us for the 
already lined out reasons.


>> If we disallowed
>> unknown tags instead, we would force vendors to go through a KEP 
>> process
>> to improve the Kolab format for everyone.
>
> And until the KEP has been approved the vendor is unable to enable 
> use
> of a feature that the client might already have?
>

In this scenario, yes. However we can actually speed up things if we 
want.
Extending the format is a matter of updating the XSD, recompiling the 
bindings, and distributing both,
which could be done on a days notice.
That said, it is also possible for a vendor which is in control of the 
deployment to extend the XSD by himself,
and update it on his clients. That process shouldn't be difficult as it 
takes the same steps as above.

>> Of course there are values which are by definition not useful for
>> others, but even those can be added to the format as a vendor 
>> extension.
>
> Would that take the full KEP process?
>

That is up for discussion how we handle such extension requests.

> Don't misunderstand me: I think the KEP process is great but if it
> takes a full KEP to get a tag into the format so that a client can
> support a specific feature when using Kolab as a backend seems a bit
> much.
>

Agreed.

>>
>> This way we have a much stricter definition which is much easier to
>> implement.
>
> To me the fact that it is easier to implement seems to be the main
> driving factor behind the request.
>

It is one of the driving factors.

> I'm not saying that the extensibility feature should be retained at
> all costs. But dropping it because the implementation based on an XSD
> is hard does not seem to be a good reason.
>

As I said, it is not the only reason, but nevertheless in my opinion a 
very good one.
I consider a schema based specification a big step forward for the XML 
format and databindings the only way how we get as many clients as 
possible to adhere to the spec. I believe that without schema based 
databindings there will always be client-implementations conflicting 
with other clients.

>
>> Using XSD it is actually not feasible to allow unknown tags
>> anywhere in the format, but in my opinion not allowing unknown tags 
>> is
>> the only way to get a well defined format. Without the use of 
>> namespaces
>> it is even impossible to implement unknown elements with XSD.
>
> So you are saying that the use of a namespace (can't see a problem
> with that) would allow using unknown elements? Why not going into 
> that
> direction then?
>

If the undefined elements only occur in one specific place (as 
subelements of a special element or at the end of the file), it would be 
possible.
That aside I still believe that the format is much better defined 
without unknown tags. Note that it will be a lot easier to extend the 
format in a controlled way if we have XSD based databindings, so the 
implications are not the same as they would be with the current way of 
handling the xml files.

Cheers,
Christian

> Cheers,
>
> Gunnar
>
>>
>> Of course this change in the format would imply that we include the
>> extensions currently in use in the format.
>
>>
>> === Undefined order of elements ===
>>
>> This is mainly a technical problem for XSD. It is not feasible to
>> implement an XSD Schema with an undefined order of the elements. I 
>> would
>> therefore like to make the order a requirement. I also don't see any
>> drawback of this approach, especially once the implementations are 
>> base
>> on the XSD, which will do that job for the developer anyway.
>>
>> === No defined namespace ===
>>
>> Because the current format lacks a namespace it is not suitable to 
>> be
>> used together with other XML technologies such as XSLT. The lack of 
>> a
>> namespace increases the chance of nameconflicts with other formats 
>> such
>> as ICAL. The design with a namespace is also more robust should we 
>> once
>> want to extend the format. Afterall it is just good practice with no
>> drawbacks, so I'd suggest to add a namespace if we change the format
>> anyways.
>>
>> == Conclusion ==
>>
>> Because of these reasons I propose to change the Kolab XML Format in
>> the
>> following ways:
>>
>> - disallow unknown tags
>> - include now used unknown tags into the format
>> - make the order of the elements a requirement
>>