Over the last several weeks I‘ve been working on developing XML schemas for a
client to support information exchanges between several different organizations,
so it was important to make the schemas very explicit and “tight“ so that each
party can validate XML before or after sending it. The XML documents could be
used in conjunction with Web services or as part of “old-fashion“ file based
exchange. In short, this was pretty typical system integration task.

The client had already decided to standardize on XML Schema, so using Relax NG
or Schematron was not an option.

XML Schema provides a lot of different capabilities but based on my experience I
think that it could benefit from some improvements. Here are my random thoughts
on this. Now, I don‘t claim to be the ultimate XML Schema expert, so take it for
what it‘s worth.

  • Schema‘s verbosity and unwieldy syntax makes it a poor candidate for
    communicating and publishing XML structure and rules to the wide audience of
    technical people from different organizations that may or may not know XML
    Schema. For example, “minOccur=0“ means “optional field” which is probably not
    very intuitive to anyone unfamiliar with the Schema specification. Even after
    formatting the schema for publishing (e.g., by using xsddoc)
    schemas are still hard to understand. Of course, one can use the annotations and
    try to explain each type in plain English, but then the documentation always
    tends to get out of synch.

    The obvious counter-argument here is that XML Schema is designed to be the data
    modeling/validation tool and as such it is not suitable for capturing business
    requirements but I just think that it would be nice if it could really be used
    for both, essentially becoming the “system of records“ for integrating different
    systems and organization.

  • Error messages thrown by XML parsers are far from being the most intuitive (this
    obviously depends on the parser and I have not done any comparative analysis).
    For example, missing required element results in “Element ‘element name‘ is not
    valid for content model“ where ‘element name‘ is the name of the element
    following the missing required missing element. Why can‘t the parser simply say
    “Required element is missing“? Again, this problem is exacerbated when you‘re
    dealing with people with only cursory XML Schema knowledge. I‘m not aware of a
    standard way to customize error messages, so in my case developers will have to
    do error translation in the code.
  • XML Schema users are forced to use regular expressions for defining any more or
    less complex template for simple types (phone number, SSN , etc). This poses a
    problem in an environment where you can‘t expect all users to be familiar with
    regexp syntax. When you get a message “Value does not match regular expression
    facet ‘\+?[0–9\-\(\)\s]{1,25}“, it could very easily befuddle an uninitiated. I
    wish there was a simplified templating mechanism, may be something similar to
    java.text.MessageFormat “##.##“.
  • Reuse capabilities in XML Schema are not perfect. “extend“ only allows to append
    an element to the end of the sequence. “Restrict“ requires repeating the content
    model of the parent type. This creates very brittle schemas and violates DRY
    principle. There is no way to parameterize an XML type. Let‘s say there is “name“
    type with “first“ and “last“ elements. When a new person is added, I want “last“
    element to be mandatory. In “update“ situation all fields could be optional. I
    wish I could make “minOccur“ a parameter here.
  • XML Schema may seem very OO-like at the first glance, but in fact it is missing
    some important OO-like capabilities. For instance, there is no element-level
    polymorphism. In the example above, I wanted to change the “behavior“ of “last”
    (some aspect of this type definition) in a subtype and I can‘t do that.
    Inheritance by restriction for complex types (I don‘t have a problem with using
    it for simple types) is IMO counter-intuitive. So now I can have a child which
    does not have all properties of its parent, and so there is no way to enforce
    optional elements for all children.
  • Element and type scoping could‘ve been more flexible. All declarations are
    either parent element-scoped (local) or global. This does not allow me to define
    a reusable type or a group scoped to certain parent element or to a file (any
    more or less complex schema would have to broken down into multiple files for
    manageability sake). So say I have a name type for person‘s name (first, middle,
    last) and business‘ name type with a different structure. If I want use to
    person‘s name type for different elements within Person type, I will have to
    define as global and name it PersonNameType, essentially embedding parent‘s name
    into the child‘s name. I wish I could simply define NameType and specify what
    parent type or element it is scoped to.
  • XML Schema is a declarative language and so it lacks programming constructs,
    which is fine. But there is still a need for Schematron-like facility or the
    scripting language for expressing complex rules (such as cross-field validation).
    Schematron works fine when embedded inside annotations, but it requires a
    separate validation step and Schematron XSLT . So it would be great if this
    capability was supported by the Schema standard and natively understood by XML
    parsers. This would make schemas truly self-contained.

So my wish list is actually quite simple:

  • Make XML schemas easier to understand for non-technical users or people without
    schema knowledge perhaps via some intelligent translation mechanism.
  • Make the Schema more powerful by allowing programming constructs, variables and
    more flexible scoping.

Tags: ,