Subscribe to RSS feed

«

»

Aug
17

XML, UML and the Zen of Data Modeling

Editor Note: This was originally from May 28, 2010

 I've been rather conspicuous of late in the absence from my own site. I can cite moving cross country, starting a new job with a contractor working with a major US archival agency, house hunting, and the stresses related from all of these in that absence, but as things slowly begin to settle down here, I'm hoping to get back to writing regularly for XMLToday.org and elsewhere.

Shortly after starting (within an hour of arriving at my new workplace, in fact), I was handed a dozen or so schemas, generated from UML documents, that were being used to drive an application for the client, and told that my job was to manage them. Over the course of the last two months, this task has turned into a fairly deep, soul searching experience about what exactly we mean by data modeling, and has left scars that will likely take years to heal.

There are several facets to the process of creating broad data models that seemed to be a good idea to share with the XML community through this site. These are, admittedly, subjective, your mileage may vary, and no doubt there are those people who will feel strongly that the exact opposite is true, so I put this list out as a challenge to others to explore the principles of XML Data Modeling, and how this differs from more traditional modeling.

Rule #1: The Lord of the Rings Principle

One of the slogans that the US National Archives and Records Administration (known in these parts as NARA) uses in much of their public facing sites is that their mission of archiving is intended "for the Life of the Republic". The underlying principle here is simple - both the archives and the archival metadata need to be designed in such a way that they can be used ten years, fifty years, even a hundred years down the road. One wag at the office made the observation that the acronym for the expression (LotR) happened to be the same as that for Lord of the Rings, so this concept has come to be called the Lord of the Rings Principle.

One consequence of this principle is that everything that you do needs to be thought out with the understanding that the end consumers of data modeling schemas will not necessarily be your organization. This is especially true of governmental schemas, which may very well end up getting adopted by organizations well outside of your immediate scope, both at the governmental level and in both business and education.

There's an interesting flip-side to this. I've worked with clients in the past who see developing schemas as something that will be a feather in their cap - being the one to design a specific business or archival schema that gets heavily adopted can be a real career boost. The danger with this is that this attitude tends to obscure the more relevant question of whether the schema itself is designed for longevity. The LotR test should not be seen as dictating the importance of the schema, but rather the structure of the schema.

 

Rule #2: Distinguish Between Archival Objects and System Objects

 

Schemas in this day and age are seldom single documents. Instead, they often range across dozens (or in some extreme cases hundreds) of individual schema documents, dealing with hundreds or thousands of distinct objects. One of the major challenges when dealing with this many objects is understanding which of these are archival objects and which are system objects.

An archival object is one which will be persisted, be made available to outside users out of the scope of the application, and will typically need to designed for clarity, ease of use and intuitiveness. These objects need to satisfy the LotR Principle. System objects, on the other hand, are typically used as part of the workflow for creating, updating or monitoring the archival objects. While they may still need to be validated, system objects generally can be designed for performance or processing optimizations rather than the transparency of archival objects.

Understanding which is which is important, and not always easy to determine. This is also one of those situations where using multiple namespaces can be important - by differentiating between system and archival objects in namespaces, the distinction between the two can be easier to maintain.

It's worth noting that system objects typically don't require the same level of scrutiny from all players that archival objects do. Archival objects are effectively contracts - their definitions need to be clear and consistent and need to be agreed upon by all the participants in that data modeling process. This means that designing such schemas can take months (or even years) and can often be a rancorous political problem rather than an engineering one. By agreeing ahead of time which objects fall within the archival domain, you can shorten this political process and can move more quickly to application development.

 

Rule #3: Establish Naming and Design Conventions Early ... and Stay Consistent

 

Data models precede schema development. In essence you are designing a language. The easiest languages to learn are those that have regular, consistent, clear rules for each term in that language. The more you deviate from these rules, the more exceptions you create, and the more complex the coding that needs to be done to make use of these objects.

While a full list of such conventions is well beyond the scope of this article, there are some obvious hot spots that deserve special attention

 

  • Each object should have a consistent mechanism for identifying that object uniquely - an ID of some sort - and there should additionally be a consistent mechanism for referencing that object from some other document. Almost all large schemas will have internal references of some sort - how these objects are referenced will determine a significant portion of your architecture.
  • As an added rule of thumb, even if your objects are not necessarily web accessible, consider using URLs, rather than just numbers, to identify and reference resources in archival schemas. Local numeric indices will fail the moment that you have to differentiate between multiple data repositories.
  • Use prefixes or suffixes on your data model terms to identify common patterns. For instance, in the NARA case the term Set indicated a pattern for a containing element that occurs zero or more times, and that holds a collection of zero or more "records" that are either the same type or the same abstract type. The W3C has two excellent references notes on design patterns in schema, Basic XML Schema Patterns for Databinding and Advanced XML Schema Patterns for Databinding that can help to identify such patterns.
  • The specific preference for case usage in terms is very subjective - we used Title case (such as RecordScheduleSet) to indicate element names, but could have chosen to use camel case, hyphens or underscores. Whatever you do, try to break the convention for simple or complex type names - e.g., the complex type for RecordScheduleSet in our XSDs was named "RecordScheduleSet_Type". This makes it immediately evident what's an element name versus an element's type definition.

 

 

Rule #4: Enumerations Can Be Quicksand

 

Enumerations - a list of tokens indicating potential choices for a string - seem fairly innocuous, but many of the biggest issues that we faced seemed to inevitably be signaled by the use of enumerations.

Frequently we encountered a pattern that looked something like this:

 

MyObjectContainter:MyObjectContainer_Type
MyObject:MyObject_Type
    ObjectId:Id_Type
    ObjectTitle:Title_Type
    ObjectCategory:Enumeration_Type {ObjCat1,ObjCat2,ObjCat3}
    ObjectProperty1:Property1_Type [0..1]
    ObjectProperty2:Property2_Type [0..1]
    ObjectProperty3:Property3_Type [0..1]

 

where the ObjectCategory would be taken from the enumerated set, and where the specific properties being displayed were dependent upon which enumeration value was chosen. To me, this became a signal that we had an incompletely modeled section. Typically, the resolution to this was to create a base type from MyObject_Type, then inherit this into three distinct objects:

 

MyObject:MyObject_Type
    ObjectId:Id_Type
    ObjectTitle:Title_Type
MyObjectContainer:MyObjectContainer_Type
    {
MyCat1Object:MyCat1Object_Type::MyObject_Type
    ObjectProperty1:Property1_Type,
MyCat2Object:MyCat2Object_Type::MyObject_Type
    ObjectProperty2:Property2_Type,
MyCat3Object:MyCat3Object_Type::MyObject_Type
    ObjectProperty3:Property3_Type
    }

 

where the notation A:B::C reads as - A is of type B, which inherits type C, and the sequence in the brackets {} indicates a choice between the three objects (a form of inheritance, discussed shortly).

This isn't always consistent - sometimes there were no child properties involved, so that while the above form would be valid, it's not necessary. This highlights one of the more critical aspects of such schema development - there's often a judgement call that needs to made when designing schemas as to the point at which you do things like decomposing a type into a base type and associated inherited types. That's why schema design is still more art than science.

A second problematic area for enumerations comes when those enumerations are externally controlled and potentially dynamic. This is both a data modeling and an XSD problem, as XSD doesn't have any formal mechanism for dealing with such external enumerations (i.e., it's usually application defined). In general, this implies that the schemas in question need to identify early on which enumerations are stable and static, which are unstable and dynamic, and which are hiding type declarations and treat these separately, then deal with these three cases in a clearly consistent manner.

One alternative for the dynamic case is to annotate the schema with a pointer to a web service (i.e., create a <Source> element in the <xs:appInfo> element within a schema that points to a web service that retrieves a consistent structure for each code list. Once you have this identify the type for the element itself as a distinct CodeList_Type orControlledVocabulary_Type that in turn maps to a string. This can prove very useful if you're using your schemas for code autogeneration. (This technique also hints at the idea of using simple type maps as well for properties that are scalar but still have significance beyond their base xsi:type.

 

Rule #5: Introverting Inheritance

 

Inheritance can be both your friend and your enemy when designing complex schemas, and should be approached with a great deal of deliberation. When discussing inheritance, it is worth considering that while many languages, especially object oriented languages like Java or C++, incorporate the concept of inheritance, such inheritance is typically focused on the methods rather than external properties of a given object. Because XML document are nothing but properties and collections of properties, the benefits of inheritance are not as strong as they are with Java objects.

Given that UML tools (we're using the Rational toolset) are designed to model OOP-centered language, this has as an implication the fact that such tools are inclined to spit out inheritance diagrams that describe business object relationships without considering the distinction between XML and Java, especially since the XML Schema Definition Language(XSD) also incorporates both concrete and abstract inheritance capabilities.

A little bit of history, though, can prove revealing here. The earliest drafts of XSDs did not include a mechanism for abstract inheritance, because it went counter to the way that you build descriptive document object models. While the language was still in Working Draft, SOAP and WSDL were also evolving, and the SOAP community in particular (represented by heavyweights IBM, Microsoft and BEA) pushed heavily for the incorporation of abstract types so that XSD could be used to better be used for marshaling objects into and out of Java or C++. There was some opposition to this from the more traditional document camp, but in the end abstract inheritance made its way into XSD. It's noteworthy that James Clark and Murata Makoto created the RelaxNG language in great part because they felt so strongly that inheritance was neither required by XML schemas nor was it desirable.

Mapping abstract inheritance in particular to XML can prove to introduce some seriously odd artifacts into the language. For instance, suppose that you wanted to create a base abstract class called a Vehicle which could then be instantiated as an Automobile class, a Train class and a Airplane class. In a purely declarative mode, it may be that all three subclasses share the following properties:

 

Vehicle:
    Id:Id_Type
    Location:
        Latitude:Latitude_Type
        Longitude:Longitude_Type
    Velocity:
        LatitudeDelta:Arc_Type
        LongitudeDelta:Arc_Type
    NumberOfPassengers:Integer

 

A traditional inheritance model here would work by extension:

 

Automobile::Vehicle
    Make:String
    Model:String
    Year:Date
Train::Vehicle
    TrackLine:String
    TransportOwner:String
Airplane::Vehicle
    FlightNum:String
    Airline:String
    DepartingAirport:String
    ArrivingAirport:String

 

If you used XSD, then each class would inherit the base class via an extension class:

 

<xs:complexType name="Vehicle_Type" abstract="true">
    <xs:sequence>
        <xs:element name="Id" type="xs:ID"/>
        <xs:element name="Location" type="Location_Type"/>
        <xs:element name="Velocity" type="Velocity_Type"/>
        <xs:element name="NumberOfPassengers" type="xs:integer"/>
    </xs:sequence>
</xs:complexType>

<xs:complexType name="Automobile_Type">
    <xs:complexContent>
        <xs:extension base="Vehicle_Type">
            <xs:sequence>
                 <xs:element name="Make" type="xs:string"/>
                 <xs:element name="Model" type="xs:string"/>
                 <xs:element name="Year" type="xs:gYear"/>
            </xs:sequence>
        </xs:extension>
    </xs:complexContent>
</xs:complexType>

 

While this may look obvious when modeling it, the rendering of the XML can look odd indeed:

 

<Vehicle xsi:type="Automobile_Type">
    <Id>A12345</Id>
    <Location>
        <Latitude>-165.123</Latitude>
        <Longitude>45.582</Longitude>
    </Location>
    <Velocity>
        <LatitudeDelta>0.192</LatitudeDelta>
        <LongitudeDelta>-0.052</LatitudeDelta>
    </Velocity>
    <NumberOfPassengers>2</NumberOfPassengers>
    <Make>Ford</Make>
    <Model>Focus Sedan</Model>
    <Year>2000</Year>
</Vehicle>

 

The expression <Vehicle xsi:type="Automobile_Type"> may in fact be easier for Java or C++ to use, but it makes XPath expressions more complex, provides some real issues with XML databases, and, can often hide a lot of the underlying logical model in differentiating between what is just a vehicle and what is an automobile.

One alternative that I proposed for NARA (and that ultimately ended up becoming a part of the specification) was to use a technique which I termed Introversion, though its more generally a form of composition. In this case, I modeled the inheritance by making the base class explicit and including it within the subclass. The introverted model then looked something like:

 

<Automobile>
    <Vehicle>
        <Id>A12345</Id>
        <Location>
            <Latitude>-165.123</Latitude>
            <Longitude>45.582</Longitude>
        </Location>
        <Velocity>
            <LatitudeDelta>0.192</LatitudeDelta>
            <LongitudeDelta>-0.052</LatitudeDelta>
        </Velocity>
        <NumberOfPassengers>2</NumberOfPassengers>
    </Vehicle>
    <Make>Ford</Make>
    <Model>Focus Sedan</Model>
    <Year>2000</Year>
</Automobile>

 

The introversion model works by "instantiating" the inheritance base class as an element that wraps the content of just the inherited properties of the base class (for instance, Vehicle ends up becoming a child of Automobile). This tends to make for deeper (i.e., bushier) XML models, but it also makes it possible to set up Xpath expressions that look likeVehicleSet/Automobile//NumberOfPassengers in order to retrieve common properties, instead of VehicleSet/Vehicle[@xsi:type='Automobile']/NumberOfPassengers.

In this approach, there's no reason to bring the schema into the XML instance (which is an important benefit, since XML structures in general should have no preferred schema structure - so long as the instance can be modeled consistently, it should in theory be unnecessary to require one particular schema layout over another). It's also more intuitive as to what the XML is supposed to model. It is also more verbose, though verbosity by itself is not always a bad thing when it comes to public schemas, if the verbosity serves to make the structure of the XML clearer.

Additionally, abstraction should be viewed with an eye toward utility. It's not uncommon in languages such as Java to create empty (or very nearly empty) abstract classes for organizational purposes. This can make it easier to do things such as iterations across multiple objects of the same base class type in those languages. However, these do nothing for XML processing languages, which can iterate over any node without regard to the semantic content of the node - and that don't gain any real benefit from such empty abstractions. This means that simply because an object model works for Java doesn't guarantee that such a model is in fact even remotely useful for XML. Put another way, when designing archival sequences in particular, design the XML first, then adapt any other language processes to use system objects that better encapsulate the local needs for processing.

These kinds of decisions should usually be made early in the design process, and here as elsewhere, there's a distinction between Archival and System objects. System objects may very well be marshaled via SOAP to other languages and as such use of abstract inheritance may be more efficient, but archival objects should stress clarity over efficiency.

 

Rule #6: To Wrap or Not To Wrap

 

The inheritance issue is part of a broader issue that every schema developer should be cognizant of - to wrap or not to wrap. A wrapper element serves to provide some form of structural organization without necessarily being strictly required in terms of the property sets involved. For instance, it is possible in UML to create a sequence of items Foo with minOccurs="0" and maxOccurs="unbounded" directly in the model. A language such as Java would interpret this as a list of items of that particular type. However, an XML interpretation would simply indicate that its possible to have zero or more items that are siblings to all other properties.

There's nothing wrong with this - it is possible to create XPath expressions that can differentiate such siblings simply by name. However, it may not be obvious to a reader of the XML that such items are in fact "records" rather than simply properties. That is to say, the following model:

 

TransportationGrid
    TransportationGridId:Id_Type
    TransportationGridTitle:String
    TransportationGridIncrement:Long
    Vehicle:Vehicle_Type 0..*

 

is valid in that it validates a workable instance:

 

<TransportationGrid>
    <TransportationGridId>C81225F</TransportationGrid>
    <TransportationGridTitle>My Grid</TransportationGridTitle>
    <TransportationGridIncrement>72</TransportationGridIncrement>
    <Vehicle><!-- First vehicle--></Vehicle>
    <Vehicle><!-- Second vehicle --></Vehicle>
    <Vehicle><!-- Third vehicle --></Vehicle>
</TransportationGrid>

 

However, from an organizational standpoint, creating a formal containment wrapper can make the relationships far more obvious:

 

<TransportationGrid>
    <TransportationGridId>C81225F</TransportationGrid>
    <TransportationGridTitle>My Grid</TransportationGridTitle>
    <TransportationGridIncrement>72</TransportationGridIncrement>
    <VehicleSet>
        <Vehicle><!-- First vehicle--></Vehicle>
        <Vehicle><!-- Second vehicle --></Vehicle>
        <Vehicle><!-- Third vehicle --></Vehicle>
    </VehicleSet>
</TransportationGrid>

 

Thus, one of the challenges to the data modeler is deciding at what stage such wrappers make sense. My own personal sense is that for archival schemas, wrappers should be used liberally, even if it makes for more verbose code. Part of the reason for this is that such wrappers make the underlying design patterns used within the schemas clearer.

Indeed, one exercise that every data modeler should do is attempt to build a model that emphasizes these design patterns. While there are a number of such patterns, the biggies that I've come across include the following:

 

  • Records and RecordSets. A record is a core level business object (an item) that is stored in a sequence of other such records. Records typically are unique (have a unique id relative to the recordset).
  • Feed and FeedItems. A feed is a recordset with an associated metadata block, and each item in the set is usually a pointer to a specific record rather than a record itself. The term derives from syndication feeds such as RSS or Atom.
  • PropertyBag. A property bag is a collection of properties contained within a wrapper that nonetheless have no internal uniqueness (such as the metadata header in a record or recordset).
  • Alternatives. The precursor of inheritance, alternatives are generally objects constrained by a choice block. These are often displayed as categories or tabs in UIs.
  • Root Structures. These are usually but not always the root node of an XML document, but in general these elements represent the root on which any display is set.
  • Metadata blocks. Metadata blocks are attached to objects with unique identities and serve to provide systemic level information about the entity that don't necessarily reflect the data associated with that given record or container. Identifiers, titles, descriptions and abstracts are usually found within such blocks.

 

Not surprisingly, these structure usually have associated search and user interface implications; indeed, a good exercise for the data modeler is to put together an XForms document and see how readily such structures can decompose into the associated design patterns. The easier the decomposition, the more "natural" the schema is likely to be in terms of writing XPath queries, transformations, XQuery scripts and so forth. When you have to spend a lot of time writing "one-offs" then it may be a good indication that the model isn't as well optimized for XML processing as it could be.

 

Designing for XML

 

What this means for using UML tools to develop schemas is simple - remember at all times that at least some of your objects may very well have to satisfy the LotR test, and as such should be designed with XML processing in mind, not some other language used as a downstream consumer of the XML. This means that abstraction models and inheritance needs to be examined with respect to what you're attempting to do, this means that you should be cognizant of the underlying design patterns and where they best work, and it means that you should attempt to be consistent in your modeling, especially when dealing with archival schemas. Time spent achieving those goals will be easily recouped when it comes time to start developing applications against these schemas later.

No comment yet

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>