Cloud computing has largely passed through it's hype phase - that period where all kinds of magical characteristics are applied to it because no one really knows that much about it, and is shifting to the real world scenarios where people are actually building clouds and finding the limitations therein (and the cycle goes through the Trough of Despair).
Despite this incipient skeptical phase, there are some significant reasons to think hard about the combination of cloud computing and XML Databases, especially those such as MarkLogic or eXist-db that combine the database with a web app server. There are many ways of using such servers, though I believe one of the most powerful is to use a RESTful Services approach in conjunction with URL rewriting in order to create intelligent file systems.
What is an intelligent file system? It's a file system that generates a proxy layer between the internal files that it holds and the outside world, one that can shape both inbound and outbound content so that what is presented or accepted in the file system can change depending upon who's using it, why they're using it, and what they need.
It's a file system that can let you search for files by specific terms, categories, dates, or far more sophisticated queries ("Give me all files that I wrote last month where I talked about ducks and their relationship to XQuery, in descending order by date ..."), and it's a file system that can generate its own editors and reports for those documents, either singly or in the aggregate ("... and combine them all as a slideshow, with the section titles being the slide bullets.").
Such a file system doesn't need to make use of folders, which as an organizing principle fail once you get more than a couple of levels deep. It sees file format not as hard and fast characteristics of the file but preferences imposed by the user ("I'd like it as an ODP presentation instead of a PowerPoint presentation."). It respects it's users ("If a member of my class wants to access each file, they can only see it as a static web page, but I want the file open with a toggleable editor."), and the files are accessible from any Internet aware machine ("I also want a quick summary page for iPads.").
This approach can be done for simple documents using a SQL relational database. But most documents, almost by definition, are not simple, and the contortions necessary to get more complex documents to behave in this way can become significant. This is one of the areas where XML Databases, on the other hand, shine bright. By taking a RESTful approach - by creating filters for both ingest and delivery of content for each underlying document and their representations while at the same time maintaining the appearance of simple RESTful publishing - you can gain this flexibility relatively easily, and in a highly scalable manner. You can also use this approach with "virtual" documents that are composed from other documents dynamically, even to the extent of decomposing the virtual documents back to their constituent parts during a virtual "publish".
This approach, one I call Dynamic RESTful Services though it is also subsumed under XRX, can make documents act intelligently, configuring themselves to the needs of each user as appropriate. It reduces (and in some cases completely eliminates) the need for specialized, stand-alone applications, though it can certainly generate content to be used by them - the same presentation as described above could be mapped to a Powerpoint, a Word Doc, a PDF document, a Flash SFW, or even an HTML 5 "page".
This doesn't come free, of course - each core document type, stored within the system, needs to have transformations developed that will map to the appropriate output, or that will accept and modify the appropriate input, but by providing a consistent mechanism for how such transformations work this approach makes it far easier to develop such transformations consistently.
In a single machine environment, the XML database would effectively house each of the "core" document types as collections - one collection for "lessons", for example, another for student grade books, a third for homework, another for student entries and so forth. There are obvious interdependencies at work here, which is one of the secondary reasons for insuring that you have intelligent pipelines for requests and submissions, as these can expose hooks that let other collections' records react when a given record (such as a graded homework assignment) changes.
Yet, the true power of such databases comes about via virtualization. A school district could set up a core XML database server configuration as a virtualization snapshot. Every year, new database collections would come online, while older ones could be effectively backed up and packed away as snapshots, with an index indicated which snapshot a give primary record is located on. If a search was made of older records, then the snapshot could be brought back online at that time.
Additionally, such databases can also function as temporary buffers - storing such records locally after invocation until such time as they are no longer needed, while at the same time making use of the current presenters and ingest mechanisms on the existing systems (with some care for forward compatibility).
Moreover, virtualized servers and RESTful URLs make the ability of performing distributed queries considerably easier. If I was a researcher and wanted to do research queries on all of the students in a given state, creating a federated query would be relatively simple, and could be handled both synchronously (slow but comprehensive) and asynchronously (preferable, in that record results could come back as feeds that could be qualified as Atom content).
Indeed, this raises an interesting change to our concept of file access. Currently, most local file systems (and even network systems) are deterministic - you know exactly which files you are going to get back, barring file corruption or network interruption. However, distributed systems have comparatively longer latencies, sometimes significantly so, which means that such a file system becomes search oriented and stochastic - you may not necessarily find THE file you are looking for immediately, but if you are willing to wait a sufficient period of time, it will eventually show up. The question is one of whether the relevant file appears within the time envelope, which is where the probabilistic aspect of file access comes about.
In addition to that, such queries also benefit from the pipeline, in that it provides a mechanism to perform identify sterilization in such a way that only the primary "record of storage" database has access to the full record.
The discussion here of asynchronous queries also points to another aspect of collections - there is comparatively little difference between a collection and a queue. a research collection built in such a way could be stored locally on the system doing the initial querying, and then as different systems check back in (typically by placing query requests on a queue on remote machines then querying feeds) the calling machine would add these items to the queue. This data becomes cached, and once the data is no longer needed, it could be flushed.
The example I've given here is for a school system, but there are endless alternative scenarios that could be supplied - sales accounts, invoicing systems, intelligent resume systems, blogging, government data, library systems and so on. Indeed, this last example shows how powerful such a system could be - by creating clouds of library servers that utilized a common interchange and query mechanism, this opens up the possibility that you could make a request for a book that would go out to every library in the country, and the first library able to fulfill the request within given parameters (likely the lowest distribution costs) would then fulfill the request. Because each library would retain their own catalog cluster in the cloud, you'd always have a record of record while there would be no need to maintain a centralized catalog for the entire system.
Dave Kellogg, CEO of MarkLogic, began exploring the concept of Data as a Service (DAAS) fairly early on, and recently set up an agreement with Amazon to make a MarkLogic application server available as an Amazon Machine Image (AMI) in their Elastic Compute Cloud (EC2) service. This makes it possible to develop commodity XML computation at varying price points depending upon how extensively you use EC2 services (as well as a making the more limited MarkLogic community server also available in the EC2 environment). More information can be found at MarkLogic Cloud Computing Center.
On the eXist-db side, there have been a number of test deployments of eXist on both Amazon Web Services and Google Apps, the latter possible with the Java support that Google Apps provided a year ago. Both MarkLogic and eXist provide the necessary internal web services infrastructure necessary to handle distributed data systems well, and eXist might be a suitable alternative to MarkLogic for organizations that can't afford MarkLogic licenses but have requirements larger than their community edition.
Perhaps one of the more intriguing players in this space launched very recently - 28msec debuted their distributed XQuery database called Sausalito, which is specifically developed with cloud use in mind. Their systems provide both unified app servers and XML database stores integrated with XQuery, working on top of EC2 virtualization instances. The approach they take is very much in the Dynamic RESTful Service model, with URL rewriting and XQuery based handlers mapping the underlying systems, supported by XQuery library modules.
Sausalito has also added an incentive to their offering - a contest for XQuery developers to utilize their Sausalito system in the most creative ways, with a prize of US$7500 to the winner.
Contests aside, this is a significant time - XML Databases are becoming more mature and are in fact, more properly, XML-based application platforms, platforms that work remarkably well with distributed systems and cloud computing, and that can provide a remarkably rich system for working with large, complex data in a wide variety of applications. Pay close attention to this space.
3 comments
Dave Pawson says:
August 28, 2010 at 11:40 pm (UTC -7)
“the XML database would effectively house each of the “core” document types as collections – one collection for “lessons”, for example, another for student grade books, a third for homework,”
brings it back to reality Kurt? We still need to dream up categories (where we used to name directories)
If the above (AI class of app) is the dream, what is the current state of play? What class of query is realistic with todays tech?
I’m sure some part of this is doable today?
Very thought provoking.
Kurt Cagle says:
August 29, 2010 at 8:30 am (UTC -7)
Categorization is both the basis for power and the source of a lot of potential problems moving forward. I’m finding with my own XRX work that the decision points have switched to “What is the best decomposition of document classes for a given domain?” That’s why I see data modeling becoming a much bigger part of the overall development process over time – we are moving to a mode where what becomes important are the collections of objects, and the interactions between them, which up until now has been the driving impetus in enterprise bus architectures, now becomes more of an afterthought – but only if you have decomposed your model properly.
I’ve built prototype systems that can do everything I’ve talked about in the article, and I think that others are moving toward that as well. There’s still work to be done – RESTful services does not magically make everything easier, and it makes some things more challenging, but overall the approach seems to solve more than a few of the problems that plague more intentional systems.
James Fuller says:
August 29, 2010 at 3:39 am (UTC -7)
Great article, its spawned a number of ruminations of my own (apologies for the length).
Widespread XML encoding of documents enabling open access to the data contained within has allowed us to think past documents demarcated merely as a physical file towards that resembling a classic database ‘view’.
I think we can attribute the ‘web page’ as changing the notion of what a document is over the past decade(s) and now users are much more comfortable with a document being more like a ‘database view’, but going from a static document to a ‘live’ one can have compelling knock on effects (think archiving for example).
So why do we not have a database as a filesystem today?
Microsoft certainly tried with ‘Yukon’ some years ago but maybe at the last moment decided against potentially cannibalizing and competing against their own SQL server product. We would obviously argue today that an XML database is a natural candidate to work as a repository of document orientated data versus shredding things up into a RDBMS.
From the reverse angle, there are interesting approaches using FUSE that reformulates data allowing for a familiar ‘file explorer’ browsing experience. In Linux/Unix, we can also browse the system’s internal state via the /proc directory. I think these kind of things confirm that users like browsing directory structures with files. People do seem to like browsing & managing files and folders, but maybe this is just due to long enforced convention.
Lets not forget that the power of REST is somewhat indirectly related to these earlier notions of file systems e.g. where would we be without the venerated URI efficiently addressing such hierarchical structures ?
Back to the main points of your article …
I) I agree that XML Databases are the natural choice for managing document orientated data … documents can live in their creation format and we have much less marshaling between interim formats and less interfaces to program and thusly quicker development/less bugs. RESTful interfaces on an XML Database completes the ‘virtuous circle’ and makes a compelling alternative to webservers serving up files from a file system; especially for data driven web application development.
II) REST (and things like URI’s) are the foundation for how things work on the web and building applications with these principles in mind ensures you are going ‘along the grain’ of the web. What is great news is that it seems like cloud computing and the slew of ‘new’ database technologies have embraced RESTful interfaces ‘enmasse’, in turn we have a critical mass of understanding of these principles, though I think more work is needed in communicating the correct practice of REST, there is a lot of imperfect understanding of even the basics of REST.
III) I am less sure about the whole cloud computing thing … like many developers I too rushed to deploy to Amazon a few years ago with an application and paid a lot more then expected and was less then happy with other things that matter (latency is a biggie).
I think what is more important and indirectly related to cloud computing is virtualization (and appliance approach to deployment) that gets imposed throughout the development lifecycle when building apps for the cloud. I maybe wrong, but to me the ‘real infrastructure story’ over the past few years is virtualization and its impact is changing *everything*. Some people conflate ‘the cloud’ and ‘virtualisation’ , I don’t and I see them as two different (but can be related) things.
I think what cloud computing does today is allow something that was impossible before, e.g. with cloud computing I can scale up simply using basic techniques. I don’t have to be so careful about how I design my application for scalability and from a developer point of view this is akin to the advance in available memory of the years … now we can use megabytes or gigabytes of memory when just a few years before we were far less (kilobytes). This kind of largesse breeds inefficiency which naturally chafes me as a developer but admittedly sometimes progress is sometimes measured more in terms of doing something ‘faster and bigger’ rather then ‘smarter, better and more efficient’.
I think the marketing behind the cloud is that there is a *lot* of money to be made in replacing systems in the enterprise .. e.g. will we see enterprise applications naturally get deployed in the cloud (as the ultimate mainframe replacement) ? I don’t know … but I do know that few of my applications ever need to be deployed at the ‘planet scale’ and whilst I relish the challenges of working on such projects I need to take the pragmatic view that its wasteful to use such technologies just because ‘they exist’ and if somehow I all of a sudden need too scale up … virtualisation gives us a quick (and paradoxically inefficient) route to scalability.
Most importantly and as with any technology adoption, choosing to use the cloud in your infrastructure can have a significant impact somewhere else ‘upstream’ or ‘downstream’ in your current development/maintenance process.
I have been involved in some very large build and deployment systems and I have found that the difference between success and failure tends to focus on software that few developers have experience with e.g. configuration management (I am a deep die hard fan of Puppet and cfengine). Cloud computing can quickly expose deficiencies in existing development processes, especially if the principles of continuous integration and deployment are not in place.
Thanks once again for the interesting article, gave me a lot of food for thought.