Exploring the world of data and documents.

MarkLogic as Platform

Kurt Cagle's picture

One of the joys of being a consultant is that you go through stretches where there's not much to do, and then you get thrown into a project where you're burning 50+ hours a week, your kids have trouble recognizing you, and where you wake up with nightmares about code that simply ... won't ... work! I'm in the latter phase at the moment, though look to be wrapping up soon, at which point I'll finally get around to finishing the SVG book that I've been working on since March (sigh). Thus, to those who've wondered if this site has been abandoned, it hasn't, and I'm hoping to add some more intriguing articles to it very soon now.

One question that's been in the back of my mind as I've worked on MarkLogic projects lately has been the question first raised from a Google Plus post by Google Engineer Steve Yegge when he inadvertently sent a post meant for internal consumption to the global Plus user community, and instantly became one of those seminal documents that will likely be cited in dissertations about the growth of the web twenty years from now. The question he posed was "what exactly constitutes a platform?" along with the corollary to that "Why do platforms matter?". As I've been deep in the bowels of a MarkLogic project for the last couple of months now, this tipped over in my mind to the logical set of questions "Is Marklogic a platform? Should it be?"

In programming circles there seems to be rough agreement about the scope of various software projects. An application in general is a piece of software that fulfills a specific function or set of functions, is usually of finite scope, and is generally self contained. The bulk of all software written prior to about 2000 were clearly applications, even though occasionally there were bundles of applications that were intended to broaden the domain of applicability. Microsoft Word, Adobe Photoshop, many games and most databases clearly fell into that particular category.

A framework, on the other hand, consisted of sets of libraries that enabled the construction of applications. Most such frameworks tended to reside on top of core languages, to the extent that it was frequently difficult to tell where the language itself ended and the framework began, but in general frameworks provided the building blocks to accomplish a specific set of tasks more readily with the code at hand.

Most frameworks were essentially helper functions that made it possible to build certain types of applications quickly, but by dint of this often tended to be used by harried developers to limit the potential of the language rather than expand upon it. This meant that frameworks have tended to be seen in a somewhat negative light by many developers and project managers, because while they automate certain things, they also make developing applications outside of these frameworks more complex because there is an integration cost involved between the framework-enhanced language and the unenhanced version - and such applications tend to become a lot like older British mansions - courtyards from the 13th century, turrets from the 16th, one Georgian Wing from the 17th century, a Victorian Wing from the 19th, with 21st century solar panels on the 18th century gabled roofs.

A platform, on the other hand, is a more global approach, and is one that while more difficult to write in the beginning, is one that provides bigger dividends down the road. The idea behind a platform is simple - design the project from the outset to be extensible, and for that extensibility to then feed into the development of commonly libraries that work cohesively together. Most operating systems are platforms, because they have to provide the underlying capabilities necessary to do everything else. Facebook is trending towards being a platform because within its ecosystem there are very definite hooks for integration and because it is possible to build more complex pieces on simpler pieces that nonetheless fall outside of Facebook's own development purview. Yegge's argument was that Google itself should have been a platform but wasn't, was in fact a number of separate APIs that were not cohesively integrated into a more common framework, and that as a consequence there was an unfortunately high degree of impedence between Google libraries (Android gets around this somewhat with the incorporation of generic components, but the point holds even there). Microsoft, for all its faults otherwise, is a master of the platform - there is a high degree of uniformity regarding naming conventions, interface structures, expectations of shared library use and so forth. This has arguably led to the observation that developers outside of Microsoft itself are not as skilled in programming, but the flip side to this is that they don't need to be, which can provide a huge boost when it comes to finding developers for your projects.

So, given that, is it fair to say that MarkLogic is a platform? I personally do not believe that it is, but I also believe that with a comparatively small amount of work and investment it could be. Now, I personally love working with MarkLogic - it's an incredibly flexible application and when deployed properly can blow rings around Java, Ruby and other more traditional web languages. However, as a platform it is missing a few key pieces, and without those pieces it is limiting its potential severely

Forms Interfaces

First and foremost, MarkLogic absolutely needs a forms interface. The importance of forms tends to be overlooked by people who see MarkLogic as being a database, because the underlying assumption is that in general it's the client's responsibility to generate the forms that they need, and they can always write such forms in XQuery scripts. This is true, but it also misses a critical point. Forms are the windows into your database. Especially given the presence of MarkLogic AppServers, forms provide a way of not only seeing, but also creating and modifying the content in that database, whether that content be legal documents, books or similar composite media, insurance data, financial information, blogs, car parts or anything else.

As it is right now, most developers see MarkLogic as being an analytics tool, because the path to being able to get content in always goes through a pre-existing ingestion pipeline rather than being edited directly over the web by a user - even though forms are possible, most ML applications that I've seen have tended towards dash boards simply because the cost of adding that forms layer for the developer is simply too insurmountable.

The exact mechanism for that forms API could be any number of things - XForms immediately comes to mind (and it's a technology that is well known to many within MarkLogic) but to make XForms happen MarkLogic either has to incorporate an existing engine such as XSLTForms or Orbeon or has to roll its own. My personal recommendation would be that it might be well worth it to MarkLogic to spend some money, get rights to modify the base code of one of these engines, then use it as the foundation for an internally supported version that can be tested and tuned to MarkLogic's specific needs. Writing libraries to make the generation of XForms easier may also help, as would the ability for third party suppliers to build on the core libraries in order to build custom modules - from treeview controls to visualization tools to rich editors.

It need not be XForms either, though that's probably a good place to start - a lot of the harder conceptual work has already been ironed out. The real issue is that MarkLogic needs to have a core platform on which to build, modularize and package user interfaces to make it easier for users to interact with the database itself. Given the recent work done on the JSON side, creating AJAX enabled applications that bind to these libraries opens up similar possibilities for Javascript components. The key here though is that building an interface framework for the user/developer production of editors needs to be accomplished. Perhaps one solution is to license now and build out from that (perhaps hiring the licensees in the process), or license now while building an internal forms solution that's more integrated with MarkLogic capabilities.

With these, MarkLogic becomes a fully round-tripped engine with a powerful web enabled data system on the back end, making it much more attractive to customers who are looking to manage their data, not just analyse and search it. Without UI editor capabilities, MarkLogic will remain JUST a database.

Packaging

Along these same lines, packaging is going to become increasingly important. Right now, deploying applications within MarkLogic is too hard. Not writing them - deploying them. It should be possible to create for MarkLogic an installer capability that will let a developer package up an application, library or similar component into a single zip file, then for someone else to open up a package manager within MarkLogic, click a button to load that zipped file, present a configuration screen, then have the application able to work and fully functionality with no other work involved.

Information Studio is a good start on this, but it is still too data centric rather than application centric. For MarkLogic to really take off as a platform, applications need to be Apps - download the application from a zipped file on the web directly into the server where it then expands into the application in the environment This componentization of functionality is absolutely essential, because it is becoming the dominant model that people are expecting in their own interactions with software, and each step that's involved in manually installing an application becomes just one more bullet point that a competitor can use to sell their own products.

So what of security? Isn't the ability to willy nilly allow such applications to be downloaded from the web an open invitation to data corruption? Security is ultimately a matter of trust. MarkLogic has a superb trust environment, one of the best in the industry, and there is no reason why the importation of packaged applications should not be done by trusted individuals (those with the proper security credentials). Additionally, the fact that MarkLogic can effectively partition different forests and databases means that applications can (and in general should) run within their own data environments, possibly with restrictions concerning what API commands are available to an application running within the server.

However, a bigger part of such a trust network is in creating a trusted community. App stores abound, and part of the attraction of such app stores is that inclusion of an application within an app-store generally can only take place when the code itself is vetted and countersigned by the community manager. This also means that its possible to segregate safe code from experimental code, so that people can work with the experimental code only upon agreeing upon waiving indemnity rights.

Does such an app-model packaging mode make a difference? In my experiences, absolutely. One of the biggest gating factors with working with an application is the possibility for most people that they will in fact actually need to get under the hood to configure things - and this fear is perhaps even stronger for something like MarkLogic, because even for established developers XQuery is a novelty, and the sheer size and complexity of the existing APIs can be daunting. I also watched recently when Drupal made the transition from a manual component integration model (which required about eight steps and root level access to the server) to an app model for their module (with Drupal 7) and use of components skyrocketed, because people no longer had to be Drupal "experts" in order to add functionality. Curiously, security has not been seriously affected either, except possibly for the better - without the need to get into the server to update things, the chances of an accidental error wiping out other modules or similar components dropped pretty dramatically.

Reflective Documentation

Another area that would provide a major return on investment is the development of a standardized documentation format along the lines of javadoc. Right now there's no clean way of documenting user defined XQuery functions short of doing it externally in some other medium. It seems like introducing a couple of new XQuery command extensions of the form

declare documentation myNS:foo(){
    <documentation>
         <title>Foo Function</title>
         <para>This is documentation for the foo function in the myNS namespace. It creates foos.</para> 
    </documentation>
};

and

declare module-documentation myNS (){
    <documentation>
         <title>The myNS Module</title>
         <para>This is documentation for the myNS module</para>  
    </documentation>
};

would go a long way towards ameliorating that, as it would allow for an "xdmp:generate-documentation()" function to sweep through modules in a system and create documentation for that module based upon reflection of the parameters, incorporating rich text descriptions. This way, when new functionality IS added in by a third party tool, the API for that becomes automatically exposed. This will not only encourage the use of XQuery as a primary way of extending the library of available XQuery capabilities, but it also keeps such documentation up to date without the (often tedious) process of rebuilding such documentation by hand.

Community Model and Vertical Integration

MarkLogic took a BIG step forward recently with the announcement that MarkLogic 5.0 (which has just recently been released) would be available in a free Express Edition. This license makes available a single (non-clustered) MarkLogic instance with a 40 GB upper limit on database size, the full CPF libraries for doing PDF and Microsoft Office document processing, support for Hadoop and other languages and so forth, for use by an individual or organization. This means that for organizations whose data needs are comparatively modest (and 40GB of space makes for a very generous interpretation of modest) this provides precisely the kind of database that makes community development possible.

Some of the terms of the license (such as the one-developer only restriction) tend to be in the wink-wink nudge-nudge category - MarkLogic's intent is to get the database in front of people who would not otherwise use it because of cost, and any sales that this may cannibalize now will likely be more than made up by future sales as organizations grow into needs for large data stores. A tiered licensing model is the next likely stage of evolution in this model, a strategy that makes sense as MarkLogic evolves from a specialty database into a full-blown platform. However, this stage needs to be coordinated with their community development. Packaging will make a big difference there, but so to will the ability to create a common community app-store, with some apps (or libraries) available for free, others available for a fee, but in all cases made up jointly of MarkLogic and third party developers.

Does this cannibalize professional consulting services and similar channels? To a limited extent it may, but in general this will be made up by the need for customization services. For instance, consider the open source Moodle package, which is used extensively in the development and deployment of education services, and which would be ideal for integration with MarkLogic. A core ML-Moodle package could be created that is freely available (perhaps under an Apache license), which would provide the same functionality that would be available as the one that currently uses mySQL, perhaps with some low level additions such as reporting and faceted search. This could readily be used as is by school districts and universities, but also opens up the possibility for customization that could be accomplished by MarkLogic CPS or one of it's channel partners, such as Avalon Consulting. Such consulting might not be as large scale as deploying ML-Moodle in the first place, but there would almost certainly be more than one such follow-on engagement by different educational institutions. The same thing holds for government contracts, where deploying NIEM management tools, XBRL libraries and the like will lead into more sophisticated and specialized consulting opportunities. It also enables incremental upgrades - you buy the processing power that you need when you need it, rather than front-loading the contracts and gating your clientele only to very large and deep-pocketed customers.

Integrated Strategies

I've been very encouraged by what I've seen of MarkLogic CEO Ken Bado in his goal of getting MarkLogic competitive in the Big Data space, but I would warn against becoming too fixated upon that goal at the expense of reducing the overall efficacy of MarkLogic Server as a platform. Both need to happen. Big Data (and I'm thinking specifically Hadoop here) is a somewhat restrictive play at the moment because in general those areas where Hadoop shines are either those areas that MarkLogic can also do without Hadoop support (entity enrichment of raw text content, processing of XML or JSON feeds from external "firehose" sources such as Twitter, or similar extractive works) or are areas that MarkLogic doesn't do well (image processing, signal processing, really any case where the need to index doesn't come into play). Meanwhile, it is very likely that over time other languages (Java, Ruby, etc.) will either incorporate Hadoop-like processing into their own respective frameworks or will build out better distributed parallel processing mechanisms, which is what Hadoop ultimately is about. Thus, while I think that Hadoop or its successors are areas that MarkLogic should continue to invest in, I wouldn't bet the (server) farm on it.

With respect to other areas of Big Data, let MarkLogic be MarkLogic. My suspicion is that those customers who are getting into the "deep Big Data" space - exabyte and zettabyte range - probably don't give a damn about Hadoop except as an ingestion process, but they do want the ingested data to be fully indexable and accessible, something that MarkLogic is competitive with Oracle on and probably outperforms IBM on. MarkLogic doesn't need to be a Hadoop me-too player, but it does need to be able to ingest data more effectively. This is an imperative - ingestion is increasingly becoming the bottleneck facing most large scale data projects, and to me MarkLogic should be looking at Hadoop as a wake-up call to that broader need - the need to reliably and quickly process incoming (non-structured non-XML) data from external repositories, from data streams, and even from images. Adding an OCR package to MarkLogic would do wonders, for instance.

Finally, don't lose sight of the Semantic Web. Semantics are becoming increasingly important, ironically because of the rise of mobile computing devices. A smart phone or tablet are different from laptops in that they have a much more complete understanding of context - what they are, where they are, who they belong to, when they are. They provide a vehicle for storing user preferences and patterns, making them ideal loci in a semantic network. The information contained in a MarkLogic database (and more specifically the relationships between key pieces of information) can readily be stored as RDF triples within MarkLogic, a project that your teams have been working on for some time, and that means that, with the right bridge libraries, it becomes possible to marry such factors as triggers, app-server services, RDF and mobile networks into a cohesive environment for establishing event driven customizations - targeted news content, environment aware notifications, diagnostic systems for yourself, your family or your car, proximity driven alerts and so forth. RDF is a remarkable technology for connecting services without the need for formal APIs (or programmers in the middle) and as such it could play a fundamental role in establishing a cohesive platform play for MarkLogic without the need of revamping the existing API structure, which at this stage would probably be counterproductive.

A MarkLogic Platform

I realize that resources are always at a premium, especially in this day and age (the Great Recession has now become ingrained in popular vernacular), but at least from my perspective looking in from the outside, the almost-but-not-quite platform status of MarkLogic is hampering its adoption in places where it would otherwise be a great fit. The existing framework is sound and reasonably comprehensive beyond those areas that I mentioned here, and the move towards making the system more integrated with other data formats gives it a real edge over other, more specialized systems. Moreover, the culture that MarkLogic has, that balances out a drive towards innovation with real attention to detail (and genuine fun), makes it a very attractive and sympathetic company - I wouldn't mind working there, and I don't say that of most companies - and also makes the final step towards a cohesive platform one that is very much in accord with the company culture.

However, in the end, a technology company lives or dies by the adoption rate of its core technologies in the broader community, and to make MarkLogic more appealing to the typical developer it needs to reduce the overall barrier of producing content more than its competitors in the space. To do that, MarkLogic has to start looking at its server as a platform play, with user-definable editors (and core components to support same), with packaging and installations, with an existing body of easy to install community resources that can be deployed in an app fashion, and needs a way of dealing with a context aware mode, and by extension reducing the reliance of specialized programmers for mundane functions. Without such a platform, it becomes much harder to attract the wide pool of developers necessary to push adoption, which in turn reduces the number of success stories that the company overall can use to sell its higher end offerings to customers.

MarkLogic is great software, but it can be better. MarkLogic as platform is one way I think it can be improved, and ultimately I think it will reap benefits for MarkLogic the company as well.

Source Code: 

This is a test of the comment revisions.