Skip to main content

DaaS, XQuery and the NoSQL Movement

Kurt Cagle's picture

It's a lazy afternoon a couple of days before Christmas, I'm copying several tens of thousands of files from one (xml) database to another, the sun's shining, and it seems like a good time to set to print (or at least electrons) a few thoughts I've been having lately on data. People's outlook about data to a very great extent is colored by the tools that they use to collect, process and display that data, to the extent that, for many years now, the data landscape has been described almost invariably in terms of SQL-related technologies.

This of course makes perfect sense- SQL was one of the most successful standards to ever emerge, about on par with HTTP, and in the 1980s it transformed the industry from what had been up until then highly proprietary and individualistic data query and processing tools. In theory, someone who knew how to write SQL could jump from an Oracle database to an IBM database to a Sybase DB and so forth. In practice, there were usually enough proprietary extensions and variations from the SQL specifications due to fairly significant holes in those same specs that in general you still had to be a specialist in a certain company's flavor of SQL, but as most transactions ultimately were written around this database as the core, from the programmer's standpoint this really didn't make that much difference. It also meant that the Database Administrator, the DBA, was considered only one step below the IT manager, and in many cases had more clout in just about any organization.

This in turn had a peculiar effect upon the way that we built software. Everything originated from the database. When you built applications, those applications ultimately had to reflect less the needs of the customer or even the skills of the programmer than they did the underlying structure of the database and the possible permutations that structure could take. This ultimately meant that data models had to be set up in the database, and once set up, were effectively frozen, regardless of how far off the mark those models ultimately ended up being.

What's more, that database created what amounted to a gravitational center for applications - anytime you needed to add new functionality that had even minor dependency upon some facet of the data within the database, the whole application would need to be tethered to that database. Multiple connections were certainly possible in later generations of these database-centric applications, but the more connections you maintained, the more complex the application, and the more tightly coupled the systems became to one another. The net effect of this was siloization, in which databases and their corresponding applications became islands that were effectively isolated from one another.

The Trials and Tribulations of EAI

In the late 1990ss, a new IT discipline emerged - Enterprise Application Integration, or EAI. The idea behind this seemed logical at the time - you could create specialized interfaces that would be exposed by various applications over the Internet (or at least controlled intranets) that would make it possible to "plug-in" to existing applications - in essence, such services would effectively emulate the normal component interfaces (or a subset of same), passing specialized messages in an XML dialect called SOAP between producers and consumers of these objects. The role of SOAP was to act as a proxy, marshaling both the request from an existing object in an XML representation of the data types and the response from the server's object. Both client and server saw the XML only as a temporary intermediary, which, once each end point received this message, would be reconverted back into a binary object.This discipline was initially known as web services, then as web services began to gain a bad reputation for fragility, bottlenecks, and increasing complexity, was rebranded alternatively as EAI or Services Oriented Architecture, or SOA. The original proponents of SOA, as they began to see the domain where this architecture did work (the division to division space within a protected firewall) and where it didn't (the world outside of this domain), began arguing that the better solution was to shift application development to messaging queues and processing, taking away the immediate request/response latency, but at a cost of requiring multiple channels to keep track of the interactions. What's worse, SOA generally worked best with always connected networks - but the unreliable disconnected Internet was far less friendly towards SOA. While there are even now case studies that showcase where SOA-type architectures have been very effective, almost all of them have worked exclusively in that always-connected environment within controlled firewalls, but the complexities involved in extending them out into the wilder world of the Internet frequently made them both performance and security nightmares in the latter realm.A second problem that occurred with SOA was that the mindset itself hadn't changed much; such SOA enabled integration points generally worked to scale applications up so that they could shift from being applicable to a few dozen people to perhaps a few thousand people within an organization, but seemed to hit the ceiling when it came to integrate outside of the immediate domain. The principle issue was that if every database constellated system created its own set of web services with its own APIs, then it behooved partners and users to write code that integrated with your API. Of course, once this was done, it created a sticky link between the two organizations. If company A had their API set A.foo(), A.bar() and A.bat(), and company B had their API B.alpha(), B.beta() and B.gamma(), then if company C wanted to do business with both of them, it had to write code that would conditionally connect to each API in a unique fashion. In effect, you get combinatorial explosion, where operations as simple as request information can require a database of its own just to store how each of these access forms are used.As both applications and businesses become increasingly interconnected, this is forcing a major change in thinking about our relationship to the data, one that is likely to only become more prominent over the next few years.

Data Integration and RESTful Services

A shift in thinking about the role of data changed around 2000, with Roy Fielding's PhD dissertation on Representational State Transfer, or REST. In that paper he was the first person to cohesively lay out a radical concept - one way of thinking about the web was as a database, and the HTTP verbs that were associated with the Internet corresponded closely with the fundamental operations of a database - the Create/Read/Update/Delete verbs collectively known as CRUD. At the time, however, the very document-centric nature of the web generally made it difficult for people to see how this observation applied to the more traditional role of databases.

That changed somewhat with the rising importance of syndicated feeds, especially as blogging began to gain prominence in the 2003-2005 time frame. Blogs represented an interesting transitional stage - while they were clearly documents, these documents also required the use of metadata content that was original stored within databases as additional properties. The fascinating aspect about blogs is that they thus typify a different approach to thinking about data - one where the primary operations are publishing oriented (get, post, put, delete) but that also effectively represent the transmission of sets of records - recordsets - as an analog for collections of related document content.

Put another way, a blog is simultaneously a set of documents and a set of records - it spans the border between data and document quite nicely. Most blog APIs, no matter how convoluted they start out, have this rather intriguing tendency over time to simplify to RESTful forms in which the fundamental operations on the dataset are the ones that stress this CRUD layer - independently of the semantics of the document format itself. This is an important, perhaps critical point. CRUD-based XML data apps have very different semantic expectations between the the publishing mechanism and the content stored in each "row" or "record". They are fundamentally decoupled.

This means that regardless of whether the content of each member of that collection is an web page, an SVG graphic, an RSS feed entry, or even a PDF document, the publishing mechanism is the same. This semantically neutral "API" becomes very simple in principle, though in practice it does require more than straightforward CRUD operations ... though not much more. By making the API neutral, the contract that exists between producer and consumer of that data also becomes simple, regardless of whether that data is a blog post talking about data as a service or a several 100MB XBRL documents providing the financial statistics for a multi-national company. And this, ultimately, is what Data as a Service (DaaS) is all about.

Put in simple terms, Data as a Service can be thought of as a way to uniformly provide access to data in a datastore - regardless of the underlying storage mechanism of that data. With DaaS, the data can be stored as XML content, as SQL tables, as name/value pairs, as LDAP assertions, as syndicated JSON feeds, or as files on a file system - or in any other form imaginable. The data may be explicitly defined in a 1 to 1 model, or may be derived through some transformation of that initial data (that is to say, different representations of the same underlying data model), but the important thing to keep in mind here is that regardless of whether it is either of these the data is accessed in a CRUD type manner via some form of URI point request.

So how does this differ from Services Oriented Architectures and EAIs? The difference is simultaneously subtle and profound. The reality is that such DaaS systems are in fact forms of services; those services have specific endpoints (URLs), have a set of minimal APIs (the CRUD operations described in the RESTful verbs) and involve the transmission of content via some form of standardized messaging architecture (XML or JSON, typically). However, those services do not have any formal semantics beyond the CRUD ones; the semantics are solely associated with the URL endpoints themselves, and are effectively variations of serving either individual entries or collections consisting of zero or more links to entries of a given semantic type.

This is not a new concept. Every time you perform a Google search you are creating a collection of links, one to an entry, which, when followed (clicked), takes you to the corresponding resource. Go to Flickr and the same paradigm presents itself - a collection of links, each with a metadata "view" that shows the image in miniature and that pulls up the corresponding entry when you click on the link. YouTube is the same - search for a collection of links with embedded metadata (in this case a thumbnail of the video) to bring up the full entry. Of course, those entries can themselves contain related links presented in the same manner, which is why this particular paradigm works so well. RSS works the same way of course, but so too does Twitter, save that there is no necessary requirement that the metadata entry has an associated resource link (the structure is a little looser, but it's there). Resources (which are objects of a similar conceptual or semantic type) and collections of links to those resources forms the basis of the web; the only difference between these and Data as a Service is that the resource in question is much more likely to be an XML or JSON entity where the responsibility for the presentation of that resource falls to the consumer, rather than the producer, of the content.

This novel idea is actually the basis for AJAX. Most AJAX calls are RESTful calls to services to retrieve raw data - the responsibility (and privilege) of rendering that data then devolves upon components (either explicitly declared via frameworks or implicity designed in web pages) on the client. This means ultimately that the server no longer needs to know the intent of the client when delivering the content. If the server no longer needs to be aware of intent, then it's interfaces can be simplified dramatically to those involved in providing the requested data. There are still some operations that are critical that need to be on the server; with XML databases and XQuery, for instance, the retrieval of collections of entity links are usually shaped by a sequence that runs:

  1. determine the collection to be queried
  2. run a query against that collection, taking a query string that provides the requisite parameters for that query.
  3. sort the query based upon a give sort criteria, either user or server provided
  4. partition the resulting collection into discrete sets and return a page of that partition as appropriate (necessary once collections get too big to deliver all at once)
  5. Pre-processing the resulting entries in order to hide content inappropriate for a given user, expand internal representations of specific information (i.e., macro expansion) or other form of processing that doesn't ultimately change the schema
  6. create a presentation on the resulting collection, such as an XML document, a JSON feed, an RSS or Atom feed, a "search page", a table of entries, a "gallery" of image links, or whatever else best identifies that data to the resulting user.

Retrieving individual entries is usually a matter of following steps #1, #5 and #6, since the query, sort and partitioning are irrelevant once you have a resource identified, but you still may need to expand-process it and style it. As a good example of #5, by the way, consider the preprocessing that a wiki does with primary body content, expanding out shortcut notation into HTML notation prior to performing the final presentation layer. Note also that there is usually a qualitative difference between an entry and a collection that contains only one item - simply because there is only one answer to a given query does not imply that it is the answer I'm looking for.

Most web applications, when you get right down to it, could be recast relatively easily into this paradigm - an editor for a given resource is simply another presentation of that data, this time in a format that allows for the creation of that same data on the client as a JSON or XML model that can then be transmitted POSTed or PUT to the server as appropriate. Moreover, if you look at this operations in that light, then ultimately whether this content is delivered as XML or as JSON is a matter of what kind of input presentation interfaces exist on the server - there's nothing that says that a given URL can't have a PUT or POST service that accepts JSON and translates it back into XML in the back-end data abstraction, so long as these operations are kept relatively context free.

There is another caveat here that's worth mentioning. Just as GET operations typically have very predictable pipelines, so do POST and PUT operations, and there is also nothing that says that a service can't include appropriate hooks that other processes on the server can subscribe to in order to perform some operation. For instance, if a typical POST pipeline looks like:

  1. Retrieve the input data
  2. Convert the data into an appropriate internal format
  3. Verify the integrity of that data
  4. If no ID exists for that resource, create one.
  5. Save the resource into the data store

then there is no reason other than potentially system performance issues that can't have a hook either between step #4 and step #5 that can then invoke subscribing handlers (this is the basis for webhooks, in fact). Thus, once a resource is saved, a hook could be intercepted that will send out an email to a given recipient containing notification of the addition, a tweet will be added, a syndication cache will be updated and so forth. This is why I usually see the PUT/POST pipeline as being a little more complex:

  1. Retrieve the input data
  2. Convert the data into an appropriate internal format
  3. If #2 fails then call an error hook
  4. Verify the integrity of that data
  5. If #4 fails than call a validation hook
  6. If no ID exists for that resource, create one
  7. Call a pre-storage hook
  8. Save the resource into the data store
  9. Call a post-storage hook
  10. Return a status message to the client

Those who argue for static REST when dealing with data stores fail to understand the need for all of those hook calls, but the reality is that most web applications cannot be purely static in terms of their data processing - they have to deal with the possibility of failure. However, the POST/PUT pipeline as described above allows for a surprisingly broad range of actions, while still keeping true to the spirit of RESTful interactions.

This is where the differences between traditional SOA and RESTful Services/DaaS become important - from the standpoint of the client, regardless of what actually happens in the data store, the operations appear to be CRUD-like. That is to say, the client should not in fact see any differences between a static and dynamic RESTful interface - the illusion of the static interface remains. This does requre some modifications to the conventions of older REST, especially with non-idempotent operations. When you POST content to a RESTful interface, the client usually needs to have some mechanism to determine what the ID is for the newly created entity. Similarly, when an error occurs, there needs to be some way of identifying what object that error applies to.

In most cases, what this means is that for database changing operations, what gets returned is usually a set of metadate link entries to either the resulting object or to a message log object of some sort. This particular result operation is perhaps the one last remaining hurdle to change in terms of how operations are performed over the Internet, but there are indications that some standardization is occuring on even this.

DaaS and the NoSQL Movement

The examples given here discussed XQuery as the query mechanism, but, for all its power, XQuery should be seen as simply one such mechanism for performing queries. Another such mechanism is SparQL, which works on RDF Linked Data stores. Again, in the case of RDF stores the underlying collections may be quite varied, but the resulting operations of the query usually come back to either retrieving a linear collection of resources that satisfy a given query property, to be then handed off to some other process. Of course, in this case the data store is no longer located in a single database per se (and indeed this holds true for XML Databases as well, which have the potential to retrieve content from external data repositories) but ultimately the collection identification, query, sort and partitioning still create the abstraction (or illusion) of a unified data collection.

XQuery and SparQL should be seen as complementary technologies - they work with different types of data (not all RDF is encoded in XML) but they are also ultimately designed to bridge data domains to both create and potentially update data resources within their respective repositories. Indeed, there is some discussion about a next generation language that would make it possible for XQuery to invoke SparQL queries within its context.

Yet these should also be seen as simply two (albeit important) types of a burgeoning set of post-SQL query mechanisms that emerged after the establishment of the Internet and as such are much more aware of the collection/resource paradigm than SQL was. CouchDB and other name/value data stores represents another type of application, which essentially encodes the concept of a hash table into the very foundation of a data repository. Indeed, RDF triples can be thought of as simply a modified form of hash table that uses two parameters - a subject key and a relationship key - in order to assign a connection to an associated value (the object key or link).

Such name/value datasets also make use of the concept of map/reduce in order to make it possible to query distributed datasets. Map/reduce is a significant concept - it essentially provides a mechanism for performing queries in parallel against different data stores, then aggregating the results into a reduced set. This is also something that XML database vendors in particular should be looking at in far greater detail, as the clustering of XML data repositories becomes more and more commonplace.

We're moving to a time where you have petabytes of data content (with exabyes beginning to show up) that need to be stored and queried, and there comes a point where indexed searches, regardless of how efficient the indexing process is, become negatively impacted by single large data stores. If a mechanism existed to perform XQuery and SparQL type queries against distributed data stores (such as employed by map/reduce) then you get the best of both world - efficient indexing of content along with the ability to query against large datasets in parallel, with the results then syndicated via common storage mechanisms and aggregated first to the subordinate calling data servers then to the initiating server. Such a formalized mechanism doesn't yet exist, but I feel confident that it will, soon - the demand is there, and the technology to do it in an ad hoc manner already exists - it's just a matter of formalizing and standardizing it.

What all of these technologies share is the fact that they are moving beyond the relational database model for specific types of information. This is not a blanket condemnation of relational databases - they have been around since the late 1970s, they are highly efficient for many types of data, and they can provide a sufficient level of flexibility for many traditional applications. However, because of the effectively lack of a consistent serialization and parsing mechanism, this has also meant that relational databases in and of themselves tend to promote siloization. What I expect to see out of all of this is the emergence of precisely that layer within the relational data core, such that it too can interoperate both with other heterogeneous relational databases and with the growing "NoSQL" movement of non-SQL-based data repositories. This will happen as RESTful services begin to become one of the accepted mechanisms for communicating with "traditional CRUD" databases as well as the NoSQL ones.

Implications of DaaS

Given that Data as a Service seems to be sitting on the horizon, it's worth considering some of the ramificiations that wide-spread adoption will end up having. One of the most immediately obvious is that it will continue to shape web application development into a mode where process-oriented development becomes replaced by resource-oriented development. In an e-commerce site, for instance, the process oriented approach is often to accept an incoming invoice request from a client, process the transaction within the same process, and then return a message indicating that the transaction either succeeded or failed. The resource oriented approach is more complex - a request gets POSTed to a services URL as a message.

The message (if validated) gets added to a message queue of unprocessed messages (or the model element contains the requisite "unprocessed" flag as a field) , and an asynchronous processing hook gets called. The hook in turn performs the credit card validation, the message is posted either to a "processed" or "failed" queue, and the appropriate asynchronous processing hook gets called to either send the invoice to the respective fulfillment agent or to send an error message to a message queue. The client can then in turn monitor both the processed and the error queues, and can then set up the appropriate notifications as necessary.

Such message queues have become an integral part of most larger scale application architectures. Even though the may be conceptually a little more difficult to understand, their central advantage is that they make asynchronous processing possible. However, in a purely resource based environment those message queues are typically the results of queries on existing supersets (the sets of all invoices), filtering in only those invoices that are in a specific state, ordered usually by time. In other words, they can be manifest as query based parametric feeds.

Additionally, many XML databases in particular are now supporting versioning. In process based distributed systems, versioning can be a real nightmare, because you effectively have to reverse the transaction (and hence keep a specific record of the transaction) across multiple data repositories. With versioning in place and through the use of a resource-centric architecture, it becomes far easier to "pop the stack" on a transaction - and it also makes it easier to hold any given resource in an appropriate pending state until the transaction is commited. This will become even more of an issue as transactions themselves become more impulse like - where one transaction causes cascades with other transactions.

Resource-centric DaaS also will continue shaping applications on the client to be more component oriented and decentralized. It also has a second impact, one that in the long run will have a far bigger effect on the way that we think about data. Up until comparatively recently, data was seen as comparatively static, something that effectively either existed or didn't, and when you made a request for data, then either the information was there or it wasn't. However, now we're moving to a mode where data is both distributed and dynamic. If you make a SparQL query, for instance, that query will be talking to multiple distributed servers, and the servers will be passing information back with a noticeable and variable latency. This means that at any given stage you won't necessarily know whether you have all of the data, and in some situations the data itself could be potentially infinite in scope or total latency. This means that the information that you have effectively becomes both more stochastic (probabilistic) and also changes dynamically over time. For the most part such changes are cumulative - your data increases monotonically with time, but it's possible that data could also be transient or be filtered out by other operations (this is especially true of data redundancy).

A second issue, especially when dealing with things like RDF Linked Data stores, is the integrity of the data. This isn't necessarily schematic integrity - RDF makes it possible to create ad-hoc schemas based upon the relationships imposed by the data assertions themselves. Rather, the issue becomes one of the degree to which a given set of assertions are true. It there is a question about the reliability of the data, if the data is arrived at via assertion chains each link of which is somewhat fuzzy (it has a "truth" value somewhere between 0 (false) and 1 (true)), if the data is incomplete, all of these will factor into the trust that one can place upon the conclusions based upon the initial assertions.

Yet another implication is that such data can be effectively tailored to different access levels, based upon authentication privileges. A good example of this is the notion of a set of 3D renderings of an office park. A general user may only have access to the general physical renderings of the outside and perhaps the lobby until they were upgraded to guest, at which point they may be given access to those corridors and hallways necessary for them to get to their meeting. A person who worked at the office would have a general access view of the facilities, but may not have access to specific security features - the location of video cameras, fire alarms, and so forth, which a security manager might. The same facility might also highlight the electrical system or plumbing conduits for an electrician or plumber to locate the relevant pipe informations. All of this information would come from the same underlying data model, but the queries and similar processing would insure that only people with a "need to know" had access to potentially dangerous content within the building.

In a DaaS-centric architecture, the data to produce the relevant 3D data files would be a resource, and the filtering that took place would be provided in both the query and the presentation layers. Because of the focus of this information as declarative content, adding new levels of detail is a relatively easy operation. This points to another aspect of DaaS - the notion of Level of Detail becomes an intrinsic one in data architectures. The combination of nodes of content (resources) and linked collections is a recursive one, making it possible for such information to become both fractal and richly dimensional. To go back to the building analogy, it probably does a plumber no good to have a highly detailed schematic of all of the pipes in the building when he knows the problem is in a bathroom in the third floor men's room. He could get the general schematics for the pipe layouts, but when he has actually found the critical area, he can also go in and ask for more detailed information specific to that particular bathroom. This - level of detail - will increasingly be recognized not as an application specific property, but rather a property that is intrinsic to distributed asynchronous data systems.

One final effect of the rise of DaaS is that you'll see the consolidation around specific syndication formats - Atom and XMPP are the two I'd personally wager on as mechanisms to provide metadata links, and the metadata entries in turn will become responsible for identifying format type and services availability. This is a shift away from the concept of schematic standardization at the topical level. If the server invoice format as different from yours, with a RESTful services approach you have several possibilities: you can transform the incoming (known) invoice into your own internal invoice format, you can ask your partner to provide a transformation "face" that will generate the invoice to your internal format, or you can agree upon a known standard and both adopt it. While there are benefits to standardizing, business realities don't necessarily make that feasible. I believe DaaS can more readily accomodate that than existing SOA/EAI approaches.

Overall, I believe that in the long run DaaS based development will likely subsume the existing Software as a Service way of thinking. Software itself in a DaaS regime becomes somewhat defuse and fractal, but realistically DaaS-centered technologies such as XML databases, Map/Reduce, files on filesystems and so forth will likely simplify the way that such applications are built anyway. Overall it's likely to become one of the major technology thrusts in the Twenty Teens, playing out over the next five to six years or so. That XML developers in particular should be thinking about this now is also important, because it is likely that DaaS will be to this coming decade what SQL was for the last couple of decades - a powerful tool for the creation, dissemination and manipulation of information.

Re: DaaS, XQuery and the NoSQL Movement

Fantastic Kurt!

As usual, much grist for the mill. It has seemed to me for a while now (since working more with SparQL and Jeni Tennison's RDFQuery in particular) that, as a RESTful NoSQL approach shortens the impedence between resource and consumer, the relationship and object modelling which has been so important in the last decade has started to move to a new relational layer, as the relevence and relationships between "things" is determined more and more for the consumer by the client and presentation layer.

This article also puts me in mind of the approach used in Playdar, a music content resolution system that uses metadata to find audio files on you network, which are addressed using local URLs rather than distributed file handles. This is possible because Playdar runs as a really lightweight webserver on your local machine. I am surprised that the approach of leveraging a local server in the browser has not become more popular; it seems like a natural counterpart to DaaS.

The idea of content resolution is becoming more important too, because, as you mentioned, data is more and more dynamic - what do you do when your data comes from multiple resources? Source of truth has always been at issue politically in government; it seems like issues of provenance and relevence are both starting to come closer to home.

>>> ...a few thoughts

Fantastic essay!

Re: DaaS, XQuery and the NoSQL Movement

Kurt Cagle's picture

Anonymous wrote:
As usual, much grist for the mill. It has seemed to me for a while now (since working more with SparQL and Jeni Tennison's RDFQuery in particular) that, as a RESTful NoSQL approach shortens the impedence between resource and consumer, the relationship and object modelling which has been so important in the last decade has started to move to a new relational layer, as the relevence and relationships between "things" is determined more and more for the consumer by the client and presentation layer.

I've had that same sense as well. Put another way: the client is increasingly determining the relationships between things ex post facto, rather than being handed these relationships already pre-hardened by the provider. This is a radical change when you think about it; it places an enormous amount of power into the hands of the consumer of this information, albeit at a cost to the provider of losing the role of being the gatekeeper of such relationships. Of course, for that to happen, client developers have to be aware that they have this power in the first place, and I think that side of the equation is going to be ongoing as well.

Anonymous wrote:
This article also puts me in mind of the approach used in Playdar, a music content resolution system that uses metadata to find audio files on you network, which are addressed using local URLs rather than distributed file handles. This is possible because Playdar runs as a really lightweight webserver on your local machine. I am surprised that the approach of leveraging a local server in the browser has not become more popular; it seems like a natural counterpart to DaaS.

There are still factors that mitigate against this, as useful as it is for a developer of Internet apps. Perceived security is one, though I have to admit that this is more a function of WHICH server is localhost. Memory usage is a second - a server does require CPU cycles, though I think that this is also a weak argument. Mostly it's just a question of changing user perceptions, and simplifying the interfaces necessary to work with such servers. I set up localhosts on any computer I work with and have for so long I don't even think twice about it, but I also do things like run Drupal as an internal CMS. That we'll see this approach be taken more often I think is inevitable - it provides an alternative approach to offline applications which makes more sense than trying to put a database into a browser.

Anonymous wrote:
The idea of content resolution is becoming more important too, because, as you mentioned, data is more and more dynamic - what do you do when your data comes from multiple resources? Source of truth has always been at issue politically in government; it seems like issues of provenance and relevence are both starting to come closer to home.

Yup. It will take time to change the way that we think about data, but that change is taking place even now, and will only continue.

Re: DaaS, XQuery and the NoSQL Movement

Agreed. At the same time, I am very excited by the possibilities opened up by Sugarlabs "Sugar on a Stick" project, which comes indirectly from of the OLPC foundation, bundling Fedora and the Sugar OS onto a USB device so a child can live-boot his or her own "computer" on any box which supports live booting - ie at home, at school, library etc.

This has the advantage of moving children away from the MS/Apple silos, but also underscores the way that data is accessed, by essentially bundling an access layer with whatever activities the child is involved with. It is one more level of indirection between the user and the world at large, but one which is personalizable and would permit a graduated approach to access, possibly a way of authenticating the user by age and offering age-appropriate protection. I'm not actually sure where I stand on the issue of this sort of protection, but it does seem that if information providers start to remove limitations and filters at the application level n the server side, these functions will need to be negotiated more locally.

I am really curious to see how these shifts play out as the "cloud" hype either expands or evaporates and the firehoses are unleashed. It's a good time for developers in any case.

 

Cheers

-- Piers

Re: DaaS, XQuery and the NoSQL Movement

Kurt Cagle's picture

Anonymous wrote:
Agreed. At the same time, I am very excited by the possibilities opened up by Sugarlabs "Sugar on a Stick" project, which comes indirectly from of the OLPC foundation, bundling Fedora and the Sugar OS onto a USB device so a child can live-boot his or her own "computer" on any box which supports live booting - ie at home, at school, library etc. This has the advantage of moving children away from the MS/Apple silos, but also underscores the way that data is accessed, by essentially bundling an access layer with whatever activities the child is involved with. It is one more level of indirection between the user and the world at large, but one which is personalizable and would permit a graduated approach to access, possibly a way of authenticating the user by age and offering age-appropriate protection. I'm not actually sure where I stand on the issue of this sort of protection, but it does seem that if information providers start to remove limitations and filters at the application level n the server side, these functions will need to be negotiated more locally.

I find it significant that USB Sticks are now getting to a point where you can store up to 16GB of applications on one. I think that's the break even - at 16GB, you can have a solid Linux core and support infrastructure, a decently sized /home/ directory, and enough spare space to store at least basic resources. Up until that point, they were storage, but little more - now they're effectively working operating systems - especially if you're working with the presumption of cloud availability for your resources as well.

And yes, I think that's a key tipping point in broader computer usage, especially with corporate and institutional usage. I actually think a good place to start in that regard is with game boxes. Put a USB port on a PSP3 or XBox (it may have one already), add a keyboard/mouse peripheral bundle through the appropriate hardware port (either USB or IR/Bluetooth) and you have the core hardware system to handle the relevant OS.

Anonymous wrote:
I am really curious to see how these shifts play out as the "cloud" hype either expands or evaporates and the firehoses are unleashed. It's a good time for developers in any case.

Cloud is marketing hype - anyone in the industry knows that - it's essentially taking the trends that already have been in place for years and making them sound sexy. That doesn't mean, however, that the trends themselves don't continue to have relevance. Local storage and control of one's OS is a key part of that equation, and as Netbooks continue to cannibalize sales of lower to mid-sized laptops, I think that cloud will only become more relevant. Here's a slightly different scenario from the one above, but it's just as relevant. Suppose that what gets stored on your data stick is, rather than the OS itself - a virtualization snapshot. The netbooks or whatever exist primarily as virtual OS servers, you put in your stick, and the stick bundle then gets booted back up. This means that in essence you not only always have the programs you want, but you maintain the precise state of where you were when you were last on. The stick contains the relevant virtualization software (on Linux something like VirtualBox, which is a pretty impressive piece of software) which can be loaded onto the system in question if it doesn't have the software already present.

This would be a boon in the commercial sector - you're not spending the five to ten minutes waiting for everything to configure, you can essentially jump right into a presentation without having to set up beforehand, if you're interrupted you can save your state and pick up immediately when you get back within range of a computer. In this context, I also see your portable data as being effectively the social equivalent of RAM - it can be readily persisted back into the cloud as necessary, can be stored locally when you know you're going to be out of range of the Internet, and especially if you go back to the idea of running a local server, can actually perform work while out of the Internet, which is one of the biggest problems that you have with most web apps. The trick though is not to try to create separate protocols for your local file system that you do with your web FS, but instead to create a virtualization layer over your LFS so that it simply looks like another web service provider. Once you're at that stage, then synching becomes an automated process - you persist to a local process which looks like a web process, then when a wireless signal is transmitted, the messages are sent and the cache of local messages are cleared.

Kurt