Best practices to access schema-less data?

Asked 18/12, 2009 at 19:15 Answered 31/12, 2009 at 20:55

I am toying with RDF, and in particular how to access information stored in a rdf storage. The huge difference from a traditional relational database is the lack of a predefined schema: in a relational database, you know that table has those columns, and you can technically map each row to an instance of a class. The class has well defined methods, and well defined attributes.

In a schema-less system, you don't know what data is associated to a given information. It's like having a database table with an arbitrary and not predefined number of columns, and every row can have data in any number of these columns.

Similar to ObjectRelational Mappers, there are Object RDF mappers. RDFAlchemy and SuRF are the two I am playing right now. Basically, they provide you a Resource object, whose methods and attributes are provided dynamically. It kind of make sense... however, it's not that easy. In many cases, you prefer to have a well defined interface, and to have more control of what's going on when you set and get data on your model object. Having such a generic access makes things difficult, in some sense.

Another thing (and most important) I noted is that, even if in general, schema-less data are expected to provide arbitrary information about a resource, in practice you more or less know "classes of information" that tend to be together. Of course, you cannot exclude the presence of additional info, but this, in some cases, is the exception, rather than the norm, although the exception is sensible enough to be too disruptive for a strict schema. In a rdf representation of an article (e.g. like in RSS/ATOM feeds) you know the terms of your described resources, and you can map them to a well defined object. If you provide additional information, you can define an extended object (inherited from the base one) to provide accessors to the enhanced information. So in a sense, you deal with schema-less data by means of "schema oriented objects" you can extend when you want to see specific additional information you are interested about.

My question is relative to your experience about real world usage practices of schema-less data storage. How do they map to the object-oriented world so that you can use it proficiently and without going too near to the "bare metal" of the schema-less storage ? (in RelDB terms, without using too much SQL and directly messing with the table structure)

Is the access doomed to be very generic (e.g. SuRF "plugged-in attributes" is the highest, most specialized level you can have to access your data), or having specialized classes for specific agreed convenient schemas is also a good approach, introducing however the risk of having a proliferation of classes to access new and unexpected associated data ?

Misrule answered 18/12, 2009 at 19:15 Comment(1)

For length or complexity ? :P – Misrule 18/12, 2009 at 19:18

I guess my short answer would be "don't". I'm a bit of a greybeard, and have done a lot of mapping XML data into relational databases. If you do decide to use such a database, you're going to have to validate your data constantly. You'll also need very strict discipline in order to avoid having databases with little commonality. Using a schema helps here, as most XML schemas are object-oriented and thus extensible, easing the need for analysis to keep from creating similar data with dissimilar names, which will cause anyone who has to access your database to think evil thoughts about you.

In my personal experience, if you're doing the sorts of things where a networked database makes sense, go for it. If not, you lose all the other things relational databases can do, like integrity checking, transactions and set selecting. However, since most people use a relational database as an object store anyway, I guess the point is moot.

As for how to access that data, just put it in a Hashtable. Seriously. If there is no schema anywhere, then you'll never know what is in there. If you have a schema, you can use that to generate accessor objects, but you gain little, as you lose all the flexibility of the underlying store while simultaneously gaining the inflexibility of a DAO (Data Access Object).

For instance, if you have a Hashtable, getting the values out of an XML parser is often fairly easy. You define the storage types you're going to use, then you walk the XML tree and put the values in the storage types, storing the types in either a Hashtable or List as appropriate. If, however, you use a DAO, you end up not being able to trivially extend the data object, one of the strengths of XML, and you have to create getters and setters for the object that do

public void setter(Element e) throws NoSuchElementException {
    try {
        this.Name = e.getChild("Name").getValue();
    } catch (Exception ex) {
        throw new NoSuchElementException("Element not found for Name: "+ex.getMessage());
    }
}

Except, of course, you have to do it for every single value in that schema layer, including loaders and definitions for sublayers. And, of course, you end up with a much bigger mess if you use the faster parsers that employ callbacks, as you now have to track which object your'e in as you produce the resultant tree.

I've done all this, although I normally construct a validator, then an adapter that provides the match between the XML and the data class, then a reconcile process to reconcile it to the database. Almost all the code ends up being generated, though. If you have the DTD, you can generate most of the Java code to access it, and do so with reasonable performance.

In the end, I'd simply keep freeform, networked or hierarchical data as freeform, networked or hierarchical data.

Northwestward answered 31/12, 2009 at 20:55 Comment(0)

I would say the best practice for a schema-less XML file is to create a schema for it!

Having no schema is not particularly nice. It means you cannot validate the file in any way, other than to detect if it is well-formed XML or not.

Having no semantics to the file whatsoever seems fishy. Because that would mean that you do not know what you should, did, or will put into it. If that is the case, it sounds suspiciously like a solution in search of a problem.

If you have no schema because you do not yet know a schema language, take a look at DTD. It is very simple. You can learn and master it in about an hour or two, if you have a validation utility or validating parser in your application.

If the issue that is preventing you from having a schema is that your schema rules do not seem to fit schema definition file types you have looked at so far, fear not.

While DTD and even XSD (XML Schema) files are somewhat inflexible, there are other more flexible schema file types. They are much simpler than XSD too, trust me.

Take a look at the RNC (RELAX NG, compact) schema file spec. The RNC files are very easy for humans to read and write. There are some XML editors out there that understand them. There are utilities that will convert back and forth between RELAX NG format (RNG or RNC) and other formats like DTD and XSD.

Last time I checked, the XHTML TR included a non-normative RNC file for help in validating it, not to mention documenting it unambiguously. RELAX NG has the flexibility to do that, and you can actually read it without being part of the Borg collective. In this case Borg is not a euphemism Microsoft.

If you need something even more flexible than RELAX NG, take a glance at Schematron. It is a very nice rule-based schema validation language. It is not very complex. Like these other schema languages, it too has been around a long time, is mature, and is a recognized standard.

Even some senior engineers at Microsoft had grave misgivings about XSD. The complexity is high, it turns out to be unable to express certain not-so-odd data arrangements, it is very verbose, it mixes concerns such as validation and default values, and so on. Whatever you are doing, it does not sound very well suited towards directly supporting it.

RDF mappers, like XSD binding tools, are well suited towards persisting objects, given their classes in some supported programming language like Java (e.g. with JAXB). It is not clear you have some classes you want to persist in the first place, though.

There are some semantic web technologies out there like OWL and RDF which are flexible, and very dynamic.

One tool you might want to look at is Stanford's Protege. It is quite powerful and very flexible. It is basically a semantic web IDE and framework. The latter is written in Java, as is the tool. However, the semantic web schema and data files Protege creates and edits could be used by programs written in any language. There is no bias towards Java in such files.

Also, you can find lots of semantic web schemas by using Swoogle. There might be a schema already that fits whatever your application is.

Basically, coming up with a schema file in one of these many schema validation languages is not very hard once you know what you want to put in your XML data file. If you have no idea then it is unlikely a program or a person is going to know what to do with it when they read it. If that is the case, XML might not be the best storage representation. I am not sure anything would be.

Instead, you might simple want to do whatever you are doing in a general purpose, dynamically typed scripting language like Python or Ruby. LISP could also be used, if you want your programs to be able to not only have unlimited data formats but be able to modify themselves as well.

Another option for schema-less data storage is a logic programming language. These usually do not have any schema. They have an ontology instead.

Two programming languages I have worked a lot with that use ontologies are CLIPS and Prolog. There are free, open source, cross-platform, implementations of both available.

Take a look at SWI-Prolog; fast, simple, and powerful. You can define facts in it, and rules which basically synthesize apropos facts when necessary. You pull the data out with queries. Prolog was actually an inspiration for RDF when it was created, back in the 1990's, as I recall. The original RDF documentation used to make frequent references to Prolog. If you want to "discover" or "analyze" or "find" things about facts in your ontology, Prolog is a very good language for writing such applications. It is also handy for natural language parsing.

CLIPS is nice too, if you are looking to do problem-solving upon the facts in your ontology. It is well-suited towards organizing, troubleshooting, and configuration related applications.

If schemas are not your thing, perhaps ontologies are. If not, maybe you should just use a dynamically typed scripting language and persist data stored in complex objects using maps and lists into files using their standard persistence mechanisms.

Applause answered 30/12, 2009 at 6:29 Comment(0)

I have no experience with schema less DB combined with OOP, with I have year of experience with a schema less DB and scripting. From my experience, it can be quite usefull. The DB I've used was also untyped (all arbitrary strings). This leads to the following advantages:

you don't have to take care about your DB structure. If you need to store something, you just store it. And you don't have to care about the data types which fits the scripting language
you can easily add debug information to "objects" when needed without having empty columns for most of the table rows. This allows you to even store huge chunks of data where needed,
you don't have to care about updates to the DB structure. You just write the new data which comes with your new software version to the database. This way, you don't need an admin to update your table structure and convert your old data. It just happens on the fly
if the key for your key-values pairs has a meaningfull name, you don't need much documentation for your data

So in my case, the schema less DB together with the scripting was very usefull and a huge success.

When you think of using objects for the schema less DB, I would try to keep the freedom by storing the objects in a hashtable. This would give you the freedom to access all the key-value pairs - no matter which "object" you selected. It would also give you the freedom to add new key-values as needed.

If your objects (like in an RSS feed) have a well defined base, it makes sense to come up with a base objects which encapsulates the well defined base but also has some kind of hash map for your freedom.

As soon as you discover that more and more key-value pairs turn out to be "standard", just update your object model to encapsulate these - you software will evolve into the right data structure. May it makes even sense to move some of the data to a traditional RMDBS at a later time.

Don't over engineer - implement the features when needed...

Unrestraint answered 29/12, 2009 at 8:25 Comment(0)

Use MongoDB or other nosql databases. Also see this blog on, Why I think Mongo is to Databases what Rails was to Framework.

Skillless answered 30/12, 2009 at 6:34 Comment(0)

Recommended topics

Hot tags