An alternative to hierarchical data model

Asked 20/8, 2012 at 12:4 Answered 29/8, 2012 at 5:32

Problem domain

I'm working on a rather big application, which uses a hierarchical data model. It takes images, extracts images' features and creates analysis objects on top of these. So the basic model is like Object-(1:N)-Image_features-(1:1)-Image. But the same set of images may be used to create multiple analysis objects (with different options).

Then an object and image can have a lot of other connected objects, like the analysis object can be refined with additional data or complex conclusions (solutions) can be based on the analysis object and other data.

Current solution

This is a sketch of the solution. Stacks represent sets of objects, arrows represent pointers (i.e. image features link to their images, but not vice versa). Some parts: images, image features, additional data, may be included in multiple analysis objects (because user wants to make analysis on different sets of object, combined differently).

Current solution simplified sketch

Images, features, additional data and analysis objects are stored in global storage (god-object). Solutions are stored inside analysis objects by means of composition (and contain solution features in turn).

All the entities (images, image features, analysis objects, solutions, additional data) are instances of corresponding classes (like IImage, ...). Almost all the parts are optional (i.e., we may want to discard images after we have a solution).

Current solution drawbacks

Navigating this structure is painful, when you need connections like the dotted one in the sketch. If you have to display an image with a couple of solutions features on top, you first have to iterate through analysis objects to find which of them are based on this image, and then iterate through the solutions to display them.
If to solve 1. you choose to explicitly store dotted links (i.e. image class will have pointers to solution features, which are related to it), you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes.

My idea

I'd like to build a more extensible (2) and flexible (1) data model. The first idea was to use a relational model, separating objects and their relations. And why not use RDBMS here - sqlite seems an appropriate engine to me. So complex relations will be accessible by simple (left)JOIN's on the database: pseudocode "images JOIN images_to_image_features JOIN image_features JOIN image_features_to_objects JOIN objects JOIN solutions JOIN solution_features") and then fetching actual C++ objects for solution features from global storage by ID.

The question

So my primary question is

Is using RDBMS an appropriate solution for problems I described, or it's not worth it and there are better ways to organize information in my app?

If RDBMS is ok, I'd appreciate any advice on using RDBMS and relational approach to store C++ objects' relationships.

Umbra answered 20/8, 2012 at 12:4 Comment(9)

Hi Steed. What you are asking is a really difficult question. You are also asking many questions, not one. What do you refer to as a data model? Do you intend to use the data model over the network, write it to file, in memory? Without more details and a specific question, answers becomes even more difficult – Cornelie 20/8, 2012 at 12:15

I open a file, create data structure, work with it, save back to a file. By "data model" I mean storing information about real-world objects and relations between them in memory. I'll try edit the question to focus on single question. – Umbra 20/8, 2012 at 12:28

If I need to improve the question further (how?), please let me know. – Umbra 23/8, 2012 at 9:22

You seem to be combining a description of what you are trying to do, a description of a solution you propose, and a question about what solution to use. These can all be usefull parts of a good question, but I think you need to seperate them a bit more and clarify exactly what you are asking. – Chieftain 23/8, 2012 at 20:31

I'm just trying to understand the structure of your current solution. When you say a 'tree-like structure' do you mean it is done in a single class? Or is it a collection of related classes? "Data is duplicated" => Why so? Why don't you maintain a link to the relevant data rather than duplicating it? "A lot of work should be done, if you have a leaf " => Does this mean more implementation work or more time to run? Basically are you looking for a time optimization or more maintainable/easily codable solution? – Ghana 24/8, 2012 at 10:57

@PermanentGuest, tletnes, I've rewritten the question once more to try to answer your requests. – Umbra 24/8, 2012 at 13:32

@Umbra : This question looks now much better. I would try to answer in one or two days, but definitely now you would get some good answers from others. – Ghana 24/8, 2012 at 16:33

Have a look here too OODBMS: en.wikipedia.org/wiki/Object_database – Profess 29/8, 2012 at 6:31

@wingman, thanks. At first glance, GigaBASE looks promising. – Umbra 10/9, 2012 at 11:29

I don't recommend RDBMS based on your requirement for an extensible and flexible model.

Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.

I strongly recommend using standard C++ OO programming with STL.

You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc

All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.

Finned answered 29/8, 2012 at 5:32 Comment(1)

I actually ended up doing it all in C++ without DBs. Just more abstraction and more generic code. Thanks for your answer. – Umbra 28/11, 2012 at 5:36

You may want to look at Semantic Web technologies, such as RDF, RDFS and OWL that provide an alternative, extensible way of modeling the world. There are some open-source triple stores available, and some of the mainstream RDBMS also have triple store capabilities.

In particular take a look at Manchester Universities Protege/OWL tutorial: http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/

And if you decide this direction is worth looking at further, I can recommend "SEMANTIC WEB for the WORKING ONTOLOGIST"

Cohl answered 27/8, 2012 at 11:4 Comment(1)

OWL tutorial is exciting! Thank you for the answer. It will take time for me to read and understand, as well as shipr's solution. Maybe I should create two bounties..;) – Umbra 28/8, 2012 at 11:21

Just based on the diagram, I would suggest that an RDBMS solution would indeed work. It has been years since I was a developer on an RDMS (called RDM, of course!), but I was able to renew my knowledge and gain very many valuable insights into data structure and layout very similar to what you describe by reading the fabulous book "The Art of SQL" by Stephane Faroult. His book will go a long way to answer your questions.

I've included a link to it on Amazon, to ensure accuracy: http://www.amazon.com/The-Art-SQL-Stephane-Faroult/dp/0596008945

You will not go wrong by reading it, even if in the end it does not solve your problem fully, because the author does such a great job of breaking down a relation in clear terms and presenting elegant solutions. The book is not a manual for SQL, but an in-depth analysis of how to think about data and how it interrelates. Check it out!

Using an RDBMS to track the links between data can be an efficient way to store and think about the analysis you are seeking, and the links are "soft" -- that is, they go away when the hard objects they link are deleted. This ensures data integrity; and Mssr Fauroult can answer what to do to ensure that remains true.

Escurial answered 24/8, 2012 at 16:26 Comment(2)

Thanks for the answer! I'll check the book as soon an I get it. Can you think of any disadvantages or tricky points of implementing RDBMS soution (not covered by the book)? – Umbra 24/8, 2012 at 18:17

I cannot think of specific disadvantages other than that the data is stored to disk using the RDBMS engine, and not fully contained in memory -- but of course that may an advantage instead. The trickiest part will be to properly establish the relations, and to maintain them when data is deleted; but those things the book does a good job describing. – Escurial 28/8, 2012 at 17:53

http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html

"you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes."

With the help of Boost.MultiIndex you can create almost every kind of index on a "table". I think the quoted problem is not so serious, so the original solution is manageable.

Counterclockwise answered 25/8, 2012 at 6:26 Comment(1)

Thank you for the answer, but I can't see right away how I could use multi_index for my problem. Could you clarify a bit, please? – Umbra 27/8, 2012 at 9:45

I don't recommend RDBMS based on your requirement for an extensible and flexible model.

Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.

I strongly recommend using standard C++ OO programming with STL.

You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc

All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.

Finned answered 29/8, 2012 at 5:32 Comment(1)

I actually ended up doing it all in C++ without DBs. Just more abstraction and more generic code. Thanks for your answer. – Umbra 28/11, 2012 at 5:36

Recommended topics

Hot tags