Many-to-many relationship with NoSQL database

F

2

22

I want to implement a taxonomy structure (geo terms) for my node.js application with NoSQL database. I had a similar taxonomy structure with MySQL but it's time to move forward and learn something new so I decided to try a different approach and use NoSQL (document-oriented) for my test app. The taxonomy structure is simple - there're five different levels: country (i.e. United Kingdom) → region (England) → county (Merseyside) → city/town/village (Liverpool) → part of the city (Toxteth).

The obvious choice is to use a tree structure but the devil is in the detail - historically some cities and towns belonged to other counties. The idea was to tag persons who were born in certain cities or towns with those terms and to filter them later by geo tags so I have to respect the fact Liverpool or Manchester (among others) were part of Lancashire at the time some people were born. Otherwise the result any user gets with my geo filter will be incorrect.

Example: John Doe was born in Blackburn (Lancashire) back in 1957. Paul Brown was born in 1960 in Liverpool (Lancashire, now Merseyside). Georgia Doe (nee Jones) was born in Wirral (Cheshire, now Merseyside) 5 years later. Their son Ringo was born in Liverpool (Merseyside by that time) in 1982.

John is Lancastrian by birth, Paul is Lancastrian and Merseysider, Georgia is from Cheshire and Merseyside at the same time, Ringo is from Merseyside. So they should be categorized accordingly when I search by county. But with simple one-to-many structure that follows modern structure of the country they'll never be filtered as they should be.

How to implement the collection respecting the complexity of its structure with NoSQL (first of all document-oriented) solutions? I googled it and did some research over stack* but still had no clue what to do next with it. There's a few possible ways to solve it in my opinion:

Use SQL-like data structure:

{
    {'name': 'United Kingdom', 'unique_id': 1},
    {'name': 'England', 'unique_id': 2, 'parents': [1]},
    {'name': 'Merseyside', 'unique_id': 3, 'parents': [2]},
    {'name': 'Lancashire', 'unique_id': 4, 'parents': [2]},
    {'name': 'Liverpool', 'unique_id': 5, 'parents': [3, 4]},
}

Use tree structure with some references:

{    
    {'name': 'United Kingdom', 'unique_id': 1
        {'name': 'England', 'unique_id': 2]
            {'name': 'Merseyside', 'unique_id': 3]
                {'name': 'Liverpool', 'unique_id': 5, 'alternate_parents': [4]},
            },
            {'name': 'Lancashire', 'unique_id': 4},
        },
    },
}

Use tree structure with no references (one-to-many) and add "alternate parent" tag to a document manually:

{    
    {'name': 'United Kingdom', 'unique_id': 1
        {'name': 'England', 'unique_id': 2]
            {'name': 'Merseyside', 'unique_id': 3]
                {'name': 'Liverpool', 'unique_id': 5},
            },
            {'name': 'Lancashire', 'unique_id': 4},
        },
    },
}

Stick with SQL.
Try to implement database-less taxonomy.

Give me advice on that matter please. I'm a newby with any NoSQL (currently I've designed no such databases) so there's a real design issue for me.

And I'm new to stack* so feel free to correct me if I did anything wrong with this post :) Thank you!

EDIT I've chosen @Jonathan answer as a solution. I think it suits better for my needs (there'll be other documents to store in my database and tag them with those terms) especially with mapReduce functionality suggested by @Valentyn.

But if there's no document collections needed for your app a graph database (based on relationships not documents) suggested by @Philipp is probably the best solution possible.

Fresco answered 12/1, 2013 at 1:51 Comment(4)

This explains how the ruby ORM Mongoid does it: "When defining a relation of this nature, each document is stored in its respective collection, and each document contains a "foreign key" reference to the other in the form of an array." mongoid.org/en/mongoid/docs/… – Repair 12/1, 2013 at 1:54

What NoSQL database are you using? There are a lot of database solutions which are herded under the general term "NoSQL" and they don't have much in common. – Gable 12/1, 2013 at 11:15

@AlexWayne Thanks mate. It looks promising and I'll give it a try later. – Fresco 12/1, 2013 at 12:39

@Gable Yeah. You're right mate, I should be more specific with the terms. There'll be a document-oriented db, most probably MongoDB instance. – Fresco 12/1, 2013 at 12:46

C

6

Firstly, picking between NoSQL and a SQL database is hard if you're not familiar with the basic principles. If this is the only data you are storing, go with a relational (SQL). If there is more data (which I assume) and it requires more of a interwoven schema, stick with NoSQL hands down.

I would take the relational route on this to keep it from getting too complex... start several collections; one for countries, region and so on. Don't get discouraged from doing relational (SQL) type schemas in a NoSQL database; most of the time they are the best solution.

Then, in each of the sub-groups, have a field which names the parent.

For example:

{
    {'name': 'United Kingdom'},
    {'name': 'United States'}
}

{
    {'name': 'England', 'parent': 'United Kingdom'},
    {'name': 'California', 'parent': 'United States'}
}

That way, your data-set doesn't get so nested that the returned data is unmanageable. Then you can grab the countries and the corresponding regions... etc with ease.

Best of luck!

EDIT: Answering OP's questions:

(Firstly, I'd recommend MongoDB - it's a great solution all around.)

Because when you start working with MongoDB, you'll realize that it stores data side by side on the hard drive. If you edit a huge record like that, it will most likely be pushed to the back of the disk, making your hard drive similar to Swiss cheese. Once you get to that point, you'll have to do a repair to condense it once more. Also, this way the data is more easily separated in your application, that way, if you need to do something with the data, you won't have to apply it to the entire object. I am assuming that you will have a large dataset since there are many different locations in the world.
Don't worry too much about that kind of thing. You can use ID's for the parent and match the children with the ID if you plan on changing names a lot. I just did it this way because I assumed you wouldn't need to change a location database.
Rather than an array, I would use a nested document to store multiple parents. That way, it can be more easily queried and indexed. I would use the following method:
```
{
    {
        'name': 'England,
        'parent': {
            1: 1,
            568: 1
        }
     }
 }
```

So that way you can employ your idea of indexes and find where db.region.$.568 = 1

Congratulate answered 12/1, 2013 at 2:7 Comment(6)

Also I would like to add that in NoSQL sultuions instead of making queries with SELECT ... WHERE x IN or SELECT .. GROUP BY "Map-Reduce" method is been used. For example to get all items that are in "United Kingdom" you can use Map that will mark all items that have needed parent and then Reduce resulting set by filtering marked items. So I will +1 to @Congratulate solution - this way you'll get less coupled entities and also entities would have much more meaning - you can get nessesary data without issuing additional queries. – Relate 12/1, 2013 at 2:35

Depending on the database the OP is using (unfortunately he doesn't tell) this might require a lot of JOIN operation between children and parents. Some NoSQL databases have no or bad support for JOINs, so it would be a bad solution for them. – Gable 12/1, 2013 at 11:21

Thank you @Congratulate for your response. There's definitely more data to store (and lots of different kinds of documents) so there was an obvious choice for me to try MongoDB or one of its cousins… That's an interesting solution but being a novice there's a few questions from me on its design: – Fresco 12/1, 2013 at 13:4

1. Why did you choose different collection for similar data over one collection for all of them? There's five granularity levels so there'll be five collections with the same structure - one for each each level. 2. The name of the parent is a string in your example so when I rename parent term (i.e. United Kingdom to Great Britain) I should find all its children and rename parent's name there. I'd prefer references to plain names. 3. How to store multiple parents in this setup? Should I use an array for it or anything else? – Fresco 12/1, 2013 at 13:15

@IvanPotapov Edited my answer with your questions. – Congratulate 12/1, 2013 at 20:5

I'm sorry for the late reply @Jonathan. Thank you for your clarification. Funnily enough but yesterday I noticed a note here: [link]docs.mongodb.org/manual/tutorial/… that reads: This pattern may also provide a suitable solution for storing graphs where a node may have multiple parents. So it looks like it suits my needs. I choose your answer as a solution. An important note on mapReduce from @ValentynShybanov makes it even better. – Fresco 14/1, 2013 at 19:49

G

9

Because of a comment you made, I assume that you mean "MongoDB" when you say "NoSQL". There are a lot of other database technologies commonly referred to as NoSQL which are completely different, but this one seems to be the one you mean.

is not a good idea, because to get the whole taxonomy chain you will need to do multiple database queries, which should generally be avoided.
and 3. A single document which is a huge tree is not a good idea either, because MongoDB has a limit of 16MB per document. When you create huge, monolithic documents, you might hit that limit.

I think that MongoDB might not be the best solution for your use-case. Did you consider using a graph database? MongoDB is optimized for self-contained documents which stand on their own. But the focus of graph databases is on datasets where you have a lot of entities which are defined by their relations to other entities. This looks a lot like your use-case.

Gable answered 12/1, 2013 at 20:20 Comment(1)

Thank you @Philipp. Sorry for late reply. I think your solution is really interesting I had never had a chance to try graph databases (i didn't even know about them) so it's probably worth a try but I don't know if it really suits me. I need to tag persons (and places) with those terms, and it will probably be better to store their profiles in collections not graphs (I don't need to store any relationships for people and places). I don't know if it's possible to combine both methods (document-oriented db and graph-db) in one app but IMO that'll be an unnecessary overhead for my app anyway. – Fresco 14/1, 2013 at 19:27

C

6

Firstly, picking between NoSQL and a SQL database is hard if you're not familiar with the basic principles. If this is the only data you are storing, go with a relational (SQL). If there is more data (which I assume) and it requires more of a interwoven schema, stick with NoSQL hands down.

I would take the relational route on this to keep it from getting too complex... start several collections; one for countries, region and so on. Don't get discouraged from doing relational (SQL) type schemas in a NoSQL database; most of the time they are the best solution.

Then, in each of the sub-groups, have a field which names the parent.

For example:

{
    {'name': 'United Kingdom'},
    {'name': 'United States'}
}

{
    {'name': 'England', 'parent': 'United Kingdom'},
    {'name': 'California', 'parent': 'United States'}
}

That way, your data-set doesn't get so nested that the returned data is unmanageable. Then you can grab the countries and the corresponding regions... etc with ease.

Best of luck!

EDIT: Answering OP's questions:

(Firstly, I'd recommend MongoDB - it's a great solution all around.)

Because when you start working with MongoDB, you'll realize that it stores data side by side on the hard drive. If you edit a huge record like that, it will most likely be pushed to the back of the disk, making your hard drive similar to Swiss cheese. Once you get to that point, you'll have to do a repair to condense it once more. Also, this way the data is more easily separated in your application, that way, if you need to do something with the data, you won't have to apply it to the entire object. I am assuming that you will have a large dataset since there are many different locations in the world.
Don't worry too much about that kind of thing. You can use ID's for the parent and match the children with the ID if you plan on changing names a lot. I just did it this way because I assumed you wouldn't need to change a location database.
Rather than an array, I would use a nested document to store multiple parents. That way, it can be more easily queried and indexed. I would use the following method:
```
{
    {
        'name': 'England,
        'parent': {
            1: 1,
            568: 1
        }
     }
 }
```

So that way you can employ your idea of indexes and find where db.region.$.568 = 1

Congratulate answered 12/1, 2013 at 2:7 Comment(6)

Also I would like to add that in NoSQL sultuions instead of making queries with SELECT ... WHERE x IN or SELECT .. GROUP BY "Map-Reduce" method is been used. For example to get all items that are in "United Kingdom" you can use Map that will mark all items that have needed parent and then Reduce resulting set by filtering marked items. So I will +1 to @Congratulate solution - this way you'll get less coupled entities and also entities would have much more meaning - you can get nessesary data without issuing additional queries. – Relate 12/1, 2013 at 2:35

Depending on the database the OP is using (unfortunately he doesn't tell) this might require a lot of JOIN operation between children and parents. Some NoSQL databases have no or bad support for JOINs, so it would be a bad solution for them. – Gable 12/1, 2013 at 11:21

Thank you @Congratulate for your response. There's definitely more data to store (and lots of different kinds of documents) so there was an obvious choice for me to try MongoDB or one of its cousins… That's an interesting solution but being a novice there's a few questions from me on its design: – Fresco 12/1, 2013 at 13:4

1. Why did you choose different collection for similar data over one collection for all of them? There's five granularity levels so there'll be five collections with the same structure - one for each each level. 2. The name of the parent is a string in your example so when I rename parent term (i.e. United Kingdom to Great Britain) I should find all its children and rename parent's name there. I'd prefer references to plain names. 3. How to store multiple parents in this setup? Should I use an array for it or anything else? – Fresco 12/1, 2013 at 13:15

@IvanPotapov Edited my answer with your questions. – Congratulate 12/1, 2013 at 20:5

I'm sorry for the late reply @Jonathan. Thank you for your clarification. Funnily enough but yesterday I noticed a note here: [link]docs.mongodb.org/manual/tutorial/… that reads: This pattern may also provide a suitable solution for storing graphs where a node may have multiple parents. So it looks like it suits my needs. I choose your answer as a solution. An important note on mapReduce from @ValentynShybanov makes it even better. – Fresco 14/1, 2013 at 19:49

Recommended topics

Hot tags