Storing tags in a graph database
Asked Answered
T

3

7

I've found some advice for setting up tagging systems in relational and document databases, but nothing for graph/multi-model databases.

I am trying to set up a tagging system for documents (let's call them "articles") in ArangoDB. I can think of two obvious ways to store tags in a multi-model (graph+document) database like Arango:

  • as an array within each article document (document database-style)
  • as a separate document class with each tag as a unique document and edges connecting tag documents to the article documents (something closer to relational database-style)

Are these in fact the two main ways to do this? Neither seems ideal. For example:

  • If I'm storing tags within each article document, I can index the tags and presumably ArangoDB is optimizing the space they use. However, I can't use graph features to link or traverse tags (or I have to do it separately).
  • If I'm storing tags as separate tag documents, it seems like extra overhead (an extra query) when I just want to get a list of tags on a document.

Which leads me to an explicit question: with regard to the latter option, is there any simple way to automatically make connected 'tag' documents show up within the article documents? E.g. have an array property that somehow 'mirrored' the tag.name properties of the connected tag documents?

General advice is also welcome.

Telestich answered 2/3, 2016 at 19:2 Comment(2)
Did the answers work for you? if yes, can you mark the best of them as 'accepted'? If not, whats missing?Veii
See this example: neo4j.com/blog/soundcloud-recommendations-neo4jNadler
N
4

@Joachim Bøggild linked to Mike Williamson: https://mikewilliamson.wordpress.com/2015/07/16/data-modeling-with-arangodb/

I would agree with Williamson that "Compact by default" is generally the way to go. You can then extract vertices (aka. nodes) from properties if/when the actual need emerges. It also avoids creating an overly interconnected graph structure which would be slow for all kinds of traversal queries.

However, in this case, I think having Tag vertices (i.e. "documents", in your terminology) is good to have, because you can then store meta-data on the tag (like count), and connect it to other tags and sub-tags. It seems very useful and immediately foreseeable in the particular case of tags. Having a vertex, which you can add more relationships to if/when you need them, is also very extensible, so you keep your future options more open (more easily, at least).

It seems Williamson agrees that Tags warrant special consideration:

"But not everything belongs together. Any attribute that contains a complex data structure (like the “comments” array or the “tags” array) deserves a little scrutiny as it might make sense as a vertex (or vertices) of its own."

The original question by @ropeladder poses the main objection that it would require extra overhead (an extra query). I think it might be premature optimization to think too much about performance at this stage. After all; the extra query might be fast, or it might actually be joined with and included in the original query. In any case, I would quote this:

“In general, it’s bad practice to try to conflate nodes to preserve query-time efficiency. If we model in accordance with the questions we want to ask of our data, an accurate representation of the domain will emerge. Graph databases maintain fast query times even when storing vast amounts of data. Learning to trust our graph database is important when learning to structure our graphs without denormalizing them.” --- from page 64, chapter 'Avoiding Anti-patterns', in the book 'Graph Databases', a book co-written by Eifrem, the founder of Neo4j, another very popular native graph database. It's free and available online here: https://neo4j.com/graph-databases-book/

See also this article on some anti-patterns (dense vs sparse graphs), to supplement Williamsons points: https://neo4j.com/blog/dark-side-neo4j-worst-practices/


Extra section included for completeness, to those who want to dive a little bit deeper into this question:

Answering Williamson's own criteria for deciding whether something should be a vertex/node on its own, instead of leaving it as a property on the document vertex:

Will it be accessed on it’s own? (ie: showing tags without the document)

Yes. Browsing tags available in the system could be useful.

Will you be running a graph measurement (like GRAPH_BETWEENNESS) on it?

Unsure. Likely not.

Will it be edited on it’s own?

Yes, probably. A user could edit it separately. Maybe an admin/moderator wants to clean up the tag names (correct spelling errors), or clean up their structure (if you have sub-tags).

Does/could the tags have relationships of it’s own? (assuming you care)

Yes. They could. Sub-tags, or other kinds of content than merely documents. Actually, it's very useful to be able to click a tag and immediately see all documents with that tag. That would presumably be sub-optimal with tags stored as a property array on each document. Whereas a graph database is fundamentally optimized for the case of querying vertices adjacent to other vertices (aka. nodes).

Would/should this attribute exist without it’s parent vertex?

Yes. A tag could/should exist even if the last tagged document was deleted. Someone might want to use that tag later on, and it represents domain information you might want to preserve.


Nadler answered 8/4, 2019 at 13:8 Comment(0)
V
3

You already mention most of the available decision criterias. Maybe I can add some more:

Relational tags inside the documents could use array indices to filter on them, which could make queries on them fast. However, if you like to add a rating or an explanation to each item of that tag array, there is no way to. If you want to count the documents tagged, this may also be more expensive than counting all edges that originate from a specific tag, or maybe find all tags matching a search criteria.

One of the powers of multi model is, that you don't need to decide between the both aproaches. You can have an edge collection connecting tags with attributes to your documents, and have an indexed array with the same (flat) tags inside of the document. If you find all (or most) of your queries just use one method, try to convert the rest and remove the other solution. If that doesn't work, your application simply needs both of them.

In both cases finding other tagged documents alongside could be done in a subequery:

LET docs=(FOR ftDoc IN FULLTEXT(articles, 'text', 'search')
    COLLECT tags = ftDoc.tags INTO tags RETURN {tags, ftDoc})
LET tags = FLATTEN(FOR t IN docs[*].tags RETURN t)
LET otherArticles = (FOR oneTag IN tags 
    FOR oneD IN articles FILTER oneTag IN oneD.tag RETURN oneD._key)
RETURN {articles: docs, tags: tags, otherArticles: otherArticles}
Veii answered 8/3, 2016 at 8:43 Comment(0)
I
1

The answer to your explicit question about if a connected document could automatically show up inside your document is unfortunately no. I have made an ArangoDB graph with separate tag documents, but I am seriously considering just turning it into properties on the individual items since the tags seem to follow the criteria for being properties, not related items.

Mike Williamson has done a nice blog post about this: https://mikewilliamson.wordpress.com/2015/07/16/data-modeling-with-arangodb/

He argues that having a lot of edges from a single vertex is slow, and that would be the case with the number of edges from a popular Tag vertex.

Incubus answered 8/3, 2016 at 13:11 Comment(1)
I was going to comment here, as I generally agree with Williamson in the article which you shared. But I think you might have read him and concluded a bit fast. Please see my answer: https://mcmap.net/q/1500289/-storing-tags-in-a-graph-databaseNadler

© 2022 - 2024 — McMap. All rights reserved.