Recommended SQL database design for tags or tagging [closed]
Asked Answered
B

6

343

I've heard of a few ways to implement tagging; using a mapping table between TagID and ItemID (makes sense to me, but does it scale?), adding a fixed number of possible TagID columns to ItemID (seems like a bad idea), Keeping tags in a text column that's comma separated (sounds crazy but could work). I've even heard someone recommend a sparse matrix, but then how do the tag names grow gracefully?

Am I missing a best practice for tags?

Belisle answered 21/8, 2008 at 19:18 Comment(7)
Okay this is question #20856, the (almost) same question is #48475 asked at least two weeks after this question was asked.Belisle
Another interesting question is "How SO implements tags?"Waverley
Another interesting question is "Would you internationalize them, and if so, how?"Caprification
Interesting comparison (Postgres specific): databasesoup.com/2015/01/tag-all-things.htmlCircle
see also https://mcmap.net/q/94352/-database-design-for-tagging-closedKellene
if you use RDBMS you have to use a minor-major tag technique else it is not scalable when users will actually use tags.Accountant
What do you mean by "minor-major tag technique"?Spire
V
476

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID
Vigorous answered 21/8, 2008 at 19:22 Comment(8)
This is known as the “Toxi” solution, you can find additional information about it here : howto.philippkeller.com/2005/04/24/Tags-Database-schemasAcrobat
One thing not shown here is hierarchal "tags" or categories in the Tag table. This is commonly needed on sites that have categories and subcategories but need the flexibility of tagging. For example, recipe sites, auto parts sites, business directories, etc. These types of data don't usually fit into only one single category so tagging is the answer but you need to use something like the Nested Set Model or the Adjacency List Model in your Tag table.Sporophore
I agrree with HK1 is it possible with above structure + Table : TagGroup Columns : TagGropuId, Title Table: Tag Columns: TagID, Title , TagGroupIdAutoroute
I thought that the table ItemTag needs a Column ItemTagID. Am I wrong? do I miss anything?Insolation
I think the table Tag should only have the title as PK, having a separate ID as primary key especially an auto incrementing one would result in the table Tag getting filled with duplicate values upon insertion/update (in a ManyToMany relationship).Musetta
@ftvs: link again broken, the new link is howto.philippkeller.com/2005/04/24/Tags-Database-schemasAborn
this is standard and the best RDBMS approach, if you use 2 ItemTag tables splitting tags to minor and major you are also top scalableAccountant
Could you elaborate on "2 ItemTag tables splitting tags to minor and major"?Spire
O
110

Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:

Use two tables:

Table: Item
Columns: ItemID, Title, Content
Indexes: ItemID

Table: Tag
Columns: ItemID, Title
Indexes: ItemId, Title

This has some major advantages:

First it makes development much simpler: in the three-table solution for insert and update of item you have to lookup the Tag table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.

Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags for one Item, draw a Tag-Cloud and select all items for one Tag Title.

All Tags for one Item:

3-Table:

SELECT Tag.Title 
  FROM Tag 
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 WHERE ItemTag.ItemID = :id

2-Table:

SELECT Tag.Title
FROM Tag
WHERE Tag.ItemID = :id

Tag-Cloud:

3-Table:

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

2-Table:

SELECT Tag.Title, count(*)
  FROM Tag
 GROUP BY Tag.Title

Items for one Tag:

3-Table:

SELECT Item.*
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

2-Table:

SELECT Item.*
  FROM Item
  JOIN Tag ON Item.ItemID = Tag.ItemID
 WHERE Tag.Title = :title

But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.

The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.

The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.

So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)

Olden answered 20/9, 2013 at 18:37 Comment(14)
Does "Index: ItemId, Title" mean an index for each or one index containing both?Caprification
Normally two indexes. Could depend on the database you're using, though.Olden
In the tag table is ItemId and Tag a composite key? or do you have a PK as well?Paregoric
I think this depends on the kind of framework you are using / how you are used to write code. Sometime its easier to have a primary key and sometimes it is not. In my case I do have a PK because I like it more to use something like DELETE?id=5 than DELETE?idemid=3&tag=Foo in the web client. From a data consistency standpoint on the other hand I wouldn't use an additional PK. And for teaching purposes I wouldn't recommend it :)Olden
Regarding Scheintod's two table solution, it should work, but there are a couple limitations. 1. If your tags have any data associated with it (For example, if your tag has a description or some other data) multiply tagged objects could not have their metadata edited consistently without doing a lot of work to keep them in sync. 2. There is some extra storage involved when there are long titles and tags are used many times. In a 2 table solution, the title would be re-entered many times. Perhaps some DBs would optimize the data as part of their storage/indexing, but I am not sure. In my case,Solfa
Hi, you said - "taking in regard the savings for not having to join and the fact that you can build a good index" - the 'small' table will also be indexed, the index size will be considerably less. join on FK which is int must be much faster than querying a huge index of nvarchar(n). With that said, I like this idea and will probably use a derivation of it to join many different item types (tables) with common tags.Selves
HI h.alex. The point is, that you have to do less joins which can turn out to be faster than having them joined with an int FK. But considering that normally Tag tables tend to be small the speed difference (in which direction whatsoever) might not be noticeable at all. Here is another article which compares roughly the both techniques: dba.stackexchange.com/questions/15897/…Olden
correct me if i'm wrong, but when i delete an item wont there be a null value in the row, and if i delete all the items wont there be a whole lot of repeating set of rows all with null values for itemid?Entertain
@deez: I'm not sure, what you are talking about (but it's early in the morning). If you delete an item you have to delete the tags with it. Depending on the actual database implementation (foreign key constraint) the database won't let you delete it until you have deleted the tags.Olden
in my application i have some tags that come pre loaded i cant dare delete those tags, but i can do cascade nullify which leaves a lot of duplicate useless rows, i do like the idea of eliminating the mapping table thoughEntertain
where is the original tag list stored ? web page ?Lallygag
this way you cannot create "unused" tags so an "add tag" feature has to be performed on an Item. On the other method, the "add tag" feature can be performed independentlyTheologize
Hi I wonder if you still believe 2-table solution is better 3-table solution after all these years ?Inexpiable
@Quilang. I still believe it depends on what you are up to :) I implemented it in both ways in different projects. In my last one I ended up with a 3 table solution because I needed a "tag-type" (or some other meta information on the tag) and could reuse some code from a close cousin of tags: parameters. But in the very same project I used exactly this method for an even closer cousin: flags (eg. 'sold', 'new', 'hot')Olden
P
42

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.

Pancake answered 7/9, 2008 at 19:41 Comment(3)
+1 Nice to see some NoSQL implementations also.Fagoting
@NickRetallack The link is not working. If you could, please update this answer.Univalve
Ok I replaced the link with one to archive.orgPancake
P
15

Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.

If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.

[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.

Polyneuritis answered 7/9, 2008 at 11:47 Comment(5)
Are you aware of any full-text search engine that doesn't find variations on a word? For example, searching for book returns books? Also, what do you do about tags like "c++"? SQL Server, for example, would strip the plus signs in the index. Thanks.Slapbang
Try Sphinx - sphinxsearch.comVandyke
This 3-parts tutorial maybe useful for those who are going this route (full text search). It's using PostgreSQL native facilities: shisaa.jp/postset/postgresql-full-text-search-part-1.htmlTimework
is this better than the selected answer in terms of performance ?Thadeus
how about storing in using varchar 255, comma separeted tags and adding kfull text index on it ?Thadeus
L
11

I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.

Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.

Lewan answered 21/8, 2008 at 19:23 Comment(1)
This is even more easy if you don't use an mapping table :)Olden
T
-2

I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.

In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.

Now while searching for items for a tag it will be super fast.

Truck answered 28/11, 2015 at 9:51 Comment(1)
en.wikipedia.org/wiki/First_normal_form although there are exceptions to the this, you can denormalize, but not hereEntertain

© 2022 - 2024 — McMap. All rights reserved.