Database Design for Tagging [closed]

P

11

181

How would you design a database to support the following tagging features:

items can have a large number of tags
searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
creating/writing items may be slower to enable quick lookup/reading

Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.

Any ideas?

Thanks for all the answers so far.

If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)

Parthena answered 7/9, 2008 at 14:31 Comment(1)

from scaling perspective you have to use a minor-major tag technique else performance will degrade when lots of major tags come forward (and they will do). Especially if you use a de-normalized schema this will happen much faster - for RDBMS solutions always – Strawflower 18/6, 2021 at 9:51

K

24

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

Keane answered 7/9, 2008 at 18:22 Comment(5)

I have to say that answer is a bit short sighted, because using a bit field type of the database limits you to a specific number of bits. This does not mean each item is limited to a certain number of tags, but that there can only be a certain number of unique tags in the whole system (usually up to 32 or 64). – Buckle 29/6, 2009 at 21:2

Assuming a 3nf implementation (Question, Tag, Question_has_Tag), and a bitmap index on the Tag_id in Question_has_Tag, the bitmap index has to rebuilt every time a question has a tag added or removed. A query like

select * from question q inner join question_has_tag qt where tag_id in (select tag_id from tags where (what we want) minus select tag_id from tags where (what we don't)

should be fine and scale out assuming the right b-tree indexes exist on the middle table – Jacinto 24/2, 2010 at 16:41

"This article" link is dead. I would have liked to read that :( – Festoonery 21/10, 2010 at 0:19

Mark: This one looks good: simple-talk.com/sql/t-sql-programming/… It's probably a re-published version of the one I referred to. – Keane 31/10, 2010 at 13:34

the URL of the article isn't valid anymore – Lanky 22/5, 2014 at 14:17

O

82

Here's a good article on tagging Database schemas:

http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/

along with performance tests:

http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.

Otway answered 7/9, 2008 at 19:17 Comment(2)

This question on Meta has some info on the SO schema: meta.stackexchange.com/questions/1863/so-database-schema – Tropology 9/9, 2009 at 16:9

Despite being written by @Jeff, this is still essentially a link only answer. – Gibberish 31/10, 2015 at 23:24

K

24

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

Keane answered 7/9, 2008 at 18:22 Comment(5)

I have to say that answer is a bit short sighted, because using a bit field type of the database limits you to a specific number of bits. This does not mean each item is limited to a certain number of tags, but that there can only be a certain number of unique tags in the whole system (usually up to 32 or 64). – Buckle 29/6, 2009 at 21:2

Assuming a 3nf implementation (Question, Tag, Question_has_Tag), and a bitmap index on the Tag_id in Question_has_Tag, the bitmap index has to rebuilt every time a question has a tag added or removed. A query like

select * from question q inner join question_has_tag qt where tag_id in (select tag_id from tags where (what we want) minus select tag_id from tags where (what we don't)

should be fine and scale out assuming the right b-tree indexes exist on the middle table – Jacinto 24/2, 2010 at 16:41

"This article" link is dead. I would have liked to read that :( – Festoonery 21/10, 2010 at 0:19

Mark: This one looks good: simple-talk.com/sql/t-sql-programming/… It's probably a re-published version of the one I referred to. – Keane 31/10, 2010 at 13:34

the URL of the article isn't valid anymore – Lanky 22/5, 2014 at 14:17

S

15

I just wanted to highlight that the article that @Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.

ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.

Scifi answered 5/11, 2008 at 0:40 Comment(0)

I

13

I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"

Indices on cross table should be enough optimisation. Selecting appropriate items would be

SELECT * FROM items WHERE id IN  
    (SELECT DISTINCT item_id FROM item_tag WHERE  
    tag_id = tag1 OR tag_id = tag2 OR ...)

AND tagging would be

SELECT * FROM items WHERE  
    EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)  
    AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)  
    AND ...

which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.

Incorruptible answered 7/9, 2008 at 14:39 Comment(0)

P

7

You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.

This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.

However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.

EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.

Posada answered 7/9, 2008 at 14:52 Comment(0)

T

5

The easiest method is to create a tags table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag

Querying the data would be something like:

Select distinct target from tags   
where tag in ([your list of tags to search for here])  
and target_type = [the table you're searching]

UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this

select target
from (
  select target, count(*) cnt 
  from tags   
  where tag in ([your list of tags to search for here])
    and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]

Thibaud answered 7/9, 2008 at 14:39 Comment(0)

D

1

I'd second @Zizzencs suggestion that you might want something that's not totally (R)DB-centric

Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.

I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.

You might want to consider if the added complexity is worth it.

Dagoba answered 7/9, 2008 at 17:38 Comment(0)

C

0

You won't be able to avoid joins and still be somewhat normalized.

My approach is to have a Tag Table.

 TagId (PK)| TagName (Indexed)

Then, you have a TagXREFID column in your items table.

This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:

 TagXrefID | ItemID | TagId

So, to get all tags for an item would be something like:

SELECT Tags.TagId,Tags.TagName 
     FROM Tags,TagXref 
     WHERE TagXref.TagId = Tags.TagId 
         AND TagXref.ItemID = @ItemID

And to get all items for a tag, I'd use something like this:

SELECT * FROM Items, TagXref
     WHERE TagXref.TagId IN 
          ( SELECT Tags.TagId FROM Tags
                WHERE Tags.TagName = @TagName; )
     AND Items.ItemId = TagXref.ItemId;

To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = @TagName1 AND Tags.TagName = @TagName2 etc...and dynamically build the query.

Clockmaker answered 7/9, 2008 at 14:43 Comment(0)

S

0

What I like to do is have a number of tables that represent the raw data, so in this case you'd have

Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)

This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.

A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...

CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)

Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.

Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.

In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that

If all four tags are assigned to an object, the object would look like this...

If just the first two...

Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.

Check this link to find out more.

Substantialism answered 7/9, 2008 at 16:25 Comment(0)

C

0

To paraphrase what others have said: the trick isn't in the schema, it's in the query.

The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.

The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".

I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.

Charlena answered 7/9, 2008 at 17:12 Comment(1)

You probably shouldn't refrain from giving tidbits about a certain technology because other people trying to work in this problem domain may actually be using that technology and would benefit. – Evenhanded 23/12, 2008 at 23:11

H

0

A variation to the above answer is take the tag ids, sort them, combine as a ^ separated string and hash them. Then simply associate the hash to the item. Each combination of tags produces a new key. To do an AND search simply re-create the hash with the given tag ids and search. Changing tags on an item will cause the hash to be recreated. Items with the same set of tags share the same hash key.

Harrod answered 14/1, 2011 at 5:17 Comment(1)

With this approach you can only search for entries with the exact same set of tags - that's always trivial. In my original question I want to find entries that have all the tags I query for, and possibly more. – Parthena 27/1, 2011 at 13:31

Recommended topics

Hot tags