What is the most efficient way to store tags in a database?
Asked Answered
K

8

165

I am implementing a tagging system on my website similar to one stackoverflow uses, my question is - what is the most effective way to store tags so that they may be searched and filtered?

My idea is this:

Table: Items
Columns: Item_ID, Title, Content

Table: Tags
Columns: Title, Item_ID

Is this too slow? Is there a better way?

Knawel answered 2/12, 2008 at 14:51 Comment(2)
Previously asked: #21356Shote
As of 2016, use Solr or ElasticsearchSalve
P
227

One item is going to have many tags. And one tag will belong to many items. This implies to me that you'll quite possibly need an intermediary table to overcome the many-to-many obstacle.

Something like:

Table: Items
Columns: Item_ID, Item_Title, Content

Table: Tags
Columns: Tag_ID, Tag_Title

Table: Items_Tags
Columns: Item_ID, Tag_ID

It might be that your web app is very very popular and need de-normalizing down the road, but it's pointless muddying the waters too early.

Preview answered 2/12, 2008 at 15:5 Comment(3)
Related: #21356Tientiena
if there is some thing like tagGroup how to handle it eg the tags are grouped into categories eg : Programming languages : c#,vb,pearl. OS : windows7,dos ,linux etcSemiconductor
@Thunder: assuming that one tag may only belong to one category, I would create a TagCategory table consisting of category_id and category_name. From there, I would append a category_id field to the Tags table and perform a join on that.Preview
A
17

Actually I believe de-normalising the tags table might be a better way forward, depending on scale.

This way, the tags table simply has tagid, itemid, tagname.

You'll get duplicate tagnames, but it makes adding/removing/editing tags for specific items MUCH more simple. You don't have to create a new tag, remove the allocation of the old one and re-allocate a new one, you just edit the tagname.

For displaying a list of tags, you simply use DISTINCT or GROUP BY, and of course you can count how many times a tag is used easily, too.

Antenatal answered 3/12, 2008 at 11:38 Comment(6)
changing the tag_name you have to iterate over all the tagsUnlike
I wouldn't say so. It'd be something like update tags set value = 'new' where value = 'old'; And really, are you sure you'd even want to allow changing a tag? If something was tagged as "software" but then someone renamed the "software" tag to "hardware", that would be wrong. You don't rename tags unless you're very careful the new value still really means the same thing.Antenatal
that is exactly what I meant. update tags set value = 'new' where value = 'old'; is iterating over all the tags. let's say you have 10 million tags, that would be a very expensive query that needs to run for hours. With a normalized schema it's only 1 row you need to update.Unlike
there is also too many duplication of data. DB storage is expensive. by normalizing you only store the name 1 time.Unlike
It depends what you are optimising for, and whether you actually need a single transaction for the update. Who cares if it takes hours? And depending how you do it, storage is cheaper than compute. Think of a highly collaborative domain like stackoverflow, where thousands or transactions per second have to check whether a tag exists and might need to lock and create it for example.Antenatal
This answer seems better. Normalization only helps till certain level. Every application grows in size and then performance matters.Aube
L
6

If you don't mind using a bit of non-standard stuff, Postgres version 9.4 and up has an option of storing a record of type JSON text array.

Your schema would be:

Table: Items
Columns: Item_ID:int, Title:text, Content:text

Table: Tags
Columns: Item_ID:int, Tag_Title:text[]

For more info, see this excellent post by Josh Berkus: http://www.databasesoup.com/2015/01/tag-all-things.html

There are more various options compared thoroughly for performance and the one suggested above is the best overall.

Lictor answered 17/8, 2015 at 23:51 Comment(0)
W
3

You can't really talk about slowness based on the data you provided in a question. And I don't think you should even worry too much about performance at this stage of developement. It's called premature optimization.

However, I'd suggest that you'd include Tag_ID column in the Tags table. It's usually a good practice that every table has an ID column.

Wertheimer answered 2/12, 2008 at 14:58 Comment(0)
N
2

I'd suggest using intermediary third table for storing tags<=>items associations, since we have many-to-many relations between tags and items, i.e. one item can be associated with multiple tags and one tag can be associated with multiple items. HTH, Valve.

Nauplius answered 2/12, 2008 at 15:3 Comment(0)
P
1

If space is going to be an issue, have a 3rd table Tags(Tag_Id, Title) to store the text for the tag and then change your Tags table to be (Tag_Id, Item_Id). Those two values should provide a unique composite primary key as well.

Patric answered 2/12, 2008 at 15:4 Comment(0)
T
0

Items should have an "ID" field, and Tags should have an "ID" field (Primary Key, Clustered).

Then make an intermediate table of ItemID/TagID and put the "Perfect Index" on there.

Tsuda answered 2/12, 2008 at 15:6 Comment(0)
L
0

Tag Schema: Tag tables and attributes:

Tables:

tags (each row only keeps information about a particular tag)
taggings (each row keeps information about trigger and who will receive the trigger )
products_tags (each row keeps information about tag with particular product)
tag_status (each row keeps track of a tag status)

Table: tags Attributes of tags table:

id(PK)
userId(FK users)(not null)(A tag only belongs to one user, but  a user can create multiple tags. So it is one to many relationships.)
genreId(FK products_geners)(not null)
name (string) (not null)
description (string)
status (int) (0=inactive, 1=pending, 2=active, there could be more flag)
rank(int)  (rank is the popularity of a particular tag), this field can be use for sorting among similar tags.)
type (int) (0=type1, 1=type2, 2=type3)
photo(string)
visibility (int) (0=public, 2=protected, 3 = private)(private means the tag only  visible to assigned users of a product, protected means a tag only visible to all friends and followers of the creator of the tag, public means search by public, such as all admin created tag)
createdAt(timestamp for the tag was created at)
updatedAt (timestamp for the tag last time updated)
deletedAt (default value null) (timestamp when tag was deleted, we need this field because we will delete tag permanently from audit table). 

Note: Keeping field no 10 will come handy later.

Table: taggings :

This table will be used for triggering such as broadcasting other users' feed or sending them notification. After a row inserted in this table, there will be a service who will read a row take associated action to remove the row.

Attributes of taggings table:

Id(PK)
tagId(a tagging row only belongs to a tag, but a tag can have multiple row).
taggableId (id of a user who will receive notification)
taggableType(int) (0=notification, 1=feed message)
taggerId(the person who triggered the broadcast)
taggerType(ad, product, news)
createdAt(timestamp for the tag was created at)

Table: products_tags

From user perspective a user able to create a tag after instantiating an product, so bellow table will keep information about which products has which tags.

Attributes of Attributes of taggings table:

Id (PK)
productId(FK)
tagId(FK)

Table: tag_status

When user will create a tag, a row will be created in this table with tagId and default status inactive/pending, admin will pull all tags from tags table where status=pending/inactive, after reviewing a tag if admin approved the tag then value of status in tag table will be approved and the row of tag_status will be removed. If admin is rejected then the value of the status field of tag_status table will be rejected and a trigger will be broadcasted and the receiver will send a notification to the associated user of that tag with a message that his tag is rejected .

id(PK)
senderId(Id of the user)
receiverId(Id of admin user)
createdAt(timestamp of created at)
updatedAt(timestamp of updated at)
deletedAt(timestamp of deletedAt) default value null
expiredAt (if a tag never gets approved it will expire after a certain time for removing its information from the database. If a rejected tag gets updated by user then expiredAt will reset to new future time)
status 
Message (string varchar(256)) (message for user)
Lucinalucinda answered 13/5, 2021 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.