There is... another.
Another way to store detailed read/unread data for a hierarchical forum structure ( board > section > thread, etc.). It does so without a) having to pre-populate read/unread information, and b) without having to ever store more than U*(M/2) rows in its worst case, where U is the number of users, and M is the total number of posts in the database (and usually much, much less than this)
I researched this topic a while ago. I found that SMF/phpBB "cheat" a little bit in how they store user reading history. Their schema supports storage of either the last timestamps or message ID that was marked as read in a given board, forum, subforum, topic (or viewed directly by the browser), like so:
[ user_id, board, last_msg_id, last_timestamp ]
[ user_id, board, forum, last_msg_id, last_timestamp ]
[ user_id, board, forum, subforum, last_msg_id, last_timestamp ]
[ user_id, board, forum, subforum, topic, last_msg_id, last_timestamp ]
This lets users mark specific boards, forums, topics, etc., as "read". It requires, however, either action on the part of the user (either by reading, or actively clicking "mark as read"), and in the case of phpBB, doesn't give you the granularity to say "I have seen this specific message, but not that specific message." You also get the situation where you read the last message in a topic first (viewing the latest activity in a thread), and you're immediately assumed to have read the rest of the thread.
It works for SMF and phpBB to store things like this because it's rare that you're ever viewing just one post (default views are set up for 20+ posts in the last page of a topic). However, for more threaded forums (particularly forums where you're viewing messages one at a time), this is less than ideal. Users of this system will likely care a whole lot if they have read one message but not another, and might consider it cumbersome to only be able to mark an entire section as read, when really they just wanted a few marked as read.
You store messages in tuples like this: [ user_id, lower_msg_id, upper_msg_id ]
The user history log is maintained as the following:
Upon page view, a function looks to see if user_id has a record where current_msg_id is between lower_msg_id and upper_msg_id. If it has, then this page is read, and no action needs taken. If it hasn't, then another query has to be issued, this time determining if current_msg_id is either one less than lower_msg_id (current_msg_id == lower_msg_id-1), or one more than upper_msg_id (current_msg_id == upper_msg_id +1). This is the case where we grow our "read" or "seen" boundary by 1. If we're one away from lower_msg_id or uppper_msg_id, then we grow the tuple by 1 in that direction. If we aren't growing our tuple range, then we insert a new tuple, [ user_id, current_msg_id, current_msg_id ].
Corner case is when two tuple ranges approach each other. In this case, upon searching between the lower tuple boundary and the upper tuple boundary, merge the two boundaries by setting the upper boundary of the lower tuple to the upper boundary of the upper tuple, and delete the upper tuple.
Code example in PHP:
function seen_bounds( $usr_id, $msg_id ) {
# mysql escape
$usr_id = mres( $usr_id );
$msg_id = mres( $msg_id );
$seen_query = "
SELECT
msb.id,
msb.lower_msg_id,
msb.upper_msg_id
FROM
msgs_seen_bounds msb
WHERE
$msg_id BETWEEN msb.lower_msg_id AND msb.upper_msg_id AND
msb.usr_id = $usr_id
LIMIT 1;
";
# See if this post already exists within a given
# seen bound.
$seen_row = query($seen_query, ROW);
if($seen_row == 0) {
# Has not been seen, try to detect if we're "near"
# another bound (and we can grow that bound to include
# this post).
$lower_query = "
SELECT
msb.id,
msb.lower_msg_id,
msb.upper_msg_id
FROM
msgs_seen_bounds msb
WHERE
msb.upper_msg_id = ($msg_id - 1) AND
msb.usr_id = $usr_id
LIMIT 1;
";
$upper_query = "
SELECT
msb.id,
msb.lower_msg_id,
msb.upper_msg_id
FROM
msgs_seen_bounds msb
WHERE
msb.lower_msg_id = ($msg_id + 1) AND
msb.usr_id = $usr_id
LIMIT 1;
";
$lower = query($lower_query, ROW);
$upper = query($upper_query, ROW);
if( $lower == 0 && $upper == 0 ) {
# No bounds exist for or near this. We'll insert a single-ID
# bound
$saw_query = "
INSERT INTO
msgs_seen_bounds
(usr_id, lower_msg_id, upper_msg_id)
VALUES
($usr_id, $msg_id, $msg_id)
;
";
query($saw_query, NONE);
} else {
if( $lower != 0 && $upper != 0 ) {
# Found "near" bounds both on the upper
# and lower bounds.
$update_query = '
UPDATE msgs_seen_bounds
SET
upper_msg_id = ' . $upper['upper_msg_id'] . '
WHERE
msgs_seen_bounds.id = ' . $lower['id'] . '
;
';
$delete_query = '
DELETE FROM msgs_seen_bounds
WHERE
msgs_seen_bounds.id = ' . $upper['id'] . '
;
';
query($update_query, NONE);
query($delete_query, NONE);
} else {
if( $lower != 0 ) {
# Only found lower bound, update accordingly.
$update_query = '
UPDATE msgs_seen_bounds
SET
upper_msg_id = ' . $msg_id . '
WHERE
msgs_seen_bounds.id = ' . $lower['id'] . '
;
';
query($update_query, NONE);
}
if( $upper != 0 ) {
# Only found upper bound, update accordingly.
$update_query = '
UPDATE msgs_seen_bounds
SET
lower_msg_id = ' . $msg_id . '
WHERE
msgs_seen_bounds.id = ' . $upper['id'] . '
;
';
query($update_query, NONE);
}
}
}
} else {
# Do nothing, already seen.
}
}
Searching for unread posts is finding where current_msg_id does not exist between any lower_msg_id and upper_msg_id for a given user (a NOT EXISTS query in SQL terms). It's not the most efficient of queries when implementing in a relational database, but can be solved by aggressive indexing. For example, the following is a SQL query for counting unread posts for a given user, grouping by the discussion area ("item") that the posts are in:
$count_unseen_query = "
SELECT
msgs.item as id,
count(1) as the_count
FROM msgs
WHERE
msgs.usr != " . $usr_id . " AND
msgs.state != 'deleted' AND
NOT EXISTS (
SELECT 1
FROM
msgs_seen_bounds msb
WHERE
msgs.id BETWEEN msb.lower_msg_id AND msb.upper_msg_id
AND msb.usr_id = " . $usr_id . "
)
GROUP BY msgs.item
;
The more users read on the forum, the wider the bounds marked as read by each tuple, and the less tuples have to be stored. Users can get an accurate count of read vs. unread, and can pretty easily be aggregated to see read vs. unread in each forum, subforum, topic, etc.
Given a small forum of about 2000+ posts, the following are the usage statistics regarding the number of tuples stored, sorted by number of times users have logged in (approximating user activity). The column "num_bounds" is the number of tuples required to store the user's "num_posts_read" viewing history.
id num_log_entries num_bounds num_posts_read num_posts
479 584 11 2161 228
118 461 6 2167 724
487 119 34 2093 199
499 97 6 2090 309
476 71 139 481 82
480 33 92 167 26
486 33 256 757 154
496 31 108 193 51
490 31 80 179 61
475 28 129 226 47
491 22 22 1207 24
502 20 100 232 65
493 14 73 141 5
489 14 12 1517 22
498 10 72 132 17
I haven't seen this particular implementation in any forum but my own custom one, and it's a small one at that. I'd be interested if anybody else has implemented, or seen this implemented elsewhere, particularly in a large and/or active forum.
Regards,
Kaiden