PHP forums - how to cope with unread discussions / topics / posts
Asked Answered
C

7

28

I know this question has been asked here a couple of times, but none of the answers had pleased me. This is because almost all of them involve a huge read / write process related with the database, which I'd like to avoid at all cost.

About unread discussions / topics / posts, there's a lot to think of. I don't know how do forum systems like MyBB, vBulletin, Invision Power Board, Vanilla, phpBB, etc., cope with that issue, so I'd like to read from you guys your experience with that. I know that using a database table just for that is the simplest way, but that would involve a huge read / write when the community has over 10,000 members and 1000 new topics every month. It's hard, but there should be a way to avoid the server's overloading.

So, what do you find as the best practices for this issue, as well as how other forum systems cope with it?

Cinchonism answered 18/2, 2010 at 13:15 Comment(0)
M
16

There isn't a lot of choices.

  1. mark every reader thread by each user.

    • Disadvantages: a lot of rows in very active forums
    • Advantages: Every user knows with post has read or not.
  2. mark every unread thread by each user.

    • Disadvantages: a lot of space with "unreaded" rows if there is inactivity of a lot of users
    • Solutions: add a lifetime timestamp and delete old records with a cron
    • Advantages: Every user knows with post has read or not.
  3. use timestamps to determine if show it as unread or not.

    • Disadvantages: The users don't know with are the real unread threads, the marks only show the "new trheads" since the last login
    • Advantage: Save space

The other alternative is mixing solutions, that is,

1 and 3) show thread as "unread" if they aren't older than X days and there isn't a row marked as readed for the user. The "read" rows can be deleted when they are X day older without affect anything.

Advantages

  • less spaced used to determine unread threads

Disadvantages

  • create a cron that keeps the system clean
  • Users don't know if they read threads olders than x days.

Advantages

  • Every user knows which "new posts" has read or not.
Musicale answered 22/2, 2010 at 2:38 Comment(1)
I still think there must be a easier way to do it. I thought about using MemCache, but it relies on memory, and I'm still thinking about APC. If I could have some sort of cache file to work with, would probably help.Cinchonism
C
9

There is... another.

Another way to store detailed read/unread data for a hierarchical forum structure ( board > section > thread, etc.). It does so without a) having to pre-populate read/unread information, and b) without having to ever store more than U*(M/2) rows in its worst case, where U is the number of users, and M is the total number of posts in the database (and usually much, much less than this)

I researched this topic a while ago. I found that SMF/phpBB "cheat" a little bit in how they store user reading history. Their schema supports storage of either the last timestamps or message ID that was marked as read in a given board, forum, subforum, topic (or viewed directly by the browser), like so:

[ user_id, board, last_msg_id, last_timestamp ]

[ user_id, board, forum, last_msg_id, last_timestamp ]

[ user_id, board, forum, subforum, last_msg_id, last_timestamp ]

[ user_id, board, forum, subforum, topic, last_msg_id, last_timestamp ]

This lets users mark specific boards, forums, topics, etc., as "read". It requires, however, either action on the part of the user (either by reading, or actively clicking "mark as read"), and in the case of phpBB, doesn't give you the granularity to say "I have seen this specific message, but not that specific message." You also get the situation where you read the last message in a topic first (viewing the latest activity in a thread), and you're immediately assumed to have read the rest of the thread.

It works for SMF and phpBB to store things like this because it's rare that you're ever viewing just one post (default views are set up for 20+ posts in the last page of a topic). However, for more threaded forums (particularly forums where you're viewing messages one at a time), this is less than ideal. Users of this system will likely care a whole lot if they have read one message but not another, and might consider it cumbersome to only be able to mark an entire section as read, when really they just wanted a few marked as read.

You store messages in tuples like this: [ user_id, lower_msg_id, upper_msg_id ]

The user history log is maintained as the following:

Upon page view, a function looks to see if user_id has a record where current_msg_id is between lower_msg_id and upper_msg_id. If it has, then this page is read, and no action needs taken. If it hasn't, then another query has to be issued, this time determining if current_msg_id is either one less than lower_msg_id (current_msg_id == lower_msg_id-1), or one more than upper_msg_id (current_msg_id == upper_msg_id +1). This is the case where we grow our "read" or "seen" boundary by 1. If we're one away from lower_msg_id or uppper_msg_id, then we grow the tuple by 1 in that direction. If we aren't growing our tuple range, then we insert a new tuple, [ user_id, current_msg_id, current_msg_id ].

Corner case is when two tuple ranges approach each other. In this case, upon searching between the lower tuple boundary and the upper tuple boundary, merge the two boundaries by setting the upper boundary of the lower tuple to the upper boundary of the upper tuple, and delete the upper tuple.

Code example in PHP:

function seen_bounds( $usr_id, $msg_id ) {

    # mysql escape
    $usr_id = mres( $usr_id );
    $msg_id = mres( $msg_id );

    $seen_query = "
        SELECT
            msb.id,
            msb.lower_msg_id,
            msb.upper_msg_id
        FROM
            msgs_seen_bounds msb
        WHERE
            $msg_id BETWEEN msb.lower_msg_id AND msb.upper_msg_id AND
            msb.usr_id = $usr_id
        LIMIT 1;
    ";

    # See if this post already exists within a given
    # seen bound.
    $seen_row = query($seen_query, ROW);

    if($seen_row == 0) {
        # Has not been seen, try to detect if we're "near"
        # another bound (and we can grow that bound to include
        # this post).
        $lower_query = "
            SELECT
                msb.id,
                msb.lower_msg_id,
                msb.upper_msg_id
            FROM
                msgs_seen_bounds msb
            WHERE
                msb.upper_msg_id = ($msg_id - 1) AND
                msb.usr_id = $usr_id
            LIMIT 1;
        ";

        $upper_query = "
            SELECT
                msb.id,
                msb.lower_msg_id,
                msb.upper_msg_id
            FROM
                msgs_seen_bounds msb
            WHERE
                msb.lower_msg_id = ($msg_id + 1) AND
                msb.usr_id = $usr_id
            LIMIT 1;
        ";

        $lower = query($lower_query, ROW);
        $upper = query($upper_query, ROW);

        if( $lower == 0 && $upper == 0 ) {
            # No bounds exist for or near this. We'll insert a single-ID
            # bound

            $saw_query = "
                INSERT INTO
                    msgs_seen_bounds
                (usr_id, lower_msg_id, upper_msg_id)
                VALUES
                ($usr_id, $msg_id, $msg_id)
                ;
            ";

            query($saw_query, NONE);
        } else {
            if( $lower != 0 && $upper != 0 ) {
                # Found "near" bounds both on the upper
                # and lower bounds.

                $update_query = '
                    UPDATE msgs_seen_bounds
                    SET
                        upper_msg_id = ' . $upper['upper_msg_id'] . '
                    WHERE
                        msgs_seen_bounds.id = ' . $lower['id'] . '
                    ;
                ';

                $delete_query = '
                    DELETE FROM msgs_seen_bounds
                    WHERE
                        msgs_seen_bounds.id = ' . $upper['id'] . '
                    ;
                ';

                query($update_query, NONE);
                query($delete_query, NONE);
            } else {
                if( $lower != 0 ) {
                    # Only found lower bound, update accordingly.
                    $update_query = '
                        UPDATE msgs_seen_bounds
                        SET
                            upper_msg_id = ' . $msg_id . '
                        WHERE
                            msgs_seen_bounds.id = ' . $lower['id'] . '
                        ;
                    ';

                    query($update_query, NONE);
                }

                if( $upper != 0 ) {
                    # Only found upper bound, update accordingly.
                    $update_query = '
                        UPDATE msgs_seen_bounds
                        SET
                            lower_msg_id = ' . $msg_id . '
                        WHERE
                            msgs_seen_bounds.id = ' . $upper['id'] . '
                        ;
                    ';

                    query($update_query, NONE);
                }
            }
        }
    } else {
        # Do nothing, already seen.
    }

}

Searching for unread posts is finding where current_msg_id does not exist between any lower_msg_id and upper_msg_id for a given user (a NOT EXISTS query in SQL terms). It's not the most efficient of queries when implementing in a relational database, but can be solved by aggressive indexing. For example, the following is a SQL query for counting unread posts for a given user, grouping by the discussion area ("item") that the posts are in:

$count_unseen_query = "
    SELECT 
        msgs.item as id,
        count(1) as the_count
    FROM msgs
    WHERE
    msgs.usr != " . $usr_id . " AND
    msgs.state != 'deleted' AND
    NOT EXISTS (
       SELECT 1 
       FROM 
          msgs_seen_bounds msb
       WHERE 
          msgs.id BETWEEN msb.lower_msg_id AND msb.upper_msg_id
          AND msb.usr_id = " . $usr_id . "
    )
    GROUP BY msgs.item
    ;

The more users read on the forum, the wider the bounds marked as read by each tuple, and the less tuples have to be stored. Users can get an accurate count of read vs. unread, and can pretty easily be aggregated to see read vs. unread in each forum, subforum, topic, etc.

Given a small forum of about 2000+ posts, the following are the usage statistics regarding the number of tuples stored, sorted by number of times users have logged in (approximating user activity). The column "num_bounds" is the number of tuples required to store the user's "num_posts_read" viewing history.

id  num_log_entries num_bounds num_posts_read num_posts
479             584         11           2161       228
118             461          6           2167       724
487             119         34           2093       199
499              97          6           2090       309
476              71        139            481        82
480              33         92            167        26
486              33        256            757       154
496              31        108            193        51
490              31         80            179        61
475              28        129            226        47
491              22         22           1207        24
502              20        100            232        65
493              14         73            141         5
489              14         12           1517        22
498              10         72            132        17

I haven't seen this particular implementation in any forum but my own custom one, and it's a small one at that. I'd be interested if anybody else has implemented, or seen this implemented elsewhere, particularly in a large and/or active forum.

Regards,

Kaiden

Clapper answered 18/2, 2011 at 19:46 Comment(5)
I'd like to look at a pratical implementation of that, if possible :)Cinchonism
@yoda, not sure what you're asking for. I've got working code, but whether it's SQL-injection proof or not, not sure I'd want to expose the running instance to the horribleness of the Internet. What's the precedent set on StackOverflow for including code? Is it expected to be part of the answer, or in the comments (which are woefully too short), or off-site in another service like paste.ie? Lemme know.Clapper
you can show it to me only, if you agree with that and don't want to expose your code to everybody. You can use pastebin.com and send the code to aeon dot yoda at gmail dot com. I'd appreciate!Cinchonism
@yoda, Well, can't see harm in putting up code snippets. I also included some usage stats on how people's usage of the system ends up looking in the database as regards the viewing history.Clapper
Great answer, I love that you've shown the varying depths of "read" as distinguished by the timestamps on the forums and threads.Swadeshi
C
3

Non exactly a PHP-answer, but here's how we do it in our asp.net-based forum (I'm affiliated with this product, disclosing that due to the rules)

  1. We use cookies, not the database.
    • Disadvantage of cookies - not "cross-device" (visiting from another computer shows everything as unread)
    • Advantage - no huge DB reads/writes. And tracking works for "guest" users also! This is awesome.
  2. We store a cookie with { topicID, lastReadMessageID } pairs for every topic the user visits.
  3. If the data for a particular topic is not found in the cookie we assume the topic is either:
    • fully unread (if the topic's last message is greater than MAX lastReadMessageID from (2)
    • fully read (if otherwise)

This has some minor flaws, but it does the job.

PS. Also, some might say that using cookies leaves garbage on the user's computer (I personally hate this), but we found out that an average user tracks about 20 topics tops, so it takes about 10 bytes per topic so it takes less than 200 bytes on the user's harddrive.

Cursorial answered 15/12, 2013 at 11:56 Comment(0)
B
1

Why are you concerned?

I don't see an issue with any I/O for getting the unread threads. It doesn't have to be live. A 15 minute delay based on a cache value would work.

So for unread threads you just

Pseudo code..

$result = SELECT id,viewcount from my_forum_threads

$cache->setThreads($result['id'],$result['viewcount']);

Then on a page load you just get the cache values rather than querying the database again. Its really not a big issue at all.

The average page on my website takes 20 mysql queries. When I cache it is only two to four queries.

Baynebridge answered 18/2, 2010 at 13:29 Comment(2)
What about each user's thread viewed / opened? What you suggest seems to imply that I cache every user information...Cinchonism
You load it once when the session starts or after a certain amount of time has gone by. The cache is in the session, not a big permanent cache.Kinney
H
1

Almost any forum I know of will use some sort of reference timestamp to determine whether or not a thread/message should be regarded as "unread" or not. This timestamp usually is the date/time of the last action you performed on your previous visit to the forum.

So you keep ie. a previous_last_action & last_action timestamp in your user table, last_action is updated on every user action, the previous_last_action column is set once to last_action when logging in (or upon creation of a new session - if you have "remember me" functionality). To determine if a thread/message is unread you would compare that thread/message creation (or update) timestamp with the value in previous_last_action for the user currently logged in.

Holo answered 18/2, 2010 at 16:16 Comment(2)
Better forum software will give you last read data PER forum at a minimum. Extreme forum software actually keeps track of what you've read.Kinney
Nothing new, sorry. Thing is, the last user activity redards the hole forum, not each thread, wich means that for that to work properly I'd still need to track the timestamp regarding each thread.Cinchonism
P
1

A quick answer on how (I think) IPB does it:

All posts older than the config amount (default 30 days) are automatically marked as read. A cronjob prunes these from each user to keep the size manageable.

All posts less than 30 days old are tracked as a JSON entry for each user ID + category. Ex: 12 categories with 1000 active users = maximum of 12,000 rows.

There is an "unread count" field for quick lookups for, say, the Forum Home, or anywhere else just a number is needed.

I could be completely off on the actual MySQL storage. I couldn't find documentation on this, but I dug through the database and saw a table that /looked/ like read/unread threads (table: core_item_markers, for reference). But I am positive on the hybrid age/mysql model.

Pontus answered 10/7, 2013 at 17:19 Comment(0)
H
0

I have read all the answers and I came with an idea that may be the best mix for this subject (no code though).
This idea is a mix of all of your ideas and the little experience I have in programming
Aprox 95% of the users (statistics got from a forum admin and his forum logs) read the forum's topics straight to the last post (or page) and don't go back, read the 1st pages' posts (or just the 1st post) and then go to the last page, or they read the whole thread from the beginning 'til the end and if they turn back they had already read that part. So a good solution would work like this:
I think If we make a store, for each user, for each thread, the timestamp of the last post the user viewed (and, if applicable, the first post the user viewed even if that may not get to be useful) we could get somewhere with this. The system is pretty simple and is almost like phpbb's. It would also be useful to mark the last post we have seen to continue in that one later (instead of being forced to consider all that page as read). And, as each thread has its own id. There's no need to organize like phpbb does.

Hyposensitize answered 24/2, 2011 at 13:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.