Terribly slow SQL query with COUNT and GROUP BY on two columns

Asked 20/2, 2010 at 17:21 Answered 20/2, 2010 at 17:59

I'm archiving this web forum, which normally gets purged about once a week. So I'm screen scraping it, and storing it into my database (PostgreSQL).

I also do a little analysis on the data, with some graphs for users to enjoy, like what time of day is the forum most active, and so forth.

So I have a posts table, like so:

   Column   |            Type
------------+------------------------------
 id         | integer
 body       | text
 created_at | timestamp without time zone
 topic_id   | integer
 user_name  | text
 user_id    | integer

And I now want to have a post count for each user, for my little top 10 posters table.

I came up with this:

SELECT user_id, user_name, count(*)
FROM posts
GROUP BY user_id, user_name
ORDER BY count DESC LIMIT 10

Which turns out to be very slow. 9 seconds, with just about 300 000 rows in the posts table at the moment.

It takes only half a second, if I group on just one column, but I need both.

I'm rather new to relational databases, and SQL, so I'm not quite sure if this is right, or just how am I doing it wrong?

Outwardly answered 20/2, 2010 at 17:21 Comment(0)

There's probably only one user with a particular ID, so max(user_name) should equal user_name. Then you can group on a single column, which your post indicates works faster:

SELECT user_id, max(user_name), count(*)
FROM posts
GROUP BY user_id

Prima answered 20/2, 2010 at 17:22 Comment(1)

+1 yup, that is the way to do it :). In the sql spec 1999 and up, it is actually allowed to list un-aggregated columns in the SELECT list that do not also appear in the GROUP BY list, as long as those columns are functionally dependent upon the GROUP BY list. Shameless plug: rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html – Cotto 20/2, 2010 at 17:26

also could use having count > 0 so you only return true

Daren answered 20/2, 2010 at 17:59 Comment(0)

Recommended topics

Hot tags