SQL classification

Asked 16/2, 2012 at 9:1 Answered 16/2, 2012 at 18:46

Solved mysql sql algorithm classification document-classification

I have a system that tracks what documents users view. Each document has its ID and a cluster that it belongs to. My system tracks the session ID and the number of views. I would now like to construct an SQL query which would give me two columns - the session ID and the classified cluster. The algorithm for classification is simple:

1. select all sessions
2. for each session S
   I. prepare an accumulator ACC for clusters
   II. select the clusters of viewed documents for this session
   III. for each cluster C accumulate the cluster count ( ACC[C]++ )
   IV. find the maximum in the ACC. That is the cluster that the session was classified to

The table structures are as follows, I'm using MySQL 5.5.16:

Session

+-------+-----------+--------------------+
| ID    | sessionID | classified_cluster |
+-------+-----------+--------------------+

SessionDocument

+-------+-----------+------------+
| ID    | sessionID | documentID |
+-------+-----------+------------+

Cluster

+-------+-------+
| ID    | label |
+-------+-------+

ClusterDocument

+-------+-----------+------------+
| ID    | clusterID | documentID |
+-------+-----------+------------+

So basically, I want to select the clusters for each session, count the occurrence of each cluster for viewed documents and find the maximum occurrence. Then the ID of the cluster that occurred the most, is the result for the session therefore the final result set holds the session ID and the most occurred cluster:

Result

+-----------+-----------------------+
| sessionID | classifiedIntoCluster |
+-----------+-----------------------+

I managed to get the clusters of viewed documents for each session (step 2/II.) with this query:

SELECT SD.session_id, CD.cluster_id 
FROM cluster_document AS CD 
INNER JOIN session_document AS SD 
ON CD.document_id = SD.document_id
WHERE session_id IN (SELECT session_id FROM session)

I'm having trouble figuring out the rest. Is this even possible with nested SELECT queries? Should I use a cursor, and if yes, could someone show an example with a cursor? Any help will be much appreciated.

EDIT #1: added a C# implementation, MySQL dump and expected result

C# implementation

    private void ClassifyUsers() {
        int nClusters = Database.SelectClusterCount(); //get number of clusters
        DataSet sessions = Database.SelectSessions(); //get all sessions
        foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session
           int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster
           string s_id = session["session_id"].ToString();
           DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session

           foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster
               int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1;
               acc[c]++; //accumulate the cluster count
           }

           //find the maximum in the accumulator -> that is the most relevant cluster
           int max = 0;
           for (int j = 0; j < acc.Length; j++) {
               if (acc[j] >= acc[max]) max = j;
           }
           max++;
           Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster
       }
    }

Table structure, test data and expected result

Table structure and test data

Expected result

EDIT #2: added a smaller data set and further algorithm walkthrough

Here is a smaller data set:

SESSION

session id    |  cluster
abc                 0
def                 0
ghi                 0
jkl                 0       
mno                 0

CLUSTER

cluster_id  | label
1               A
2               B
3               C
4               D
5               E

SESSION_DOCUMENT

id      | session_id    |   document_id
1           abc             1
2           def             5
3           jkl             3
4           ghi             4
5           mno             2
6           def             2
7           abc             5
8           ghi             3

CLUSTER_DOCUMENT

id      | cluster_id    |   document_id
1           1                  2
2           1                  3
3           2                  5
4           3                  5
5           3                  1
6           4                  3
7           5                  2
8           5                  4

Algorithm in detail

Step 1: get clusters for documents viewed by the session

session_id  |  cluster_id   | label     | document_id   
abc             3               C           1
abc             2               B           5
abc             3               C           5
-----
def             2               B           5
def             3               C           5   
def             1               A           2
def             5               E           2   
----
ghi             5               E           4   
ghi             1               A           3   
ghi             4               D           3   
----
jkl             1               A           3   
jkl             4               D           3   
----
mno             1               A           2
mno             5               E           2

Step 2: count occurrence of clusters

session_id |    cluster_id  | label |   occurrence
abc             3               C           2   <--- MAX
abc             2               B           1
----
def             2               B           1
def             3               C           1   
def             1               A           1
def             5               E           1   <--- MAX
----
ghi             5               E           1   
ghi             1               A           1   
ghi             4               D           1   <--- MAX
----
jkl             1               A           1   
jkl             4               D           1   <--- MAX
----
mno             1               A           1   
mno             5               E           1   <--- MAX

Step 3 (end result): find maximum occurred cluster for each session (see above) and construct the final result set (session_id, cluster_id):

session_id |    cluster_id  
abc                 3           
def                 5
ghi                 4
jkl                 4
mno                 5

EDIT #3: Accepted answer clarification

Both given answers are correct. They both provide a solution for the problem. I gave Mosty Mostacho the accepted answer because he delivered the solution first and provided another version of the solution with a VIEW. The solution from mankuTimma is of the same quality as Mosty Mostacho's solution. Therefore, we have two equally good solutions, I just picked Mosty Mostacho because he was first.

Thanks to both of them for their contributions. .

Infusible answered 16/2, 2012 at 9:1 Comment(9)

If you could provide sample data for your current tables and your expected result from your sample data it will be easier for us – Edmundedmunda 16/2, 2012 at 10:45

@MostyMostacho: I added some sample data and expected results. – Infusible 16/2, 2012 at 11:22

+1 for a well written question – Adelladella 16/2, 2012 at 11:27

Although having the database dump is very useful, I was expecting a smaller set of data, easier to analyze (there are over 40K inserts!). The example data should be minimal, for example – Edmundedmunda 16/2, 2012 at 11:31

@MostyMostacho: I added an illustrational set of sample data and a further algorithm explanation using that sample data. I hope it is much clearer now, what I want to achieve. – Infusible 16/2, 2012 at 12:44

@brozo: 1) Isn't a mno missing in step 1 2) What criteria are you using to select a max when there are many equal maximum values? EG: ghi It looks like you pick a random cluster, right? – Edmundedmunda 16/2, 2012 at 17:46

@MostyMostacho: for equal maximum values I just take the most recent one (the one with a larger ID, timestamp etc.). And yes, there is one "mno" missing. I'll fix that in the question. – Infusible 16/2, 2012 at 20:35

Ok, then I happened to go in the right direction in my answer below. Bear in mind that ghi has not been chosen under the max(id) criteria in your example as it should be the number 5, as you can see in my answer. Give it a try – Edmundedmunda 16/2, 2012 at 20:44

OK, I'll try it tomorrow on my work machine, because I don't have the test environment available right now. I'll also try it out on a much larger data set, and see if it suits my performance needs. But seeing that your solution gives the correct result, I'm staying positive. If it works, I'll give you an accepted answer and if the performance is low, maybe we can discuss how to optimize that query. – Infusible 16/2, 2012 at 20:48

Well I have some doubts about how to choose the an occurrence when there are many equal, but looking at the C# code it seems this selection is non-deterministic.

Now, given the sampla data Step 2 actually results in:

+------------+------------+-------+------------+
| SESSION_ID | CLUSTER_ID | LABEL | OCCURRENCE |
+------------+------------+-------+------------+
| abc        |          3 | C     |          2 |
| def        |          1 | A     |          1 |
| def        |          2 | B     |          1 |
| def        |          3 | C     |          1 |
| def        |          5 | E     |          1 |
| ghi        |          1 | A     |          1 |
| ghi        |          4 | D     |          1 |
| ghi        |          5 | E     |          1 |
| jkl        |          1 | A     |          1 |
| jkl        |          4 | D     |          1 |
| mno        |          1 | A     |          1 |
| mno        |          5 | E     |          1 |
+------------+------------+-------+------------+

So, continuing with this data, I get the session_id and the max(cluster_id) for that session id, resulting in:

+------------+------------+
| SESSION_ID | CLUSTER_ID |
+------------+------------+
| abc        |          3 |
| def        |          5 |
| ghi        |          5 |
| jkl        |          4 |
| mno        |          5 |
+------------+------------+

The max(cluster_id) is just there to perform that non-deterministic selection. This is the query:

select s1.session_id, max(s1.cluster_id) as cluster_id from (
  select sd.session_id, cd.cluster_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
) as s1
left join (
  select sd.session_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
) as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id

Maybe adding a view would improve performance (replacement of above query):

create view MaxOccurrences as (
  select sd.session_id, cd.cluster_id, count(*) as Occurrence
  from session_document sd
  join cluster_document cd
  on sd.document_id = cd.document_id
  join cluster c
  on c.cluster_id = cd.cluster_id
  group by sd.session_id, cd.cluster_id, c.label
);

select s1.session_id, max(s1.cluster_id) as cluster_id
from MaxOccurrences as s1
left join MaxOccurrences as s2
on s1.session_id = s2.session_id and s1.occurrence < s2.occurrence
where s2.occurrence is null
group by s1.session_id

Let me know if it works.

Edmundedmunda answered 16/2, 2012 at 18:46 Comment(1)

This solution now works. Both of them - the nested SELECT version and the VIEW version. The time for both on a large data set is 0.686s, which is quite impressive, so the VIEW solution at this level doesn't bring any improvement in performance. I will mark this as the accepted answer because you were first and offered an optimized version of the query. – Infusible 17/2, 2012 at 6:43

If I understand your problem correctly, for every session you want the cluster with the most number of document views. So, here you go The below query returns the maximum count or the number of occurrences of a particular cluster id for every session id.

SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID

Then join the above result with the sub-query(where I am calculating the count) again to get the cluster id of the maximum occurrence. In case there are two cluster id's with same number of occurrences I am using the cluster id with maximum value. I have tested on your data and it works. Also, now this query should work on all databases.

SELECT B.SESSION_ID, MAX(CNT2.CLUSTER_ID) FROM 
(SELECT SESSION_ID,MAX(CNT) MAX_CNT
FROM (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT1
GROUP BY SESSION_ID) B
JOIN (SELECT SD.SESSION_ID, CD.CLUSTER_ID,COUNT(*) AS CNT
FROM CLUSTER_DOCUMENT AS CD 
INNER JOIN SESSION_DOCUMENT AS SD 
ON CD.DOCUMENT_ID = SD.DOCUMENT_ID
GROUP BY SD.SESSION_ID,CD.CLUSTER_ID) CNT2
ON B.SESSION_ID = CNT2.SESSION_ID
AND B.MAX_CNT = CNT2.CNT
GROUP BY B.SESSION_ID

Vermiform answered 16/2, 2012 at 10:31 Comment(5)

The result of this query doesn't match the expected result that my C# implementation returns. But I think it's quite close. – Infusible 16/2, 2012 at 11:27

Looks like my query used to give you the minimum count instead of the maximum. Just got a little lazy and did not check my code well. I have changed it. Hopefully it should help. – Vermiform 17/2, 2012 at 5:48

Your edited query works very well - 0.686s on the large data set. Quite impressive. – Infusible 17/2, 2012 at 6:36

However, Mosty Mostacho delivered the correct solution to the problem before you, so I am giving the accepted answer to him. I still upvoted your solution though, because it works and its performance is the same as Mosty Mostacho's. – Infusible 17/2, 2012 at 6:43

Thanks for the upvote. But seriously I answer here because I like working with queries. – Vermiform 18/2, 2012 at 6:15