I have a system that tracks what documents users view. Each document has its ID and a cluster that it belongs to. My system tracks the session ID and the number of views. I would now like to construct an SQL query which would give me two columns - the session ID and the classified cluster. The algorithm for classification is simple:
1. select all sessions
2. for each session S
I. prepare an accumulator ACC for clusters
II. select the clusters of viewed documents for this session
III. for each cluster C accumulate the cluster count ( ACC[C]++ )
IV. find the maximum in the ACC. That is the cluster that the session was classified to
The table structures are as follows, I'm using MySQL 5.5.16:
Session
+-------+-----------+--------------------+
| ID | sessionID | classified_cluster |
+-------+-----------+--------------------+
SessionDocument
+-------+-----------+------------+
| ID | sessionID | documentID |
+-------+-----------+------------+
Cluster
+-------+-------+
| ID | label |
+-------+-------+
ClusterDocument
+-------+-----------+------------+
| ID | clusterID | documentID |
+-------+-----------+------------+
So basically, I want to select the clusters for each session, count the occurrence of each cluster for viewed documents and find the maximum occurrence. Then the ID of the cluster that occurred the most, is the result for the session therefore the final result set holds the session ID and the most occurred cluster:
Result
+-----------+-----------------------+
| sessionID | classifiedIntoCluster |
+-----------+-----------------------+
I managed to get the clusters of viewed documents for each session (step 2/II.) with this query:
SELECT SD.session_id, CD.cluster_id
FROM cluster_document AS CD
INNER JOIN session_document AS SD
ON CD.document_id = SD.document_id
WHERE session_id IN (SELECT session_id FROM session)
I'm having trouble figuring out the rest. Is this even possible with nested SELECT queries? Should I use a cursor, and if yes, could someone show an example with a cursor? Any help will be much appreciated.
EDIT #1: added a C# implementation, MySQL dump and expected result
C# implementation
private void ClassifyUsers() {
int nClusters = Database.SelectClusterCount(); //get number of clusters
DataSet sessions = Database.SelectSessions(); //get all sessions
foreach (DataRow session in sessions.Tables[0].Rows) { //foreach session
int[] acc = new int[nClusters]; //prepare an accumulator for each known cluster
string s_id = session["session_id"].ToString();
DataSet sessionClusters = Database.SelectSessionClusters(s_id); //get clusters for this session
foreach (DataRow cluster in sessionClusters.Tables[0].Rows) { //for each cluster
int c = Convert.ToInt32(cluster["cluster_id"].ToString()) - 1;
acc[c]++; //accumulate the cluster count
}
//find the maximum in the accumulator -> that is the most relevant cluster
int max = 0;
for (int j = 0; j < acc.Length; j++) {
if (acc[j] >= acc[max]) max = j;
}
max++;
Database.UpdateSessionCluster(s_id, max); //update the session with its new assigned cluster
}
}
Table structure, test data and expected result
EDIT #2: added a smaller data set and further algorithm walkthrough
Here is a smaller data set:
SESSION
session id | cluster
abc 0
def 0
ghi 0
jkl 0
mno 0
CLUSTER
cluster_id | label
1 A
2 B
3 C
4 D
5 E
SESSION_DOCUMENT
id | session_id | document_id
1 abc 1
2 def 5
3 jkl 3
4 ghi 4
5 mno 2
6 def 2
7 abc 5
8 ghi 3
CLUSTER_DOCUMENT
id | cluster_id | document_id
1 1 2
2 1 3
3 2 5
4 3 5
5 3 1
6 4 3
7 5 2
8 5 4
Algorithm in detail
Step 1: get clusters for documents viewed by the session
session_id | cluster_id | label | document_id
abc 3 C 1
abc 2 B 5
abc 3 C 5
-----
def 2 B 5
def 3 C 5
def 1 A 2
def 5 E 2
----
ghi 5 E 4
ghi 1 A 3
ghi 4 D 3
----
jkl 1 A 3
jkl 4 D 3
----
mno 1 A 2
mno 5 E 2
Step 2: count occurrence of clusters
session_id | cluster_id | label | occurrence
abc 3 C 2 <--- MAX
abc 2 B 1
----
def 2 B 1
def 3 C 1
def 1 A 1
def 5 E 1 <--- MAX
----
ghi 5 E 1
ghi 1 A 1
ghi 4 D 1 <--- MAX
----
jkl 1 A 1
jkl 4 D 1 <--- MAX
----
mno 1 A 1
mno 5 E 1 <--- MAX
Step 3 (end result): find maximum occurred cluster for each session (see above) and construct the final result set (session_id, cluster_id):
session_id | cluster_id
abc 3
def 5
ghi 4
jkl 4
mno 5
EDIT #3: Accepted answer clarification
Both given answers are correct. They both provide a solution for the problem. I gave Mosty Mostacho the accepted answer because he delivered the solution first and provided another version of the solution with a VIEW
. The solution from mankuTimma is of the same quality as Mosty Mostacho's solution. Therefore, we have two equally good solutions, I just picked Mosty Mostacho because he was first.
Thanks to both of them for their contributions. .
mno
missing in step 1 2) What criteria are you using to select a max when there are many equal maximum values? EG:ghi
It looks like you pick a random cluster, right? – Edmundedmundaghi
has not been chosen under the max(id) criteria in your example as it should be the number5
, as you can see in my answer. Give it a try – Edmundedmunda