Combine multiple sets of rows in SPARQL
Asked Answered
C

1

7

I cannot describe my problem formally due to my bad English; let me tell it using an example. The table below is actually grouped by 'subject','predicate'.

We define a set on rows, if they the same 'subject'. Now I want to combine any two sets if they contain the same 'predicate's, sum the 'count' of the same 'predicate', and count the number of distinct subjects which have a same set.

subject    predicate    count
-----------------------------
s1           p1           1
s1           p2           2
s2           p1           3
s3           p1           2
s3           p2           2

Therefore, what wanted from this table is two sets:

{2, (p1, 3), (p2, 4)}, 
{1, (p1,3)} 

where in the first set, 2 indicates there are two subjects (s1 and s3) having this set; (p1,3) is the sum from (s1, p1, 1) and (s3, p1, 2).

So how can I retrieve these sets and store them in Java?

  • How can I do it using SPARQL?

  • Or, firstly store these triples in Java, then how can I get these sets using Java?


One solution might be concat predicates and counts,

SELECT (COUNT(?s) AS ?distinct)
?propset
(group_concat(?count; separator = \"\\t\") AS ?counts)
{
    SELECT ?s 
    (group_concat(?p; separator = \" \") AS ?propset)
    (group_concat(?c; separator = \" \") AS ?count
    {
        ?s ?p ?c        
    } GROUP BY ?s ORDER BY ?s
} GROUP BY ?propset ORDER BY ?propset

Then the counts could be decoupled, then sum up. It works fine on small dataset, but very time consuming.

I think I will give up this weird problem. Thank you very much for answering.

Case answered 14/6, 2012 at 4:44 Comment(0)
A
9

Let's start with

select ?predicate (sum(?count) as ?totalcount) 
{
    ?subject ?predicate ?count
}
group by ?predicate

That's the basic bit, but the grouping isn't right (now clarified).

The grouping variable should be like this (hope this is the right syntax):

select ?subject (group_concat(distinct ?p ; separator = ",") AS ?propset)
{
    ?subject ?p ?c
}
group by ?subject

I hope that gives:

subject    propset
------------------
s1          "p1,p2" 
s2          "p1"  
s3          "p1,p2"  

So the final query should be:

select ?predicate (sum(?count) as ?totalcount) 
{
    ?subject ?predicate ?count .
    {
        select ?subject (group_concat(distinct ?p ; separator = ",") AS ?propset)
        {
            ?subject ?p ?c
        }
        group by ?subject
    }
}
group by ?propset ?predicate

Does that work?

Albaalbacete answered 14/6, 2012 at 9:0 Comment(7)
yes I mean 's1 and s3 have the same set'. sorry for the typo, I have modified it (and changed the predicate value of s2 to make it more clear). However, the result I want is 'the set of sets'. Two sets, say {p1,p2} and {p1}, cannot be combined since they are different. Therefore we would not only sum the value of every same predicate. Thank you for answering :)Case
Ah, got it. That might be hard, but I've added a second pass.Albaalbacete
It's nearly close to what I want :) but it keeps running and seems that it won't give any result. Besides, if this query succeeds, how can I retrieve those sets form the resulting table? I just can see the resulting table contains two column, but gives no information about the sets. For the results, they will be stored in java, so..actually I just want these results. Thank you again.Case
If you want the sets just add ?propset to the select.Albaalbacete
But totalcount does not work. From your clue, I also use concat on column 'count', and count 'subject' when group by propset. Afterwards I can decouple the concat count. It works fine on small dataset, but is really a disaster on large ones.Case
Oh yes, I imagine this will be seriously slow on large datasets. You're doing two complete scans of the data: one to work out how the data is partitioned, the second to do the count. I don't see a way to avoid that. However you could precalculate the partitioning bit?Albaalbacete
The first table is already a calculated one; I tried to precalculated this table then using Jena ARQ to construct a new RDF model then store it using TDB. But still, constructing is intolerable, possibly due to the ResultSet in Jena - it retrieves result dynamically.. I have decided to give up. Thank you very much...Case

© 2022 - 2024 — McMap. All rights reserved.