Analysis of different Sets and optimizations. Best approach?

For the last few days, I've tried to accomplish the following task regarding the analysis of a Set of Objects, and the solutions I've come up with rely heavily on Memory (obtaining OutOfMemory exceptions in some cases) or take an incredible long time to process. I now think is a good idea to post it here, as I'm out of ideas. I will explain the problem in detail, and provide the logic I've followed so far.

Scenario:

First, we have an object, which we'll name Individual, that contains the following properties:

A date
A Longitude - Latitude pair

Second, we have another object, which we'll name Group, which definition is: A set of Individuals that, together, match the following conditions:

All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.

A group can have N>1 individuals, as long as each of the Individuals match the conditions within each other. All individuals are stored in a database. All groups would also be stored in a database.

The task:

Now, consider a new individual. The system has to check if the new individual:

Belongs to an existing Group or Groups
The Individual now forms one or multiple new Groups with other Individuals.

Notes:

The new individual could be in multiple existing groups, or could create multiple new groups.
SubGroups of Individuals are not allowed, for example if we have a Group that contains Individuals {A,B,C}, there cannot exist a group that contains {A,B}, {A,C} or {B,C}.

Solution (limited in processing time and Memory)

First, we filter the database with all the Individuals that match the initial conditions. This will output a FilteredIndividuals enumerable, containing all the Individuals that we know will form a Group (of 2) with the new one.

Briefly, a Powerset is a set that contains all the possible subsets of a particular set. For example, a powerset of {A,B,C} would be: {[empty], A, B, C, AB, AC, BC, ABC}

Note: A powerset will output a new set with 2^N combinations, where N is the length of the originating set.

The idea with using powersets is the following:

First, we create a powerset of the FilteredIndividuals list. This will give all possible combinations of Groups within the FilteredIndividuals list. For analysis purposes and by definition, we can omit all the combinations that have less than 2 Individuals in them.
We Check if each of the Individuals in a combination of the powerset, match the conditions within each other. If they match, that means that all of the Individuals in that combination form a Group with the new Individual. Then, to avoid SubGroups, we can eliminate all of the subsets that contain combinations of the Checked combination. I do this by creating a powerset of the Checked combination, and then eliminating the new powerset from the original one.
At this point, we have a list of sets that match the conditions to form a Group.

Before formally creating a Group, I compare the DB with other existing Groups that contain the same elements as the new sets: If I find a match, I eliminate the newly created set, and add the new Individual to the old Group. If I don't find a match, it means they are new Groups. So I add the new Individual to the sets and finally create the new Groups.

This solution works well when the FilteredIndividuals enumerable has less than 52 Individuals. After that, Memory exceptions are thrown (I know this is because of the maximum size allowed for data types, but incrementing such size is not of help with very big sets. For your consideration, the top amount of Individuals that match the conditions I've found is 345).

Note: I have access to the definition of both entities. If there's a new property that would reduce the processing time, we can add it.

I'm using the .NET framework with C#, but if the language is something that requires changing, we can accept this as long as we can later convert the results to object understandable by our main system.

All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other. The distance between each object is less than Y meters.

So your problem becomes how to cluster these points in 3-space, a partitioning where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is an appropriately scaled variant of the Manhattan distance. Specifically you scale Z so that 10*Z days equals your maximum distance of Y meters.

One possible shortcut would be to use the divide et impera and classify your points (Individuals) in buckets, Y meters wide and 10 days high. You do so by dividing their coordinates by Y and by 10 days (you can use Julian dates for that). If an individual is in bucket H { X=5, Y=3, Z=71 }, then it cannot be than any individual in buckets with X < (5-1) or X > (5+1), Y < (3-1) or Y > (3+1), or Z < (71-1) or Z > (71+1), is in his same group, because their distance would certainly be above the threshold. This means that you can quickly select a subset of 27 "buckets" and worry with only those individuals in there.

At this point you can enumerate the possible groups your new individual can be in (if you use a database back end, they would be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( @bucketsId )), and compare those with the group your individual may form from other individuals (SELECT individuals.id WHERE bucketId IN ( @bucketsId ) AND ((x-@newX)*(x-@newX)+(y-@newY)*(y-@newY)) < @YSquared AND ABS(z - @newZ) < 10)).

This approach is not very performant (it depends on the database, and you'll want an index on bucketId at a minimum), but it has the advantage of using as little memory as possible.

On some database backends with geographical extensions, you might want to use the native latitude and longitude functions instead of implicitly converting to meters.

Recommended topics

Hot tags