Analysis of different Sets and optimizations. Best approach?
Asked Answered
R

1

6

For the last few days, I've tried to accomplish the following task regarding the analysis of a Set of Objects, and the solutions I've come up with rely heavily on Memory (obtaining OutOfMemory exceptions in some cases) or take an incredible long time to process. I now think is a good idea to post it here, as I'm out of ideas. I will explain the problem in detail, and provide the logic I've followed so far.

Scenario:

First, we have an object, which we'll name Individual, that contains the following properties:

  1. A date
  2. A Longitude - Latitude pair

Second, we have another object, which we'll name Group, which definition is: A set of Individuals that, together, match the following conditions:

  • All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.

  • The distance between each object is less than Y meters.

A group can have N>1 individuals, as long as each of the Individuals match the conditions within each other. All individuals are stored in a database. All groups would also be stored in a database.

The task:

Now, consider a new individual. The system has to check if the new individual:

  1. Belongs to an existing Group or Groups

  2. The Individual now forms one or multiple new Groups with other Individuals.

Notes:

  • The new individual could be in multiple existing groups, or could create multiple new groups.

  • SubGroups of Individuals are not allowed, for example if we have a Group that contains Individuals {A,B,C}, there cannot exist a group that contains {A,B}, {A,C} or {B,C}.

Solution (limited in processing time and Memory)

First, we filter the database with all the Individuals that match the initial conditions. This will output a FilteredIndividuals enumerable, containing all the Individuals that we know will form a Group (of 2) with the new one.

Briefly, a Powerset is a set that contains all the possible subsets of a particular set. For example, a powerset of {A,B,C} would be: {[empty], A, B, C, AB, AC, BC, ABC}

Note: A powerset will output a new set with 2^N combinations, where N is the length of the originating set.

The idea with using powersets is the following:

  • First, we create a powerset of the FilteredIndividuals list. This will give all possible combinations of Groups within the FilteredIndividuals list. For analysis purposes and by definition, we can omit all the combinations that have less than 2 Individuals in them.

  • We Check if each of the Individuals in a combination of the powerset, match the conditions within each other. If they match, that means that all of the Individuals in that combination form a Group with the new Individual. Then, to avoid SubGroups, we can eliminate all of the subsets that contain combinations of the Checked combination. I do this by creating a powerset of the Checked combination, and then eliminating the new powerset from the original one.

  • At this point, we have a list of sets that match the conditions to form a Group.

Before formally creating a Group, I compare the DB with other existing Groups that contain the same elements as the new sets: If I find a match, I eliminate the newly created set, and add the new Individual to the old Group. If I don't find a match, it means they are new Groups. So I add the new Individual to the sets and finally create the new Groups.

This solution works well when the FilteredIndividuals enumerable has less than 52 Individuals. After that, Memory exceptions are thrown (I know this is because of the maximum size allowed for data types, but incrementing such size is not of help with very big sets. For your consideration, the top amount of Individuals that match the conditions I've found is 345).

Note: I have access to the definition of both entities. If there's a new property that would reduce the processing time, we can add it.

I'm using the .NET framework with C#, but if the language is something that requires changing, we can accept this as long as we can later convert the results to object understandable by our main system.

Russianize answered 27/7, 2016 at 16:16 Comment(15)
"All individuals in the set have a date which, within each other, is not superior to X days" - what are you talking about? "Within each other"? "Not superior to X days"? These words do not make sense put together this way in this context.Last
I went into more detail @Last Thanks for the heads up!Russianize
I guess I don't get the goal: holding the powerset gets (too) expensive - but what is the goal in the first place? Powerset "without proper subsets"?Overdone
@Overdone It's to create "Groups" of "Individual", without being replicates or subgroups, and assuming the Individuals would be analyzed in a FIFO manner.Russianize
[goal is] to create "Groups" of "Individual", with [constraints met] - exactly what I don't get/accept: what use are Groups? What "operations" do the have to support? FIFO 1st in, 1st out? Fixed upper/lower limits on size? Or FCFS - 1st come, 1st "served"/processed?Overdone
Groups are requirements of the Business Logic. They are table in a database and after their creation, they server another purpose. FCFS is an acronym more suited to this problem.Russianize
If you have A, B, C, and D with {A, B}, {B, C}, {C, D} and {D, A} spaced close enough to be in groups like shown, but not in groups containing more that two of A, B, C, D (think square with |side|=Y), does any combination of groups that contains all members do, or are all groups required? (Still trying to figure out whether the difference can be more than factor 4.)Overdone
As long as ALL of the Individuals in a group match the criteria, the groups should be created or modified. AB BC CD DA In this scenario, four groups would be created. If, for example, DB would also have matched, that would mean that when D is inserted, a modification to group AB would had to be done, transforming it to ABD. We can't have subgroups, meaning that if ABD exists, AB and DB should not.Russianize
What's the scale of this task? That is, how many individuals must it handle, and how many Groups?Atlantis
Also, is the Date a date-only or a date-time?Atlantis
It must handle individuals as they enter the system. No fixed scale. The datatype we are using is Datetime. But all of our calculations are and must be based on the amount of days.Russianize
Thanks @OverdoneRussianize
Adding your name so you could see this @AtlantisRussianize
What is the upper limit? If you want a orblem to fit into memory, we have to know how big it is and how much memory you have.Atlantis
32GB RAM, multiple terabytes of disk space. @AtlantisRussianize
C
3

All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other. The distance between each object is less than Y meters.

So your problem becomes how to cluster these points in 3-space, a partitioning where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is an appropriately scaled variant of the Manhattan distance. Specifically you scale Z so that 10*Z days equals your maximum distance of Y meters.

One possible shortcut would be to use the divide et impera and classify your points (Individuals) in buckets, Y meters wide and 10 days high. You do so by dividing their coordinates by Y and by 10 days (you can use Julian dates for that). If an individual is in bucket H { X=5, Y=3, Z=71 }, then it cannot be than any individual in buckets with X < (5-1) or X > (5+1), Y < (3-1) or Y > (3+1), or Z < (71-1) or Z > (71+1), is in his same group, because their distance would certainly be above the threshold. This means that you can quickly select a subset of 27 "buckets" and worry with only those individuals in there.

At this point you can enumerate the possible groups your new individual can be in (if you use a database back end, they would be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( @bucketsId )), and compare those with the group your individual may form from other individuals (SELECT individuals.id WHERE bucketId IN ( @bucketsId ) AND ((x-@newX)*(x-@newX)+(y-@newY)*(y-@newY)) < @YSquared AND ABS(z - @newZ) < 10)).

This approach is not very performant (it depends on the database, and you'll want an index on bucketId at a minimum), but it has the advantage of using as little memory as possible.

On some database backends with geographical extensions, you might want to use the native latitude and longitude functions instead of implicitly converting to meters.

Cece answered 5/9, 2016 at 21:12 Comment(2)
Hi @Iserni, your answer brings something we have thought of but discarded, due to limited mathematical and geometry knowledge. I'm going to be more honest than what I should here, but I admit I have some trouble understanding your answer at first glance, and I will have to read it more than once and also read the links you've added in the answer. One more question, would you be willing to accept a payment to "properly" solve this? I'd of course give you more details on the issue and on everything, before accepting. Contact me at rainiermallol[at]gmail[dot]com if interested.Russianize
Say n is the order of magnitude of points you're considering. First, partition your space into floor(sqrt(n)) pieces of equal length. Each of your points goes in exactly one of these squares, and on average there will be one point per square. If your points aren't randomly distributed in space, you might want to modify this to account for your distribution. Now, each time you add a point P, you only need to check points in the squares which could possibly be within Y meters of P.Carlyn

© 2022 - 2024 — McMap. All rights reserved.