I have a large list of GUIDs and other data that I need to pull a subset from, what is the quickest way to do this?
Asked Answered
P

3

5

I have a list of audit data from Dynamics CRM 2013 that I have deserialised and stuck into a HashSet, defined as:

private class AuditCache
{
    public Guid ObjectId;
    public int HistoryId;
    public DateTime? DateFrom;
    public DateTime? DateTo;
    public string Value;
};
private HashSet<AuditCache> _ac = new HashSet<AuditCache>();

I add the data like this (from a SQL Server recordset):

_ac.Add(new AuditCache{ 
    ObjectId = currentObjectId,
    HistoryId = Convert.ToInt32(dr["HistoryId"]),
    DateTo = Convert.ToDateTime(dr["CreatedOn"]),
    Value = value});

I end up with roughly half a million records.

Next I need to iterate through every Guid and pull out the subset of data from my audit data that matches. I have a list of the Guids that I generate elsewhere and there are around 300,000 to process. I store them in this:

var workList = new Dictionary<Guid, DateTime>();

...and iterate through them like this:

foreach (var g in workList)

Then I need to do this to pull out the subset for each Guid:

List<AuditCache> currentSet = _ac.Where(v => v.ObjectId == g.Key).ToList();

But it's slow.

It take around 1 minute to populate my initial audit data list but will take hours (I never ran it to completion so this is based on the time to process 1% of the data) to pull out each set, process it and squirt it back into a database table.

Stepping through the code I can see the bottleneck seems to be pulling out the subset from my list for each Guid. So my question is, is there a better/ more efficient way (architecture?) to store/ retrieve my data set?

One thing to note, I know Guids are inherently slow to index/ search but I am pretty much constrained to using them due to the way Dynamics CRM works. I guess I could create a Dictionary to lookup Guids and "convert" them to integer values, or something along those lines, but I am not convinced that would help much?

Edit

Okay, I tested the three solutions using my live data (371,901 Guids) and these are the results as an average time per 1,000 Guids. Note that this includes the processing/ INSERT to SQL Server so it isn't a proper benchmark.

Method #0 - List with Lambda    ~30.00s per 1,000 rows (I never benchmarked this precisely)
Method #1 - IntersectWith        40.24s per 1,000 rows (cloning my Hashset spoilt this)
Method #2 - BinarySearch          3.20s per 1,000 rows
Method #3 - Generic Dictionary    2.19s per 1,000 rows

On the basis of this I am going to probably rewrite my code from scratch as I think the whole approach I was taking was incorrect.

However, this has been a very useful learning exercise and many thanks to everyone who contributed. I am going to accept the BinarySearch as the correct answer as it does what I wanted and is much faster than my original code.

Just to be clear here, the IntersectWith is indeed "smoking" fast, but it doesn't work for my particular problem as I need to constantly go back to my original hashset.

Populate answered 7/8, 2014 at 13:34 Comment(7)
Is it an option to take the workList (all 300k) of them to the database (in a temp or staging table) and make the join there? How many times a day do you have to make this lookup?Koy
The problem here is that the processing is not set-based and doesn't naturally fit into SQL. Originally I had a pure SQL process to achieve the same end result. It works fine but it takes around 2-3 hours per attribute to process. This is obviously going to be a problem as I have around a dozen attributes to process every night. I thought that if I had everything in memory then it would be much quicker. It is quicker, but not nearly quick enough.Populate
Have you tried using Parallel.Foreach to process the workList?Dmitri
Adding multithreading is certainly something I was planning to do, but I wanted to get the base performance down from hours to minutes (or ideally seconds) first.Populate
See my answer. HashSet.InstersectWith will be seconds.Septilateral
@Blam - Cool, I am going to try setting this up a couple of different ways (so far) and test it. I will come back and accept an answer when I have something that works better than my current solution :DPopulate
@RichardHansell: You will have to accept Blam's, since it is indeed faster (if you can model your implementation for two hash sets).Koy
K
4

How about if you sort the AuditCache list by GUID (which is a big integer after all) and then use List<T>.BinarySearch on it?

I got pretty good results for it (under 15 seconds on an i3-3110M @2.4Ghz). Cumulative times below:

  1. Collection created: 892 ms
  2. Collection sorted: 9285 ms
  3. Lookup: 12055 ms

Below I'm using BigInteger from System.Numerics to interpret Guids as 128 bit integers.

If I'm not missing something, then this should work. Note that the lookup for loop is actually worst case scenario, because it is very unlikely that there are collisions (so index will always be -1). In your case it could be even faster:

class AuditCache
{
    public Guid ObjectId;
    public int HistoryId;
    public DateTime? DateFrom;
    public DateTime? DateTo;
    public string Value;
};

class AuditCacheComparer : IComparer<AuditCache>
{
    public int Compare(AuditCache x, AuditCache y)
    {
        BigInteger intx = new BigInteger(x.ObjectId.ToByteArray());
        BigInteger inty = new BigInteger(y.ObjectId.ToByteArray());
        if (intx < inty)
        {
            return -1;
        }
        else if (intx > inty)
        {
            return 1;
        }

        return 0;
    }
}

class Program
{

    static void Main(string[] args)
    {
        List<AuditCache> testCollection = new List<AuditCache>();
        Stopwatch sw = Stopwatch.StartNew();

        for (int i = 0; i != 1000000; ++i)
        {
            testCollection.Add(new AuditCache() { ObjectId = Guid.NewGuid(), HistoryId = i });
        }

        Console.WriteLine("Collection created: {0} ms", sw.ElapsedMilliseconds);

        AuditCacheComparer comparer = new AuditCacheComparer();
        testCollection.Sort(comparer);

        Console.WriteLine("Collection sorted: {0} ms", sw.ElapsedMilliseconds);

        for(int i = 0; i != 300000; ++ i)
        {
            var index = testCollection.BinarySearch(new AuditCache() {ObjectId = Guid.NewGuid()}, comparer);
            if (index > 0)
            {
                Console.WriteLine("Found: {0} ms", sw.ElapsedMilliseconds);
            }
        }

        Console.WriteLine("Lookup: {0} ms", sw.ElapsedMilliseconds);
        Console.ReadLine();
    }
}
Koy answered 7/8, 2014 at 14:6 Comment(3)
Okay cool, I got this hooked up and it seems to work. However, it only gives me the first index in my List for a matching Guid. I then need to iterate over the list to form a true subset from it. I will do this then perform some timings to see which method comes out fastest.Populate
@RichardHansell: Yes, you do have to call BinarySearch multiple times (just as I did).Koy
This looks like it will process the entire job in under 20 minutes so many thanks :DPopulate
S
5

1 million in 1.2 seconds on a P4

HashSet has IntersectWith
It is smoking fast

If the collection represented by the other parameter is a HashSet collection with the same equality comparer as the current HashSet object, this method is an O(n) operation. Otherwise, this method is an O(n + m) operation, where n is Count and m is the number of elements in other.

But to make it work you need AuditCache to implement Object and override both GetHashCode and Equals
Object.GetHashCode Method

GUID hash nicely (minimal collision) so this will be very fast.

WorkList will need to also be AuditCache (even if it is really not AuditCache)
Or you could have them both implement a class that uses Guid ObjectId as the key (Equal and GetHashCode)

GUID is NOT inherently slow to index if you are using it as a hashed key (dictionary and hashset) - it is a great key as it would have few (or no) collisions. Even if you go with both as Dictionary with key as Guid it would be a lot faster. But Dictionary does not have IntersectWith.

1 million in 1.2 seconds on a P4 0.5 seconds to clone and 0.7 seconds to intersect

using System.Diagnostics;
namespace HashSetIntersect
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        public MainWindow()
        {
            InitializeComponent();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            HashSet<AuditCache> TestHashKeys1 = new HashSet<AuditCache>();
            HashSet<AuditCache> TestHashKeys2 = new HashSet<AuditCache>();
            for (UInt32 i = 0; i < 1000000; i++)
            {
                Guid g = Guid.NewGuid();
                TestHashKeys1.Add(new AuditCache(g, 1, (DateTime?)null, (DateTime?)null, "value1"));
                if (i % 2 == 0) TestHashKeys2.Add(new AuditCache(g, 0, (DateTime?)null, (DateTime?)null, "value2"));
            }            
            Debug.WriteLine(TestHashKeys1.Count.ToString() + " " + TestHashKeys2.Count.ToString());
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            sw.Restart();
            HashSet<AuditCache> TestHashKeys3 = new HashSet<AuditCache>(TestHashKeys1);
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            sw.Restart();
            TestHashKeys3.IntersectWith(TestHashKeys2);
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString());
            foreach (AuditCache ac in TestHashKeys3)
            {
                Debug.WriteLine(ac.Value);
            }
        }
    }
    public abstract class HashKey : Object
    {
        public Guid ObjectId { get; private set; }
        public override bool Equals(object obj)
        {
            if (!(obj is HashKey)) return false;
            HashKey comp = (HashKey)obj;
            return this.ObjectId == comp.ObjectId;
        }

        public override int GetHashCode()
        {
            return ObjectId.GetHashCode();
        }
        public HashKey(Guid objectId)
        {
            ObjectId = objectId;
        }
    }
    public class TestHashKey : HashKey
    {
        public TestHashKey(Guid ObjectId)
            : base(ObjectId)
        { }
    }
    public class AuditCache : HashKey
    {
        public int HistoryId { get; private set; }
        public DateTime? DateFrom { get; private set; }
        public DateTime? DateTo { get; private set; }
        public string Value { get; private set; }
        public AuditCache(Guid objectId, int historyId, DateTime? dateFrom, DateTime? dateTo, string value)
            : base(objectId)
        {
            HistoryId = historyId;
            DateFrom = dateFrom;
            DateTo = dateTo;
            Value = value;
        }
    }
}
Septilateral answered 7/8, 2014 at 14:8 Comment(7)
+1. It is indeed faster. Will have to take a look at their implementation. Their asymptotic complexity is O(n+m), whereas mine is O(m*log(n)).Koy
Yar, good point about Guids, I keep forgetting they are just large numbers!Populate
@MarcelN. If they both have the same equality comparer then it is o(n).Septilateral
Okay, I see you modified your code, took me a while to figure out that the GetHasCode override was slightly wrong :D The only concern I have with this is that you need to clone the whole hashset every time you want to take a new intersectwith. In testing this ended up being even slower than the original method! (Although I may well have messed something up?)Populate
You don't HAVE to clone the HashSet - that is only if you need to save the original. I was able to clone 10 million in 5 seconds. What is wrong with that override int GetHashCode()? You did something wrong cause I got 10 million in 8 seconds. 13 seconds if you add the clone.Septilateral
The problem is I am pulling roughly 300,000 different subsets out of my original list, so I need to be able to repeat the IntersectWith multiple times. Nothing wrong with your override GetHasCode() code now, it was slightly wrong before but I figured it out and now it looks the same as my version. In summary the IntersectWith is incredi-fast but the clone makes it slower. I ran it for fifteen minutes and it has only processed 5,000 from 371,000 guids.Populate
I was able to clone 1 million AuditCache in 0.5 seconds.Septilateral
K
4

How about if you sort the AuditCache list by GUID (which is a big integer after all) and then use List<T>.BinarySearch on it?

I got pretty good results for it (under 15 seconds on an i3-3110M @2.4Ghz). Cumulative times below:

  1. Collection created: 892 ms
  2. Collection sorted: 9285 ms
  3. Lookup: 12055 ms

Below I'm using BigInteger from System.Numerics to interpret Guids as 128 bit integers.

If I'm not missing something, then this should work. Note that the lookup for loop is actually worst case scenario, because it is very unlikely that there are collisions (so index will always be -1). In your case it could be even faster:

class AuditCache
{
    public Guid ObjectId;
    public int HistoryId;
    public DateTime? DateFrom;
    public DateTime? DateTo;
    public string Value;
};

class AuditCacheComparer : IComparer<AuditCache>
{
    public int Compare(AuditCache x, AuditCache y)
    {
        BigInteger intx = new BigInteger(x.ObjectId.ToByteArray());
        BigInteger inty = new BigInteger(y.ObjectId.ToByteArray());
        if (intx < inty)
        {
            return -1;
        }
        else if (intx > inty)
        {
            return 1;
        }

        return 0;
    }
}

class Program
{

    static void Main(string[] args)
    {
        List<AuditCache> testCollection = new List<AuditCache>();
        Stopwatch sw = Stopwatch.StartNew();

        for (int i = 0; i != 1000000; ++i)
        {
            testCollection.Add(new AuditCache() { ObjectId = Guid.NewGuid(), HistoryId = i });
        }

        Console.WriteLine("Collection created: {0} ms", sw.ElapsedMilliseconds);

        AuditCacheComparer comparer = new AuditCacheComparer();
        testCollection.Sort(comparer);

        Console.WriteLine("Collection sorted: {0} ms", sw.ElapsedMilliseconds);

        for(int i = 0; i != 300000; ++ i)
        {
            var index = testCollection.BinarySearch(new AuditCache() {ObjectId = Guid.NewGuid()}, comparer);
            if (index > 0)
            {
                Console.WriteLine("Found: {0} ms", sw.ElapsedMilliseconds);
            }
        }

        Console.WriteLine("Lookup: {0} ms", sw.ElapsedMilliseconds);
        Console.ReadLine();
    }
}
Koy answered 7/8, 2014 at 14:6 Comment(3)
Okay cool, I got this hooked up and it seems to work. However, it only gives me the first index in my List for a matching Guid. I then need to iterate over the list to form a true subset from it. I will do this then perform some timings to see which method comes out fastest.Populate
@RichardHansell: Yes, you do have to call BinarySearch multiple times (just as I did).Koy
This looks like it will process the entire job in under 20 minutes so many thanks :DPopulate
P
0

This won't be the accepted answer, but one of my colleagues pointed out that I should try this architecture:

private Dictionary<Guid, List<AuditCache>> _crmAudit = new Dictionary<Guid, List<AuditCache>>();

To get the data in I "fill up" the _ac list/ hashset for a single Guid and then write it like this:

_crmAudit.Add(lastObjectId, _ac);

When I want the data back out again I can do this:

List<AuditCache> currentSet;
if (_crmAudit.TryGetValue(g.Key, out currentSet))
{
    List<AuditCache> sortedSet = currentSet.OrderBy(o => o.HistoryId).ToList();

to get the records back out.

I tried this and the job is now a lot faster already; after 3 minutes it had almost finished - but then my entire office had a network outage (nothing to do with me though!).

One other thought I had was that I am really making this more complicated than it needs to be. My problem can be broken down to this:

  • I have a list of Guids;
  • I have 0-n audit records for each Guid in my first list (no other Guids will ever be in my list of audit records.

My current solution is:

  • iterate over the first list of Guids;
  • for each Guid extract the 0-n audit records from the second list (which can be slow);
  • process each set.

Really my solution should be:

  • ignore the first list;
  • sort the second list into Guid order;
  • keep popping off rows, sequentially, from the second list until the Guid changes;
  • process the data for that Guid and then move to the next Guid until I run out of audit records.

I will probably give this a go as well.

Now I will start trying the other suggestions (many thanks for those).

Populate answered 7/8, 2014 at 14:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.