I have a list of audit data from Dynamics CRM 2013 that I have deserialised and stuck into a HashSet, defined as:
private class AuditCache
{
public Guid ObjectId;
public int HistoryId;
public DateTime? DateFrom;
public DateTime? DateTo;
public string Value;
};
private HashSet<AuditCache> _ac = new HashSet<AuditCache>();
I add the data like this (from a SQL Server recordset):
_ac.Add(new AuditCache{
ObjectId = currentObjectId,
HistoryId = Convert.ToInt32(dr["HistoryId"]),
DateTo = Convert.ToDateTime(dr["CreatedOn"]),
Value = value});
I end up with roughly half a million records.
Next I need to iterate through every Guid and pull out the subset of data from my audit data that matches. I have a list of the Guids that I generate elsewhere and there are around 300,000 to process. I store them in this:
var workList = new Dictionary<Guid, DateTime>();
...and iterate through them like this:
foreach (var g in workList)
Then I need to do this to pull out the subset for each Guid:
List<AuditCache> currentSet = _ac.Where(v => v.ObjectId == g.Key).ToList();
But it's slow.
It take around 1 minute to populate my initial audit data list but will take hours (I never ran it to completion so this is based on the time to process 1% of the data) to pull out each set, process it and squirt it back into a database table.
Stepping through the code I can see the bottleneck seems to be pulling out the subset from my list for each Guid. So my question is, is there a better/ more efficient way (architecture?) to store/ retrieve my data set?
One thing to note, I know Guids are inherently slow to index/ search but I am pretty much constrained to using them due to the way Dynamics CRM works. I guess I could create a Dictionary to lookup Guids and "convert" them to integer values, or something along those lines, but I am not convinced that would help much?
Edit
Okay, I tested the three solutions using my live data (371,901 Guids) and these are the results as an average time per 1,000 Guids. Note that this includes the processing/ INSERT to SQL Server so it isn't a proper benchmark.
Method #0 - List with Lambda ~30.00s per 1,000 rows (I never benchmarked this precisely)
Method #1 - IntersectWith 40.24s per 1,000 rows (cloning my Hashset spoilt this)
Method #2 - BinarySearch 3.20s per 1,000 rows
Method #3 - Generic Dictionary 2.19s per 1,000 rows
On the basis of this I am going to probably rewrite my code from scratch as I think the whole approach I was taking was incorrect.
However, this has been a very useful learning exercise and many thanks to everyone who contributed. I am going to accept the BinarySearch as the correct answer as it does what I wanted and is much faster than my original code.
Just to be clear here, the IntersectWith is indeed "smoking" fast, but it doesn't work for my particular problem as I need to constantly go back to my original hashset.
Parallel.Foreach
to process the workList? – Dmitri