Retrieving large amount of records under multiple limitations, without causing an out-of-memory exception
Asked Answered
M

2

5

I've got the following situation:

  1. There are two related types. For this question, I'll use the following simple types:

    public class Person
    {
         public Guid Id {get; set;}
         public int status {get; set;}
    }
    
    public class Account
    {
         public Guid AccountId {get; set;}
         public decimal Amount { get; set; }
         public Guid PersonId { get; set; }
    }
    

    So that one Person might have multiple Accounts (i.e., multiple Accounts would reference the same PersonId).

  2. In our database, there are tens of thousands of persons, and each have 5-10 accounts on average.

  3. I need to retrieve each person's accounts, assuming they fulfill certain requirements. Afterwards, I need to see if all of this person's accounts, together, fulfill another condition.

    In this example, let's say I need every account with amount < 100, and that after retrieving one person's accounts, I need to check if their sum is larger than 1000.

  4. Using a LINQ query is desirable, but can't be done using group-by-into keywords, because the Linq-Provider (LINQ-to-CRM) doesn't support it.

  5. In addition, doing the following simple LINQ query to implement listing 3 requirements is also not possible (please read the inlined comment):

    var query = from p in personList
                join a in accountList on p.Id equals a.PersonId
                where a.Amount < 100
                select a;
    var groups = query.GroupBy(a => a.PersonId);
    // and now, run in bulks on x groups 
    // (let x be the groups amount that won't cause an out-of-memory exception)
    

    It is not possible for 2 reasons:

    a. The Linq-Provider force a call to ToList() before using GroupBy().

    b. Trying to actually call ToList() before using GroupBy() results in an out-of-memory exception - since there are tens of thousands of accounts.

  6. For efficiency reasons, I don't want to do the following, since it means tens of thousands retrievals:

    a. Retrieve all persons.

    b. Loop through them and retrieve each person's accounts on each iteration.

Will be glad for efficient ideas.

Mekka answered 3/7, 2017 at 18:55 Comment(1)
Just as an FYI: points 4 and 5.A are the same, and 5.B is kind of implied. You'd probably also get an OOM with a GroupBy anywayDrayton
M
5

I would suggest ordering the query by PersonId, switching to LINQ to Objects via AsEnumerable() (thus executing it, but without materializing the whole result set in memory like ToList() call), and then use the GroupAdjacent method from MoreLINQ package:

This method is implemented by using deferred execution and streams the groupings. The grouping elements, however, are buffered. Each grouping is therefore yielded as soon as it is complete and before the next grouping occurs.

var query = from p in personList
            join a in accountList on p.Id equals a.PersonId
            where a.Amount < 100
            orderby a.PersonId
            select a;
var groups = query.AsEnumerable()
    .GroupAdjacent(a => a.PersonId)
    .Where(g => g.Sum(a => a.Amount) > 1000);

The AsEnumerable() trick works well with EF query provider for sure. Whether it works with LINQ to CRM provider really depends on how the provider implements GetEnumerator() method - if it tries to buffer the whole query result anyway, then you are out of luck.

Mastrianni answered 3/7, 2017 at 19:35 Comment(6)
How is it that AsEnumerable executes the query but does not materialize the objects in memory? Does that work with other LinQ providers like EF?Drayton
@CamiloTerevinto Absolutely - that's the preferred way to switch from LINQ to Entities to LINQ to Objects as well. By executing I mean it will be executed when the final L2O query is executed, but will be processed sequentially in memory (one by one) as usual IEnumerable<T>. The typical difference between IEnumerable<T> (streaming) and List<T> (buffering) returning methods.Mastrianni
Interesting, I'm amazed I've never read about that in so much time using EF and LINQ. I'll have to give that a SQL Server Profiler try :) thanks!Drayton
Two questions: 1. Can I take the result of GroupAdjacent and loop through each group (while avoiding oom)? 2. This got me thinking about the basics: If foreach doesn't materiliaze, maybe I can loop through each account, add it into a local List, stop when getting to a new PersonId, and so forth. What do you think?Mekka
Note that not materializing objects isn't the same as not querying the data from SQL - AsEnumerable will still pull all the data from SQL in a single query.Drummond
@HeyJude This is exactly what GroupAdjacent does (you can take a look at the implementation), but in a generic reusable way. Of course you can do the same with foreach.Mastrianni
D
1

I would order by the GUID and then process in blocks:

var basep = (from p in personList select p.Id).OrderBy(id => id);

int basepCount = personList.Count();
int blocksize = 1000;
int numblocks = (basepCount / blocksize) + (basepCount % blocksize == 0 ? 0 : 1);
for (var block = 0; block < numblocks; ++block) {
    var firstPersonId = basep.Skip(block * blocksize).First();
    var lastPersonId = basep.Skip(Math.Min(basepCount-1, block*blocksize+blocksize-1)).First();

    var query = from p in personList.Where(ps => firstPersonId.CompareTo(ps.Id) <= 0 && ps.Id.CompareTo(lastPersonId) <= 0)
                join a in accountList on p.Id equals a.PersonId
                where a.Amount < 100
                select a;
    var groups = query.GroupBy(a => a.PersonId);
    // work on groups
}
Drummond answered 3/7, 2017 at 21:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.