Read huge table with LINQ to SQL: Running out of memory vs slow paging
Asked Answered
C

1

6

I have a huge table which I need to read through on a certain order and compute some aggregate statistics. The table already has a clustered index for the correct order so getting the records themselves is pretty fast. I'm trying to use LINQ to SQL to simplify the code that I need to write. The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems.

Here's the breakdown. Original attempt was this:

var logs = 
    (from record in dataContext.someTable 
     where [index is appropriate]
     select record);

foreach( linqEntity l in logs )
{
    // Do stuff with data from l
}

This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. So after reading Out of memory when creating a lot of objects C# , I tried the following approach. This seems to be the common Skip/Take paradigm that many people use, with the added feature of saving memory.

Note that _conn is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected.

int skipAmount = 0;
bool finished = false;

while (!finished)
{
    // Trick to allow for automatic garbage collection while iterating through the DB
    using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
    {               
        var query =
            (from record in tempDataContext.someTable
             where [index is appropriate]
             select record);

        List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
        if (logs.Count == 0)
        {
            finished = true;
            continue;
        }

        foreach( linqEntity l in logs )
        {
            // Do stuff with data from l
        }

        skipAmount += logs.Count;
    }
}

Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. Yet, I have a far worse problem: each Skip is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. This problem has appeared in the following posts:

I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. Is there a way to do this properly? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that.

Claustral answered 17/9, 2012 at 23:0 Comment(4)
"I have a huge table which I need to read through on a certain order and compute some aggregate statistics. " - Do it at the server in TSQL....That's what it is good at!Smoodge
No, the statistics are more complicated than that, and aren't computable with SQL queries. The data needs to be iterated through in a certain order and things computed that are temporally correct, etc.Claustral
"No, the statistics are more complicated than that, and aren't computable with SQL queries" - really? Is it possible to give a full example?Smoodge
I've greatly simplified the code for posting on here. I am iterating through a log of user interactions in temporal order and computing some features for a machine learning algorithm at each point in time. I'm computing things like Shannon entropy, running aggregates of min/max/avg/stdev, aggregate values for items seen by users, etc. involving about 50 total values per input. There is a lot of state involved for each computation, and also non-SQL functions, and I'm highly certain it would be easier to compute (and make changes) by streaming the data rather than trying to write it in SQL.Claustral
C
16

After madly grasping at some straws, I found that the DataContext's ObjectTrackingEnabled = false could be just what the doctor ordered. It is, not surprisingly, specifically designed for a read-only case like this.

using (var readOnlyDataContext = 
    new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{                                                 
    var logs =
        (from record in readOnlyDataContext.someTable
         where [index is appropriate]
         select record);

    foreach( linqEntity l in logs )
    {
        // Do stuff with data from l   
    }                
}

The above approach does not use any memory when streaming through objects. When writing data, I can use a different DataContext that has object tracking enabled, and that seems to work okay. However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives.

A warning about turning object tracking off: I found out that when you try to do multiple concurrent reads with the same DataContext, you don't get the error There is already an open DataReader associated with this Command which must be closed first. The application just goes into an infinite loop with 100% CPU usage. I'm not sure if this is a C# bug or a feature.

Claustral answered 17/9, 2012 at 23:20 Comment(1)
I had exactly the same problem, reading in 1m rows, and running out of memory. With "ObjectTrackingEnabled" set to false, garbage collection did free up the memory this function had used. Without it, about 300Mb of memory wasn't being released. So this little setting can make a huge impact.Ansilme

© 2022 - 2024 — McMap. All rights reserved.