Entity Framework Core traverse big blob data without memory overflow, best practice
Asked Answered
G

2

8

I'm writing code that's traversing big amounts of picture data, preparing a big delta block containing it all compressed for sending.

Here's a sample on how this data could be

[MessagePackObject]
public class Blob : VersionEntity
{
    [Key(2)]
    public Guid Id { get; set; }
    [Key(3)]
    public DateTime CreatedAt { get; set; }
    [Key(4)]
    public string Mediatype { get; set; }
    [Key(5)]
    public string Filename { get; set; }
    [Key(6)]
    public string Comment { get; set; }
    [Key(7)]
    public byte[] Data { get; set; }
    [Key(8)]
    public bool IsTemporarySmall { get; set; }
}

public class BlobDbContext : DbContext
{
    public DbSet<Blob> Blob { get; set; }

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.Entity<Blob>().HasKey(o => o.Id);
    }
}

When working with this I process everything into a filestream, and I want to keep as little as possible in the memory at any given time.

Is it enough to do it like this?

foreach(var b in context.Where(o => somefilters).AsNoTracking())
    MessagePackSerializer.Serialize(stream, b);

Will this still fill up the memory with all the blob records, or will they be processed one by one as I iterate on the enumerator. It's not using any ToList, only the enumerator, so Entity Framework should be able to process it on the go, but I'm not sure if that's what it does.

Any Entity Framework experts here who can give some guidance on how this is handled properly.

Gen answered 13/11, 2019 at 8:41 Comment(3)
I am not a 100% sure but I think this will result in a single query being send to the database, it however processes it on the c# side 1 by 1.(you can check this with sql profiler) you could change your loop and use skip and take to make sure you're getting a single item however this isn't what ef is made for so I'm not sure if you're going to find a best practice.Chaunce
If I understand correctly, SqlDataReader will make a connection to the database and fetch parts while you are iterating Read(). If the enumerator works the same way here, it should be fine. But if it buffers all, and then iterate, we have a problem. Anyone here who can confirm how this works? I want it to execute one single query, but have a stream connection to the database and work as you go with the data, processing and freeing one entity at a time.Gen
Why don't you memory-profile your code? We can't do that for you. Also, the question is broad/unclear (and would be put on hold as such if it wasn't for the bounty) because of unknown components and surrounding code. (Like, where does stream come from?). Finally, handling SQL Server filestream data fast and streaming requires a different approach that is beyond EF.Diction
A
1

In general when you create a LINQ filter on an Entity it is like writing a SQL statement in code form. It returns an IQueryable that has not actually executed against the database. When you iterate over the IQueryable with a foreach or call ToList() then the sql is executed and all the results are returned, and stored in memory.

https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/ef/language-reference/query-execution

While EF is maybe not the best choice for pure performance there is a relatively simple way to handle this without worrying too much about memory usage:

Consider the following

var filteredIds = BlobDbContext.Blobs
                      .Where(b => b.SomeProperty == "SomeValue")
                      .Select(x => x.Id)
                      .ToList();

Now you have filtered the Blobs according to your requirements, and executed this against the database, but only returned the Id values in memory.

Then

foreach (var id in filteredIds)
{
    var blob = BlobDbContext.Blobs.AsNoTracking().Single(x => x.Id == id);
    // Do your work here against a single in-memory blob
}

The large blob should be available for garbage collection once you are finished with it, and you shouldn't run out of memory.

Obviously you can sense-check the number of records in the id list, or you could add metadata to the first query to help you decide how to process it if you want to refine the idea.

Asiaasian answered 25/11, 2019 at 18:57 Comment(3)
This does not answer my question. I wanted to know if EF handles fetching from the query in a sequencial matter when traversing the enumerator, the way SqlDataReader does with Next. It should be possible, and it's also the preferred way instead of fetching one by one. The closest I've been to an answer here is what Smit Patel says in an answer here: github.com/aspnet/EntityFrameworkCore/issues/14640 He says "What that means is, we wouldn't need buffering internally. Hence in your case, a no-tracking query would not get/store more data than what is current result row is.".Gen
If you can confirm 100% that EF fetches everything before enumerating, that would be part of an answer, if you also provide a way of using SqlDataReader to do it the proper way. Or if EF actually does this properly, a confirmation on that would be an answer. Anyway, this is starting to take more time than it would take me to debug EF for a confirmation ;)Gen
Sorry - I did a little digging but didn't get to the bottom of it. I would suggest that if you are worried about pure performance, EF is not the way to go, if you want to keep the EF paradigm, then my answer ensures that you will not run out of memory. Assuming that the Id has a clustered index, the performance hit of lots of sequential queries might not be as bad as you think.Asiaasian
G
0

As long as you use it directly as illustrated with notracking, it will only read the part you fetch and dispose when you iterate, so my first assumtion was correct.

Gen answered 28/5, 2021 at 9:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.