Reading large files containing files in binary format and extracting those files with minimum heap allocation
Asked Answered
B

2

8

Sorry for the title, it might be a bit confusing but I don't know how I could have explained it better.

There are two files with .cat (catalog file) and .dat extensions. A .cat file contains the information of binary files in a .dat file. This information is the name of the file, its size, offset in the .dat file, and md5 hash.

Example .cat file;

assets/textures/environments/asteroids/ast_crystal_blue_diff-small.gz 22387 1546955265 85a67a982194e4141e08fac4bf062c8f
assets/textures/environments/asteroids/ast_crystal_blue_diff.gz 83859 1546955265 86c7e940de82c2c2573a822c9efc9b6b
assets/textures/environments/asteroids/ast_crystal_diff-small.gz 22693 1546955265 cff6956c94b59e946b78419d9c90f972
assets/textures/environments/asteroids/ast_crystal_diff.gz 85531 1546955265 57d5a24dd4da673a42cbf0a3e8e08398
assets/textures/environments/asteroids/ast_crystal_green_diff-small.gz 22312 1546955265 857fea639e1af42282b015e8decb02db
assets/textures/environments/asteroids/ast_crystal_green_diff.gz 115569 1546955265 ee6f60b0a8211ec048172caa762d8a1a
assets/textures/environments/asteroids/ast_crystal_purple_diff-small.gz 14179 1546955265 632317951273252d516d36b80de7dfcd
assets/textures/environments/asteroids/ast_crystal_purple_diff.gz 53781 1546955265 c057acc06a4953ce6ea3c6588bbad743
assets/textures/environments/asteroids/ast_crystal_yellow_diff-small.gz 21966 1546955265 a893c12e696f9e5fb188409630b8d10b
assets/textures/environments/asteroids/ast_crystal_yellow_diff.gz 82471 1546955265 c50a5e59093fe9c6abb64f0f47a26e57
assets/textures/environments/asteroids/xen_crystal_diff-small.gz 14161 1546955265 23b34bdd1900a7e61a94751ae798e934
assets/textures/environments/asteroids/xen_crystal_diff.gz 53748 1546955265 dcb7c8294ef72137e7bca8dd8ea2525f
assets/textures/lensflares/lens_rays3_small_diff.gz 14107 1546955265 a656d1fad4198b0662a783919feb91a5

I did parse those files with relative ease and I used Span<T> and after some benchmarks with BenchmarkDotNet, I believe I have optimized the reading of those types of files as much I could.

But .dat files are another story. A typical .dat file is GBs in size.

I tried the most straightforward method I could think of first.

(I removed the null checks and validation codes to make the code more readable.)

public async Task ExportAssetsAsync(CatalogFile catalogFile, string destDirectory, CancellationToken ct = default)
{
    IFileInfo catalogFileInfo = _fs.FileInfo.FromFileName(catalogFile.FilePath);
    string catalogFileName = _fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = _fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
    IFileInfo datFileInfo = _fs.FileInfo.FromFileName(datFilePath);

    await using Stream stream = datFileInfo.OpenRead();
    
    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = _fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
        IFileInfo destFile = _fs.FileInfo.FromFileName(destFilePath);
        if (!destFile.Directory.Exists)
        {
            destFile.Directory.Create();
        }
        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);
        var newFileData = new byte[catalogEntry.AssetSize];
        int read = await stream.ReadAsync(newFileData, 0, catalogEntry.AssetSize, ct);
        if (read != catalogEntry.AssetSize)
        {
            _logger?.LogError("Could not read asset data from dat file: {DatFile}", datFilePath);
            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);
        }
        await using Stream destStream = _fs.File.Open(destFile.FullName, FileMode.Create);
        destStream.Write(newFileData);
        destStream.Close();
    }
}

As you can guess this method is both slow and allocates a lot in heap and it keeps the GC busy.

I did some modifications to the method above and try reading with a buffer then using stackalloc and Span instead of allocation with new byte[catalogEntry.AssetSize]. I didn't gain much in the buffered reading, and naturally, I got the StackOverflow exception with stackalloc since some files are large than the stack size.

Then after some research, I decided that I could use System.IO.Pipelines introduced with .NET Core 2.1. And I changed the above method as below.

public async Task ExportAssetsPipe(CatalogFile catalogFile, string destDirectory, CancellationToken ct = default)
{
    IFileInfo catalogFileInfo = _fs.FileInfo.FromFileName(catalogFile.FilePath);
    string catalogFileName = _fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = _fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");

    IFileInfo datFileInfo = _fs.FileInfo.FromFileName(datFilePath);
    
    await using Stream stream = datFileInfo.OpenRead();

    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = _fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
        IFileInfo destFile = _fs.FileInfo.FromFileName(destFilePath);
        if (!destFile.Directory.Exists)
        {
            destFile.Directory.Create();
        }
        stream.Position = catalogEntry.ByteOffset;
        var reader = PipeReader.Create(stream);
        while (true)
        {
            ReadResult readResult = await reader.ReadAsync(ct);
            ReadOnlySequence<byte> buffer = readResult.Buffer;
            if (buffer.Length >= catalogEntry.AssetSize)
            {
                ReadOnlySequence<byte> entry = buffer.Slice(0, catalogEntry.AssetSize);
                await using Stream destStream = File.Open(destFile.FullName, FileMode.Create);
                foreach (ReadOnlyMemory<byte> mem in entry)
                {
                   await destStream.WriteAsync(mem, ct);
                }
                destStream.Close();
                break;
            }
            reader.AdvanceTo(buffer.Start, buffer.End);
        }
    }
}

Well according to BenchmarkDotnet the results are worse than the first method both in performance and memory allocations. This is probably because I am using System.IO.Pipelines incorrectly or out of purpose.

I don't have much experience with this as I haven't done I/O operations for such large files before. How could I do what I want to do with minimum memory allocation and maximum performance? Thank you very much in advance for your help and correct guidance.

Babara answered 26/4, 2022 at 17:5 Comment(12)
Use the new ArrayPool<T> it's on System.Buffers, research first how to use it to avoid memory leaks. You need to always rent from and return to the pool, this will help a lot with memory allocation.Spermaceti
Try this link adamsitnik.com/Array-PoolSpermaceti
Thank you @MauricioAtanache I started to read the article.Venessavenetia
#6949941 + Stream.CopyTo should be much simpler and require just one buffer allocation... (assuming you simply want to split large files into small once) Reading ~16-64k blocks at a time is likely what you need - there are plenty of questions discussion IO optimizations if you want to roll you own (allocating memory on LOH as you do now is not the way to optimize :) )Mcmichael
@MauricioAtanache I just tried ArrayPool and the results were very surprising There is a 2x performance difference but most importantly memory allocation is 8x less for a 5 GB .dat file. Thank you so muchVenessavenetia
Glad to hear, should I make this an answer?Spermaceti
@Deniz you should post the code you ended up with, as an answer - bring the benefit to future readers of your question, and you'll get upvoted for it. You're allowed to (and should) answer your own questionsDarrick
Hi, @CaiusJard I'm working on it. I'm refactoring my code now and will post the latest version of my code. But I already marked Mauricio Atanache's comment as the answer below. Should I still answer my own question too or edit just editing my question will be enough?Venessavenetia
And I would like to try @AlexeiLevenkov 's answer (using substream) too first.Venessavenetia
Should I still answer my own question too or edit just editing my question will be enough? - answers should be posted as answers, never as "edit, update here is how I solved this" edits to the questionDarrick
If you write Alexei's approach, you can post two answersDarrick
Wow, @AlexeiLevenkov's SubStream and CopyTo approach gave better results. I don't know maybe I implemented Mauricio's approach wrong. Anyway, I'll post a detailed answer including benchmark results for large and small .dat files.Venessavenetia
B
7

First of all, I thank Mauricio Atanache and Alexei Levenkov for their advice. I learned quite a bit while trying the method they both suggested. After the benchmarks I made, I decided to proceed with the SubStream and Stream.CopyTo method suggested by Alexei Levenkov.

First I want to share the solution. Afterward, those who are curious can examine benchmarks and results.

Solution

Alexei directed me to an old question, I reviewed the solution there and adapted it to my own code.

How to expose a sub section of my stream to a user

First of all, I need a SubStream implementation, basically what I want to do is extract small files from a big .dat file. By using SubStream, I can encapsulate the file at the offset I want from FileStream. Then, using the Stream.Copy method, I can copy the content in the SubStream to another FileStream and write it to the file system. With this method, I only make one buffer allocation.

public class SubStream : Stream
{
    private readonly Stream _baseStream;
    private readonly long _length;
    private long _position;

    public SubStream(Stream baseStream, long offset, long length)
    {
        if (baseStream == null)
        {
            throw new ArgumentNullException(nameof(baseStream), "Base stream cannot be null");
        }

        if (!baseStream.CanRead)
        {
            throw new ArgumentException("Base stream must be readable.", nameof(baseStream));
        }

        if (offset < 0)
        {
            throw new ArgumentOutOfRangeException(nameof(offset));
        }

        _baseStream = baseStream;
        _length = length;

        if (baseStream.CanSeek)
        {
            baseStream.Seek(offset, SeekOrigin.Current);
        }
        else
        {
            // read it manually...
            const int bufferSize = 512;
            var buffer = new byte[bufferSize];
            while (offset > 0)
            {
                int read = baseStream.Read(buffer, 0, offset < bufferSize ? (int)offset : bufferSize);
                offset -= read;
            }
        }
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        CheckDisposed();
        long remaining = _length - _position;
        if (remaining <= 0)
        {
            return 0;
        }

        if (remaining < count)
        {
            count = (int)remaining;
        }
        
        int read = _baseStream.Read(buffer, offset, count);
        _position += read;
        
        return read;
    }

    private void CheckDisposed()
    {
        if (_baseStream == null)
        {
            throw new ObjectDisposedException(GetType().Name);
        }
    }

    public override long Length
    {
        get
        {
            CheckDisposed();
            return _length;
        }
    }

    public override bool CanRead
    {
        get
        {
            CheckDisposed();
            return true;
        }
    }

    public override bool CanWrite
    {
        get
        {
            CheckDisposed();
            return false;
        }
    }

    public override bool CanSeek
    {
        get
        {
            CheckDisposed();
            return false;
        }
    }

    public override long Position
    {
        get
        {
            CheckDisposed();
            return _position;
        }
        set => throw new NotSupportedException();
    }

    public override long Seek(long offset, SeekOrigin origin) => throw new NotSupportedException();

    public override void SetLength(long value) => throw new NotSupportedException();

    public override void Write(byte[] buffer, int offset, int count) => throw new NotImplementedException();

    public override void Flush()
    {
        CheckDisposed();
        _baseStream.Flush();
    }
}

The final version of the method is as follows.

private static void ExportAssets(CatalogFile catalogFile, string destDirectory)
{
    FileInfo catalogFileInfo = new FileInfo(catalogFile.FilePath);
    string catalogFileName = Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
    FileInfo datFileInfo = new FileInfo(datFilePath);

    using Stream stream = datFileInfo.OpenRead();
    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = Path.Combine(destDirectory, catalogEntry.AssetPath);
        FileInfo destFile = new FileInfo(destFilePath);

        if (!destFile.Directory.Exists)
        {
            destFile.Directory.Create();
        }

        using var subStream = new SubStream(stream, catalogEntry.ByteOffset, catalogEntry.AssetSize);
        using Stream destStream = File.Open(destFile.FullName, FileMode.Create);
        subStream.CopyTo(destStream);
        destStream.Close();
    }
}

Benchmark Setup

The setup I use when making benchmarks

  • I used two separate .dat files, one 600KB and the other 550MB.
  • In benchmarks, writing operations to the file system caused fluctuations in the results. Instead, I used MemoryStream to simulate the write operation.
  • I included both sync and async versions of the methods in benchmarks.
  • I'm using the System.IO.Abstractions library to mock File IO operations for unit tests. Don't get confused by the method calls that begin with Fs. (for example Fs.FileInfo.FromFileName(catalogFile.FilePath)).

Three different versions of the methods were used for benchmarking.

The first one is the unoptimized version which allocates new byte[] for every subfile in the .dat file.

private static void ExportAssetsUnoptimized(CatalogFile catalogFile, string destDirectory)
{
    IFileInfo catalogFileInfo = Fs.FileInfo.FromFileName(catalogFile.FilePath);
    string catalogFileName = Fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = Fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
    IFileInfo datFileInfo = Fs.FileInfo.FromFileName(datFilePath);

    using Stream stream = datFileInfo.OpenRead();

    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = Fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
        IFileInfo destFile = Fs.FileInfo.FromFileName(destFilePath);

        if (!destFile.Directory.Exists)
        {
            // destFile.Directory.Create();
        }

        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);
        var newFileData = new byte[catalogEntry.AssetSize];
        int read = stream.Read(newFileData, 0, catalogEntry.AssetSize);

        if (read != catalogEntry.AssetSize)
        {
            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);
        }

        // using Stream destStream = Fs.File.Open(destFile.FullName, FileMode.Create);
        using var destStream = new MemoryStream();
        destStream.Write(newFileData);
        destStream.Close();
    }
}

The second one is ArrayPool in System.Buffer (which was suggested by Mauricio Atanache). ArrayPool<T> is a high-performance pool of managed arrays. You can find it in System.Buffers package and its source code are available on GitHub. It’s mature and ready to use in production.

There is a nice article that explains the subject in detail.

Pooling large arrays with ArrayPool

I still have my doubts that I am not using it properly or for its intended purpose. But when I use it as below, I observed that it works faster and makes half allocation compared to the unoptimized version above.

private static void ExportAssetsWithArrayPool(CatalogFile catalogFile, string destDirectory)
{
    IFileInfo catalogFileInfo = Fs.FileInfo.FromFileName(catalogFile.FilePath);
    string catalogFileName = Fs.Path.GetFileNameWithoutExtension(catalogFileInfo.Name);
    string datFilePath = Fs.Path.Combine(catalogFileInfo.DirectoryName, $"{catalogFileName}.dat");
    IFileInfo datFileInfo = Fs.FileInfo.FromFileName(datFilePath);

    ArrayPool<byte> bufferPool = ArrayPool<byte>.Shared;

    using Stream stream = datFileInfo.OpenRead();
    foreach (CatalogEntry catalogEntry in catalogFile.CatalogEntries)
    {
        string destFilePath = Fs.Path.Combine(destDirectory, catalogEntry.AssetPath);
        IFileInfo destFile = Fs.FileInfo.FromFileName(destFilePath);

        if (!destFile.Directory.Exists)
        {
            //destFile.Directory.Create();
        }

        stream.Seek(catalogEntry.ByteOffset, SeekOrigin.Begin);
        byte[] newFileData = bufferPool.Rent(catalogEntry.AssetSize);
        int read = stream.Read(newFileData, 0, catalogEntry.AssetSize);

        if (read != catalogEntry.AssetSize)
        {
            throw new DatFileReadException("Could not read asset data from dat file", datFilePath);
        }

        // using Stream destStream = Fs.File.Open(destFile.FullName, FileMode.Create);
        using Stream destStream = new MemoryStream();
        destStream.Write(newFileData, 0, catalogEntry.AssetSize);
        destStream.Close();
        bufferPool.Return(newFileData);
    }
}

And the third one is the fastest and the least memory-allocated version. And the third one is the fastest and the least memory-allocated version. By least memory-allocated I mean ~75x less memory allocated and significantly faster.

I already gave the code sample of this method at the beginning of the answer and explained it. So, I'm skipping to benchmark results.

You can access the Full Benchmarkdotnet setup from the gist link below.

https://gist.github.com/Blind-Striker/8f7e8ff56de6d9c2a4ab7a47ae423eba

Benchmark Results

Method FileSize Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
ExportAssetsUnoptimized_Benchmark Large_5GB 563,034.4 us 13,290.13 us 38,977.64 us 140000.0000 140000.0000 140000.0000 1,110,966 KB
ExportAssetsWithArrayPool_Benchmark Large_5GB 270,394.1 us 5,308.29 us 6,319.15 us 5500.0000 4000.0000 4000.0000 555,960 KB
ExportAssetsSubStream_Benchmark Large_5GB 17,525.8 us 183.55 us 171.69 us 3468.7500 3468.7500 3468.7500 14,494 KB
ExportAssetsUnoptimizedAsync_Benchmark Large_5GB 574,430.4 us 20,442.46 us 59,954.20 us 133000.0000 133000.0000 133000.0000 1,111,298 KB
ExportAssetsWithArrayPoolAsync_Benchmark Large_5GB 237,256.6 us 5,673.63 us 16,728.82 us 1500.0000 - - 556,088 KB
ExportAssetsSubStreamAsync_Benchmark Large_5GB 32,766.5 us 636.08 us 732.51 us 3187.5000 2562.5000 2562.5000 15,186 KB
ExportAssetsUnoptimized_Benchmark Small_600KB 680.4 us 13.24 us 23.20 us 166.0156 124.0234 124.0234 1,198 KB
ExportAssetsWithArrayPool_Benchmark Small_600KB 497.9 us 7.54 us 7.06 us 124.5117 62.0117 62.0117 605 KB
ExportAssetsSubStream_Benchmark Small_600KB 332.0 us 4.87 us 4.32 us 26.8555 26.8555 26.8555 223 KB
ExportAssetsUnoptimizedAsync_Benchmark Small_600KB 739.2 us 5.98 us 5.30 us 186.5234 124.0234 124.0234 1,200 KB
ExportAssetsWithArrayPoolAsync_Benchmark Small_600KB 604.9 us 6.99 us 6.54 us 124.0234 61.5234 61.5234 607 KB
ExportAssetsSubStreamAsync_Benchmark Small_600KB 496.6 us 8.02 us 6.70 us 26.8555 26.8555 26.8555 228 KB

Conclusion & Disclaimer

I have come to the conclusion that SubStream and Stream.CopyTo approach allocates much less memory and runs much faster. Probably some of the allocations were because of Path.Combine.

I'd like to remind you though, that I hadn't used ArrayPool before until I posted this question on Stackoverflow. There is a possibility that I am not using it correctly or for its intended purpose. I'm also not sure how accurate it is to use MemoryStream instead of FileStream as a write destination to keep benchmarks consistent.

Babara answered 27/4, 2022 at 23:31 Comment(0)
S
2

Use the new ArrayPool it's on System.Buffers, research first how to use it to avoid memory leaks.

You need to always rent from and return to the pool, this will help a lot with memory allocation. –

Try this link adamsitnik.com/Array-Pool for your research

Spermaceti answered 26/4, 2022 at 19:20 Comment(1)
Come on; you can put a bit more effort in than that. "look at ArrayPool; here's a link to a blog that might die in future" is fine as a comment but low quality as an answer; this answer should contain a snip of code derived from the blog but made relevant to the OP's situationDarrick

© 2022 - 2025 — McMap. All rights reserved.