How to serialize a large collection
Asked Answered
B

1

6

I'm working a system that has lists and dictionaries with over five million items where each item is typically a flat dto with up to 90 primitive properties. The collections are persisted to disk using protobuf-net for resilience and subsequence processing.

Unsurprisingly, we're hitting the LOH during processing and serialization.

We can avoid the LOH during processing by using ConcurrentBag etc but we still hit the problem when serializing.

Currently, the items in a collection are batched into groups of 1000 and serialized to memory streams in parallel. Each byte array is placed in a concurrent queue to be later written to a file stream.

While I understand what this is trying to do, it seems overly complicated. It feels like there should be something within protobuf itself that deals with huge collections without using the LOH.

I'm hoping I've made a schoolboy error - that there are some settings I've overlooked. Otherwise, I'll be looking to write a custom binary reader/writer.

I should point out we're using 4.0, looking to move to 4.5 soon but realise we won't overcome this issue despite the GC improvements.

Any help appreciated.

Bellied answered 13/9, 2013 at 14:4 Comment(10)
What is your problem with having many short lived LOH objects?Pipkin
The size of the objects exceeds 85,000 bytes. Therefore the memory isn't compacted - eventually leading to an oom exception. This is a known problem with .net which, although improved, is still present in 4.5. I want to avoid the LOH as much as possible or at least keep well within the hard limitBellied
@Bellied this is almost certainly sub-object output buffering. In most cases this can be fixed by using the "group" data format. Do you have a concrete model I can look at?Sc
Unfortunately I can't show the exact model but I've put some a small test app on github. There isn't much to see - simple flat model (without using the "group" format though) - and I serialize the whole bag.Bellied
@Bellied k - will take a peekSc
@Bellied Some infos... 32 or 64 bits and what step should I do in the example program to trigger the OOM?Pipkin
Sorry, in the example program you'd need to repeatedly choose options BDX (on 64 bit) to get an OOM. But, even before the exception, you can see the issue if you open up perfmon and monitor the LOH: the heap grows rapidly as soon as the bag is serialized, clearing the bag doesn't compact the memory.Bellied
I've just noticed that while ConcurrentBag avoids the issue during writes, if you enumerate the whole of the bag then the heap grows dramatically. This suggests that this isn't a protobuf-net issue at allBellied
I dont think you have the correct designAucoin
Have you considered using a document database or object database to persist so only the data you are actively processing in memory (at any given time) is only a subset of the full set? If there was no obvious counter arguments I would consider that before rolling my own binary serialization.Hans
A
0

Write the data to the disk , and do not use memory stream .

read using StreamReader so you will not have to keep a large ammount from that data in memory if you need to load all the data as the same time to do processing then do it in SQL server by storying them in temprory table .

memory is not a place to store large data.

Aucoin answered 30/9, 2013 at 20:26 Comment(7)
So you're suggesting that to work with their numerical analysis dataset (what it sounds like from description), they should do it in SQL server and have cross-network issues, or store it on disk to add I/O time?Occultism
the best solution to store your data in SQL server and do all the analysis there and just get the results , but if your data is not stored originally in SQL it is better to have one process to store data in SQL server and another to process it and return the results. can you tell us where are you getting you data from ?Aucoin
If the data is not relational, I don't see how SQL server is a good solution to propose for how to store and process it.Occultism
SQL server for all kind of data , even if it is only stringsAucoin
"A flat DTO with 90 properties" is not an object that seems well-suited to SQL server. Relational databases are generally great for things they're great for. Claiming that the design is wrong because it doesn't have an SQL Server component is pretty aggressive given the amount of information we don't have about the design space Joe is working in.Occultism
"A flat DTO with 90 properties" that suitable for SQL server , he can create a table for them , keeping 5 million record in RAM is design issue in most casesAucoin
It depends entirely on the nature of the system being designed. Claiming the addition of a relational database is an intrinsically better design doesn't carry a lot of weight on its face - if the computations being performed aren't well-suited to the relational calculus it would in fact be a bad design decision to move into an RDBMS. A 4 to 10 GB working set in-memory approach does not seem unreasonable at all for numerical analyses where an SQL Server design would involve repeated table scans to perform the analysis.Occultism

© 2022 - 2024 — McMap. All rights reserved.