How to serialize a large collection - McMap

About

How to serialize a large collection

Asked 13/9, 2013 at 14:4 Answered 30/9, 2013 at 20:26

c#serialization protobuf-net large-object-heap

B

1

6

I'm working a system that has lists and dictionaries with over five million items where each item is typically a flat dto with up to 90 primitive properties. The collections are persisted to disk using protobuf-net for resilience and subsequence processing.

Unsurprisingly, we're hitting the LOH during processing and serialization.

We can avoid the LOH during processing by using ConcurrentBag etc but we still hit the problem when serializing.

Currently, the items in a collection are batched into groups of 1000 and serialized to memory streams in parallel. Each byte array is placed in a concurrent queue to be later written to a file stream.

While I understand what this is trying to do, it seems overly complicated. It feels like there should be something within protobuf itself that deals with huge collections without using the LOH.

I'm hoping I've made a schoolboy error - that there are some settings I've overlooked. Otherwise, I'll be looking to write a custom binary reader/writer.

I should point out we're using 4.0, looking to move to 4.5 soon but realise we won't overcome this issue despite the GC improvements.

Any help appreciated.

Bellied answered 13/9, 2013 at 14:4 Comment(10)

What is your problem with having many short lived LOH objects? – Pipkin 13/9, 2013 at 14:12

The size of the objects exceeds 85,000 bytes. Therefore the memory isn't compacted - eventually leading to an oom exception. This is a known problem with .net which, although improved, is still present in 4.5. I want to avoid the LOH as much as possible or at least keep well within the hard limit – Bellied 13/9, 2013 at 14:17

@Bellied this is almost certainly sub-object output buffering. In most cases this can be fixed by using the "group" data format. Do you have a concrete model I can look at? – Sc 13/9, 2013 at 14:38

Unfortunately I can't show the exact model but I've put some a small test app on github. There isn't much to see - simple flat model (without using the "group" format though) - and I serialize the whole bag. – Bellied 13/9, 2013 at 14:51

@Bellied k - will take a peek – Sc 13/9, 2013 at 19:24

@Bellied Some infos... 32 or 64 bits and what step should I do in the example program to trigger the OOM? – Pipkin 15/9, 2013 at 9:13

Sorry, in the example program you'd need to repeatedly choose options BDX (on 64 bit) to get an OOM. But, even before the exception, you can see the issue if you open up perfmon and monitor the LOH: the heap grows rapidly as soon as the bag is serialized, clearing the bag doesn't compact the memory. – Bellied 16/9, 2013 at 8:51

I've just noticed that while ConcurrentBag avoids the issue during writes, if you enumerate the whole of the bag then the heap grows dramatically. This suggests that this isn't a protobuf-net issue at all – Bellied 17/9, 2013 at 9:41

I dont think you have the correct design – Aucoin 18/9, 2013 at 19:55

Have you considered using a document database or object database to persist so only the data you are actively processing in memory (at any given time) is only a subset of the full set? If there was no obvious counter arguments I would consider that before rolling my own binary serialization. – Hans 22/4, 2016 at 13:40

A

0

Write the data to the disk , and do not use memory stream .

read using StreamReader so you will not have to keep a large ammount from that data in memory if you need to load all the data as the same time to do processing then do it in SQL server by storying them in temprory table .

memory is not a place to store large data.

Aucoin answered 30/9, 2013 at 20:26 Comment(7)

So you're suggesting that to work with their numerical analysis dataset (what it sounds like from description), they should do it in SQL server and have cross-network issues, or store it on disk to add I/O time? – Occultism 30/9, 2013 at 20:36

the best solution to store your data in SQL server and do all the analysis there and just get the results , but if your data is not stored originally in SQL it is better to have one process to store data in SQL server and another to process it and return the results. can you tell us where are you getting you data from ? – Aucoin 1/10, 2013 at 13:0

If the data is not relational, I don't see how SQL server is a good solution to propose for how to store and process it. – Occultism 1/10, 2013 at 13:53

SQL server for all kind of data , even if it is only strings – Aucoin 1/10, 2013 at 13:57

"A flat DTO with 90 properties" is not an object that seems well-suited to SQL server. Relational databases are generally great for things they're great for. Claiming that the design is wrong because it doesn't have an SQL Server component is pretty aggressive given the amount of information we don't have about the design space Joe is working in. – Occultism 1/10, 2013 at 14:1

"A flat DTO with 90 properties" that suitable for SQL server , he can create a table for them , keeping 5 million record in RAM is design issue in most cases – Aucoin 1/10, 2013 at 14:10

It depends entirely on the nature of the system being designed. Claiming the addition of a relational database is an intrinsically better design doesn't carry a lot of weight on its face - if the computations being performed aren't well-suited to the relational calculus it would in fact be a bad design decision to move into an RDBMS. A 4 to 10 GB working set in-memory approach does not seem unreasonable at all for numerical analyses where an SQL Server design would involve repeated table scans to perform the analysis. – Occultism 2/10, 2013 at 15:25

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.