RegEx, StringBuilder and Large Object Heap Fragmentation
Asked Answered
I

3

12

How can I run lots of RegExes (to find matches) in big strings without causing LOH fragmentation?

It's .NET Framework 4.0 so I'm using StringBuilder so it's not in the LOH however as soon as I need to run a RegEx on it I have to call StringBuilder.ToString() which means it'll be in the LOH.

Is there any solution to this problem? It's virtually impossible to have a long running application that deals with big strings and RegExes like this.

An Idea to Solve this problem:

While thinking about this problem, I think I found a dirty solution.

At a given time I only have 5 strings and these 5 strings (bigger than 85KB) will be passed to RegEx.Match.

Since the fragmentation occurs because new objects won't fit to empty spaces in LOH, this should solve the problem:

  1. PadRight all strings to a max. accepted size, let's say 1024KB (I might need to do this with StringBuider)
  2. By doing so all new strings will fit to already emptied memory as previous string is already out of scope
  3. There won't be any fragmentation because object size is always same hence I'll only allocate 1024*5 at a given time, and these space in LOH will be shared between these strings.

I suppose the biggest problem with this design what happens if other big objects allocate this location in LOH which would cause application to allocate lots of 1024 KB strings maybe with an even worse fragmentation. fixed statement might help however how can I send a fixed string to RegEx without actually create a new string which is not located in a fixed memory address?

Any ideas about this theory? (Unfortunately I can't reproduce the problem easily, I'm generally trying to use a memory profiler to observe the changes and not sure what kind of isolated test case I can write for this)

Insufficiency answered 5/11, 2011 at 13:45 Comment(6)
Are you certain that the large object heap is becoming fragmented? I do a lot of work with large (several hundred kilobytes) strings, and I've never run into an LOH fragmentation problem.Spume
Yes I'm sure. Application needs to be memory hungry and long running to see the actual impact of it. If you actually do memory profiling you might see it's affecting you but not much enough to crash your app.Insufficiency
Yeah, it's easy. A fat hundred bucks buys you a 64-bit operating system. No kind of programming effort can match that.Fervent
@Hans I guess you are a server-side programmer :) It's a tool that's being shipped to computers that I don't have control over. Do you want me to tell my users that they have to use x64 ? Because is't just $100? Or maybe just restrict the customer base to x64-Win 7 users? That's easier, isn't it?Insufficiency
Any software vendor publishes prerequisites. Need Windows, can't use an Apple, must be able to run .NET, yada disk space, yada RAM, etcetera. This is no different, there's nothing special about x64. Default choice for Dell for the past 3 years, doesn't cost extra.Fervent
@HansPassant I wish satisfying business requirements were that easy :)Insufficiency
I
7

OK, here is my attempt solve this problem in a fairly generic way but with some obvious limitations. Since I haven't seen this advice anywhere and everyone is whining about LOH Fragmentation I wanted to share the code to confirm that my design and assumptions are correct.

Theory:

  1. Create a shared massive StringBuilder (this is to store the big strings that read from we read from streams) - new StringBuilder(ChunkSize * 5);
  2. Create a massive String (has to be bigger than max. accepted size), should be initialized with empty space. - new string(' ', ChunkSize * 10);
  3. Pin string object to memory so GC will not mess with it. GCHandle.Alloc(pinnedText, GCHandleType.Pinned). Even though LOH objects are normally pinned this seems to improve the performance. Maybe because of unsafe code
  4. Read stream into shared StringBuilder and then unsafe copy it to pinnedText by using indexers
  5. Pass the pinnedText to RegEx

With this implementation the code below works just like there is no LOH allocation. If I switch to new string(' ') allocations instead of using a static StringBuilder or use StringBuilder.ToString() code can allocate 300% less memory before crashing with outofmemory exception

I also confirmed the results with a memory profiler, that there is no LOH fragmentation in this implementation. I still don't understand why RegEx doesn't cause any unexpected problems. I also tested with different and expensive RegEx patterns and results are same, no fragmentation.

Code:

http://pastebin.com/ZuuBUXk3

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;

namespace LOH_RegEx
{
    internal class Program
    {
        private static List<string> storage = new List<string>();
        private const int ChunkSize = 100000;
        private static StringBuilder _sb = new StringBuilder(ChunkSize * 5);


        private static void Main(string[] args)
        {
            var pinnedText = new string(' ', ChunkSize * 10);
            var sourceCodePin = GCHandle.Alloc(pinnedText, GCHandleType.Pinned);

            var rgx = new Regex("A", RegexOptions.CultureInvariant | RegexOptions.Compiled);

            try
            {

                for (var i = 0; i < 30000; i++)
                {                   
                    //Simulate that we read data from stream to SB
                    UpdateSB(i);
                    CopyInto(pinnedText);                   
                    var rgxMatch = rgx.Match(pinnedText);

                    if (!rgxMatch.Success)
                    {
                        Console.WriteLine("RegEx failed!");
                        Console.ReadLine();
                    }

                    //Extra buffer to fragment LoH
                    storage.Add(new string('z', 50000));
                    if ((i%100) == 0)
                    {
                        Console.Write(i + ",");
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine("OOM Crash!");
                Console.ReadLine();
            }
        }


        private static unsafe void CopyInto(string text)
        {
            fixed (char* pChar = text)
            {
                int i;
                for (i = 0; i < _sb.Length; i++)
                {
                    pChar[i] = _sb[i];
                }

                pChar[i + 1] = '\0';
            }
        }

        private static void UpdateSB(int extraSize)
        {
            _sb.Remove(0,_sb.Length);

            var rnd = new Random();
            for (var i = 0; i < ChunkSize + extraSize; i++)
            {
                _sb.Append((char)rnd.Next(60, 80));
            }
        }
    }
}
Insufficiency answered 6/11, 2011 at 11:21 Comment(0)
O
0

You can do your job in an AppDomain that is unloaded at some points in time?

Outsmart answered 5/11, 2011 at 19:19 Comment(1)
The thing is you still need to share the results unless you use a shared storage and read the data as stream directly from memory or a file, you still have the very same problem. Because if you use remoting any way then you'll again create big arrays or strings that will go to LOH and now in both of the appdomains. Sharing a memory, memory mapped file etc. is a solution indeed but it's really really complicated in a big application and got quite a bit performance hit.Insufficiency
D
0

One alternative would be to find some way of performing reg-ex matches on a non-array based data structure. Unfortunately, a quick Google didn't bring up much in terms of stream based reg-ex libraries. I would guess that the reg-ex algorithm would need to do a lot of back tracking, which isn't supported by streams.

Do you absolutely require the full power of regular expressions? Could you perhaps implement your own simpler search functions that could work on linked lists of strings all under 85kb?

Also, LOH fragmentation only really causes issues if you hold on to the large object references for long periods. If you're constantly creating and destroying them, the LOH shouldn't grow.

FWIW, I've fount the RedGate ANTS memory profiler very good at tracking down objects in the LOH and levels of fragmentation.

Dichloride answered 6/11, 2011 at 14:25 Comment(7)
"LOH fragmentation only really causes issues if you hold on to the large object references for long periods" AFAIK this is not correct, anything bigger than 85KB will be located in LOH regardless of time they held. I'm using ANTS Profiler, indeed it's pretty good.Insufficiency
Yes need the full power of RegExInsufficiency
Sorry, I meant, that I've only seen problems with the LOH when references are being held on to for long periods. You are correct that anything over 85k goes into the LOH. Just so that I understand, is your problem that you have other longer term objects being allocated in between your strings in the LOH causing your string allocations to be pushed further up the LOH until you run out of memory?Dichloride
If so, could you stop these other objects going into the LOH? The LOH is generally quite large, so if the only things being allocated there were your strings, and they all get de-allocated at the same rate then I wouldn't expect issues.Dichloride
if you read about LOH Fragmentation it's a known problem. That what my problem is.Insufficiency
Yes, I understand what LOH fragmentation means. I'm just trying to get an idea of what's causing it for you. If nothing but your 5 strings are getting allocated on the LOH, they are of vaguely similar sizes, and the references go out of scope at the same rate that they are created the you should not experience fragmentation. However, if there are other objects getting allocated between your strings then it's a different issue. So, to re-iterate, do you know whether there are other objects (aside from your 5 strings) getting allocated on the LOH?Dichloride
I guess I wasn't clear. Yes I do have 5 strings but the function they are assigned called 10000+ times. Which means they are collected by GC about 10000*5 times hence the fragmentation problem.Insufficiency

© 2022 - 2024 — McMap. All rights reserved.