Handling very large text files using TStreamReader and TStringList
Asked Answered
E

1

6

I am using Embarcadero's Rad Studio Delphi (10.2.3) and have encountered a memory issue while reading in very large text files (7 million lines+, every line is different, lines can be 1 to ~200 characters in length, etc.). I am fairly new at Delphi programming, so I have scoured SO and Google looking for help before posting.

I originally implemented a TStringList and read the file using LoadFromFile method, but this failed spectacularly when the processed text files became large enough. I then implemented a TStreamReader and used ReadLn to populate the TStringList using the basic code found here:

TStringList.LoadFromFile - Exceptions with Large Text Files

Code Example:

//MyStringList.LoadFromFile(filename);
Reader := TStreamReader.Create(filename, true);
try
  MyStringList.BeginUpdate;
  try
    MyStringList.Clear;
    while not Reader.EndOfStream do
      MyStringList.Add(Reader.ReadLine);
  finally
    MyStringList.EndUpdate;
  end;
finally
  Reader.Free;
end;

This worked great until the files I needed to process became huge (~7 million lines +). It appears that the TStringList is getting so large as to run out of memory. I say "appears" as I don't actually have access to the file that is being run, and all error information is provided by my customer through email, making this problem even more difficult as I can't simply debug it in the IDE.

The code is compiled 32-bit and I am unable to use the 64-bit compiler. I can't include a database system or the like, either. Unfortunately, I have some tight restrictions. I need to load in every line to look for patterns and compare those lines to other lines to look for "patterns within patterns." I apologize for being very vague here.

The bottom line is this--is there a way to have access to every line in the text file without using a TStringList, or perhaps a better way to handle the TStringList memory?

Maybe there is a way to load a specific block of lines from the StreamReader into the TStringList (e.g., read in the first 100,000 lines and process, the next 100,000 lines, etc.) instead of everything at once? I think I could then write something to handle the possible "inter-block" patterns.

Thanks in advance for any and all help and suggestions!

***** EDITED WITH UPDATE *****

Ok, here is the basic solution that I need to implement:

var
  filename: string;
  sr: TStreamReader;
  sl: TStringList;
  total, blocksize: integer;
begin
  filename := 'thefilenamegoeshere';
  sl := TStringList.Create;
  sr := TStreamReader.Create(filename, true);
  sl.Capacity := sr.BaseStream.Size div 100;
  total := 0; // Total number of lines in the file (after it is read in)
  blocksize := 10000; // The number of lines per "block"
  try
    sl.BeginUpdate;
    try
      while not sr.EndOfStream do
        begin
          sl.Clear;
          while not (sl.Count >= blocksize) do
            begin
              sl.Add(sr.ReadLine);
              total := total + 1;
              if (sr.EndOfStream = true) then break;
            end;
          // Handle the current block of lines here
        end;
    finally
      sl.EndUpdate;
    end;
  finally
    sr.Free;
    sl.Free;
  end;
end;

I have some test code that I will use to refine my routines, but this seems to be relatively fast, efficient, and sufficient. I want to thank everyone for their responses that got my gray matter firing!

Exist answered 17/10, 2018 at 14:4 Comment(12)
Perhaps try IMAGE_FILE_LARGE_ADDRESS_AWARE? docwiki.embarcadero.com/RADStudio/Tokyo/en/…Pah
If you don't want to make use of 64GB address space then you'll need to redesign your code to avoid having to load the entire file into memory. It's that simple. Exactly how you do that will depend on the details that you have but we do not. But it's hard to see past the fact that if the data can't fit in memory at once, then you need to avoid trying to fit it into memory at once.Polyphyletic
@VilleKrumlinde - Thanks for the suggestion. I have set the LARGE_ADDRESS flag in my software, though I don't know if the customer has done everything on their end. They are running Windows 10 Pro 64-bit, though even that is through an emulator on a Linux box sometimes (depending on the particular user). And the users don't have Admin privileges, thus making all this even more complicated.Exist
@DavidHeffernan - Yes, that's the true, common sense answer! Since I can't avoid the memory size at the moment, I have to come up with a work around--thus my question(s) above. Is there a way to load blocks of the text file from the StreamReader (say, line 1 - 100,000, then line 100,001 - 200,000, etc) directly? I didn't see any way to specify which line to start reading from. Thanks, David! I also used some code you had posted regarding executing external processes command line (WaitUntilSignaled and ExecuteProcess). Invaluable to me...thank you!!!Exist
You just read the first N lines, deal with them. Then read the next N lines, deal with them and so on. You can use a variable to count how many lines you have read.Polyphyletic
@DavidHeffernan - Ha, I had my epiphany while you were replying. The Stream knows what character it left off with after the ReadLn. I just need to come up with a modified loop routine that reads the file in blocks (outer loop) while ensuring that it stops at the EOF (inner loop)--I think. After each block of lines is read in, I need to process it and store anything that might be related to inter-block patterns. Then I can clear the TStringList and start over. Thank you!!!Exist
@RicCrooks What processing do you do on these text lines? Perhaps you don't even need to use string list at all.Lumper
@RicCrooks. As a small point, your solution screams for an anonymous method to be inserted where you wrote: > // Handle the current block of lines here. keep your methods cohesive. One method should deal with breaking a large file into blocks according to some scheme, the other method should deal with processing the block. It will result in more maintanable code, and also easier to swap in a new file processing methodology in the future.Dilatory
@Lumper - The file structure is a series of headers, followed by one or more "blocks" of text that includes sub-headers and data. I have to read in each block sequentially, examine its sub-headers for key items or values, determine if it is important or not based on that examination, and then either store it or discard it. The ones that are stored then have certain strings extracted, manipulated, and then returned to the StringList. Finally, I add additional information to each line (simply appending more to each StringList record) before writing those blocks back out.Exist
@DaveNovo - I thought about that after I wrote it. My first pass was a large "sub-process" within the method, though I will likely extract and separate everything into two distinct methods once the rewrite passes its initial tests. You are absolutely right! Thank you for keeping me straight!Exist
@All - I just wanted to pass along another update. The methodology above has successfully handled the larger files that had caused the previous memory errors. It hasn't been tested against the huge files yet (10+ Gb), but the 1-2 Gb files were read in and processed quickly. Thank you all again for your help!Exist
@RicCrooks Adjusting your code to read one block of the data at a time is the right course of action since your memory requirements for StringList now drop to the size of individual block. But I'm still wondering if you even need to use StringList at all. Maybe you could write your code in a way so that it can do all needed processing while reading data one line at a time directly from the file. But this depends on how you are processing of each main header block.Lumper
P
1

As a (very) quick fix, you can try to use TALStringlist (just replace in your code TStringList by TalStringList) from https://github.com/Zeus64/alcinoe. It's not a very clean way to go, but TALStringlist will stay in unicode UTF-8, reducing by 2 the memory used by default UTF 16 String. As you have 7 000 000 lines of around 100 chars it's mean around 700 Mb, this can work on 32 bits

Pycnidium answered 17/10, 2018 at 16:18 Comment(1)
Thanks for the suggestion! My concern is that this solution, as you pointed out, is a quick fix. I know that there are even larger files (likely ~20 million lines) coming down the pipe, so I will need to find a permanent solution relatively soon. I think I have a fix (edited my original post) that I plan on using. Thanks again, though!!!Exist

© 2022 - 2024 — McMap. All rights reserved.