TStringList.LoadFromFile - Exceptions with Large Text Files
Asked Answered
S

1

8

I'm running Delphi RAD Studio XE2.

I have some very large files, each containing a large number of lines. The lines themselves are small - just 3 tab separated doubles. I want to load a file into a TStringList using TStringList.LoadFromFile but this raises an exception with large files.

For files of 2 million lines (approximately 1GB) I get the EIntOverflow exception. For larger files (20 million lines and approximately 10GB, for example) I get the ERangeCheck exception.

I have 32GB of RAM to play with and am just trying to load this file and use it quickly. What's going on here and what other options do I have? Could I use a file stream with a large buffer to load this file into a TStringList? If so could you please provide an example.

Spirited answered 19/11, 2014 at 2:28 Comment(4)
I just have to wonder, why are you loading 20 million lines of text? You may have better luck using a TFileStream.Peculiarity
Do you have an example showing how to use TFileStream to read lines of a text file into a TStringList?Spirited
I would prefer to store file's lines in a table in database. Manipulation then will be much faster than using T*List descendants. So the question is what you intend to do with the data?Hear
Simply put, the real solution is to stop trying load the entire file into memory.Incorporate
A
21

When Delphi switched to Unicode in Delphi 2009, the TStrings.LoadFromStream() method (which TStrings.LoadFromFile() calls internally) became very inefficient for large streams/files.

Internally, LoadFromStream() reads the entire file into memory as a TBytes, then converts that to a UnicodeString using TEncoding.GetString() (which decodes the bytes into a TCharArray, copies that into the final UnicodeString, and then frees the array), then parses the UnicodeString (while the TBytes is still in memory) adding substrings into the list as needed.

So, just prior to LoadFromStream() exiting, there are four copies of the file data in memory - three copies taking up at worse filesize * 3 bytes of memory (where each copy is using its own contiguous memory block + some MemoryMgr overhead), and one copy for the parsed substrings! Granted, the first three copies are freed when LoadFromStream() actually exits. But this explains why you are getting memory errors before reaching that point - LoadFromStream() is trying to use 3-4 GB of memory to load a 1GB file, and the RTL's memory manger cannot handle that.

If you want to load the content of a large file into a TStringList, you are better off using TStreamReader instead of LoadFromFile(). TStreamReader uses a buffered file I/O approach to read the file in small chunks. Simply call its ReadLine() method in a loop, Add()'ing each line to the TStringList. For example:

//MyStringList.LoadFromFile(filename);
Reader := TStreamReader.Create(filename, true);
try
  MyStringList.BeginUpdate;
  try
    MyStringList.Clear;
    while not Reader.EndOfStream do
      MyStringList.Add(Reader.ReadLine);
  finally
    MyStringList.EndUpdate;
  end;
finally
  Reader.Free;
end;

Maybe some day, LoadFromStream() might be re-written to use TStreamReader internally like this.

Anandrous answered 19/11, 2014 at 2:59 Comment(13)
And if you know how many lines there are use sl.Capacity := KnownValue; to prevent multiple calls to ReallocMem()Farsighted
TStringList does not call ReallocMem() on every Add(), it grows its memory in exponential capacities.Anandrous
Memory is reallocated only when the current Count is at Capacity when adding a new string. The Capacity grows (in items, the byte count would be Capacity*SizeOf(TStringItem) plus a little MemoryMgr overhead) as follows: 0,4,8,12,28,44,60,76,95,118,147,183,228,285,356,445,556,...Anandrous
Even if you don't know exactly how many list items there are/will be, huge performance gains can be had by pre-setting Capacity to a representatively large number (a best guess, if you can) and then setting it to the actual count when the items have finished loading to reclaim any 'waste'. In this case, a good guesstimate at required capacity could be made given that the format of each line in the file is known (3 tab delim doubles): capacity := file size / average line lengthGlimpse
@RemyLebeau Thanks for this. I'm testing it now and it solves my problem (at least for 5GB files). How can I tweak it to improve the performance? Is your solution using a default buffer size? How do I alter the buffer size? Furthermore, in some cases (not all) I know the number of lines and the format of each line.Spirited
@RemyLebeau - never said it did grow each time. It grows by 25% when it reaches capacity. Older versions used ReallocMem, newer use SetLength, but use a delta of current capacity / 4Farsighted
@Trojanian: TStreamReader uses a 4KB buffer by default, but you can specify a different buffer size in the constructor. And there are plenty of third-party buffered I/O TFileStream implementations floating around.Anandrous
@RemyLebeau: The overloaded contstructor I need is then System.Classes.TStreamReader.Create(const Filename: string; Encoding: TEncoding; DetectBOM: Boolean = False; BufferSize: Integer = 1024). What is DetectBOM?Spirited
@GerryColl: How do I pre-set the capacity within the given answer example code?Spirited
@Trojanian: yes, that would be the constructor to use. DetectBOM tells the reader whether it can look at the beginning of the file to see if there is a BOM specifying the encoding of the data. Otherwise, you have to specify an encoding in the Encoding parameter. Since you are loading a text file, and TStreamReader (and TStringList) operates on Unicode strings, the reader needs to know what the file encoding is so it can decode the text to Unicode while reading.Anandrous
@Trojanian: Deltics told you how to pre-set the capacity: capacity := file size / average line length. For example: MyStringList.Capacity := Reader.BaseStream.Size div AverageLineLength; You have to provide a value for AverageLineLength based on what your data actually looks like.Anandrous
@RemyLebeau: Thanks - very coherent. I learnt from this post. :-)Spirited
FWIW, stream reader is appallingly inefficient. Every time you consume something, the remainder of the buffer is moved down with TStringBuilder.Remove. This even ends up reallocating the buffer to reduce its capacity. Stream reader performance gets worse as the buffer size is increased. I cannot believe how appallingly bad the implementation is.Incorporate

© 2022 - 2024 — McMap. All rights reserved.