Function to extract plain text from RTF file gives wrong result
Asked Answered
K

2

1

In a 32-bit VCL Application in Windows 10 in Delphi 11 Alexandria, I need to search for text in an RTF file. So I use this function (found here) to extract the plain text from the RTF file:

function RtfToText(const RTF_FilePath: string; ReplaceLineFeedWithSpace: Boolean): string;
var
  RTFConverter: TRichEdit;
  MyStringStream: TStringStream;
begin
  RTFConverter := TRichEdit.CreateParented(HWND_MESSAGE);
  try
    MyStringStream := TStringStream.Create(RTF_FilePath);
    try
      RTFConverter.Lines.LoadFromStream(MyStringStream);
      RTFConverter.PlainText := True;
      RTFConverter.Lines.StrictDelimiter := True;
      if ReplaceLineFeedWithSpace then
        RTFConverter.Lines.Delimiter := ' '
      else
        RTFConverter.Lines.Delimiter := #13;
      Result := RTFConverter.Lines.DelimitedText;
    finally
      MyStringStream.Free;
    end;
  finally
    RTFConverter.Free;
  end;
end;

However, instead of the RTF file's plain text content, the function gives back the file path of the RTF file!

What is wrong with this function, and how can I efficiently extract the plain text from an RTF file without having to use a parented TRichEdit control?

Kopaz answered 30/1, 2022 at 11:57 Comment(2)
The control is capable to search text. See EM_FINDTEX[EX].Bracci
@SertacAkyuz TRichEdit has a FindText() method.Nonlinearity
N
3

The TStringStream constructor does not load a file, like you are expecting it to. TStringStream is not TFileStream. As its name suggests, TStringStream is a stream wrapper for a string. So, its constructor takes in a string and copies it as-is into the stream. Thus, you are loading the RichEdit with the value of the file path string itself, not the content of the file that the string refers to.

You don't actually need the TStringStream at all, as the TRichEdit can load the file directly, eg:

function RtfToText(const RTF_FilePath: string; ReplaceLineFeedWithSpace: Boolean): string;
var
  RTFConverter: TRichEdit;
begin
  RTFConverter := TRichEdit.CreateParented(HWND_MESSAGE);
  try
    RTFConverter.PlainText := False; 
    RTFConverter.Lines.LoadFromFile(RTF_FilePath);
    RTFConverter.PlainText := True;
    RTFConverter.Lines.StrictDelimiter := True;
    if ReplaceLineFeedWithSpace then
      RTFConverter.Lines.Delimiter := ' '
    else
      RTFConverter.Lines.Delimiter := #13;
    Result := RTFConverter.Lines.DelimitedText;
  finally
    RTFConverter.Free;
  end;
end;

That being said, there is nothing outside of TRichEdit in the native RTL or VCL that will parse RTF into plain-text for you. If you don't want to use TRichEdit, you will have to either parse the RTF yourself, or find a 3rd party parser to use.

Nonlinearity answered 30/1, 2022 at 12:44 Comment(4)
Isn't loading the RTF file into a TStringStream more efficient than loading it into TRichEdit?Kopaz
I've made a test and found the version using TStringStream slightly more efficient (faster).Kopaz
@Kopaz I would not say it is more efficient. At best, it might be faster, since it loads a copy of the entire file into memory, and then the RichEdit's parser is not being slowed down by file I/O. But, the downside is it is loading the entire file into memory, whereas LoadFromFile() reads in a file in smaller chunks. In any case, since you are not using the TStringStream.DataString property, TStringStream is basically just TMemoryStream, so you may as well use TMemoryStream itself.Nonlinearity
@Kopaz if you really want better efficiency, find (or write) a TStream that wraps a memory-mapped file, then you are not wasting overhead allocating any memory at all for reading the file.Nonlinearity
K
0

The function in the Q assigns the RTF file path string directly to the TStringStream without loading the RTF file (as @Remy Lebeau correctly observes: "The TStringStream constructor does not load a file").

This is how it works by loading the RTF file to the TStringStream:

function RtfToText(const RTF_FilePath: string; ReplaceLineFeedWithSpace: Boolean): string;
var
  RTFConverter: TRichEdit;
  MyStringStream: TStringStream;
begin
  RTFConverter := TRichEdit.CreateParented(HWND_MESSAGE);
  try
    MyStringStream := TStringStream.Create('');
    try
      MyStringStream.LoadFromFile(RTF_FilePath);
      RTFConverter.Lines.LoadFromStream(MyStringStream);
      RTFConverter.PlainText := True;
      RTFConverter.Lines.StrictDelimiter := True;
      if ReplaceLineFeedWithSpace then
        RTFConverter.Lines.Delimiter := ' '
      else
        RTFConverter.Lines.Delimiter := #13;
      Result := RTFConverter.Lines.DelimitedText;
    finally
      MyStringStream.Free;
    end;
  finally
    RTFConverter.Free;
  end;
end;
Kopaz answered 30/1, 2022 at 12:18 Comment(1)
@MartynA yes, it does answer the first questionNonlinearity

© 2022 - 2024 — McMap. All rights reserved.