delphi - strip out all non standard text characers from string

I

6

17

I need to strip out all non standard text characers from a string. I need remove all non ascii and control characters (except line feeds/carriage returns).

Imprison answered 13/4, 2011 at 14:0 Comment(0)

M

16

Something like this should do:

// For those who need a disclaimer: 
// This code is meant as a sample to show you how the basic check for non-ASCII characters goes
// It will give low performance with long strings that are called often.
// Use a TStringBuilder, or SetLength & Integer loop index to optimize.
// If you need really optimized code, pass this on to the FastCode people.
function StripNonAsciiExceptCRLF(const Value: AnsiString): AnsiString;
var
  AnsiCh: AnsiChar;
begin
  for AnsiCh in Value do
    if (AnsiCh >= #32) and (AnsiCh <= #127) and (AnsiCh <> #13) and (AnsiCh <> #10) then
      Result := Result + AnsiCh;
end;

For UnicodeString you can do something similar.

Malemute answered 13/4, 2011 at 14:10 Comment(12)

I would not reallocate Result over and over. – Fining 13/4, 2011 at 14:28

I would fix it if speed became a problem. – Malemute 13/4, 2011 at 17:36

There are two potential problems: 1) Speed 2) Memory fragmentation. Could not be an issue if the function is called sometimes and with small strings. Could become one if the function is called often with larges strings. As usual, optimizations requires to understand where some code is expected to work. – Fining 13/4, 2011 at 18:17

This will probably work well with small strings because the memory manager is optimised to deal with this pattern of allocation and because the small blocks make the required mem copy operation fairly fast. But given a reallocation-free drop-in alternative was offered (David's code, not mine) I'd never use this. – Auberta 13/4, 2011 at 18:57

@David: wow, you are harsh on me today. First of all, this is a code sample showing how to do the proper comparisons. Optimizing it distracts from that point. Furthermore, premature optimization causes a lot of evil code. That's why I optimize code when performance is indeed an issue. I've added some comments in the code to warn, but for me those warnings would go with most sample code I encounter that prove a basic algorithm. – Malemute 13/4, 2011 at 19:7

@Jeroen This is pretty trivial stuff and to do it right isn't hard or particularly long-winded. It's a very common pattern. I wouldn't class this as an optimisation. I'd regard it as the baseline for reasonable code. Any optimised version would involve unrolling the loop. – Kristelkristen 13/4, 2011 at 19:17

@David: for you this is trivial, for me this is trivial, but for a lot of SO readers this is not trivial. It's the classic example of the Pareto Principle. I teach software developers for a part of my living and see that 80/20 rule on a very regular base. Hence my samples are meant to be understood by lots of people, and the people that need optimization will figure that out themselves. I can understand you see that in a different way, but I think commenting 'sloppy programmer' based on one code sample is way to harsh, especially since there is no secondary communication involved. – Malemute 13/4, 2011 at 19:35

@Jeroen You contradict yourself. In an earlier comment you stated, "I would fix it if speed became a problem." – Kristelkristen 13/4, 2011 at 19:40

@David: I didn't see that Shane indicate that speed is a problem here. If he does, I can now point him to your optimized code (I upvoted it). If you hadn't posted it, I would optimize the code myself, and split the code into two methods: the regular one to show the basics, and the optimized one. That way anyone can make a comparison and see why things were optimized in a certain way. – Malemute 13/4, 2011 at 20:0

Wow, #13 and #10 will always be stripped as the code stands, how could this be the accepted answer? – Virtuoso 10/10, 2013 at 18:27

@LURD probably because of the disclaimer. – Malemute 13/10, 2013 at 13:51

@JeroenWiertPluimers Premature micro-optimization and worrying about technical details below the abstraction of the language appear to be unfortunate traits of many Delphi developers (although I have no idea where or why it became part of the culture). Thus, I feel that your lesson about writing clean, clear code first and only optimizing if necessary (and normally after profiling) is even more important than your instruction about stripping characters from strings! – Purchase 2/2, 2014 at 0:8

K

24

And here's a variant of Cosmin's that only walks the string once, but uses an efficient allocation pattern:

function StrippedOfNonAscii(const s: string): string;
var
  i, Count: Integer;
begin
  SetLength(Result, Length(s));
  Count := 0;
  for i := 1 to Length(s) do begin
    if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin
      inc(Count);
      Result[Count] := s[i];
    end;
  end;
  SetLength(Result, Count);
end;

Kristelkristen answered 13/4, 2011 at 14:53 Comment(2)

Very good variant, only one reallocation and possimbly no reallocations if the string doesn't contain ani non-ASCII chars. – Auberta 13/4, 2011 at 18:46

var l, i, Count: Integer; begin l := Length(s); SetLength(Result, l); if l = 0 then Exit; Count := 0; for i := 1 to l do begin if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin inc(Count); Result[Count] := s[i]; end; end; if l <> Count then SetLength(Result, Count); end; – Lulululuabourg 21/2, 2020 at 17:53

M

16

Something like this should do:

// For those who need a disclaimer: 
// This code is meant as a sample to show you how the basic check for non-ASCII characters goes
// It will give low performance with long strings that are called often.
// Use a TStringBuilder, or SetLength & Integer loop index to optimize.
// If you need really optimized code, pass this on to the FastCode people.
function StripNonAsciiExceptCRLF(const Value: AnsiString): AnsiString;
var
  AnsiCh: AnsiChar;
begin
  for AnsiCh in Value do
    if (AnsiCh >= #32) and (AnsiCh <= #127) and (AnsiCh <> #13) and (AnsiCh <> #10) then
      Result := Result + AnsiCh;
end;