delphi - strip out all non standard text characers from string
Asked Answered
I

6

17

I need to strip out all non standard text characers from a string. I need remove all non ascii and control characters (except line feeds/carriage returns).

Imprison answered 13/4, 2011 at 14:0 Comment(0)
M
16

Something like this should do:

// For those who need a disclaimer: 
// This code is meant as a sample to show you how the basic check for non-ASCII characters goes
// It will give low performance with long strings that are called often.
// Use a TStringBuilder, or SetLength & Integer loop index to optimize.
// If you need really optimized code, pass this on to the FastCode people.
function StripNonAsciiExceptCRLF(const Value: AnsiString): AnsiString;
var
  AnsiCh: AnsiChar;
begin
  for AnsiCh in Value do
    if (AnsiCh >= #32) and (AnsiCh <= #127) and (AnsiCh <> #13) and (AnsiCh <> #10) then
      Result := Result + AnsiCh;
end;

For UnicodeString you can do something similar.

Malemute answered 13/4, 2011 at 14:10 Comment(12)
I would not reallocate Result over and over.Fining
I would fix it if speed became a problem.Malemute
There are two potential problems: 1) Speed 2) Memory fragmentation. Could not be an issue if the function is called sometimes and with small strings. Could become one if the function is called often with larges strings. As usual, optimizations requires to understand where some code is expected to work.Fining
This will probably work well with small strings because the memory manager is optimised to deal with this pattern of allocation and because the small blocks make the required mem copy operation fairly fast. But given a reallocation-free drop-in alternative was offered (David's code, not mine) I'd never use this.Auberta
@David: wow, you are harsh on me today. First of all, this is a code sample showing how to do the proper comparisons. Optimizing it distracts from that point. Furthermore, premature optimization causes a lot of evil code. That's why I optimize code when performance is indeed an issue. I've added some comments in the code to warn, but for me those warnings would go with most sample code I encounter that prove a basic algorithm.Malemute
@Jeroen This is pretty trivial stuff and to do it right isn't hard or particularly long-winded. It's a very common pattern. I wouldn't class this as an optimisation. I'd regard it as the baseline for reasonable code. Any optimised version would involve unrolling the loop.Kristelkristen
@David: for you this is trivial, for me this is trivial, but for a lot of SO readers this is not trivial. It's the classic example of the Pareto Principle. I teach software developers for a part of my living and see that 80/20 rule on a very regular base. Hence my samples are meant to be understood by lots of people, and the people that need optimization will figure that out themselves. I can understand you see that in a different way, but I think commenting 'sloppy programmer' based on one code sample is way to harsh, especially since there is no secondary communication involved.Malemute
@Jeroen You contradict yourself. In an earlier comment you stated, "I would fix it if speed became a problem."Kristelkristen
@David: I didn't see that Shane indicate that speed is a problem here. If he does, I can now point him to your optimized code (I upvoted it). If you hadn't posted it, I would optimize the code myself, and split the code into two methods: the regular one to show the basics, and the optimized one. That way anyone can make a comparison and see why things were optimized in a certain way.Malemute
Wow, #13 and #10 will always be stripped as the code stands, how could this be the accepted answer?Virtuoso
@LURD probably because of the disclaimer.Malemute
@JeroenWiertPluimers Premature micro-optimization and worrying about technical details below the abstraction of the language appear to be unfortunate traits of many Delphi developers (although I have no idea where or why it became part of the culture). Thus, I feel that your lesson about writing clean, clear code first and only optimizing if necessary (and normally after profiling) is even more important than your instruction about stripping characters from strings!Purchase
K
24

And here's a variant of Cosmin's that only walks the string once, but uses an efficient allocation pattern:

function StrippedOfNonAscii(const s: string): string;
var
  i, Count: Integer;
begin
  SetLength(Result, Length(s));
  Count := 0;
  for i := 1 to Length(s) do begin
    if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin
      inc(Count);
      Result[Count] := s[i];
    end;
  end;
  SetLength(Result, Count);
end;
Kristelkristen answered 13/4, 2011 at 14:53 Comment(2)
Very good variant, only one reallocation and possimbly no reallocations if the string doesn't contain ani non-ASCII chars.Auberta
var l, i, Count: Integer; begin l := Length(s); SetLength(Result, l); if l = 0 then Exit; Count := 0; for i := 1 to l do begin if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then begin inc(Count); Result[Count] := s[i]; end; end; if l <> Count then SetLength(Result, Count); end;Lulululuabourg
M
16

Something like this should do:

// For those who need a disclaimer: 
// This code is meant as a sample to show you how the basic check for non-ASCII characters goes
// It will give low performance with long strings that are called often.
// Use a TStringBuilder, or SetLength & Integer loop index to optimize.
// If you need really optimized code, pass this on to the FastCode people.
function StripNonAsciiExceptCRLF(const Value: AnsiString): AnsiString;
var
  AnsiCh: AnsiChar;
begin
  for AnsiCh in Value do
    if (AnsiCh >= #32) and (AnsiCh <= #127) and (AnsiCh <> #13) and (AnsiCh <> #10) then
      Result := Result + AnsiCh;
end;

For UnicodeString you can do something similar.

Malemute answered 13/4, 2011 at 14:10 Comment(12)
I would not reallocate Result over and over.Fining
I would fix it if speed became a problem.Malemute
There are two potential problems: 1) Speed 2) Memory fragmentation. Could not be an issue if the function is called sometimes and with small strings. Could become one if the function is called often with larges strings. As usual, optimizations requires to understand where some code is expected to work.Fining
This will probably work well with small strings because the memory manager is optimised to deal with this pattern of allocation and because the small blocks make the required mem copy operation fairly fast. But given a reallocation-free drop-in alternative was offered (David's code, not mine) I'd never use this.Auberta
@David: wow, you are harsh on me today. First of all, this is a code sample showing how to do the proper comparisons. Optimizing it distracts from that point. Furthermore, premature optimization causes a lot of evil code. That's why I optimize code when performance is indeed an issue. I've added some comments in the code to warn, but for me those warnings would go with most sample code I encounter that prove a basic algorithm.Malemute
@Jeroen This is pretty trivial stuff and to do it right isn't hard or particularly long-winded. It's a very common pattern. I wouldn't class this as an optimisation. I'd regard it as the baseline for reasonable code. Any optimised version would involve unrolling the loop.Kristelkristen
@David: for you this is trivial, for me this is trivial, but for a lot of SO readers this is not trivial. It's the classic example of the Pareto Principle. I teach software developers for a part of my living and see that 80/20 rule on a very regular base. Hence my samples are meant to be understood by lots of people, and the people that need optimization will figure that out themselves. I can understand you see that in a different way, but I think commenting 'sloppy programmer' based on one code sample is way to harsh, especially since there is no secondary communication involved.Malemute
@Jeroen You contradict yourself. In an earlier comment you stated, "I would fix it if speed became a problem."Kristelkristen
@David: I didn't see that Shane indicate that speed is a problem here. If he does, I can now point him to your optimized code (I upvoted it). If you hadn't posted it, I would optimize the code myself, and split the code into two methods: the regular one to show the basics, and the optimized one. That way anyone can make a comparison and see why things were optimized in a certain way.Malemute
Wow, #13 and #10 will always be stripped as the code stands, how could this be the accepted answer?Virtuoso
@LURD probably because of the disclaimer.Malemute
@JeroenWiertPluimers Premature micro-optimization and worrying about technical details below the abstraction of the language appear to be unfortunate traits of many Delphi developers (although I have no idea where or why it became part of the culture). Thus, I feel that your lesson about writing clean, clear code first and only optimizing if necessary (and normally after profiling) is even more important than your instruction about stripping characters from strings!Purchase
R
5

if you don't need to do it in-place, but generating a copy of the string, try this code

 type CharSet=Set of Char;

 function StripCharsInSet(s:string; c:CharSet):string;
  var i:Integer;
  begin
     result:='';
     for i:=1 to Length(s) do
       if not (s[i] in c) then 
         result:=result+s[i];
  end;  

and use it like this

 s := StripCharsInSet(s,[#0..#9,#11,#12,#14..#31,#127]);

EDIT: added #127 for DEL ctrl char.

EDIT2: this is a faster version, thanks ldsandon

 function StripCharsInSet(s:string; c:CharSet):string;
  var i,j:Integer;
  begin
     SetLength(result,Length(s));
     j:=0;
     for i:=1 to Length(s) do
       if not (s[i] in c) then 
        begin
         inc(j);
         result[j]:=s[i];
        end;
     SetLength(result,j);
  end;  
Romulus answered 13/4, 2011 at 14:14 Comment(3)
For Delphi 2010, use the CharInSet function instead of the Ch in ... construct.Malemute
Don't worry; your solution will work correctly. For non-Ascii characters the CharInSet function is required though.Malemute
Very slow, it will reallocate result over and over. I'd set result the same length of the original string, than after it has been processed set the actual lengthFining
A
3

Here's a version that doesn't build the string by appending char-by-char, but allocates the whole string in one go. It requires going over the string twice, once to count the "good" char, once to effectively copy those chars, but it's worth it because it doesn't do multiple reallocations:

function StripNonAscii(s:string):string;
var Count, i:Integer;
begin
  Count := 0;
  for i:=1 to Length(s) do
    if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then
      Inc(Count);
  if Count = Length(s) then
    Result := s // No characters need to be removed, return the original string (no mem allocation!)
  else
    begin
      SetLength(Result, Count);
      Count := 1;
      for i:=1 to Length(s) do
        if ((s[i] >= #32) and (s[i] <= #127)) or (s[i] in [#10, #13]) then
        begin
          Result[Count] := s[i];
          Inc(Count);
        end;
    end;
end;
Auberta answered 13/4, 2011 at 14:29 Comment(7)
Why would anyone downvote this? Not that it matters much, just curious.Auberta
I would have not used StringOfChar but just SetLength(), anyway not a reason to downvote, although it requires walking the string twice.Fining
It does require walking the string twice, but it guarantees optimal allocation. If this is done for many-many strings optimal allocation is going to matter allot more then walking the string only once.Auberta
Edited the answer to use SetLength and to implement a tiny optimization that allows the routine to do it's job with ZERO or 1 string allocations.Auberta
@Cosmin one downside of multiple walks is that this code has two identical if statements which violates DRYKristelkristen
@David, that's true. To be honest I value DRY allot more then runtime performance. I don't write speed-critical applications.Auberta
@Cosmin As a maintainer of a 25 year old codebase, I agree, DRY comes firstKristelkristen
W
0

my performance solution;

function StripNonAnsiChars(const AStr: String; const AIgnoreChars: TSysCharSet): string;
var
  lBuilder: TStringBuilder;
  I: Integer;
begin
  lBuilder := TStringBuilder.Create;
  try
    for I := 1 to AStr.Length do
      if CharInSet(AStr[I], [#32..#127] + AIgnoreChars) then
        lBuilder.Append(AStr[I]);
    Result := lBuilder.ToString;
  finally
    FreeAndNil(lBuilder);
  end;
end;

I wrote by delphi xe7

Warlock answered 6/4, 2015 at 6:48 Comment(0)
J
0

my version with Result array of byte :

interface

type
  TSBox = array of byte;

and the function :

function StripNonAscii(buf: array of byte): TSBox;
var temp: TSBox;
    countr, countr2: integer;
const validchars : TSysCharSet = [#32..#127];
begin
if Length(buf) = 0 then exit;
countr2:= 0;
SetLength(temp, Length(buf)); //setze temp auf länge buff
for countr := 0 to Length(buf) do if CharInSet(chr(buf[countr]), validchars) then
  begin
    temp[countr2] := buf[countr];
    inc(countr2); //count valid chars
  end;
SetLength(temp, countr2);
Result := temp;
end;
Jeweller answered 3/9, 2017 at 18:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.