Convert Hi-Ansi chars to Ascii equivalent (é -> e)

Asked 11/12, 2009 at 22:10 Answered 19/3, 2013 at 12:52

Solved delphi character-encoding ascii delphi-2007 non-ascii-characters

Is there a routine available in Delphi 2007 to convert the characters in the high range of the ANSI table (>127) to their equivalent ones in pure ASCII (<=127) according to a locale (codepage)?

I know some chars cannot translate well but most can, esp. in the 192-255 range:

À → A
à → a
Ë → E
ë → e
Ç → C
ç → c
– (en dash) → - (hyphen - that can be trickier)
— (em dash) → - (hyphen)

Almena answered 11/12, 2009 at 22:10 Comment(0)

WideCharToMultiByte does best-fit mapping for any characters that aren't supported by the specified character set, including stripping diacritics. You can do exactly what you want by using that and passing 20127 (US-ASCII) as the codepage.

function BestFit(const AInput: AnsiString): AnsiString;
const
  CodePage = 20127; //20127 = us-ascii
var
  WS: WideString;
begin
  WS := WideString(AInput);
  SetLength(Result, WideCharToMultiByte(CodePage, 0, PWideChar(WS),
    Length(WS), nil, 0, nil, nil));
  WideCharToMultiByte(CodePage, 0, PWideChar(WS), Length(WS),
    PAnsiChar(Result), Length(Result), nil, nil);
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
   ShowMessage(BestFit('aÀàËëÇç–—€¢Š'));
end;

Calling that with your examples produces results you're looking for, including the emdash-to-minus case, which I don't think is handled by Jeroen's suggestion to convert to Normalization form D. If you did want to take that approach, Michael Kaplan has a blog post the explicitly discusses stripping diacritics (rather than normalization in general), but it uses C# and an API that was introduces in Vista. You can get something similar using the FoldString api (any WinNT release).

Of course if you're only doing this for one character set, and you want to avoid the overhead from converting to and from a WideString, Padu is correct that a simple for loop and a lookup table would be just as effective.

Adolfo answered 12/12, 2009 at 5:33 Comment(3)

Thanks Craig. That's a more generic solution than the lookup. It had a typo in the magic number, so I corrected it and used a constant instead. But anyway, it works on D2007 as well as D2009. – Almena 14/12, 2009 at 18:20

One thing we noticed with this, is that 'β' (unicode 1E9E latin capital letter sharp s) isn't converted, so we do this beforehand : StringReplace(aStr, 'β', 'SS', [rfReplaceAll]) – Interinsurance 16/10, 2015 at 16:9

Same thing with Char(539) -> t and Char(537) -> s (also with oxo's answer) – Twentyfour 17/8, 2022 at 10:38

Just to extend Craig's answer for Delphi 2009:

If you use Delphi 2009 and newer, you can use a more readable code with the same result:

function OStripAccents(const aStr: String): String;
type
  USASCIIString = type AnsiString(20127);//20127 = us ascii
begin
  Result := String(USASCIIString(aStr));
end;

Unfortunately, this code does work only on MS Windows. On Mac, the accents are not replaced by best-fitted characters but by question marks.

Obviously, Delphi internally uses WideCharToMultiByte on Windows whereas on Mac iconv is used (see LocaleCharsFromUnicode in System.pas). The question is if this different behaviour on different OS should be considered as bug and reported to CodeCentral.

Kayseri answered 19/3, 2013 at 12:52 Comment(1)

iconv does have a //TRANSLIT option, but LocaleCharsFromUnicode() does not use it. – Commotion 25/6, 2015 at 7:10

I believe your best bet is creating a lookup table.

Folliculin answered 11/12, 2009 at 22:22 Comment(2)

Also, if you're using a decent regex library with delphi, that could be used as well, but it still is kind of a lookup table. – Folliculin 11/12, 2009 at 22:28

Thanks Padu. That's what I thought. I'll nevertheless accept Craig's answer because it's more generic. – Almena 14/12, 2009 at 18:23

What you are looking for is normalization.

Michael Kaplan wrote a nice blog article about normalization.

It does not immediately solve your problem, but points you in the right direction.

--jeroen

Advisee answered 11/12, 2009 at 23:19 Comment(1)

NFKD + removal of combining marks works a lot of the time. However, there are characters like ÆÐØÞßæðøþ that do not decompose and have to be dealt with manually. – Overjoy 2/7, 2010 at 2:30

Recommended topics

Hot tags