Transition to Unicode for an application that handles text files
Asked Answered
A

4

9

My Win32 Delphi app analyzes text files produced by other applications that do not support Unicode. Thus, my apps needs to read and write ansi strings, but I would like to provide a better-localized user experience through use of Unicode in GUI. The app does some pretty heavy character-by-character analysis of string in objects descended from TList.

In making the transition to a Unicode GUI in going from Delphi 2006 to Delphi 2009, should I plan to:

  1. go fully Unicode within my app, with the exception of ansistring file I/O?
  2. encapsulate the code that handles the ansistrings (i.e. continue to handle them as ansistrings internally) from an otherwise Unicode application.

I realize that a truly detailed response would require a substantial amount of my code - I'm just asking about impressions from those who've made this transition and who still have to work with plain text files. Where to place the barrier between ansistrings and Unicode?

EDIT: if #1, any suggestions for mapping Unicode strings for ansistring output? I would guess that the conversion of input strings will be automatic using tstringlist.loadfromfile (for example).

Ari answered 17/6, 2009 at 2:40 Comment(0)
A
4

There is no such thing as AnsiString output - every text file has a character encoding. The moment that your files contain characters outside of the ASCII range you have to think about encoding, as even loading those files in different countries will produce different results - unless you happen to be using a Unicode encoding.

If you load a text file you need to know which encoding it has. For formats like xml or html that information is part of the text, for Unicode there is the BOM, even though it isn't strictly necessary for UTF-8 encoded files.

Converting an application to Delphi 2009 is a chance to think about encoding of text files and correct past mistakes. Data files of an application do often have a longer life than the applications itself, so it pays to think about how to make them future-proof and universal. I would suggest to go UTF-8 as the text file encoding for all new applications, that way porting an application to different platforms is easy. UTF-8 is the best encoding for data exchange, and for characters in the ASCII or ISO8859-1 range it does also create much smaller files than UTF-16 or UTF-32 even.

If your data files contain only ASCII chars you are all set then, as they are valid UTF-8 encoded files then as well. If your data files are in ISO8859-1 encoding (or any other fixed encoding), then use the matching conversion while loading them into string lists and saving them back. If you don't know in advance what encoding they will have, ask the user upon loading, or provide an application setting for the default encoding.

Use Unicode strings internally. Depending on the amount of data you need to handle you might use UTF-8 encoded strings.

Autocrat answered 17/6, 2009 at 4:13 Comment(3)
Excellent - the way you explained this helps a great deal. Based on my understanding, input will indeed be UTF-8 text files (straight ASCII) and it now makes sense that I can use UTF-8 encoded Unicode strings internally.Ari
It is not that straightforward to use UTF-8 encoded strings internally, so I do not recommend this as a policy. You will find that as soon as you start using Stringlists and the more useful VCL string functions, the function you need will either be absent or using it will involve two encoding conversions.Mcgary
@frogb: Indeed, as a policy it would be a bad idea. This needs to be decided on a case-by-case basis. Without knowing what the code does it is however impossible to say which VCL string functions are needed, and which encoding conversions this would cause.Autocrat
G
4

I suggest going full unicode if it's worth the effort and a requirement. And keeping the ANSI file I/O seperated from the rest. But this depends strongly from your application.

Gerita answered 17/6, 2009 at 2:45 Comment(0)
A
4

There is no such thing as AnsiString output - every text file has a character encoding. The moment that your files contain characters outside of the ASCII range you have to think about encoding, as even loading those files in different countries will produce different results - unless you happen to be using a Unicode encoding.

If you load a text file you need to know which encoding it has. For formats like xml or html that information is part of the text, for Unicode there is the BOM, even though it isn't strictly necessary for UTF-8 encoded files.

Converting an application to Delphi 2009 is a chance to think about encoding of text files and correct past mistakes. Data files of an application do often have a longer life than the applications itself, so it pays to think about how to make them future-proof and universal. I would suggest to go UTF-8 as the text file encoding for all new applications, that way porting an application to different platforms is easy. UTF-8 is the best encoding for data exchange, and for characters in the ASCII or ISO8859-1 range it does also create much smaller files than UTF-16 or UTF-32 even.

If your data files contain only ASCII chars you are all set then, as they are valid UTF-8 encoded files then as well. If your data files are in ISO8859-1 encoding (or any other fixed encoding), then use the matching conversion while loading them into string lists and saving them back. If you don't know in advance what encoding they will have, ask the user upon loading, or provide an application setting for the default encoding.

Use Unicode strings internally. Depending on the amount of data you need to handle you might use UTF-8 encoded strings.

Autocrat answered 17/6, 2009 at 4:13 Comment(3)
Excellent - the way you explained this helps a great deal. Based on my understanding, input will indeed be UTF-8 text files (straight ASCII) and it now makes sense that I can use UTF-8 encoded Unicode strings internally.Ari
It is not that straightforward to use UTF-8 encoded strings internally, so I do not recommend this as a policy. You will find that as soon as you start using Stringlists and the more useful VCL string functions, the function you need will either be absent or using it will involve two encoding conversions.Mcgary
@frogb: Indeed, as a policy it would be a bad idea. This needs to be decided on a case-by-case basis. Without knowing what the code does it is however impossible to say which VCL string functions are needed, and which encoding conversions this would cause.Autocrat
C
3

You say:

"The app does some pretty heavy character-by-character analysis of string in objects descended from TList."

Since Windows runs Unicode natively, you may find your character analysis runs faster if you load the text file internally as Unicode.

On the other hand, if it is a large file, you will also find it takes twice as much memory.

For more about this, see Jan Goyvaert's article: "Speed Benefits of Using the Native Win32 String Type"

So it is a tradeoff you have to decide on.

Calamondin answered 17/6, 2009 at 4:26 Comment(3)
Thanks for the link. The text files are not very large (a megabyte or so). I'm a long-term happily registered user of JGSoft programs, so I doubly appreciate the link - I had not read Jan's blog posts.Ari
You might also find some of the answers to a question I earlier posted of use to you. See the excellent answers to: #312618 including Jan's answer.Calamondin
If the input consists of ASCII characters only, and the character analysis does not use any RTL functions that wrap the Windows API (as explained in the linked article) but only comparison and stuff like Pos() then UnicodeString will be slower than AnsiString.Autocrat
C
1

If you are going to take Unicode input from the GUI, what's the strategy going to be for converting it to ASCII output? (This is an assumption as you mention writing Ansi text back out, assumedly for these non-Unicode based applications that you are not going to rewrite and assumedly don't have the sourcecode for.) I'd suggest staying with AnsiString throughout the app until these other apps are Unicode enabled. If your main job of your application is analyzing non-Unicode ASCII type files, then why switch to Unicode internally? If the main job of your application involves having a better Unicode enabled GUI then go Unicode. I don't believe that there's enough info presented to decide a proper choice.

If there is no chance for non-easily translatable characters to be written back out for these non-Unicode applications, then the suggestion for UTF-8 is the likely way to go. However, if there is a chance, then how are the non-Unicode applications going to handle multi-byte characters? How are you going to convert to (assumedly) the basic ASCII character set?

Cyperaceous answered 17/6, 2009 at 5:2 Comment(1)
Limiting the text output to UTF-8/ASCII will not be difficult (if I plan well) because it's derived from the input (in this regard mghie's answer is particularly applicable). The GUI is used to generate graphical output (to be saved in vector formats - a separate issue). Thanks for your answer - the cautionary tone is very helpful in thinking about the text output.Ari

© 2022 - 2024 — McMap. All rights reserved.