Converting Unicode to Windows-1252 for vCards
Asked Answered
G

1

6

I am trying to write a program in C# that will split a vCard (VCF) file with multiple contacts into individual files for each contact. I understand that the vCard needs to be saved as ANSI (1252) for most mobile phones to read them.

However, if I open a VCF file using StreamReader and then write it back with StreamWriter (setting 1252 as the Encoding format), all special characters like å, æ and ø are getting written as ?. Surely ANSI (1252) would support these characters. How do I fix this?

Edit: Here's the piece of code I use to read and write the file.

private void ReadFile()
{
   StreamReader sreader = new StreamReader(sourceVCFFile);
   string fullFileContents = sreader.ReadToEnd();
}

private void WriteFile()
{
   StreamWriter swriter = new StreamWriter(sourceVCFFile, false, Encoding.GetEncoding(1252));
   swriter.Write(fullFileContents);
}
Gonorrhea answered 4/12, 2010 at 4:45 Comment(0)
I
12

You are correct in assuming that Windows-1252 supports the special characters you listed above (for a full list see the Wikipedia entry).

using (var writer = new StreamWriter(destination, true, Encoding.GetEncoding(1252)))
{
    writer.WriteLine(source);
}

In my test app using the code above it produced this result:

Look at the cool letters I can make: å, æ, and ø!

No question marks to be found. Are you setting the encoding when your reading it in with StreamReader?

EDIT: You should just be able to use Encoding.Convert to convert the UTF-8 VCF file into Windows-1252. No need for Regex.Replace. Here is how I would do it:

// You might want to think of a better method name.
public string ConvertUTF8ToWin1252(string source)
{
    Encoding utf8 = new UTF8Encoding();
    Encoding win1252 = Encoding.GetEncoding(1252);

    byte[] input = source.ToUTF8ByteArray();  // Note the use of my extension method
    byte[] output = Encoding.Convert(utf8, win1252, input);

    return win1252.GetString(output);
}

And here is how my extension method looks:

public static class StringHelper
{
    // It should be noted that this method is expecting UTF-8 input only,
    // so you probably should give it a more fitting name.
    public static byte[] ToUTF8ByteArray(this string str)
    {
        Encoding encoding = new UTF8Encoding();
        return encoding.GetBytes(str);
    }
}

Also you'll probably want to add usings to your ReadFile and WriteFile methods.

Ilocano answered 4/12, 2010 at 5:10 Comment(14)
I think the key to the OP's problem is your last question: make sure that the StreamReader that reads the VCF has the 1252 encoding set.Shredding
I am not setting the encoding when reading the file using StreamReader. And I am pretty much using the same piece of code as your sample. But the input VCF file is in UTF-8. For some reason, Sony Ericsson's "Backup to MS" feature saves the VCF file in UTF-8!Gonorrhea
@GPX: See my updated answer, I think it should solve your problem.Ilocano
@Lucas: Thanks for the reply! I've added the code that I'm using. Now to use yours, do I do a Regex.Replace()? And also, should I hardcode the byte array for each special character?Gonorrhea
@Lucas: What would I do if the input VCF file is not UTF-8!?Gonorrhea
@Lucas: UPDATE: How do I handle VCF files that are in ANSI, then? Looks like there's no proper way to detect ANSI encoding!Gonorrhea
@Lucas: Also, your suggested method properly recodes a UTF-8 stream with special characters into ANSI. But if it is a UTF-8 stream WITHOUT any special characters, then the result is also a UTF-8 stream!Gonorrhea
@GPX: I not 100% sure what you mean by your last comment, but if the VCF file is in ANSI why should their be a problem?Ilocano
@GPX: It should also be noted that you should only call my function once you know that the input is in UTF-8 format. So you will need to put the proper checks in before your call my method.Ilocano
@Lucas: I've got everything so wrong. I used the inbuilt functions to backup contacts on both SE and Nokia phones, and guess what, both are being saved in UTF-8! I feel so terrible I missed it, after all these questions! Now if I just open a VCF file using StreamReader in UTF-8 mode and the again save it using StreamWriter in UTF-8 mode, the file is saved with special characters preserved, However, if I open the file using Notepad2, it shows "UTF-8 with Signature" as the encoding. Am I doing something wrong?Gonorrhea
@GPX: Wikipedia states that the BOM "may cause interoperability problems with existing software that could otherwise handle UTF-8" . It then goes on to give several examples of the problems it could cause. So basically UTF-8 with signature just means with BOM added.Ilocano
@GPX: Also don't feel bad, character sets are a complex subject. It just takes time and practice.Ilocano
@Lucas: So how do I save the file without adding BOM?Gonorrhea
@GPX: Notepad2 may be adding it just by opening it. If you have a HEX editor/viewer handy you might want to look at the text file right after running your program. If the BOM is in fact being added by .NET then you could always write code that checks to see if the first three bytes are 0xEF, 0xBB, 0xBF and if so remove them.Ilocano

© 2022 - 2024 — McMap. All rights reserved.