VB6/VBScript change file encoding to ansi
Asked Answered
B

4

13

I am looking for a way to convert a textfile with UTF8 encoding to ANSI encoding.

How can i go around and achieve this in Visual Basic (VB6) and or vbscript?

Bark answered 3/3, 2011 at 14:35 Comment(0)
A
20

If your files aren't truly enormous (e.g. even merely 40MB can be painfully slow) you can do this using the following code in VB6, VBA, or VBScript:

Option Explicit

Private Const adReadAll = -1
Private Const adSaveCreateOverWrite = 2
Private Const adTypeBinary = 1
Private Const adTypeText = 2
Private Const adWriteChar = 0

Private Sub UTF8toANSI(ByVal UTF8FName, ByVal ANSIFName)
    Dim strText

    With CreateObject("ADODB.Stream")
        .Open
        .Type = adTypeBinary
        .LoadFromFile UTF8FName
        .Type = adTypeText
        .Charset = "utf-8"
        strText = .ReadText(adReadAll)
        .Position = 0
        .SetEOS
        .Charset = "_autodetect" 'Use current ANSI codepage.
        .WriteText strText, adWriteChar
        .SaveToFile ANSIFName, adSaveCreateOverWrite
        .Close
    End With
End Sub

UTF8toANSI "UTF8-wBOM.txt", "ANSI1.txt"
UTF8toANSI "UTF8-noBOM.txt", "ANSI2.txt"
MsgBox "Complete!", vbOKOnly, WScript.ScriptName

Note that it will handle UTF-8 input files either with or without a BOM.

Using strong typing and early binding will improve performance a hair in VB6, and you won't need to declare those Const values. This isn't an option in script though.

For VB6 programs that need to process very large files you might be better off using VB6 native I/O against Byte arrays and use an API call to convert the data in chunks. This adds the extra messiness of finding the character boundaries though (UTF-8 uses a variable number of bytes per character). You'd need to scan each data block you read to find a safe ending point for an API translation.

I'd look at MultiByteToWideChar() and WideCharToMultiByte() to get started.

Note that UTF-8 often "arrives" with LF line delimiters instead of CRLF.

Adjacent answered 3/3, 2011 at 20:28 Comment(9)
Btw, CharSet should be "_autodetect_all"! In HKCR\Mime\Database\Charset\_autodetect value Codepage is 50932 and then in HKCR\Mime\Database\Codepage\50932 value Description is "Japanese (Auto-Select)" -- clearly does not match the "Use current ANSI codepage" intent.Photocurrent
Odd, when I examine the registry 50932's Description shows as @%SystemRoot%\system32\mlang.dll,-4637 which should retrieve the current-locale setting, AFAIK.Adjacent
Yes, on win7 just look at the string tables of mlang.dll with a resource editor. In English table id 4637 is "Japanese (Auto-Select)". On XP the string is hard coded in the registry.Photocurrent
Well I agree that _autoselect_all mapping to codepage 50001 makes more sense. Seems like a bizarre choice for _autoselect to map to 50932 but it appears to go back as far as Win95. All of the "autoselect" MIME types are noted as "not for general use" in MSDN, but aside from setting a specific CharSet value I don't see an alternative. The ironic part of all of this is the original question above probably really wanted "Unicode" (UTF-16LE) anyway.Adjacent
What if my conversion will be from UTF-8 w/o BOM to ASCII CP852? How should I do thisBenzoic
There is no such thing as an "ASCII codepage" so I assume you meant an ANSI codepage 852. To do that you could set the output .Charset = "ibm852" or its alias "cp852" (both work for me).Adjacent
doesn't work when I'm trying to convert UTF-8 to Windows-1254 (I get question marks instead of extended ascii values)Holdup
Works fine when I read with a filesystemobject and write with ADODB.streamHoldup
why .Type is firstly set to adTypeBinary and then set to adTypeText?Roselynroseman
P
4

I'm using these helper functions

Private Function pvReadFile(sFile)
    Const ForReading = 1
    Dim sPrefix

    With CreateObject("Scripting.FileSystemObject")
        sPrefix = .OpenTextFile(sFile, ForReading, False, False).Read(3)
    End With
    If Left(sPrefix, 3) <> Chr(&HEF) & Chr(&HBB) & Chr(&HBF) Then
        With CreateObject("Scripting.FileSystemObject")
            pvReadFile = .OpenTextFile(sFile, ForReading, False, Left(sPrefix, 2) = Chr(&HFF) & Chr(&HFE)).ReadAll()
        End With
    Else
        With CreateObject("ADODB.Stream")
            .Open
            If Left(sPrefix, 2) = Chr(&HFF) & Chr(&HFE) Then
                .Charset = "Unicode"
            ElseIf Left(sPrefix, 3) = Chr(&HEF) & Chr(&HBB) & Chr(&HBF) Then
                .Charset = "UTF-8"
            Else
                .Charset = "_autodetect"
            End If
            .LoadFromFile sFile
            pvReadFile = .ReadText
        End With
    End If
End Function

Private Function pvWriteFile(sFile, sText, lType)
    Const adSaveCreateOverWrite = 2

    With CreateObject("ADODB.Stream")
        .Open
        If lType = 2 Then
            .Charset = "Unicode"
        ElseIf lType = 3 Then
            .Charset = "UTF-8"
        Else
            .Charset = "_autodetect"
        End If
        .WriteText sText
        .SaveToFile sFile, adSaveCreateOverWrite
    End With
End Function

I found out that "native" FileSystemObject reading of ANSI and UTF-16/UCS-2 files is much faster that ADODB.Stream hack.

Photocurrent answered 4/3, 2011 at 10:55 Comment(2)
How is using the Stream precisely as intended some kind of "hack" anyway? And if you want speed there are far better alternatives than the pokey FSO.Adjacent
@Bob77: I don't see any alternatives in your answer. Do you think ADODB.Stream was intended for ANSI/UTF8 transcoding when there is a simple API function to do it? Then why is it's performance abysmal if there was any test case for transcoding? Frankly I'm still using these functions in production although I found out the hard way that OpenTextFile fails on empty files just with Unicode BOM (Chr(&HFF) & Chr(&HFE))Photocurrent
S
4

I'm using this script to convert any character set or code page (that i'm aware of).

This script can also handle large files (over one gigabytes), because it streams one line at a time.

' - ConvertCharset.vbs -
'
' Inspired by: 
' http://www.vbforums.com/showthread.php?533879-Generate-text-files-in-IBM-850-encoding
' https://mcmap.net/q/852924/-vb6-vbscript-change-file-encoding-to-ansi/5186170#5186170
' https://mcmap.net/q/908093/-how-to-convert-a-batch-file-stored-in-utf-8-to-something-that-works-via-another-batch-file-and-run-it
' 
' Start Main
Dim objArguments
Dim strSyntaxtext, strInputCharset, strOutputCharset, strInputFile, strOutputFile 
Dim intReadPosition, intWritePosition
Dim arrCharsets

Const adReadAll = -1
Const adReadLine = -2
Const adSaveCreateOverWrite = 2
Const adSaveCreateNotExist = 1
Const adTypeBinary = 1
Const adTypeText = 2
Const adWriteChar = 0
Const adWriteLine = 1

strSyntaxtext = strSyntaxtext & "Converts the charset of the input text file to output file." & vbCrLf
strSyntaxtext = strSyntaxtext & "Syntax: "  & vbCrLf
strSyntaxtext = strSyntaxtext & WScript.ScriptName & " /InputCharset:utf-8|windows-1252|ibm850|..." & vbCrLf
strSyntaxtext = strSyntaxtext & "              /OutputCharset:utf-8|windows-1252|ibm850|..." & vbCrLf 
strSyntaxtext = strSyntaxtext & "              /InputFile:\\path\to\inputfile.ext" & vbCrLf 
strSyntaxtext = strSyntaxtext & "              /OutputFile:\\path\to\outputfile.ext" & vbCrLf 
strSyntaxtext = strSyntaxtext & "              [/ShowAllCharSets]" & vbCrLf & vbCrLf 
strSyntaxtext = strSyntaxtext & "Example:" & vbCrLf
strSyntaxtext = strSyntaxtext & WScript.ScriptName & " /InputCharset:ibm850 /OutputCharset:utf-8 /InputFile:my_dos.txt /OutputFile:my_utf-8.txt" & vbCrLf

Set objArgumentsNamed = WScript.Arguments.Named
If objArgumentsNamed.Count = 0  Then 
   WScript.Echo strSyntaxtext
   WScript.Quit(99)
End If

arrCharsets = Split("big5,big5-hkscs,euc-jp,euc-kr,gb18030,gb2312,gbk,ibm-thai," &_
                    "ibm00858,ibm01140,ibm01141,ibm01142,ibm01143,ibm01144," &_
                    "ibm01145,ibm01146,ibm01147,ibm01148,ibm01149,ibm037," &_
                    "ibm1026,ibm273,ibm277,ibm278,ibm280,ibm284,ibm285,ibm297," &_
                    "ibm420,ibm424,ibm437,ibm500,ibm775,ibm850,ibm852,ibm855," &_
                    "ibm857,ibm860,ibm861,ibm862,ibm863,ibm864,ibm865,ibm866," &_
                    "ibm869,ibm870,ibm871,iso-2022-jp,iso-2022-kr,iso-8859-1," &_
                    "iso-8859-13,iso-8859-15,iso-8859-2,iso-8859-3,iso-8859-4," &_
                    "iso-8859-5,iso-8859-6,iso-8859-7,iso-8859-8,iso-8859-9," &_
                    "koi8-r,koi8-u,shift_jis,tis-620,us-ascii,utf-16,utf-16be," &_
                    "utf-16le,utf-7,utf-8,windows-1250,windows-1251,windows-1252," &_
                    "windows-1253,windows-1254,windows-1255,windows-1256," &_
                    "windows-1257,windows-1258,unicode", ",")

Set objFileSystem = CreateObject("Scripting.FileSystemObject")

For Each objArgumentNamed in objArgumentsNamed
   Select Case Lcase(objArgumentNamed)
      Case "inputcharset"
         strInputCharset = LCase(objArgumentsNamed(objArgumentNamed))
         If Not IsCharset(strInputCharset) Then 
            WScript.Echo "The InputCharset (" & strInputCharset & ") is not valid, quitting. The valid charsets are:"  & vbCrLf
            x = ShowCharsets()
            WScript.Quit(1)
         End If
      Case "outputcharset"
         strOutputCharset = LCase(objArgumentsNamed(objArgumentNamed))
         If Not IsCharset(strOutputCharset) Then 
            WScript.Echo "The strOutputCharset (" & strOutputCharset & ") is not valid, quitting. The valid charsets are:"  & vbCrLf
            x = ShowCharsets()
            WScript.Quit(2)
         End If
      Case "inputfile"
         strInputFile = LCase(objArgumentsNamed(objArgumentNamed))
         If Not objFileSystem.FileExists(strInputFile) Then  
            WScript.Echo "The InputFile (" & strInputFile  & ") does not exist, quitting."  & vbCrLf
            WScript.Quit(3)
         End If
      Case "outputfile"
         strOutputFile = LCase(objArgumentsNamed(objArgumentNamed))
         If objFileSystem.FileExists(strOutputFile) Then  
            WScript.Echo "The OutputFile  (" & strOutputFile & ") exists, quitting."  & vbCrLf
            WScript.Quit(4)
         End If
      Case "showallcharsets"
         x = ShowCharsets()
      Case Else
         WScript.Echo "Unknown parameter, quitting: /" & objArgumentNamed & ":" & objArgumentsNamed(objArgumentNamed)
         WScript.Echo strSyntaxtext
   End Select 
Next

If Len(strInputCharset) > 0 And Len(strOutputCharset) > 0 And Len(strInputFile) > 0 And Len(strOutputFile) Then 
   Set objInputStream = CreateObject("ADODB.Stream")
   Set objOutputStream = CreateObject("ADODB.Stream")

   With objInputStream
      .Open
      .Type = adTypeBinary
      .LoadFromFile strInputFile
      .Type = adTypeText
      .Charset = strInputCharset
      intWritePosition = 0
      objOutputStream.Open
      objOutputStream.Charset = strOutputCharset
      Do While .EOS <> True
         strText = .ReadText(adReadLine)
         objOutputStream.WriteText strText, adWriteLine
      Loop
      .Close
   End With
   objOutputStream.SaveToFile strOutputFile , adSaveCreateNotExist
   objOutputStream.Close
   WScript.Echo "The " & objFileSystem.GetFileName(strInputFile) & " was converted to "  & objFileSystem.GetFileName(strOutputFile) & " OK."
End If
' End Main

' Start Functions 

Function IsCharset(strMyCharset)
IsCharset = False
For Each strCharset in arrCharsets
   If strCharset = strMyCharset Then 
      IsCharset = True
      Exit For
   End If
Next
End Function 

Function ShowCharsets()
strDisplayCharsets = ""
intCounter = 0
For Each strcharset in arrCharsets
   intCounter = intCounter + Len(strcharset) + 1
   strDisplayCharsets = strDisplayCharsets & strcharset & ","
   If intCounter > 67 Then 
      intCounter = 0
      strDisplayCharsets = strDisplayCharsets & vbCrLf 
   End If
Next
strDisplayCharsets = Mid(strDisplayCharsets, 1, Len(strDisplayCharsets)-1)
WScript.Echo strDisplayCharsets 
End Function 
' End Functions 
Splendor answered 8/4, 2014 at 7:54 Comment(0)
H
1

@Bob77's answer did not work for me, so I converted @Ciove's answer to a simple sub routine and it works fine.

' Usage: 
' EncodeFile strInFile, "UTF-8", strOutFile, "Windows-1254", 2
Sub EncodeFile(strInputFile, strInputCharset, strOutputFile, strOutputCharset, intOverwriteMode)

    '5th parameter may take the following values:
    'Const adSaveCreateOverWrite = 2
    'Const adSaveCreateNotExist = 1

    Const adReadLine = -2
    Const adTypeBinary = 1
    Const adTypeText = 2
    Const adWriteLine = 1

    Set objInputStream = CreateObject("ADODB.Stream")
    Set objOutputStream = CreateObject("ADODB.Stream")

    With objInputStream
      .Open
      .Type = adTypeBinary
      .LoadFromFile strInputFile
      .Type = adTypeText
      .Charset = strInputCharset
      objOutputStream.Open
      objOutputStream.Charset = strOutputCharset
      Do While .EOS <> True
         strText = .ReadText(adReadLine)
         objOutputStream.WriteText strText, adWriteLine
      Loop
      .Close
    End With
    objOutputStream.SaveToFile strOutputFile, intOverwriteMode
    objOutputStream.Close
End Sub
Holdup answered 21/5, 2018 at 8:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.