How does vbscript filesystemobject encode characters?
Asked Answered
C

2

5

I have this vbscript code:

    Set fs = CreateObject("Scripting.FileSystemObject")
    Set ts = fs.OpenTextFile("tmp.txt", 2, True)

    for i = 128 to 255
        s = chr(i)
        if lenb(s) <>2 then
            wscript.echo i
            wscript.quit
        end if
        ts.write s
    next
    ts.close

On my system, each integer is converted to a double byte character: there are no numbers in that range that cannot be represented by a character, and no number requires more than 2 bytes. But when I look at the file, I find only 127 bytes.

This answer: https://mcmap.net/q/104853/-vba-save-a-file-with-utf-8-without-bom suggests the the FSO creates UTF files and inserts a BOM. But the file contains only 127 bytes, and no Byte Order Mark.

How does FSO decide how to encode text? What encoding allows 8 bit single-byte characters? What encodings do not include 255 8 bit single-byte characters?

(Answers about how FSO reads characters may also be interesting, but that's not what I'm specifically asking here)

Edit: I've limited my question to the high-bit characters, to make it clear what the question is. (Answers about the low-bit characters may also be interesting, but that's not what I'm specifically asking here)

Chiang answered 24/10, 2020 at 6:12 Comment(7)
Does this answer your question? FileSystemObject - Reading Unicode FilesAustine
It doesn’t create UTF-8 files, do not mix up unicode and UTF-8.Austine
Does this answer your question? Read utf-8 text file in vbscript.Austine
As stated, I'm specifically asking about writing. And about claims that have been made about writing. Answers that don't address Writing don't answer the question. I've used the term 'UTF' to include UTF-16 (including UCS-2). If you know an answer that addresses what kind of UTF (or similar) FSO writes in any circumstances, that could be relevant.Chiang
https://mcmap.net/q/104856/-writing-binary-data-to-file-with-jscript-vbscript/1335492 confuses writing from FSO with creating characters for FSO to write. I've tried to make the distinction clear here.Chiang
The problem is you are using Chr() which returns ASCI characters you should be using ChrW() if you intend on returning "Unicode" (UCS-2 Little Endian) characters.Austine
`Actually, Chr is returning 16 bit UTF characters on my system. Same as ChrW in both characters. ascb(midb(chr(i),1,1)) = ascb(midb(chrw(i),1,1)).. ChrB returns single byte characters (as invalid bstr).Chiang
C
5

Short Answer:

The file system object maps "Unicode" to "ASCII" using the code page associated with the System Locale. (Chr and ChrW use the User Locale.)

Application:

There may be silent transposition errors between the System code page and the Thread (user) code page. There may also be coding and decoding errors if code points are missing from a code page, or, as with Japanese and UTF-8, the code pages contain multi-byte characters.

VBscript provides no native method to detect the User, Thread, or System code page. The Thread (user) code page maybe inferred from the Locale set by SetLocale or returned by GetLocale (there is a list here: https://www.science.co.il/language/Locale-codes.php), but there does not appear to be any MS documentation. On Win2K+, WMI may be used to query the System code page. The CHCP command queries and changes the OEM codepage, which is neither the User nor the System code page.

The system code page may be spoofed by an application manifest. There is no way for an application (such as cscript or wscript) or script (such as VBScript or JScript) to change it's parent system except by creating a new process with a new manifest. or rebooting the system after making a registry change.

In detail:

 s = chr(i) 
'creates a Unicode string, using the Thread Locale Codepage. 

Code points that do not exist as characters are mapped as control characters: 127 becomes U+00FF (which is a standard Unicode control character), and 128 becomes U+20AC (the Euro symbol) and 129 becomes 0081 (which is a code point in a Unicode control character region). In VBScript, Thread Locale can be set and read by SetLocale and GetLocale

    createobject("Scripting.FileSystemObject").OpenTextFile(strOutFile, 2, True).write s
   'creates a 'code page' string, using the System Locale Codepage. 

There are two ways that Windows can handle Unicode values it can't map: it can either map to a default character, or return an error. "Scripting.FileSystemObject" uses the error setting, and throws an exception.

In More Detail:

The Thread Locale is, by default, the User Locale, which is the date and time format setting in the "Region and Language" control panel applet (called different things in different versions of windows). It has an associated code page. According to MS internationalization expert Michka (Michael Kaplan, RIP), the reason it has a code page is so that Months and Days of the week can be written in appropriate characters, and it should not be used for any other purpose.

The ASP-classic people clearly had other ideas, since Response.CodePage is thread-locale, and can be controlled by vbscript GetLocale and SetLocale amongst other methods. If the User Locale is changed, all processes are notified, and any thread that is using the default value updates. (I haven't tested what happens to a thread currently using a non-default value).

The System Locale is also called "Language for non-Unicode programs" and is also found in the "Region and Language" applet, but requires a reboot to change. This is the value used internally by windows ("The System") to map between the "A" API and the "W" API. Changing this has no effect on the language of the Windows GUI (That is not a "non-Unicode program")

Assuming that the "Time and Date" setting matches the "Language for non-Unicode programs", any Chr(i) that can create a valid Unicode code point (see "mapping errors" below), will map back exactly from Unicode to "code page". Note that this does work for code points that are "control characters": also note that it doesn't work the other way: UTF-CodePage-UTF doesn't always round-trip exactly. Famously (Character,Modifer)-CodePage-(Complex Character) does not round-trip correctly, where Unicode defines more than one way of constructing a language character representation.

If the "Time and Date" does not match the "Language for non-Unicode programs", any translation could take place, for example U+0101 is 0xE0 on cp28594 and 0xE2 on cp28603: Chr(224) would go through U+0101 to be written as 226.

Even if there are not transposition errors, if the "Time and Date" does not match the "Language for non-Unicode programs" the program may fail when translating to the System Locale: if the Unicode code point does not have a matching Code Page code point, there will be an exception from the FileSystemObject.

There may also be mapping errors at Chr(i), going from Code page to Unicode. Code page 1041 (Japanese) is a double-byte code page (probably Shift JIS). 0x81 is (only) the first byte of a double-byte pair. To be consistent with other code pages, 0x81 should map to the control character 0081, but when given 81 and code page 1041, Windows assumes that the next byte in the buffer, or in the BSTR, is the second byte of the double-byte pair (I've not determined if the mistake is made before or after the conversion). Chr(&H81) is mapped to U+xx81 (81,xx). When I did it, I got U+4581, which is a CJK Unified Ideograph (Brasenia purpurca): it's not mapped by code page 1041.

Mapping errors at Chr(1) do not cause VBScript exceptions at the point of creation. If the UTF-16 code point created is invalid or not on the System Locale code page, there will be a FileSystemObject exception at .write. This particular problem can be avoided by using ChrW(i) instead of Chr(i). On code page 1041, ChrW(129) becomes the Unicode Control character 0081 instead of xx81.

Background:

A program can map between Unicode and "codepage" using any installed code page: the Windows functions MultiByteToWideChar and WideCharToMultiByte take [UINT CodePage] as the first parameter. That mechanism is used internally in Windows to map the "A" API to the "W" API, for example GetAddressByNameA and GetAddressByNameW. Windows is "W", (wide, 16 bit) internally, and "A" strings are mapped to "W" strings on call, and back from "W" to "A" on return. When Windows does the mapping, it uses the code page associated with the "System Locale", also called "Language for non-Unicode programs".

The Windows API function WriteFile writes bytes, not characters, so it's not an "A" or "W" function. Any program that uses it has to handle conversion between strings and bytes. The c function fwrite writes characters, so it can handle 16 bit characters, but it has no way of handling variable length code points like UTF-8 or UTF-16: again, any program that uses "fwrite" has to handle conversion between strings and words.

The C++ function fwrite can handle UTF, and the compiler function _fwrite does magic that depends on the compiler. Presumably, on Windows, if code page translation is required the MultiByteToWideChar and WideCharToMultiByte API is used.

The "A" code pages and the "A" API were called "ANSI" or "ASCII" or "OEM", and started out as 8 bit characters, then grew to double-byte characters, and have now grown to UTF-8 (1..3 bytes). The "W" API started out as 16 bit characters, then grew to UTF-16 (1..6 bytes). Both are multi-word character encodings: the distinction is that for the "A" API and code pages, the word length is 8 bits: for the "W" API and UTF-16, the word length is 16 bits. Because they are both multi-byte mappings, and because "byte" and "word" and "char" and "character" mean different things in different contexts, and because "W" and particularly "A" mean different things than they did years ago, I've just use "A" and "W" and "code page" and "Unicode".

"OEM" is the code page associated with another locale: The Console I/O API. It is per-process (it's a thread locale), it can be changed dynamically (using the CHCP command) and its default value is set at installation: there is no GUI provided to change the value stored in the registry. Most console programs don't use the console I/O API, and as written, use either the system locale, or the user locale, or, (sometimes inadvertently), a mixture of both.

The System Locale can be spoofed by using a manifest and there was a WinXP utility called "AppLocale" that did the same thing.

Chiang answered 8/11, 2020 at 9:43 Comment(7)
Do you have proof that fwrite writes characters, not bytes? I think it's not true.Almucantar
@pts, c++ fwrite writes UTF-8. UTF-8 is a "variable-length character encoding standard". For more information, see any fwrite documentation.Chiang
My fwrite() documentation (man7.org/linux/man-pages/man3/fwrite.3p.html) doesn't mention UTF-8. And for decades, even before UTF-8 was invented. I've been using fwrite() to write arbitrary, non-UTF-8 binary data to files. So for me fwrite() writes bytes, not characters. Which manual tells you that it writes characters?Almucantar
@Almucantar your fwrite documentation tells you that fwrite takes a "size" parameter as well as a length parameter.Chiang
You claim that fwrite writes UTF-8. Do you have any evidence to back it up? My fwrite manual and any documentation I've ever seen says that fwrite writes bytes, not particularly UTF-8.Almucantar
@Almucantar "++" has special significance when appended after "c". It's not just decoration.Chiang
The C++ standard explicitly says that functions in the C standard library work in identically in C++. fwrite is such a function. Thus the distinction between C and C++ doesn't matter when focusing on the fwrite function. I repeat my question to you. You claim that fwrite writes UTF-8. Do you have any evidence to back it up? By evidence I mean a web link to some official documentation. My link 1 and link 2 both contradict your claim that fwrite writes UTF-8. Can you back up your claim?Almucantar
D
3

FSO decide how to encode text during file opening. Use format argument as follows:

Set ts = fs.OpenTextFile("tmp.txt", 2, True, -1)
'                                            ↑↑ 

Resource: OpenTextFile Method

Syntax


object.OpenTextFile(filename[, iomode[, create[, format]]])

Arguments

object - Required. Object is always the name of a FileSystemObject.

filename - Required. String expression that identifies the file to open.

iomode - Optional. Can be one of three constants: ForReading, ForWriting, or ForAppending.

create - Optional. Boolean value that indicates whether a new file can be created if the specified filename doesn't exist. The value is True if a new file is created, False if it isn't created. If omitted, a new file isn't created.

format - Optional. One of three Tristate values used to indicate the format of the opened file.

TristateTrue = -1 to open the file as Unicode,
TristateFalse = 0 to open the file as ASCII,
TristateUseDefault = -2 to open the file as the system default.

If omitted, the file is opened as ASCII.

Deify answered 24/10, 2020 at 10:32 Comment(4)
That isn't what they asked though is it.Austine
Not to mention this has been discussed numerous times before and ADODB.Stream is the correct approach. You realise Unicode and UTF-8 are different right?Austine
It's not clear to me that values 128-255 have an "ASCII' representation, nor how those would be encoded from a UTF-16 value. I'll try to expand the question.Chiang
@Chiang Microsoft, in their eternal wisdom, keep using misnomers ASCII and Ansi perpetually…Deify

© 2022 - 2024 — McMap. All rights reserved.