Tcl for getting ASCII code for every character in a string
Asked Answered
C

2

7

I need to get the ASCII character for every character in a string. Actually its every character in a (small) file. The following first 3 lines successfully pull all a file's contents into a string (per this recipe):

set fp [open "store_order_create_ddl.sql" r]
set data [read $fp]
close $fp

I believe I am correctly discerning the ASCII code for the characters (see http://wiki.tcl.tk/1497). However I'm having a problem figuring out how to loop over every character in the string.

First of all I don't think the following is an especially idiomatic way of looping over characters in a string with Tcl. Second and more importantly, it behaves incorrectly, inserting an extra element between every character.

Below is the code I've written to act on the contents of the "data" variable set above, followed by some sample output.

CODE:

for {set i 0} {$i < [string length $data]} {incr i} {
  set char [string index $data $i]
  scan $char %c ascii
  puts "char: $char (ascii: $ascii)"
}

OUTPUT:

char: C (ascii: 67)
char:  (ascii: 0)
char: R (ascii: 82)
char:  (ascii: 0)
char: E (ascii: 69)
char:  (ascii: 0)
char: A (ascii: 65)
char:  (ascii: 0)
char: T (ascii: 84)
char:  (ascii: 0)
char: E (ascii: 69)
char:  (ascii: 0)
char:   (ascii: 32)
char:  (ascii: 0)
char: T (ascii: 84)
char:  (ascii: 0)
char: A (ascii: 65)
char:  (ascii: 0)
char: B (ascii: 66)
char:  (ascii: 0)
char: L (ascii: 76)
char:  (ascii: 0)
char: E (ascii: 69)
Cumbrous answered 4/11, 2009 at 18:15 Comment(6)
Don't know anything about TCL, but what I can tell you from the output is that your input string is in UTF-16, specifically UTF-16 little-endian, not ASCII.Hardecanute
Arthur, I appreciate the comment, but I'm very interested to know, how can you tell that (it's UTF-16 little-endian) from the output?Cumbrous
UTF-16 uses two-byte units to encode characters. For the first 65536 Unicode characters (the so-called Plane 0), it uses one of those units, for all the rest, it uses two (i.e., 4 bytes, but distinguished into two surrogate characters encoded each on two bytes). The ASCII characters form the first 128 Unicode characters, hence they're encoded using two bytes, the most significant one always being 0, the least significant one equal to the character's ASCII code. Here you see that each ASCII code is followed by a null byte, hence you're having least-order byte first, i.e. UTF-16LE.Hardecanute
Thanks Arthur, that's clearer than the Wikipedia article I looked up in the meantime!Cumbrous
Arthur, please consider writing this up as an answer rather than a comment, and I will certainly upvote it and also probably accept it; so you can gain some reputation for your input.Cumbrous
PS...the way this came about for me was that I was actually trying to parse the output with PHP but encountered segfaults when trying to tokenize the data. With PHP I determined that there were internal null characters, and I thought it might have to do with transferring the file, first via Remote Desktop, and then via SCP. I ruled out the latter, so to try to be sure it wasn't because of a) PHP, and b) transferring via Remote Desktop, I then uploaded TCLKit to the remote desktop, so I could try with another language, directly on the machine where the SQL got generated.Cumbrous
C
12

The following code should work:

set data {CREATE TABLE}
foreach char [split $data ""] {
    lappend output [scan $char %c]
}
set output ;# 67 82 69 65 84 69 32 84 65 66 76 69

As far as the extra characters in your output, it seems like the problem is with your input data from the file. Is there some reason there would be null characters (\0) in between every character in the file?

Chromatography answered 4/11, 2009 at 18:31 Comment(3)
I'd begun to suspect that it might be an issue with the input, though there is no good reason for null characters between every character, except that it was generated with a Microsoft (SQL Server) tool ;)Cumbrous
Then that's your answer. Most Microsoft tools (as well as Apple's, by the way), use UTF-16 as their internal encoding; UTF-16LE being far more widespread because that's the native Intel endianness. You need to tell Tcl to interpret the input file as UTF-16. Again, no idea how to do that, sorry, but you should look for keywords like “encoding” or “character set” or, generally speaking, Unicode, in the docs.Hardecanute
Think you might want to do: fconfigure $fp -encoding unicode after opening the file but before reading from it.Storax
V
0

Came across this older question while looking for something else.. Going to answer it for the benefit of anyone else who may be looking for an answer to this question..

First off, understand what character encodings are. The source data in the example is NOT ASCII character encoding, so the ASCII character codes (codes 0-127) really have no meaning--Except in this example, the encoding appears to be UTF-16, which includes ASCII codes as a subset. What you probably want is the full range of "character" codes from 0 to 255, but depending on your system, the source of the data, etc, codes 128-255 may be ANSI, ISO, or some other strange code page. What you want to do is convert the data in to a format you know how to handle, such as the very common ISO 8859-1 code (encoding "iso8859-1"), which is very similar to Windows 1252 standard encoding (encoding "cp1252"), or UTF-8 (encoding "utf-8") with the "encoding" command:

set data [encoding convertto utf-8 $data] ;# For UTF-8

set data [encoding convertto iso8859-1 $data] ;# For ISO 8859-1

and so on. If you're reading the data from a file, you may want to set the file encoding (via fconfigure) prior to reading the data as well, to make sure you're reading the file data correctly. Look up the man pages for "encoding" (and "fconfigure") for more details on handing character set encoding.

Once you have the encoding of the data under control, the rest of the example code should work as expected.

Visional answered 15/4, 2015 at 19:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.