What are the file/group/record/unit separator control characters and their usage?
Asked Answered
S

4

64

Unicode defines several control characters from ASCII. http://www.unicode.org/charts/PDF/U0000.pdf

I see many control characters are widely used but I really don't see where "information separators" are used. (U+001C~U+001F)

What are they? What's their history? What were they used for?

Severalty answered 1/1, 2012 at 19:50 Comment(3)
The field and record separators can be used to marshal table data as a string. It's a bit archaic, but it works.Dowzall
Thanks for asking this. I'm totally going to use unit separators instead of tab or comma-delimiting text now.Aguilar
FYI, Unicode actually defines all 128 characters of US-ASCII, not just some of the control characters. Unicode is a superset of US-ASCII.Guardsman
M
76

Lammert Bies explains both their usage and the history behind.

28 – FS – File separator
The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose.

29 – GS – Group separator
Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn't used at that moment and the ASCII people called it a group.

30 – RS – Record separator
Within a group (or table) the records are separated with RS or record separator.

31 – US – Unit separator
The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

A Unit separator could provide essentially the same purpose as a comma in a CSV file or a tab in a tab-delimited file.

Mainmast answered 13/9, 2013 at 9:0 Comment(3)
It's kind of depressing to learn that the CSV and TSV file formats are based on a lack of ASCII knowledge.Interclavicle
CSV uses comma, but that's a printable character. All the control codes are considered non-printable, and are thus binary data - even if mostly human readable. So there is a small difference.Greaves
Re: "All the control codes are considered non-printable" - in their ASCII form, they are. But, as I discovered today, Unicode also has printable characters representing these control codes: ␜ ␝ ␞ ␟Whitneywhitson
S
11

Did you mean that most of them are usually not used these days? The control characters mostly relate to device control functions, but some of them may have been used as separators in text files. For a quick reference, check my table of C0 Controls.

The information separators have been used to group data in a simple manner, but these days, either binary formats or XML format are used for data organization. There are still curiosities, like the internal use of U+001E and U+001F in Microsoft Word to implement the program’s own idea of “nonbreaking hyphen” and “optional hyphen” (as opposite to Unicode characters for similar purposes). This mainly illustrates that programs can use control characters in weird ways. Problems arise of course if the characters are included in text transmitted to other programs.

Soissons answered 1/1, 2012 at 20:11 Comment(1)
I'm sorry for my bad English. I updated my question to be more clear.Severalty
J
5

They're deliberately ambiguous in function. From the standard reference for character coding development (Mackenzie, Charles E. Coded-Character Sets: History and Development. Addison-Wesley Longman Publishing Co., Inc., 1980.), chapter 26 section 1, page 460:

Four additional general-purpose characters, the so-called information separators, were designed into the [ASCII] 7-Bit Code and into EBCDIC. File Separator, Group Separator, Record Separator, and Unit Separator were defined broadly to be used to separate blocks of information. But how they were to be used to separate blocks, what philosophy of file and record structuring was to be used, was intentionally not specified. Such detailed specification would be left to the particular data processing application in which the separators would be used. Initially, a hierarchial [sic] philosophy of structuring information blocks was defined. A “file” was larger than, and would enclose, “groups.” A “group” was larger than, and would enclose, “records.” And a “record” was larger than, and would enclose, “units.” Eventually, the standards committees made this hierarchial [sic] specification optional; that is, the separators need not be used hierarchially [sic], buf if they were, then the hierarchy would be as described above. The standards committees realized that, as with the Device Controls, the unspecificity of the information separators could lead to difficulty of information interchange, but such difficulties could be worked out in the rare instances when they arose.

One example of a standard that uses this outline hierarchy is a now-superseded version of the ANSI/NIST-ITL Standard for interchange of forensic biometric images. The ITL “Traditional Encoding” used the ASCII separators as follows:

␜ File separator character – separates logical records.

␝ Group separator character – separates fields.

␞ Record separator character – separates repeated subfields.

␟ Unit separator character – separates information items.

This usage might appear to contradict the named purpose of the separators, but understanding the intended hierarchy of the character codes makes the choices in the ITL standard more appropriate.

A current example of a data format that uses an ASCII separator control code is JavaScript Object Notation (JSON) text sequence format (RFC 7464, media type application/json-seq) which places a ASCII Record Separator (0x1E) character before each record.

Japanese answered 22/6, 2021 at 20:54 Comment(0)
G
0

Control Picture, display counterpart

Other Answers are correct. ASCII (and therefore Unicode) define the four control characters as delimiters.

In addition, as mentioned in Comment by Rounin, Unicode defines four more characters that serve as a visual representation for each of the four control characters. These are called Control Pictures.

The Control Picture characters are useful in a text-editor that displays a text file containing any of those control characters. Also useful when documenting usage of their Control character counterpart.

Here is a table of the code points for the control characters and their display countparts. Official source are these PDF documents for Unicode 15.1, published by the Unicode Consortium:

The glyphs for each of these Control Picture characters represent a pair of uppercase letters from the secondary name:

  • FS
  • GS
  • RS
  • US
Primary
name
Secondary
name
Code point
(Decimal
Hex)
Control Picture
name
Control Picture
code point
(Decimal
Hex)
Glyph
INFORMATION SEPARATOR FOUR file separator (FS) 28
U+001C
SYMBOL FOR FILE SEPARATOR 9,244
U+241C
INFORMATION SEPARATOR THREE group separator (GS) 29
U+001D
SYMBOL FOR GROUP SEPARATOR 9,245
U+241D
INFORMATION SEPARATOR TWO record separator (RS) 30
U+001E
SYMBOL FOR RECORD SEPARATOR 9,246
U+241E
INFORMATION SEPARATOR ONE unit separator (US) 31
U+001F
SYMBOL FOR UNIT SEPARATOR 9,247
U+241F

As control characters, you cannot type them. Instead, instantiate them by their assigned code point integer number. For example, in Java:

final String file_separator_FS = Character.toString( 28 ) ;
final String group_separator_GS = Character.toString( 29 ) ;
final String record_separator_RS = Character.toString( 30 ) ;
final String unit_separator_US = Character.toString( 31 ) ;

You may want to do the same for the Control Picture characters as well.

final String SYMBOL_FOR_FILE_SEPARATOR = Character.toString( 9_244 ) ;
final String SYMBOL_FOR_GROUP_SEPARATOR = Character.toString( 9_245 ) ;
final String SYMBOL_FOR_RECORD_SEPARATOR = Character.toString( 9_246 ) ;
final String SYMBOL_FOR_UNIT_SEPARATOR = Character.toString( 9_247 ) ;
Guardsman answered 28/10, 2023 at 1:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.