Should text-processing DCGs be written to handle codes or chars? Or both?

Asked 21/2, 2021 at 10:16 Answered 22/2, 2021 at 11:24

In Prolog, there are traditionally two ways of representing a sequence of characters:

As a list of chars, which are atoms of length 1.
As a list of codes, which are just integers. The integers are to be interpreted as codepoints, but the convention to be applied is left unspecified. As a (eminently sane) example, in SWI-Prolog, the space of codepoints is Unicode (thus, roughly, the codepoint-integers range from 0 and 0x10FFFF).

DCGs, a notational way of writing left-to-right list processing code, are designed to perfom parsing on "lists of exploded text". Depending on preference, the lists to-be-handled can be lists of chars or lists of codes. However, the notation for char/code processing differs when writing down the constants. Does one generally write the DCG in "char style" or "code style"? Or maybe even in char/code style for portability in case of modules exporting DCG nonterminals?

Some Research

The following notations can be used to express constants in DCGs

'a': A char (as usual: single quotes indicate an atom, and they can be left out if the token starts with a lowercase letter.)
0'a: the code of a .
['a','b']: A list of char.
[ 0'a, 0'b ]: A list of codes, namely the codes for a and b (so you can avoid typing in the actual codepoint values).
"a" a list of codes. Traditionally, a double-quoted string is exploded into a list of codes, and this notation also works SWI-Prolog in DCG contexts, even though SWI-Prolog maps a "double-quoted string" to the special string datatype otherwise.
`0123`. Traditonally, text within back-quotes is mapped to an atom (I think, the 95 ISO Standard just avoids being specific regarding the meaning of a back-quoted string. "It would be a valid extension of this part of ISO/IEC 13211 to define a back quoted string as denoting a character string constant."). In SWI-Prolog, text within back-quotes is exploded into a list of codes unless the flag back_quotes has been set to demand a different behaviour.

Examples

Char style

Trying to recognize "any digit" in "char style" and make its "char representation" available in C:

zero(C) --> [C],{C = '0'}. 

nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

Code style

Trying to recognize "any digit" in "code style":

zero(C) --> [C],{C = 0'0}.

nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

Char/Code transparent style

DCGs can be written as "char/code transparent style" by duplicating the rules involving constants. In the above example:

zero(C) --> [C],{C = '0'}. 
zero(C) --> [C],{C = 0'0}.

nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

The above also accepts a sequence of alternating codes and chars (as lists of stuff cannot be typed). This is probably not a problem). When generating, one will get arbitrary char/code mixes which are unwanted, and then cuts need to be added.

Char/Code transparent style taking an additional `Mode` indicator

Another approach would be to explicitly indicate the mode. Looks clean:

zero(C,chars) --> [C],{C = '0'}. 
zero(C,codes) --> [C],{C = 0'0}.

nonzero(C,chars) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C,codes) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.

any_digit(C,Mode) --> zero(C,Mode).
any_digit(C,Mode) --> nonzero(C,Mode).

Char/Code transparent style using dialect features

Alternatively, features of the Prolog dialect can be used to achieve char/code transparency. In SWI-Prolog, there is code_type/2, which actually works on codes and chars (there is a corresponding char_type/2 but IMHO there should be only chary_type/2 working for chars and codes in any case) and for "digit-class" codes and chars yield the compound digit(X):

?- code_type(0'9,digit(X)).
X = 9.

?- code_type('9',digit(X)).
X = 9.

?- findall(W,code_type('9',W),B).
B = [alnum,csym,prolog_identifier_continue,ascii,
     digit,graph,to_lower(57),to_upper(57),
     digit(9),xdigit(9)].

And so one can write this for clean char/code transparency:

zero(C) --> [C],{code_type(C,digit(0)}. 

nonzero(C) --> [C],{code_type(C,digit(X),X>0}.

any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).

In SWI-Prolog in particular

SWI-Prolog by default prefers codes. Try this:

The flags

influence interpretation of "string" and `string` in "standard code". By default"string" is interpreted as an atomic "string" whereas `string` is interpreted as a "list of codes".

Outside of DCGs, the following holds in SWI-Prolog, with all flags at their default:

?- string("foo"),\+atom("foo"),\+is_list("foo").
true.

?- L=`foo`.
L = [102,111,111].

However, in DCGs, both "string" and `string` are interpreted as "codes" by default.

Without any settings changed, consider this DCG:

representation(double_quotes)    --> "bar".            % SWI-Prolog decomposes this into CODES 
representation(back_quotes)      --> `bar`.            % SWI-Prolog decomposes this into CODES
representation(explicit_codes_1) --> [98,97,114].      % explicit CODES (as obtained via atom_codes(bar,Codes))
representation(explicit_codes_2) --> [0'b,0'a,0'r].    % explicit CODES 
representation(explicit_chars)   --> ['b','a','r'].    % explicit CHARS

Which of the above matches codes?

?- 
findall(X,
   (atom_codes(bar,Codes),
    phrase(representation(X),Codes,[])),
   Reps).

Reps = [double_quotes,back_quotes,explicit_codes_1,explicit_codes_2].

Which of the above matches chars?

?- findall(X,
   (atom_chars(bar,Chars),phrase(representation(X),Chars,[])),
   Reps).
Reps = [explicit_chars].

When starting swipl with swipl --traditional the backquoted representation is rejected with Syntax error: Operator expected , but otherwise nothing changes.

Fania answered 21/2, 2021 at 10:16 Comment(4)

"... "a" a list of codes." This is not correct. The interpretation of a double-quoted term depends on the value of the standard double_quotes flag at the time the Prolog text is parsed. – Dressy 21/2, 2021 at 10:41

@PauloMoura But in SWI-Prolog the "foo" is (by default, though that can be changed) the "string" with no further substructure. Using "foo" in a DCG, however, explodes it into codes. I will add an example. – Fania 21/2, 2021 at 11:11

Still, you keep making the wrong claim that "a" is list of codes. If the standard double_quotes flag is set to atom, then "a" is an atom. If the flag is set to chars, then "a" is a list of chars, [a]. – Dressy 21/2, 2021 at 15:43

You shouldn't need to worry about these things when writing DCG rules. Just use double-quoted "string" literals in the DCG rule definitions, those work as they should. No need to hack around with the concrete representation. – Mycenaean 22/2, 2021 at 8:26

The Prolog Standard (6.3.7) says:

A double quoted list is either an atom (6.3.1.3) or a list (6.3.5).

Consequently, the following should succeed:

Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

?- Foo = "foo", (atom(Foo) ; Foo = [F, O, O]).
false.

So SWI-Prolog is not a Prolog by default. That's OK, but if you want to know about SWI-Prolog's non-Prolog behavior, please adjust the tags on the question.

From the definition it also follows that double quoted lists are completely useless by default even in a conforming Prolog: They might denote atoms, so regardless of the chars/codes distinction you can't even know that the double quoted list is actually a list. Even DCGs that only care about structural properties of the "text" (whether it's a palindrome, for example) are useless if the "list" is in fact an atom.

Hence a Prolog program that wants to process text with DCGs must at startup force the double_quotes flag to the value it wants. You have the choice between codes and chars. Codes have no advantages over chars, but they do have disadvantages in readability and typeability. Thus:

Answer: Use chars. Set the double_quotes flag explicity.

Scut answered 21/2, 2021 at 20:46 Comment(4)

That ... makes sense. SWI-Prolog's opinion about the "traditional" way of a list interpretation of a double-quoted element is that it is a list of codes: double_quotes flag. If --traditional (command line option) is given, the default is codes, which produces a list of character codes, integers that represent a Unicode code-point. The value chars produces a list of one-character atoms and the value atom makes double quotes the same as single quotes (excpt for escaping), creating a atom. – Fania 21/2, 2021 at 22:30

When you define a DCG rule, you can use double quoted literals in every Prolog I know of, including SWI-Prolog, without mucking around with any global flags. Your observations might be correct, but your conclusions are questionable. The bold Answer at the end is at best an opinion. – Mycenaean 22/2, 2021 at 8:24

Yes you can use double quoted literals in a DCG rule, but you have no guarantee that they will mean what you want. If you define foo --> "foo"., should phrase(foo, [f, o, o]). succeed? The Prolog standard doesn't tell you. But it does give you the flag you can tweak to decide for yourself. – Scut 22/2, 2021 at 10:43

you can't even know that the double quoted list is actually a list that is incorrect. The default value for the flag double_quotes is implementation defined. So you do know what "abc" means prior to changing the flag. You need to read the accompanying documentation first. That holds also for many other aspects. Like the supported processor character set, or the kind of integers supported etc. – Wenona 22/2, 2021 at 12:10

I should start to by noting that the answer to the "Should text-processing DCGs be written to handle codes or chars? Or both?" question can be neither. DCGs work by using an implicit difference list to thread state. But the elements of that difference list can be other than chars or codes. It depends on the output of text tokenization and what exactly text processing entails. E.g. I have worked on and come across Prolog NLP applications where codes/chars were only used for the basic tokenization and the reasoning was performed (still with DCGS) using either atoms or compound terms that reified the token data (e.g. v(Verb) or n(Noun)). One of those applications (a personal assistant like it's common nowadays in phones) used atoms produced by a voice-recognition component.

But let's go back to chars vs codes. Legacy practices and failed standardization left Prolog with problematic text representation. ASCII gives us a singe quote, a back quote, and a double-quote. With single quotes being used for atoms, a choice could have been made to use e.g. back quotes for representing a list of codes and double-quotes for representing a list of chars. Or the other way around. Instead, and this is where standardization failed, we got the problematic double_quotes flag. There's no shortage of Prolog code in the wild that makes an assumption about the meaning of double-quoted terms and thus works or breaks depending on the implicit value of the double_quotes flag (if you think this is mainly an issue with legacy code, think again). Guess what happens when we try to combine code that require different values for the flag? Note that, in almost all systems (including those that support modules), the flag value is global ... As Isabelle wrote in her answer, setting the flag explicitly is good general advice. But not, as I explained, without problems.

Some systems provide additional values for the flag. E.g. SWI-Prolog allows the flag to also be set to string. GNU Prolog supports additional atom_no_escape, chars_no_escape and codes_no_escape. Some systems only support codes. Some systems also provide a back_quotes flag. This Babel tower means that portable and resilient code is often forced to use atoms to represent text. But this is may not be ideal from a performance perspective.

Back to the original question. As Isabelle mentioned, chars is usually a more readable (read, easier to debug) choice. But, depending on the Prolog system, codes may provide better performance. If application performance is critical, benchmark both solutions. Some recent Prolog systems (e.g. Scryer-Prolog or Trealla Prolog) have efficient support for chars. Older systems may trail behind.

Dressy answered 21/2, 2021 at 22:44 Comment(0)

Note that your question is very much related to I/O. Prior to ISO, many systems in the DEC-10 succession supported a single kind of I/O via get0/1 and put/1 (and versions for tty) which served both characters and bytes at the same time. What can go wrong with that? Today, that is obvious. But multi-octet character set handling (MOCSH as it was called) was for many a much more exotic feature as it is today, a quarter century after the standard's publication. After all, the today mostly accepted UTF-8 encoding was invented 1992-09 and first published in 1993. And like so many projects like TRON it could have failed as well. Some other programming languages got burnt by betting on UCS-2/UTF-16 encoding.

What the standard did was to split I/O into character and byte I/O (and their corresponding types text and binary). So there is now get_char/1, get_byte/1 ... That the _byte versions all use integers in the range of 0..255 was non-controversial (plus -1 for EOF). But what about the _char versions? The only way to resolve this was to provide both _char and _code versions and consequently chars and codes versions of double-quoted strings and related built-ins. The default for flag double_quotes is implementation defined (7.11.2.5).

In this manner systems with a lot of DEC-10 legacy could continue to use codes explicitly. For them, an integer thus meant either an integer or a byte or a character. But users of such system still could use the better encoding. New systems that do not have to deal with such legacies going back to 1977 opt as default for chars like Tau, Scryer, and Trealla. As much as tradition is concerned, note that Prolog I, often called Marseille Prolog, did encode double quoted strings as lists of atoms of length one. And in the preliminary version of Prolog of 1972, often called Prolog 0, strings were encoded as nil-s-t-r-i-n-g qua boum facilitating stemming. In any case, not a single character code was present at all.

The advantages of chars should be obvious. It is much easier to read and debug, in particular if you have partially instantiated strings, say [a,X,c] vs. [97,X,99], which occur often when generalizing queries as with library(diadem). It is also a bit shorter to write. And, double quoted strings can be used for printing answers.

If you really want to write programs that both support codes and chars at the same time, use rather goals like [Ch] = "a" where Ch is now the atom a or the integer 97 or 129 or whatever processor character set you are using. It all depends on the Prolog flag double_quotes. And more succinctly you can write

nonzero(C) --> [C],{member(C,"123456789")}.

What is even more important is that phrase("abc", "abc") still holds.

However, changing that flag within the same application is certainly not a good idea (nor to switch to the value atom or some non-conforming value).

((When using chars note that single quotes as in C = 'a' are a bit misleading since the single quotes do not serve any purpose. Instead, round brackets are preferable if you want to ensure that the code will be valid even in the presence of an operator declaration for a. When a occurs as a functor's argument or a list's element, no round brackets are needed, but they are often used redundantly in operator declarations.))

Wenona answered 22/2, 2021 at 11:24 Comment(2)

Thank you. That is the first time I hear someone mention TRON. I always thought it disappeared due to lack of interest but apparently it didn't disappear and got the Huawei treatement from our USian friends. – Fania 24/2, 2021 at 11:25

See the MOCSH link for more on all this. – Wenona 24/2, 2021 at 11:43

You are making incorrect assumptions. These are not "chars":

foo_or_bar(foo) --> "foo".

The "foo" is a string, in SWI-Prolog, but this works perfectly within a DCG rule definition. The place to read about this is here, in particular:

A DCG literal

Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.

All your other suggestions are just unnecessary, you should be either enumerating explicitly all possible "nonzeros", digits and so on, or using the library.

PS: if your main goal is to write code that runs on any Prolog, you might as well use something like Logtalk instead.

Mycenaean answered 22/2, 2021 at 8:43 Comment(7)

I have to disagree. I know these are not chars, and "foo" inside of a DCG is not an SWi-Prolog string, it is a list of codes. The text quoted says as much. See also the note at the end of my question. Also, Logtalk is not Prolog (though I will certainly look into it) – Fania 24/2, 2021 at 11:9

@DavidTonhofer you say, "and "foo" inside of a DCG is not an SWi-Prolog string, it is a list of codes." I don't want to argue about this, it seems clear enough to me. It is a string but the term expansion of SWI-Prolog handles is so that DCGs using double-quoted literals don't have to be re-written. Please re-read the quoted text and check out the SWI-Prolog DCG expansion source code if you don't believe me. – Mycenaean 24/2, 2021 at 11:21

@DavidTonhofer Logtalk is not Prolog, but it is the only mature project that offers a compatibility layer between different Prolog implementations. So if you don't want to commit to an implementation, Logtalk is a pragmatic choice. – Mycenaean 24/2, 2021 at 11:23

@DavidTonhofer as a rule of thumb, reading and listening carefully is more useful than writing and talking. I will follow my own advice and abandon this thread, since it is too full of opinion and I don't see any chance at salvaging it at this point. I did what I could to point out factual errors, but this isn't helping, apparently. – Mycenaean 24/2, 2021 at 11:26

"any chance at salvaging it at this point". Are you serious. This is a good thread. – Fania 24/2, 2021 at 11:28

@DavidTonhofer :-) no, just joking – Mycenaean 24/2, 2021 at 11:30

@DavidTonhofer but seriously, figure out what your goal is. If you want to define DCGs, use double-quoted literals and forget about what they really are. – Mycenaean 24/2, 2021 at 11:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++