In Prolog, there are traditionally two ways of representing a sequence of characters:
- As a list of chars, which are atoms of length 1.
- As a list of codes, which are just integers. The integers are to be interpreted as codepoints, but the convention to be applied is left unspecified. As a (eminently sane) example, in SWI-Prolog, the space of codepoints is Unicode (thus, roughly, the codepoint-integers range from 0 and 0x10FFFF).
DCGs, a notational way of writing left-to-right list processing code, are designed to perfom parsing on "lists of exploded text". Depending on preference, the lists to-be-handled can be lists of chars or lists of codes. However, the notation for char/code processing differs when writing down the constants. Does one generally write the DCG in "char style" or "code style"? Or maybe even in char/code style for portability in case of modules exporting DCG nonterminals?
Some Research
The following notations can be used to express constants in DCGs
'a'
: A char (as usual: single quotes indicate an atom, and they can be left out if the token starts with a lowercase letter.)0'a
: the code ofa
.['a','b']
: A list of char.[ 0'a, 0'b ]
: A list of codes, namely the codes fora
andb
(so you can avoid typing in the actual codepoint values)."a"
a list of codes. Traditionally, a double-quoted string is exploded into a list of codes, and this notation also works SWI-Prolog in DCG contexts, even though SWI-Prolog maps a "double-quoted string" to the special string datatype otherwise.`0123`
. Traditonally, text within back-quotes is mapped to an atom (I think, the 95 ISO Standard just avoids being specific regarding the meaning of a back-quoted string. "It would be a valid extension of this part of ISO/IEC 13211 to define a back quoted string as denoting a character string constant."). In SWI-Prolog, text within back-quotes is exploded into a list of codes unless the flagback_quotes
has been set to demand a different behaviour.
Examples
Char style
Trying to recognize "any digit" in "char style" and make its "char representation" available in C
:
zero(C) --> [C],{C = '0'}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Code style
Trying to recognize "any digit" in "code style":
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Char/Code transparent style
DCGs can be written as "char/code transparent style" by duplicating the rules involving constants. In the above example:
zero(C) --> [C],{C = '0'}.
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
The above also accepts a sequence of alternating codes and chars (as lists of stuff cannot be typed). This is probably not a problem). When generating, one will get arbitrary char/code mixes which are unwanted, and then cuts need to be added.
Char/Code transparent style taking an additional Mode
indicator
Another approach would be to explicitly indicate the mode. Looks clean:
zero(C,chars) --> [C],{C = '0'}.
zero(C,codes) --> [C],{C = 0'0}.
nonzero(C,chars) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C,codes) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C,Mode) --> zero(C,Mode).
any_digit(C,Mode) --> nonzero(C,Mode).
Char/Code transparent style using dialect features
Alternatively, features of the Prolog dialect can be used to achieve char/code transparency. In SWI-Prolog, there is code_type/2
, which actually works on codes and chars (there is a corresponding char_type/2
but IMHO there should be only chary_type/2
working for chars and codes in any case) and for "digit-class" codes and chars yield the compound digit(X)
:
?- code_type(0'9,digit(X)).
X = 9.
?- code_type('9',digit(X)).
X = 9.
?- findall(W,code_type('9',W),B).
B = [alnum,csym,prolog_identifier_continue,ascii,
digit,graph,to_lower(57),to_upper(57),
digit(9),xdigit(9)].
And so one can write this for clean char/code transparency:
zero(C) --> [C],{code_type(C,digit(0)}.
nonzero(C) --> [C],{code_type(C,digit(X),X>0}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
In SWI-Prolog in particular
SWI-Prolog by default prefers codes. Try this:
The flags
influence interpretation of "string"
and `string`
in "standard code". By default"string"
is interpreted as an atomic "string" whereas `string`
is interpreted as a "list of codes".
Outside of DCGs, the following holds in SWI-Prolog, with all flags at their default:
?- string("foo"),\+atom("foo"),\+is_list("foo").
true.
?- L=`foo`.
L = [102,111,111].
However, in DCGs, both "string"
and `string`
are interpreted as "codes" by default.
Without any settings changed, consider this DCG:
representation(double_quotes) --> "bar". % SWI-Prolog decomposes this into CODES
representation(back_quotes) --> `bar`. % SWI-Prolog decomposes this into CODES
representation(explicit_codes_1) --> [98,97,114]. % explicit CODES (as obtained via atom_codes(bar,Codes))
representation(explicit_codes_2) --> [0'b,0'a,0'r]. % explicit CODES
representation(explicit_chars) --> ['b','a','r']. % explicit CHARS
Which of the above matches codes?
?-
findall(X,
(atom_codes(bar,Codes),
phrase(representation(X),Codes,[])),
Reps).
Reps = [double_quotes,back_quotes,explicit_codes_1,explicit_codes_2].
Which of the above matches chars?
?- findall(X,
(atom_chars(bar,Chars),phrase(representation(X),Chars,[])),
Reps).
Reps = [explicit_chars].
When starting swipl with swipl --traditional
the backquoted representation is rejected with Syntax error: Operator expected
, but otherwise nothing changes.
double_quotes
flag at the time the Prolog text is parsed. – Dressy"foo"
is (by default, though that can be changed) the "string" with no further substructure. Using"foo"
in a DCG, however, explodes it into codes. I will add an example. – Fania"a"
is list of codes. If the standarddouble_quotes
flag is set toatom
, then"a"
is an atom. If the flag is set tochars
, then"a"
is a list of chars,[a]
. – Dressy