Dylan Source Character Set Proposal

PROPOSAL NAME: Dylan Source Character Set

PROBLEM DESCRIPTION:

It is unclear exactly what characters are allowed in the source code of a Dylan program. The grammar in Appendix A of The Dylan Reference Manual specifies the characters allowed in most places, but it does not define "whitespace", the characters allowed in comments, and the term "printing character" used in the specification of character, string, and symbol literals. "Whitespace" is defined in two other places, but the definitions are inconsistent with each other. The control characters generated by backslash escapes are not clearly identified.

PROPOSAL:

Page numbers herein refer to the First Printing of The Dylan Reference Manual, by Andrew Shalit, ISBN 0-201-44211-6.

All ASCII codes mentioned in this proposal are identical to Unicodes with the same number, as defined in the Unicode standard.

Insert the following new paragraph after the first paragraph of the section "Dylan Interchange Format" on page 21 of The Dylan Reference Manual :

The file contains only newline indications, the characters denoted by ASCII codes 32 through 126 inclusive, and the horizontal tab character denoted by ASCII code 9. The encoding of newline indications and these characters is the standard encoding for the particular implementation's platform, so that standard file-transfer tools can transport the file between sites without loss of meaning.

Replace the second paragraph of the section Lexical Syntax on page 16 of The Dylan Reference Manual with the following:

Whitespace is one or more contiguous space characters, horizontal tab characters, and/or newline indications. Implementations can define additional whitespace characters.

In the next paragraph, remove the words "Although comments count as whitespace," and "genuine".

In the next paragraph, change "newline character" to "newline indication".

Replace the glossary definition of whitespace on page 451 of The Dylan Reference Manual with the following:

Whitespace
One or more contiguous space characters, horizontal tab characters, and/or newline indications. Implementations can define additional whitespace characters. Except within character, string, and symbol literals, the amount of contiguous whitespace is not significant in program code.

In the Lexical Notes on page 414 of The Dylan Reference Manual, add the following two paragraphs:

Whitespace can be one or more contiguous space characters, horizontal tab characters, and/or newline indications. Implementations can define additional whitespace characters.

A printing character (including space) is one of the characters denoted by ASCII codes 32 through 126 inclusive. Implementations can define additional printing characters.

Replace the table of control characters generated by backslash escapes at the top of page 19 of The Dylan Reference Manual with the following:

Backslash CharacterMeaningASCII code
aalarm7
bbackspace8
eescape27
fform feed12
nnewline-
rcarriage return13
ttab9
0null0

ASCII codes in the above table are used only to identify the characters. They do not specify the value returned by as(<integer>, character). There is no ASCII code for the newline character because newline indication is implementation dependent.

Note:
This proposal makes no attempt to specify what characters are available to a Dylan program at runtime, i.e. the complete set of instances of the class <character>. However, this clearly must include at least all characters that can appear in a character or string literal, i.e. the printing characters (including space), the eight control characters representable by backslash escapes, and any Unicode characters (represented by backslash angle bracket hexadecimal escapes) that the implementation is able to represent. Because '\n' is an instance of <character>, there must be a runtime character for newline indication, even if a newline indication in files is something other than a single character.

This proposal is a clarification. It might be regarded as an incompatible change for programs written in an extended character set, but such programs were already non-portable before this proposal and can continue to work in whichever implementations they worked in before this proposal.

RATIONALE:

Only source code in Dylan Interchange Format is portable, so only that format is addressed here.

I say "newline indication" instead of "newline character" because the encoding of newline in a source file might not be a single character (e.g. CR-LF in MS-DOS), or might not be in the form of characters at all (e.g. a record-structured file).

Implementation-defined characters are mentioned in several places, even though they cannot appear in a standard Dylan Interchange Format file, to clarify where implementors have freedom in their own file format (which could be an extension of Dylan Interchange Format). This information is only a hint to implementors, as it has no effect on programmers writing portable programs.

I made the two definitions of whitespace on page 16 and page 451 consistent.

I allowed horizontal tab as whitespace because of current practice in Dylan, C, and other languages. I was a bit reluctant about this, as there is no universal agreement on the formatting effect of horizontal tab, but it seemed safest to conform to current practice.

The rationale for removing newpage as a whitespace character is the weak one that it is only mentioned in one place and seems unnecessary. Is the newpage character mentioned on page 451 the same as the form feed character mentioned on page 19, except the former is in source code and the latter is at runtime? If someone thinks newpage as a whitespace character is important, it should be defined as a character denoted by an ASCII code and added uniformly.

Allowing comments as whitespace (p.16) was an error. For example, abc/*def*/ghi is a single Dylan word, not two words separated by a comment. This is clear from the remark on page 16 about a comment blending with a token.

I used the minimum possible definition of "printing character", based on my belief that if this had been intended to include control characters such as tab, backspace, or newline, they would have been mentioned explicitly (as space is).

The ASCII codes in the table of backslash escapes are intended to reflect universal current practice, and not to change anything.

This proposal doesn't really make anything significantly easier or harder, it just clarifies what is permissible. It might decrease unnecessary portability problems.

This proposal has no effect on speed or safety.

EXAMPLES:

Not applicable.

COST TO IMPLEMENTORS:

This has no cost to implementors, unless they choose either to write their own code to be portable, or to provide a tool to check portability of code, which ought to check these character set restrictions. The costs in those cases would not be significant.

RELATED PROPOSALS:

None.

REVISION HISTORY:

Version 1, 7-January-1997, by Dave Moon (not published)
Version 2, 8-January-1997, by Dave Moon

STATUS:
Open 9-January-1997