The set of tokens is {
=,
<>,
<,
>,
<=,
>=,
in,
+,
-,
*,
/,
div,
mod,
not,
and,
or,
[,
],
(,
),
.,
,,
..,
^,
nil,
IDENT,
NATURAL_LIT,
REAL_LIT,
CHAR_LIT,
STRING_LIT}.
Most of the tokens are keywords or special symbols, except for the
uppercase-named ones, whose associated regular languages are:
- The token IDENT stands for any identifier, which includes
function/procedure identifiers, variable identifiers, record field identifiers,
and so on. An identifier is a non-empty sequence of lowercase alphanumeric
characters and underscore, not starting by a digit. We do not consider
uppercase letters, since Pascal is case-insensitive.
- The token NATURAL_LIT stands for unsigned integer numeric
literals. Such literals can be represented in decimal (a non-empty set of
decimal digits), in binary (a symbol % followed by a non-empty sequence
of binary digits, i.e., 0, 1), or hexadecimal (a symbol $
followed by a non-empty sequence of hexadecimal digits, i.e., 0,
1, 2, 3, 4, 5, 6, 7, 8,
9, a, b, c, d, e, f).
- The token REAL_LIT refers to real numeric literals. Such literals
either are an integer number (i.e., a non-empty sequence of decimal digits)
followed by an exponent part, or they are a fractional number (i.e., a
non-empty sequence of decimal digits with a dot ., where neither the
integral nor the fractional part is empty) optionally followed by an exponent
part. The exponent starts with e, then there is an optional sign
(+, -), and finally a non-empty sequence of decimal digits.
- The token CHAR_LIT represents a single character literal. They
are delimited by single quotes (') or double quotes ("), and
contain a single character in between. To represent the single quote character
it may be done like '''' (the external single quotes are the delimiters)
or like "'" (the double quotes are the delimiters); and to represent the
double quote charater it may be done like '"' (the single quotes are the
delimiters) or like """" (the external double quotes are the
delimiters). Character literals can also be introduced without the delimiting
single/double quotes. In this case, the character is represented with its
numeric codepoint as follows: the number sign (#) followed by a
non-empty sequence of decimal digits.
- The token STRING_LIT is similar to the CHAR_LIT, except
that no codepoints can be introduced with #, and that between the
delimiters they may be zero or more than one characters instead of exactly
one (as was the case in character literals). As before, when the delimiters are
single quotes, a single quote is represented by the sequence '' (i.e.,
two single quotes), and when the delimiters are double quotes, a double quote
is represented by the sequence "" (i.e., two double quotes).
The following table shows the list of Pascal operators, sorted by priority
(from highest to lowest):
Category | Operator | Associativity |
Unary not | not | Right |
Multiplying operators | * / div mod and | Left |
Adding operators and unary signs | + - or | Left (binary +, -, or) and Right (unary +, -) |
Relational operators | = <> < > <= >= in | None |
Note that unary signs (
+ and
-) do not have precedence over other
adding or more prioritary operators. Thus, they cannot appear after any of
them. For instance, “
not -1” and “
4 + -1” are not valid
expression, because the unary sign
- appears after some operator. On the
other hand, “
- not 1”, “
-1 + 4” and “
4 + (-1)” are
valid. Moreover, “
+2 < -1” is also valid, since
< is a
relational operator, and those have lower precedence.
At the bottom of the operator chain, we have a
factor. A factor can be
any of the following things:
- A variable access. It can be any of the following:
- A declared variable (i.e., an identifier).
- A component variable, which denotes a component of an array or record
variable. There are two possibilities:
- An indexed variable: an array variable followed by a list of one or more
index expressions in brackets. For instance, a[10], a2[b + c], or
a3[i, j, 1]. The array variable can be any variable access, whereas the
index expressions can be an arbitrary expression. Note that the last example is
semantically equivalent to a3[i][j][1].
- A field designator: a record variable followed by . and a field
specifier. The record variable can be any variable access, whereas the field
specifier is an identifier.
- An identified variable, i.e., a variable that is identified by a pointer.
It consists of a variable access followed by ^.
- A buffer variable. It consists of a buffer variable followed by ^.
The buffer variable can be any variable access.
- A constant: an unsigned number (integer or real), a character string
(character literal or string), a constant identifier or the value nil. A
constant identifier is just the identifier of a constant variable (in other
words, a regular identifier).
- A function designator. It consists of a function identifier followed,
optionally, by a parameter list. The parameter list is a list of one or more
parameters enclosed in parentheses. Parameters can be any of the following:
expressions, variable accesses, procedure identifiers, or function identifiers.
For instance, cos(t) and max(4, 8) are function designators.
- A set constructor. It denotes a value of a set type. It consists of a
list of 0 or more member designators enclosed in brackets. Designators brackets
are values or ranges of values. Ranges are denoted by an initial value,
followed by .. and a final value. Values are just expressions. For
instance, [red, greend, blue] and [1..9, 15, 20..29] are set
constructors.
- An arbitrary expression in parentheses.
Remarks about the AST construction:
- Relational operators are not associative. This means that “x = y = z”
is not a valid expression (but “(x=y) = z” is).
- An indexed variable must have the token [ as root, with the array
variable as first child and a special node named index_list as second
child, with one child per index expression.
- A field designator must have the token . as root, with the record
variable as first child and the field specifier as second child.
- An identifier or buffer variable must have the token ^ as root,
with the corresponding variable as child.
- A function designator must have the token ( as root, with the
fuction identifier as first child and a special node named param_list as
second child, with one child per parameter (even if it doesn’t have a parameter
list).
- A set constructor must have a special node named set_constructor
as root, with one child per member designator. In turn, a member designator
consisting of a range of values, must have the token .. as root, with
the initial and final values as children.