A Pascal program is structured in blocks. A block consists of several sections,
one of which is the type declaration section. In this section, the user can
define new types, ranging from just alias of already existing types, to complex
data structures.
The set of tokens is {
;,
:,
..,
,,
^,
=,
(,
),
[,
],
IDENT,
NATURAL_LIT,
REAL_LIT,
CHAR_LIT,
STRING_LIT,
of,
end,
type,
packed,
array,
set,
file,
record,
case}.
Most of the tokens are keywords or special symbols, except for the
uppercase-named ones, whose associated regular languages are:
- The token IDENT stands for any identifier, i.e., a non-empty
sequence of lowercase alphanumeric characters and underscore, not starting by a
digit. We do not consider uppercase letters, since Pascal is case-insensitive.
Since the user can define new types, type names are considered identifiers,
even the basic ones such as “integer” or “char”.
- The token NATURAL_LIT stands for unsigned integer numeric
literals. Such literals can be represented in decimal (a non-empty set of
decimal digits), in binary (a symbol % followed by a non-empty sequence
of binary digits, i.e., 0, 1), or hexadecimal (a symbol $
followed by a non-empty sequence of hexadecimal digits, i.e., 0,
1, 2, 3, 4, 5, 6, 7, 8,
9, a, b, c, d, e, f).
- The token REAL_LIT refers to real numeric literals. Such literals
either are an integer number (i.e., a non-empty sequence of decimal digits)
followed by an exponent part, or they are a fractional number (i.e., a
non-empty sequence of decimal digits with a dot ., where neither the
integral nor the fractional part is empty) optionally followed by an exponent
part. The exponent starts with e, then there is an optional sign
(+, -), and finally a non-empty sequence of decimal digits.
- The token CHAR_LIT represents a single character literal. They
are delimited by single quotes (') or double quotes ("), and
contain a single character in between. To represent the single quote character
it may be done like '''' (the external single quotes are the delimiters)
or like "'" (the double quotes are the delimiters); and to represent the
double quote charater it may be done like '"' (the single quotes are the
delimiters) or like """" (the external double quotes are the
delimiters). Character literals can also be introduced without the delimiting
single/double quotes. In this case, the character is represented with its
numeric codepoint as follows: the number sign (#) followed by a
non-empty sequence of decimal digits.
- The token STRING_LIT is similar to the CHAR_LIT, except
that no codepoints can be introduced with #, and that between the
delimiters they may be zero or more than one characters instead of exactly
one (as was the case in character literals). As before, when the delimiters are
single quotes, a single quote is represented by the sequence '' (i.e.,
two single quotes), and when the delimiters are double quotes, a double quote
is represented by the sequence "" (i.e., two double quotes).
The type declaration section starts with the keyword
type, followed by
one or more type definitions. Each type definition consists of an identifier,
the token “
=”, a
type denoter, and a semicolon.
Type denoters can be grouped in three categories:
- ordinal types: types that are countable and ordered. They can be
classified in two kinds:
- base types, such as integer, real, bool or
char. They are just identifier.
- user-defined types. They include the following possibilities:
- enumerate types, consisting of one or more elements, each of which is an
identifier. For instance, (dog,cat,rat,pig) and
(worker,student,unemployed,retired) are enumerate types.
- subrange types, which denote a range of values by specifying the lowest
and highest value. For instance, 1..5, '0'..'9', and
dog..rat are subrange types. The specified values can be any constant,
i.e., a literal value or an identifier (as in the case of dog and
rat). We don’t need to check that the two values have the same type,
because it will be done in the semantic analysis.
- structured types: types composed of other types. They can be
packed, in which case they are preceeded by the keyword packed. There
are four types:
- array types. Unlike most languages, they can be indexed by any ordinal
type, and not just integers. Also, they can contain any type. Examples of
arrays are array [20] of char, array[1..10] of integer, and even
array[(straight, flush, full, poker)] of real, as enumerations are
ordinal types. If the range expression (the expression inside brackets) is just
a number, it is assumed that the range goes from 1 to that number. Arrays can
contain nested arrays, as in array [10] of array [dog..rat] of real. In
that case, there is an alternative notation that consists of placing the index
types of all the nested arrays separated by commas inside a single set of
brackets. Hence, the previous example is equivalent to array [10,dog..rat] of real.
- record types, consisting of a list of zero or more field declarations.
Fields are identifiers and can be of any type. If two fields have the
same type, they can be declared together. For instance, length : integer
and name, first_name : string are field declarations. Furthermore, a
record can have an optional final element called the variant part. It can be
used when we want to create a record type that has fields for different kinds
of data, but we know that we will never need to use all of the fields in a
single record instance. It has the following structure:
case variant_selector of
constant: (field list)
constant: (field list)
...
where the variant selector is a type identifier, and the constants are
values of that type (this is checked by the semantic analysis). The variant
selector might have what is called a tag (an identifier), as in tag : type_id.
It serves to consult the value of the variant selector (otherwise, it is
invisible to the user). Each case consists of one or more constant values,
separated by commas, and the list of fields that the record should have when
the value of the aforementioned type is one of these constants. There must be
at least one case. The list of fields of each case is just like the list of
fields of a record, with an optional variant part at the end.
Field declarations (and the variant section, when given), and cases inside the
variant section must be separed by semicolons, with an optional semicolon at
the end. The record declaration ends with the keyword end. The following
are examples of record declarations:
Employee = record
Name : string;
Phone : string;
Wage : real;
end;
(* Point that can be represented in cartesian or polar
coordinates *)
Point = record
case boolean of
true : (r : real;
angle : real);
false : (x : real;
y : real;);
end;
(* Point that can be represented in cartesian or polar
coordinates depending on the value of 'polarCoords' *)
Point2 = record
case polarCoords : boolean of
true : (r, angle : real);
false : (x, y : real;);
end;
- set types, contain values of an ordinal type. For instance, set of integer,
set of 1..10, and set of myEnum are valid set types.
- file types, are used to read and write from files. For instance:
file1 = file of integer; (* Allows to write integers into the file *)
file2 = file of String; (* Allows to write integers into the file *)
The type of the file can be any type denoter.
- pointer type: they consist of a type identifier, preceeded by
the token ^ to denote that they are a pointer.
Remarks about AST construction:
- The list of type definitions must have the token type as root,
with one child per definition.
- Each type definition must have the token = as root, with the new
type’s identifier as first child and the type denoter as second child.
- A subrange type must have the token .. as root, with the constants
as children.
- An enumerate type must have a special node named enumerated_type
as root, with the identifiers as children.
- A packed structured type must have the token packed as root, with
the rest of the structured type’s tree as only child.
- An array type must have the token array as root, with the list of
index types as first child and the type denoter of the array elements as second
child. In turn, the list of index types must have a special node named
index_type_list as root, with the types as children.
- A record type must have the token record as root, with the list of
fields as children. The list of fields must have a special node named
field_list as root, with one child per field declaration. If there is a
variant part, it must be added as last child of field_list. A field
declaration must have the token : as root, with the list of field
identifiers as first child and the type denoter as second child. In turn, the
list of type identifiers must have a special node named id_list as root,
with the identifiers as children. The variant part must have the token
case as root, with the variant selector as first child and the list of
variants as second child. In case the variant selector has a tag, it must have
the token : as root, with the tag as first child and the type identifier
as second child. The list of variants must have a special node named
variant_list as root, with the variants as children. Each variant must
have the token : as root, with the list of constants as first child and
the list of fields as second child. The list of constants must have a special
node named constant_list as root, with the constants as children. The
list of fields must have a special node named field_list as root, with
one child per field declaration.
- A set type must have the token set as root, with the ordinal type
as only child.
- A file type must have the token file as root, with the type
denoter as only child.
- A pointer type must have the token ^ as root, with the pointed
type identifier as only child.