Exercise 5 | RACSO

Exercise _‹5_›:

Grammar for Pascal type definitions

A Pascal program is structured in blocks. A block consists of several sections, one of which is the type declaration section. In this section, the user can define new types, ranging from just alias of already existing types, to complex data structures.

The set of tokens is {;, :, .., ,, ^, =, (, ), [, ], IDENT, NATURAL_LIT, REAL_LIT, CHAR_LIT, STRING_LIT, of, end, type, packed, array, set, file, record, case}. Most of the tokens are keywords or special symbols, except for the uppercase-named ones, whose associated regular languages are:

The token IDENT stands for any identifier, i.e., a non-empty sequence of lowercase alphanumeric characters and underscore, not starting by a digit. We do not consider uppercase letters, since Pascal is case-insensitive. Since the user can define new types, type names are considered identifiers, even the basic ones such as “integer” or “char”.
The token NATURAL_LIT stands for unsigned integer numeric literals. Such literals can be represented in decimal (a non-empty set of decimal digits), in binary (a symbol % followed by a non-empty sequence of binary digits, i.e., 0, 1), or hexadecimal (a symbol $ followed by a non-empty sequence of hexadecimal digits, i.e., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f).
The token REAL_LIT refers to real numeric literals. Such literals either are an integer number (i.e., a non-empty sequence of decimal digits) followed by an exponent part, or they are a fractional number (i.e., a non-empty sequence of decimal digits with a dot ., where neither the integral nor the fractional part is empty) optionally followed by an exponent part. The exponent starts with e, then there is an optional sign (+, -), and finally a non-empty sequence of decimal digits.
The token CHAR_LIT represents a single character literal. They are delimited by single quotes (') or double quotes ("), and contain a single character in between. To represent the single quote character it may be done like '''' (the external single quotes are the delimiters) or like "'" (the double quotes are the delimiters); and to represent the double quote charater it may be done like '"' (the single quotes are the delimiters) or like """" (the external double quotes are the delimiters). Character literals can also be introduced without the delimiting single/double quotes. In this case, the character is represented with its numeric codepoint as follows: the number sign (#) followed by a non-empty sequence of decimal digits.
The token STRING_LIT is similar to the CHAR_LIT, except that no codepoints can be introduced with #, and that between the delimiters they may be zero or more than one characters instead of exactly one (as was the case in character literals). As before, when the delimiters are single quotes, a single quote is represented by the sequence '' (i.e., two single quotes), and when the delimiters are double quotes, a double quote is represented by the sequence "" (i.e., two double quotes).

The type declaration section starts with the keyword type, followed by one or more type definitions. Each type definition consists of an identifier, the token “=”, a type denoter, and a semicolon.

Type denoters can be grouped in three categories:

ordinal types: types that are countable and ordered. They can be classified in two kinds:
- base types, such as integer, real, bool or char. They are just identifier.
- user-defined types. They include the following possibilities:
  - enumerate types, consisting of one or more elements, each of which is an identifier. For instance, (dog,cat,rat,pig) and (worker,student,unemployed,retired) are enumerate types.
  - subrange types, which denote a range of values by specifying the lowest and highest value. For instance, 1..5, '0'..'9', and dog..rat are subrange types. The specified values can be any constant, i.e., a literal value or an identifier (as in the case of dog and rat). We don’t need to check that the two values have the same type, because it will be done in the semantic analysis.
structured types: types composed of other types. They can be packed, in which case they are preceeded by the keyword packed. There are four types:
- array types. Unlike most languages, they can be indexed by any ordinal type, and not just integers. Also, they can contain any type. Examples of arrays are array [20] of char, array[1..10] of integer, and even array[(straight, flush, full, poker)] of real, as enumerations are ordinal types. If the range expression (the expression inside brackets) is just a number, it is assumed that the range goes from $1$ to that number. Arrays can contain nested arrays, as in array [10] of array [dog..rat] of real. In that case, there is an alternative notation that consists of placing the index types of all the nested arrays separated by commas inside a single set of brackets. Hence, the previous example is equivalent to array [10,dog..rat] of real.
- record types, consisting of a list of zero or more field declarations. Fields are identifiers and can be of any type. If two fields have the same type, they can be declared together. For instance, length : integer and name, first_name : string are field declarations. Furthermore, a record can have an optional final element called the variant part. It can be used when we want to create a record type that has fields for different kinds of data, but we know that we will never need to use all of the fields in a single record instance. It has the following structure:
```
case variant_selector of
     constant: (field list)
     constant: (field list)
     ...
```
  where the variant selector is a type identifier, and the constants are values of that type (this is checked by the semantic analysis). The variant selector might have what is called a tag (an identifier), as in tag : type_id. It serves to consult the value of the variant selector (otherwise, it is invisible to the user). Each case consists of one or more constant values, separated by commas, and the list of fields that the record should have when the value of the aforementioned type is one of these constants. There must be at least one case. The list of fields of each case is just like the list of fields of a record, with an optional variant part at the end.
  
  Field declarations (and the variant section, when given), and cases inside the variant section must be separed by semicolons, with an optional semicolon at the end. The record declaration ends with the keyword end. The following are examples of record declarations:
```
   Employee = record
     Name : string;
     Phone : string;
     Wage : real;
   end;

  (* Point that can be represented in cartesian or polar
  coordinates *)
  Point = record
       case boolean of
          true : (r : real;
                  angle : real);
          false : (x : real;
                   y : real;);
  end;

  (* Point that can be represented in cartesian or polar
  coordinates depending on the value of 'polarCoords' *)
  Point2 = record
       case polarCoords : boolean of
          true : (r, angle : real);
          false : (x, y : real;);
  end;
```
- set types, contain values of an ordinal type. For instance, set of integer, set of 1..10, and set of myEnum are valid set types.
- file types, are used to read and write from files. For instance:
```
 file1 = file of integer; (* Allows to write integers into the file *)
 file2 = file of String; (* Allows to write integers into the file *)
```
  The type of the file can be any type denoter.
pointer type: they consist of a type identifier, preceeded by the token ^ to denote that they are a pointer.

Remarks about AST construction:

The list of type definitions must have the token type as root, with one child per definition.
Each type definition must have the token = as root, with the new type’s identifier as first child and the type denoter as second child.
A subrange type must have the token .. as root, with the constants as children.
An enumerate type must have a special node named enumerated_type as root, with the identifiers as children.
A packed structured type must have the token packed as root, with the rest of the structured type’s tree as only child.
An array type must have the token array as root, with the list of index types as first child and the type denoter of the array elements as second child. In turn, the list of index types must have a special node named index_type_list as root, with the types as children.
A record type must have the token record as root, with the list of fields as children. The list of fields must have a special node named field_list as root, with one child per field declaration. If there is a variant part, it must be added as last child of field_list. A field declaration must have the token : as root, with the list of field identifiers as first child and the type denoter as second child. In turn, the list of type identifiers must have a special node named id_list as root, with the identifiers as children. The variant part must have the token case as root, with the variant selector as first child and the list of variants as second child. In case the variant selector has a tag, it must have the token : as root, with the tag as first child and the type identifier as second child. The list of variants must have a special node named variant_list as root, with the variants as children. Each variant must have the token : as root, with the list of constants as first child and the list of fields as second child. The list of constants must have a special node named constant_list as root, with the constants as children. The list of fields must have a special node named field_list as root, with one child per field declaration.
A set type must have the token set as root, with the ordinal type as only child.
A file type must have the token file as root, with the type denoter as only child.
A pointer type must have the token ^ as root, with the pointed type identifier as only child.

Authors: Nil Mamano / Documentation:

Exercise ‹5›:

Exercise _‹5_›: