The table below shows the list of
C operators, sorted by priority (from highest to lowest).
Category | Operator | Associativity |
Postfix unary | () [] -> . ++ -- | Left |
Prefix unary | + - ! ~ ++ -- (type) * & sizeof | Right |
Multiplicative | * / % | Left |
Additive | + - | Left |
Shift | << >> | Left |
Relational | < <= > >= | Left |
Equality | == != | Left |
Bitwise AND | & | Left |
Bitwise XOR | ^ | Left |
Bitwise OR | | | Left |
Logical AND | && | Left |
Logical OR | || | Left |
Conditional | ?: | Right |
Assignment | = += -= *= /= %= >>= <<= &= ^= |= | Right |
Comma | , | Left |
The grammar of
C expressions is not LL(1) because it has some conflicts:
the cast operator (
(type) expr) shares the first token
“
(” with a parenthesized expression. Moreover, the size-of operator
(
sizeof), can be applied to an expression (possibly starting with
parenthesis) and to a type such as
int, enclosed in parentheses. Thus,
both cases share the first token “
(”. We resolve these issues by
disregarding the cast and size-of operators.
The set of tokens is
{IDENT,
NATURAL_LIT,
FLOAT_LIT,
CHAR_LIT,
STRING_LIT,
(,
),
[,
],
->,
.,
++,
--,
+,
-,
!,
~,
*,
&,
/,
%,
<<,
>>,
<,
<=,
>,
>=,
==,
!=,
^,
|,
&&,
||,
?,
:,
ASSIGN_OP,
,} (mind the last token, which is a comma). The
regular language corresponding to each of the uppercase-named tokens is
described as follows:
- The token IDENT represents any identifier, such as a variable or
function name. Identifiers are non-empty sequences of alphanumeric characters
and underscore, not starting by a digit.
- The token NATURAL_LIT refers to unsigned integer numeric literals.
Such literals either start with a non-zero decimal digit and in such a case
describe a natural number in base 10 with any amount of digits between 0
and 9, or start with 0 and in such a case describe a natural number in base
8 with any amount of digits between 0 and 7, or start with 0x or
0X and are followed by the description of an hexadecimal number with one
or more digits and symbols between a and f, perhaps uppercase.
Integer numbers have two optional suffixes: the length specifier (l,
ll, L, LL) and the unsignedness (u, U).
These suffixes can appear in any order.
- The token FLOAT_LIT refers to floating point numeric literals.
Such literals are a non-empty sequence of decimal digits with an optional
occurrence of a point (.) somewhere, and followed by an optional
exponent. Either the point or the exponent must occur. The exponent starts with
e or E, then there is an optional sign (+, -), and
finally a non-empty sequence of decimal digits. Floating point numbers have an
optional suffix to specify the precision: float (f, F), double
(no suffix), or long double (l, L).
- The token CHAR_LIT refers to the character literals. Such literals
are delimited by single quotes ('), and either contain a single
character different from ', from \ and from new line, or contain
an escape sequence. An escape sequence is composed of a symbol \
followed by one of the following: a single lowercase letter among a,
b, t, n, v, f, and r, a single symbol
among ', ", \ and ?, a number composed of up to
three octal digits, or a lowercase letter x followed by a number
composed of up to two hexadecimal digits.
We do not consider wide characters, and neither check that the numeric escape
sequences represent a valid value (i.e., a number less than 128 in decimal).
Moreover, we also ignore the use of trigraphs, and thus, a character literal
such as '??=' (which in C may represent #) is considered
invalid since in our interpretation it contains three distinct characters
instead of just one.
- The token STRING_LIT refers to the string literals. Such literals
are delimited by double quotes ("), and contain any amount of the
following: a character different from ", from \ and from new
line, an escape sequence as detailed above, or a symbol \ followed by a
new line.
As in the case of character literals, for string literals we do not consider
trigraph nor wide characters, and do not validate the value of the numeric
escape sequences.
- The token ASSIGN_OP represents all the assignment operators:
=, +=, -=, *=, /=, %=, >>=,
<<=, &=, ^=, and |=.
Remarks about operators’ syntax and AST construction:
- All operators, including function call (()) and array indexed
access ([]), can be applied to arbitrary expressions and not just
identifiers. For instance, (NATURAL_LIT + NATURAL_LIT)() is a
syntactically valid expression, even though it is semantically nonsensical.
Nevertheless, other expressions such as (*IDENT)() could be semantically
valid, if IDENT was a pointer to a function.
- Selection by reference (.) and through pointer (->)
operators must always be followed by an identifier. For instance,
IDENT.IDENT[NATURAL_LIT]->IDENT and (NATURAL_LIT * NATURAL_LIT).IDENT
are syntactically valid expressions, even though the latter is semantically
nonsensical. On the other hand, IDENT.NATURAL_LIT or
IDENT->(IDENT) are not syntactically correct, because the operators are
not followed by an identifier. ASTs corresponding to these operators must have
the operator as root, the expression upon which it is applied as first child
and the referenced/pointed identifier as second child. Note that the implicit
parenthesization of IDENT.IDENT.IDENT is (IDENT.IDENT).IDENT.
- Function calls have 0 or more parameters in the parentheses, separated by
commas. The AST corresponding to a function call must have the token
“(” as root, the function identifier (or expression) invoked as first
child, and a special node named param_list as second child. In turn,
this node must have one child for each parameter. Parameters are arbitrary
expressions, except that the comma operator (,) can not appear (without
parentheses), because it would be confused with the commas that separate the
parameters.
- Array indexed access operator can contain any expression within the
brackets [ ]. The AST corresponding to an array access must have the
token [ as root, the array identifier (or expression) accessed as first
child and the index expression as second child.
- The ternary conditional operator (?:) allows an arbitrary
expression as its middle operand. Thus, IDENT ? IDENT , IDENT : IDENT is
a valid expression, and its implicit parenthesization is IDENT ? (IDENT , IDENT) : IDENT.
Moreover, since it is a right-associative operator, the implicit
parenthesization of IDENT ? IDENT : IDENT ? IDENT : IDENT is
IDENT ? IDENT : (IDENT ? IDENT : IDENT). The AST corresponding to a
ternary conditional operator must have the token ? as root and the three
operands as children.
- Assignments are expressions. Even though an assignment is only
semantically correct if the left operator is an lvalue, we allow
arbitrary expressions as both operands. For instance,
IDENT[NATURAL_LIT].IDENT ASSIGN_OP IDENT and NATURAL_LIT ASSIGN_OP IDENT
are examples of syntactically valid expressions, even though the latter is
semantically nonsensical because NATURAL_LIT is not an lvalue.
- The comma operator (,) must be regarded as any other
left-associative operator. For instance, the AST of IDENT,IDENT,IDENT is
“,(IDENT,,(IDENT,IDENT))”.
- A sequence of string literals is treated as a single string literal.
Thus, STRING_LIT STRING_LIT STRING_LIT is a valid expression. The AST
corresponding to a sequence of one or more string literals must have a special
node string as root, with one child per string literal.
(Note that sequences of string literals are usually concatenated into a single
string literal by the C-preprocessor, and thus, in contrast to our
approach, the parser does not have to deal with this case.)