blog:2011:01:19:arithmetic_operators

This is the fifth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we will add arithmetic operators.

We first define two new types for unary and binary operators. We use dedicated types rather than a string to let the compiler check for us that every pattern matching involving operators is exhaustive in all phases of our compiler.

(** Unary operators. *) type un_op = I_neg (** [~-] *) | F_neg (** [~-.] *) (** Binary operators. *) type bin_op = I_add (** [+] *) | I_sub (** [-] *) | I_mul (** [*] *) | I_div (** [/] *) | F_add (** [+.] *) | F_sub (** [-.] *) | F_mul (** [*.] *) | F_div (** [/.] *) | F_pow (** [**] *)

We then extend the `node`

type with unary and binary operation nodes. Note that `node`

becomes parametric in the AST label type, because node themselves can now contain ASTs.

type 'a t = { label : 'a; node : 'a node; } and 'a node = Bin_op of 'a t * bin_op * 'a t (** binary operation *) | Literal of lit (** literal *) | Un_op of un_op * 'a t (** unary operation *)

Finally, we extend `string_of_ast`

to handle new node types, along with auxiliary `string_of_un_op`

and `string_of_bin_op`

functions, which returns the OCaml representation of given operator (see source code).

While we intend to implement operator overloading in Photon (through type classes *à la* Haskell), we will stick for now to the OCaml way of doing things. We thus keep the infamous dotted operators.

We define new tokens for the following arithmetic operators: `+`

, `+.`

, `-`

, `-.`

, `*`

, `*.`

, `/`

, `/.`

, `**`

, `~-`

and `~-.`

. We also define tokens for `(`

and `)`

.

In the parser definition, we specify precedence and associativity rules for these operators. While it is possible to write the grammar without them, it is more legible with them (at least for arithmetic operators).

With ocamlyacc, precedence rules are given in the definition section, after the token definitions, and take the form

%<associativity> TOKEN_1 ... TOKEN_n %<associativity> TOKEN_1 ... TOKEN_n

Associativity can be `left`

, `right`

or `nonassoc`

when the operator is not associative. All tokens on the same line have the same precedence. The first precedence rule appearing in the file has the lowest priority, while rules appearing further down the file have increasingly higher priority.

One can specify undefined tokens in the precedence rules. These *pseudo-tokens* are used only to specify precedence, with the explicit precedence operator `%prec`

that can be used in grammar rules. We will use `prec_uminus`

for unary negation operators.

%left PLUS PLUS_DOT MINUS MINUS_DOT %left TIMES TIMES_DOT DIV DIV_DOT %right TIMES_TIMES %nonassoc prec_uminus TILDE_MINUS TILDE_MINUS_DOT

Note that unary operators have higher priority than exponentiation. This is necessary to follow OCaml behaviour that (dubiously) interprets `-2.0 ** 2.0`

as `(-2)`

while ^{2} = 4`-2`

is usually interpreted as ^{2}`-(2`

in mathematics (it is interpreted as ^{2}) = -4`0 - 2`

and then the usual precedence for ^{2}`-`

applies).

We replace the `interactive`

literal rule by an expression rule

| expr SEMI_SEMI { $1 }

`expr`

is either an `atom`

, or an expression built using the previously defined operators.

expr : atom { $1 } | expr PLUS expr { label 0 (Bin_op($1, I_add, $3)) } | expr PLUS_DOT expr { label 0 (Bin_op($1, F_add, $3)) } | expr MINUS expr { label 0 (Bin_op($1, I_sub, $3)) } | expr MINUS_DOT expr { label 0 (Bin_op($1, F_sub, $3)) } | expr TIMES expr { label 0 (Bin_op($1, I_mul, $3)) } | expr TIMES_DOT expr { label 0 (Bin_op($1, F_mul, $3)) } | expr DIV expr { label 0 (Bin_op($1, I_div, $3)) } | expr DIV_DOT expr { label 0 (Bin_op($1, F_div, $3)) } | expr TIMES_TIMES expr { label 0 (Bin_op($1, F_pow, $3)) } | MINUS expr %prec prec_uminus { unary_op "-" $2 } | MINUS_DOT expr %prec prec_uminus { unary_op "-." $2 } | TILDE_MINUS expr { unary_op "~-" $2 } | TILDE_MINUS_DOT expr { unary_op "~-." $2 }

We use `%prec prec_uminus`

to change the precedence of `-`

and `-.`

when they are used prefixed. Rather than building unary operations directly, we use an auxiliary function `unary_op`

. This function is primarily intended to account for the syntactic peculiarity that a `-`

(without dot) may precede a floating-point literal. In this case, we should build a `F_neg`

node and not a `I_neg`

one. Along the way, we profit of this function to negate the literals directly, in order to make AST dumps more legible (the negate operations would have been optimized away by LLVM anyway, so this does not stem from a concern for optimization).

let unary_op op expr = let node = match op, expr.node with ("-" | "~-"), Literal (Int i) -> Literal (Int (-i)) | ("-" | "-." | "~-."), Literal (Float f) -> Literal (Float (-.f)) | ("-" | "~-"), _ -> Un_op(I_neg, expr) | ("-." | "~-."), _ -> Un_op(F_neg, expr) | _ -> failwith ("unary_op: unknown operator " ^ op) in { label = pos 0; node = node }

`atom`

s are either literals, or `expr`

essions enclosed in parenthesis. When an enclosed expression is not terminated, we throw an exception with a newly created error type. This will allow a more specific error message than the vague “syntax error”. Enclosed expressions are returned as is, except that their label is updated to the whole position, including parenthesis (through the new auxiliary function `repos`

).

atom : lit { label 1 (Literal $1) } | LPAR expr RPAR { repos 0 $2 } | LPAR expr error { raise (Error(pos 3, Unclosed_paren (pos 1))) }

The grammar handles unary minus and the lexer should not allow it in integer and floating-point literals (contrary to what the OCaml lexical conventions announce). This is necessary so that `1-2`

without spaces is tokenized as `(INT 1, MINUS, INT 2)`

instead of `INT 1, INT (-2)`

which would be a syntax or type error. Prefixed `-`

is thus forbidden in integer and floating-point literals. New tokens are also introduced. See source code for details.

As operators can only appear fully applied for now, we need not extend our type for types yet. The type of operator themselves (or partially applied operators) is not representable, but single operators or partially applied operators are not yet syntactically valid and will be rejected by the parser.

The `Typer`

module, however, needs to be adapted to handle unary and binary operations. The typing process is still very simple. We annotate operation argument(s), we deduce expected type(s) from the operator, and then we check that the types matches. If a type does not match, we throw an exception. The type of the whole expression is deduced from the operator (for now, all our operators returns the type of their operand(s) so we can reuse it).

Typing error exceptions are defined in the same way as lexical or syntactic error exceptions. We define possible typing errors in `Typer_error.t`

and a typing error exception in `Typer.Error`

(with a function to convert typing exceptions to string).

- typer_error.mli
type t = Mismatch of Type.t * Type.t (** type mismatch [(actual, expected)] *)

exception Error of (Position.t * Typer_error.t) let string_of_error (pos, err) = let pos = string_of_position pos in let err = match err with Mismatch(actual, expected) -> "inferred type is " ^ string_of_type actual ^ " but expected type was " ^ string_of_type expected in pos ^ ": " ^ err let rec annotate ast = match ast.node with Bin_op(lhs, op, rhs) -> let lhs = annotate lhs in let (lhs_pos, lhs_type) = lhs.label in let rhs = annotate rhs in let (rhs_pos, rhs_type) = rhs.label in let expected_type = match op with I_add | I_sub | I_mul | I_div -> Type.Int | F_add | F_sub | F_mul | F_div | F_pow -> Type.Float in if lhs_type <> expected_type then raise (Error(lhs_pos, Mismatch(lhs_type, expected_type))) else if rhs_type <> expected_type then raise (Error(rhs_pos, Mismatch(rhs_type, expected_type))) else { label = (ast.label, lhs_type); node = Bin_op(lhs, op, rhs) } | Literal lit -> { label = (ast.label, type_of_lit lit); node = Literal lit } | Un_op(op, expr) -> let expr = annotate expr in let (expr_pos, expr_type) = expr.label in let expected_type = match op with I_neg -> Type.Int | F_neg -> Type.Float in if expr_type <> expected_type then raise (Error(expr_pos, Mismatch(expr_type, expected_type))) else { label = (ast.label, expr_type); node = Un_op(op, expr) }

Note that we also changed the code for literals, replacing `node = ast.node`

with `node = Literal lit`

because `node`

type has become parametric. `ast.node`

has type `Position.t node`

while the built node must have type `(Position.t, Type.t) node`

.

The driver code in `main`

is updated to catch the new type of exceptions.

| Typer.Error err -> prerr_endline (Typer.string_of_error err)

We adds patterns for unary and binary operation nodes to our `compile`

function. Except exponentiation, all defined operations correspond directly to an LLVM instruction. We just compile the operand(s) and then feed it (them) to the relevant instruction. We don't check for overflows or division by zero, and put these tasks on our tasks list to do possibly later.

Exponentiation is handled by a call to LLVM *intrinsic* `llvm.pow.f64`

. Intrinsics are built-in functions provided by the LLVM framework. Intrinsics may often be optimized with specialized machine code instructions, depending on the target platform. The code for `F_pow`

first retrieve the module from given `builder`

, and then call `llvm.pow.f64`

. We add a declaration for this intrinsic in `create_and_initialize_module`

.

let rec compile ast builder = match ast.node with Bin_op(lhs, op, rhs) -> let lhs = compile lhs builder in let rhs = compile rhs builder in begin match op with I_add -> build_add lhs rhs "sum" builder | F_add -> build_fadd lhs rhs "sum" builder | I_sub -> build_sub lhs rhs "diff" builder | F_sub -> build_fsub lhs rhs "diff" builder | I_mul -> build_mul lhs rhs "prod" builder | F_mul -> build_fmul lhs rhs "prod" builder | I_div -> build_sdiv lhs rhs "quot" builder | F_div -> build_fdiv lhs rhs "quot" builder | F_pow -> (* Calls the llvm.pow.f64 intrinsic *) let m = global_parent (block_parent (insertion_block builder)) in let f = unsafe_lookup_function "llvm.pow.f64" m in build_call f [|lhs; rhs|] "pow" builder end | Literal lit -> compile_lit lit builder | Un_op(op, expr) -> let expr = compile expr builder in begin match op with I_neg | F_neg -> build_neg expr "neg" builder end let create_and_initialize_module name = ... (* Intrinsics *) ignore (add_fun "llvm.pow.f64" double_type [|double_type; double_type|]); ...

The code for unary operators seems a bit convoluted. We match the operator although it can only be `I_neg`

or `F_neg`

, which are compiled the same. It would have been shorter to write directly

| Un_op(op, expr) -> let expr = compile expr builder in build_neg expr "neg" builder

This is a provision for the future. The compiler does case analysis to ensure pattern matches are complete. With the former code, the compiler will remind us to add new patterns to the compile code when we will add new unary operators. With the latter, it would silently compile a negation whatever the operator is.

Our Photon interpreter is now able to compute some basic arithmetic expressions. The addition was really simple showing the power of the LLVM framework and its OCaml bindings.

In the next installment, we will add naming to our interpreter, *i.e.* the possibility to name expressions to reuse them later by their name.

The code accompanying this article is available in archive photon-tut-05.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git cd photon git checkout 5-arithmetic_operators

blog/2011/01/19/arithmetic_operators.txt · Last modified: 2011/02/20 20:28 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International

## Discussion