User Tools

Site Tools


blog:2011:01:19:arithmetic_operators

Photon Compiler Development: Arithmetic Operators

This is the fifth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we will add arithmetic operators.

Abstract Syntax Tree

We first define two new types for unary and binary operators. We use dedicated types rather than a string to let the compiler check for us that every pattern matching involving operators is exhaustive in all phases of our compiler.

(** Unary operators. *)
type un_op =
     I_neg (** [~-] *)
   | F_neg (** [~-.] *)
 
(** Binary operators. *)
type bin_op =
     I_add (** [+] *)
   | I_sub (** [-] *)
   | I_mul (** [*] *)
   | I_div (** [/] *)
   | F_add (** [+.] *)
   | F_sub (** [-.] *)
   | F_mul (** [*.] *)
   | F_div (** [/.] *)
   | F_pow (** [**] *)

We then extend the node type with unary and binary operation nodes. Note that node becomes parametric in the AST label type, because node themselves can now contain ASTs.

type 'a t = {
   label : 'a;
   node : 'a node;
} and 'a node =
     Bin_op of 'a t * bin_op * 'a t (** binary operation *)
   | Literal of lit (** literal *)
   | Un_op of un_op * 'a t (** unary operation *)

Finally, we extend string_of_ast to handle new node types, along with auxiliary string_of_un_op and string_of_bin_op functions, which returns the OCaml representation of given operator (see source code).

Lexing and Parsing

While we intend to implement operator overloading in Photon (through type classes à la Haskell), we will stick for now to the OCaml way of doing things. We thus keep the infamous dotted operators.

We define new tokens for the following arithmetic operators: +, +., -, -., *, *., /, /., **, ~- and ~-.. We also define tokens for ( and ).

In the parser definition, we specify precedence and associativity rules for these operators. While it is possible to write the grammar without them, it is more legible with them (at least for arithmetic operators).

With ocamlyacc, precedence rules are given in the definition section, after the token definitions, and take the form

%<associativity> TOKEN_1 ... TOKEN_n
%<associativity> TOKEN_1 ... TOKEN_n

Associativity can be left, right or nonassoc when the operator is not associative. All tokens on the same line have the same precedence. The first precedence rule appearing in the file has the lowest priority, while rules appearing further down the file have increasingly higher priority.

One can specify undefined tokens in the precedence rules. These pseudo-tokens are used only to specify precedence, with the explicit precedence operator %prec that can be used in grammar rules. We will use prec_uminus for unary negation operators.

%left PLUS PLUS_DOT MINUS MINUS_DOT
%left TIMES TIMES_DOT DIV DIV_DOT
%right TIMES_TIMES
%nonassoc prec_uminus TILDE_MINUS TILDE_MINUS_DOT

Note that unary operators have higher priority than exponentiation. This is necessary to follow OCaml behaviour that (dubiously) interprets -2.0 ** 2.0 as (-2)2 = 4 while -22 is usually interpreted as -(22) = -4 in mathematics (it is interpreted as 0 - 22 and then the usual precedence for - applies).

We replace the interactive literal rule by an expression rule

| expr SEMI_SEMI        { $1 }

expr is either an atom, or an expression built using the previously defined operators.

expr
   : atom                  { $1 }
   | expr PLUS expr        { label 0 (Bin_op($1, I_add, $3)) }
   | expr PLUS_DOT expr    { label 0 (Bin_op($1, F_add, $3)) }
   | expr MINUS expr       { label 0 (Bin_op($1, I_sub, $3)) }
   | expr MINUS_DOT expr   { label 0 (Bin_op($1, F_sub, $3)) }
   | expr TIMES expr       { label 0 (Bin_op($1, I_mul, $3)) }
   | expr TIMES_DOT expr   { label 0 (Bin_op($1, F_mul, $3)) }
   | expr DIV expr         { label 0 (Bin_op($1, I_div, $3)) }
   | expr DIV_DOT expr     { label 0 (Bin_op($1, F_div, $3)) }
   | expr TIMES_TIMES expr { label 0 (Bin_op($1, F_pow, $3)) }
   | MINUS expr %prec prec_uminus { unary_op "-" $2 }
   | MINUS_DOT expr %prec prec_uminus { unary_op "-." $2 }
   | TILDE_MINUS expr      { unary_op "~-" $2 }
   | TILDE_MINUS_DOT expr  { unary_op "~-." $2 }

We use %prec prec_uminus to change the precedence of - and -. when they are used prefixed. Rather than building unary operations directly, we use an auxiliary function unary_op. This function is primarily intended to account for the syntactic peculiarity that a - (without dot) may precede a floating-point literal. In this case, we should build a F_neg node and not a I_neg one. Along the way, we profit of this function to negate the literals directly, in order to make AST dumps more legible (the negate operations would have been optimized away by LLVM anyway, so this does not stem from a concern for optimization).

let unary_op op expr =
   let node =
      match op, expr.node with
        ("-" | "~-"), Literal (Int i) -> Literal (Int (-i))
      | ("-" | "-." | "~-."), Literal (Float f) -> Literal (Float (-.f))
      | ("-" | "~-"), _ -> Un_op(I_neg, expr)
      | ("-." | "~-."), _ -> Un_op(F_neg, expr)
      | _ -> failwith ("unary_op: unknown operator " ^ op) in
   { label = pos 0; node = node }

atoms are either literals, or expressions enclosed in parenthesis. When an enclosed expression is not terminated, we throw an exception with a newly created error type. This will allow a more specific error message than the vague “syntax error”. Enclosed expressions are returned as is, except that their label is updated to the whole position, including parenthesis (through the new auxiliary function repos).

atom
   : lit                   { label 1 (Literal $1) }
   | LPAR expr RPAR        { repos 0 $2 }
   | LPAR expr error       { raise (Error(pos 3, Unclosed_paren (pos 1))) }

The grammar handles unary minus and the lexer should not allow it in integer and floating-point literals (contrary to what the OCaml lexical conventions announce). This is necessary so that 1-2 without spaces is tokenized as (INT 1, MINUS, INT 2) instead of INT 1, INT (-2) which would be a syntax or type error. Prefixed - is thus forbidden in integer and floating-point literals. New tokens are also introduced. See source code for details.

Typing

As operators can only appear fully applied for now, we need not extend our type for types yet. The type of operator themselves (or partially applied operators) is not representable, but single operators or partially applied operators are not yet syntactically valid and will be rejected by the parser.

The Typer module, however, needs to be adapted to handle unary and binary operations. The typing process is still very simple. We annotate operation argument(s), we deduce expected type(s) from the operator, and then we check that the types matches. If a type does not match, we throw an exception. The type of the whole expression is deduced from the operator (for now, all our operators returns the type of their operand(s) so we can reuse it).

Typing error exceptions are defined in the same way as lexical or syntactic error exceptions. We define possible typing errors in Typer_error.t and a typing error exception in Typer.Error (with a function to convert typing exceptions to string).

typer_error.mli
type t =
     Mismatch of Type.t * Type.t (** type mismatch [(actual, expected)] *)
exception Error of (Position.t * Typer_error.t)
 
let string_of_error (pos, err) =
   let pos = string_of_position pos in
   let err =
      match err with
        Mismatch(actual, expected) ->
           "inferred type is " ^ string_of_type actual
           ^ " but expected type was " ^ string_of_type expected in
   pos ^ ": " ^ err
 
let rec annotate ast =
     match ast.node with
       Bin_op(lhs, op, rhs) ->
          let lhs = annotate lhs in
          let (lhs_pos, lhs_type) = lhs.label in
          let rhs = annotate rhs in
          let (rhs_pos, rhs_type) = rhs.label in
          let expected_type =
             match op with
               I_add | I_sub | I_mul | I_div -> Type.Int
             | F_add | F_sub | F_mul | F_div | F_pow -> Type.Float in
          if lhs_type <> expected_type then
             raise (Error(lhs_pos, Mismatch(lhs_type, expected_type)))
          else if rhs_type <> expected_type then
             raise (Error(rhs_pos, Mismatch(rhs_type, expected_type)))
          else { label = (ast.label, lhs_type); node = Bin_op(lhs, op, rhs) }
     | Literal lit ->
          { label = (ast.label, type_of_lit lit); node = Literal lit }
     | Un_op(op, expr) ->
          let expr = annotate expr in
          let (expr_pos, expr_type) = expr.label in
          let expected_type =
             match op with
               I_neg -> Type.Int
             | F_neg -> Type.Float in
          if expr_type <> expected_type then
             raise (Error(expr_pos, Mismatch(expr_type, expected_type)))
          else { label = (ast.label, expr_type); node = Un_op(op, expr) }

Note that we also changed the code for literals, replacing node = ast.node with node = Literal lit because node type has become parametric. ast.node has type Position.t node while the built node must have type (Position.t, Type.t) node.

The driver code in main is updated to catch the new type of exceptions.

         | Typer.Error err -> prerr_endline (Typer.string_of_error err)

Compilation

We adds patterns for unary and binary operation nodes to our compile function. Except exponentiation, all defined operations correspond directly to an LLVM instruction. We just compile the operand(s) and then feed it (them) to the relevant instruction. We don't check for overflows or division by zero, and put these tasks on our tasks list to do possibly later.

Exponentiation is handled by a call to LLVM intrinsic llvm.pow.f64. Intrinsics are built-in functions provided by the LLVM framework. Intrinsics may often be optimized with specialized machine code instructions, depending on the target platform. The code for F_pow first retrieve the module from given builder, and then call llvm.pow.f64. We add a declaration for this intrinsic in create_and_initialize_module.

let rec compile ast builder =
   match ast.node with
     Bin_op(lhs, op, rhs) ->
        let lhs = compile lhs builder in
        let rhs = compile rhs builder in
        begin match op with
          I_add -> build_add lhs rhs "sum" builder
        | F_add -> build_fadd lhs rhs "sum" builder
        | I_sub -> build_sub lhs rhs "diff" builder
        | F_sub -> build_fsub lhs rhs "diff" builder
        | I_mul -> build_mul lhs rhs "prod" builder
        | F_mul -> build_fmul lhs rhs "prod" builder
        | I_div -> build_sdiv lhs rhs "quot" builder
        | F_div -> build_fdiv lhs rhs "quot" builder
        | F_pow ->
             (* Calls the llvm.pow.f64 intrinsic *)
             let m = global_parent (block_parent (insertion_block builder)) in
             let f = unsafe_lookup_function "llvm.pow.f64" m in
             build_call f [|lhs; rhs|] "pow" builder
        end
   | Literal lit -> compile_lit lit builder
   | Un_op(op, expr) ->
        let expr = compile expr builder in
        begin match op with
          I_neg | F_neg -> build_neg expr "neg" builder
        end
 
let create_and_initialize_module name =
   ...
   (* Intrinsics *)
   ignore (add_fun "llvm.pow.f64" double_type [|double_type; double_type|]);
   ...

The code for unary operators seems a bit convoluted. We match the operator although it can only be I_neg or F_neg, which are compiled the same. It would have been shorter to write directly

   | Un_op(op, expr) ->
        let expr = compile expr builder in
        build_neg expr "neg" builder

This is a provision for the future. The compiler does case analysis to ensure pattern matches are complete. With the former code, the compiler will remind us to add new patterns to the compile code when we will add new unary operators. With the latter, it would silently compile a negation whatever the operator is.

Conclusion

Our Photon interpreter is now able to compute some basic arithmetic expressions. The addition was really simple showing the power of the LLVM framework and its OCaml bindings.

In the next installment, we will add naming to our interpreter, i.e. the possibility to name expressions to reuse them later by their name.

Source Code

The code accompanying this article is available in archive photon-tut-05.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 5-arithmetic_operators

Discussion

Enter your comment. Wiki syntax is allowed:
 
blog/2011/01/19/arithmetic_operators.txt · Last modified: 2011/02/20 20:28 (external edit)