User Tools

Site Tools


blog:2012:01:26-monomorphic_type_inference

Photon Compiler Development: Monomorphic Type Inference

This is the 16th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement type inference for monomorphic types.

Overview

Rather than requiring every function argument and return value to be explicitly specified by the programmer, we will implement type inference, a process by which missing types are deduced by the compiler.

Our language is still strongly and statically typed. We can flag typing errors at compile time and use type information for compilation.

To achieve type inference, we modify the parser to accept untyped definitions. We then extend the typer to infer types rather than simply enforcing them. This in turn requires a modification of the type representation.

Parsing

As our type representation will evolve into something more complex, we prefer to completely avoid playing with types during parsing. To represent (optional) types in our AST, we introduce the following new data type:

type parse_type =
     PTEmpty (** no type provided *)
   | PTId of Position.t * string (** type given by name *)
   | PTFun of parse_type list (** function type *)

Function arguments and ASTs are now parameterized by an additional type argument, which is parse_type after parsing and Type.t after typing.

type 't arg = string * 't
 
type ('lbl, 't) top_level =
     Bind_fun of bool * string * 't arg list * 't * 'lbl t
   | Bind_val of string * 'lbl t * 't
   | External of string * 't * string
   ...

The parser is modified to use the new parse_type. It now also accepts unannotated function arguments, and function declarations without return types. Here is a relevant excerpt:

decl_or_dir
   : LET IDENT EQUAL expr  { Bind_val($2, $4, PTEmpty) }
   | LET IDENT COLON type_ EQUAL expr  { Bind_val($2, $6, $4) }
   | LET IDENT args EQUAL expr
                           { Bind_fun(false, $2, List.rev $3, PTEmpty, $5) }
   | LET IDENT args COLON type_ EQUAL expr
                           { Bind_fun(false, $2, List.rev $3, $5, $7) }
   | LET REC IDENT args EQUAL expr
                           { Bind_fun(true, $3, List.rev $4, PTEmpty, $6) }
   | LET REC IDENT args COLON type_ EQUAL expr
                           { Bind_fun(true, $3, List.rev $4, $6, $8) }
   ...
 
arg
   : IDENT                 { ($1, PTEmpty) }
   | LPAR IDENT COLON type_ RPAR { ($2, $4) }

Typing

Our treatment of type inference is a variation around the Damas-Milner algorithm. Our implementation is mostly inspired by the book Using, Understanding and Unraveling the OCaml Language: from Practice to Theory and vice-versa, by Didier Rémy.

An Example

To illustrate the type inference process, let's first do an example by hand.

let f c x y = (if c then x else y) + 1

First, let's rewrite this with subscripts to identify all the sub-expressions appearing in this function definition:

let f1 c2 x3 y4 = (if c8 then x9 else y10)6 +5 17

Let τi denote the type of the i-th sub-expression, as shown above. Analyzing this function definition AST, we can deduce the following type constraints that should hold for the function definition to be well-typed:

  1. τ1 = τ2 → τ3 → τ4 → τ5
    Because f if a function of c, x and y and returns the value given by its body.
  2. τ6 = int
    Because the first operand to + should be an integer.
  3. τ7 = int
    Because the second operand to + should be an integer.
  4. τ5 = int
    Because the resulting value of an + operation should be an integer.
  5. τ7 = int
    Because 1 is an integer.
  6. τ8 = bool
    Because in an if-then-else expression, the condition must be boolean.
  7. τ6 = τ9
    Because the resulting type of an if-then-else expression is the same as the one of its then branch.
  8. τ6 = τ10
    Because the resulting type of an if-then-else expression is the same as the one of its else branch.
  9. τ8 = τ2
    Because c here refers to the first argument of f.
  10. τ9 = τ3
    Because x here refers to the second argument of f.
  11. τ10 = τ4
    Because y here refers to the third argument of f.

It now remains to solve these constraints:

  • As τ2 = τ8 (9) and τ8 = bool (6), we have τ2 = bool (12).
  • As τ9 = τ6 (7) and τ6 = int (2), we have τ9 = int (13).
  • As τ10 = τ6 (8) and τ6 = int (2), we have τ10 = int (14).
  • As τ3 = τ9 (10) and τ9 = int (13), we have τ3 = int (15).
  • As τ4 = τ10 (11) and τ10 = int (14), we have τ4 = int (16).
  • Injecting (12), (15), (16) and (4) in (1) gives τ1 = bool → int → int → int (17).

The solution is thus given by:

  • τ1 = bool → int → int → int
  • τ2 = τ8 = bool
  • τ3 = τ4 = τ5 = τ6 = τ7 = τ9 = τ10 = int

This solution gives a type to every sub-expression and you can verify it violates none of the generated type constraints.

Type Representation

Rather than first generating the type constraints and then solving them afterward, we will do both at the same time. To achieve this, we use an indirection in the type representation. A type is represented by a mutable pointer to either:

  • a (partially) known type such as int or bool → τi;
  • a yet-unknown type, which we call a type variable;
  • a link to another type.
type t = {
   mutable node : node; (* partially known type, type variable or type link *)
}

When we meet a constraint of the form τi = int, we mutate τi into a pointer to int. When we meet a constraint of the form τi = τj, we mutate τi into a pointer to τj. Of course, we must ensure that these modifications are type-safe. If τi already points to bool when we meet a constraint τi = int, we must raise an error instead of replacing τi. This is the role of type unification, described below. Newly introduced unknown types are made pointing to a type variable.

To distinguish between terminal types (partially known types and type variables), and link types (links to other types), we subdivide types into descriptive types and link types.

type node =
     TDesc of desc (* terminal type *)
   | TLink of t (* link to another type *)

Type variables are represented by a dedicated constructor TVar, while partially known types are represented by the application of a type constructor to zero, one or more types.

type desc =
     TVar (* type variable *)
   | TApply of constr * t list

For example, int is represented by TApply(TInt, []) while bool → char is represented by TApply(TFun, [t_bool; t_char]) where t_bool and t_char are pre-defined boolean and character types.

Finally, we adjoin a mutable integer to each type, which can serve different purposes in type traversals (e.g. see recursive types). Our full type definition is given by:

type t = {
   mutable node : node; (** type node *)
   mutable mark : int (** a mark than can be freely used during traversals *)
} and node =
     TDesc of desc (** terminal type *)
   | TLink of t (** link to another type *)
and desc =
     TVar (** type variable *)
   | TApply of constr * t list (** type constructor *)

We introduce a function deref to de-reference types, i.e. strip potential links, and one function desc which returns the type description associated to a given type.

let rec deref t =
   match t.node with
     TLink u ->
        let v = deref u in
        t.node <- TLink v; (* we replaces chains of links by a single link *)
        v
   | TDesc _ -> t
 
let desc t =
   match (deref t).node with
     TLink _ -> assert false
   | TDesc d -> d

Type Unification

Unifying two types τi and τj is enforcing the type constraint τi = τj, while checking other type constraints about τi and τj remain valid.

For example, unifying int and bool fails while unifying char → bool and char → τi makes τi points to bool.

To unify two types, we compare their type descriptions. If one of them is a fully unknown type, i.e. a type variable, it is replaced by a link to the other one. If both types are type applications of the same type constructor to lists of types with the same lengths, we recursively unify their type arguments. In any other case, the unification fails and we report a type error at the given position.

To avoid recursing down large types which are the same one, we use physical equality tests (i.e. memory address comparisons) to shortcut the process when both types point to the exact same type. To maximize this effect, we replace by links not only type variables but also unified type applications, in order to maximize sharing.

let rec unify pos actual expected =
   let link t1 t2 = t1.node <- TLink t2 in
   let actual = deref actual in
   let expected = deref expected in
   if actual == expected then () (* Both types point to the same node *)
   else
      (* actual and expected points to different types, try to unify them *)
      match desc actual, desc expected with
        TVar, _ -> link actual expected
      | _, TVar -> link expected actual
      | TApply (con1, ts1), TApply(con2, ts2) when con1 = con2 ->
           if List.length ts1 <> List.length ts2 then
              mismatch pos actual expected;
           begin try List.iter2 (unify Position.dummy) ts1 ts2
           with Error(_, _) -> mismatch pos actual expected
           end;
           link actual expected
      | _, _ -> mismatch pos actual expected in

The recursive unification of type arguments above uses a dummy position because no position can be associated to sub-types. When an error is raised while unifying type arguments, we re-raise it with the original position and full types.

Type Inference

Armed with a type representation and type unification, the type inference process itself is rather straightforward.

Literals

Every literal has an associated pre-defined type, which we simply return.

let type_of_lit = function
     LBool _ -> t_bool
   | LChar _ -> t_char
   | LFloat _ -> t_float
   | LInt _ -> t_int
   | LString _ -> t_string
   | LUnit -> t_unit
 
let rec annotate_ast type_env ast =
   let pos = ast.label in
   let mk_ast expr t = { label = (pos, t); expr = expr } in
   match ast.expr with
     ...
   | Literal lit -> mk_ast (Literal lit) (type_of_lit lit)

Variables

We lookup variables in the type environment, and return their associated type.

let type_of_var type_env pos id =
   try Type_env.find type_env id
   with Not_found -> raise (Error (pos, "unbound variable " ^ id))
 
let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | EVar id -> mk_ast (EVar id) (type_of_var type_env pos id)

Operators

For each operand, we first infer its type, and then unify it with its expected type as determined by the operator. The operation resulting type is determined by the operator.

We dropped the support for wanna-be type-classes Eq and Ord. We are now more liberal in the types accepted by comparison operators. From now on, it will trigger a compile error rather than a type error when two values cannot be compared (we still ensure both compared values have the same type).

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
     Bin_op(lhs, op, rhs) ->
        let (operand_type, ret_type) =
           match op with
             Cmp_eq | Cmp_ne | Cmp_ge | Cmp_gt | Cmp_le | Cmp_lt ->
                  (new_tvar (), t_bool)
             | F_add | F_div | F_mul | F_pow | F_sub -> (t_float, t_float)
             | I_add | I_div | I_mul | I_sub -> (t_int, t_int) in
        let lhs = annotate_ast type_env lhs in
        unify (pos_of lhs) (type_of lhs) operand_type;
        let rhs = annotate_ast type_env rhs in
        unify (pos_of rhs) (type_of rhs) operand_type;
        mk_ast (Bin_op(lhs, op, rhs)) ret_type
   ...

The code for unary operators is very similar.

If-Then-Else

We infer a type for the condition, and then unify its type with t_bool. We then infer types for both branches, and unify the type of the else branch with the one of the then branch. Finally, the inferred type for the whole if-then-else expression is the one of the then branch.

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | If(cond, then_, else_) ->
        let cond = annotate_ast type_env cond in
        unify (pos_of cond) (type_of cond) t_bool;
        let then_ = annotate_ast type_env then_ in
        let else_ = annotate_ast type_env else_ in
        unify (pos_of else_) (type_of else_) (type_of then_);
        mk_ast (If(cond, then_, else_)) (type_of then_)

Function Calls

We first infer a type for the callee. We then generate a type τ1 → … → τn+1 for a function of n arguments, and unify it with f's inferred type.

We then infer a type for each argument, which is unified with corresponding τi.

Finally, we return τn+1 as the type for the call.

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | Call(f, args) ->
        let arg_types = List.map (fun _ -> new_tvar ()) args in
        let ret_t = new_tvar () in
        let f = annotate_ast type_env f in
        unify (pos_of f) (type_of f) (mk_fun_type (arg_types @ [ret_t]));
        let annotate_arg arg arg_t =
           let arg = annotate_ast type_env arg in
           unify (pos_of arg) (type_of arg) arg_t;
           arg in
        let args = List.map2 annotate_arg args arg_types in
        mk_ast (Call(f, args)) ret_t

Function Definition

We first convert parse types into real types. Unannotated arguments and return value are given a new type variable as type. We then extends the type environment as before and infer a type for the body. The body type is then unified with the return type, and the original type environment is extended with the new binding.

let annotate type_env = function
     Bind_fun(is_rec, id, args, ret_type, body) ->
        (* Turns parsing types into real types *)
        let convert_arg (id, pt) = (id, parse_type_to_type pt) in
        let args = List.map convert_arg args in
        let ret_type = parse_type_to_type ret_type in
        (* Builds updated type environment *)
        let add_arg type_env (id, t) = Type_env.extend type_env id t in
        let ext_type_env = List.fold_left add_arg !type_env args in
        let t = mk_fun_type (List.map snd args @ [ret_type]) in
        let ext_type_env =
           if is_rec then Type_env.extend ext_type_env id t
           else ext_type_env in
        (* Type body and updates environment *)
        let body = annotate_ast ext_type_env body in
        unify (pos_of body) (type_of body) ret_type;
        type_env := Type_env.extend !type_env id t;
        (t, Bind_fun(is_rec, id, args, ret_type, body))

As similar technique is used for value bindings.

External declaration are fully typed and no type inference is done, but parsing types are still replaced by real types (raising an error if one type is unknown).

Recursive Types

The process of linking types during unification can lead to recursive types, i.e. types that refer to themselves. For example, if we unify τi with int → τi, we will simply link τi with int → τi. If we try to dump this type naively following links, it will looks like int → (int → (int → (…))), an infinite expansion.

Recursive types can have their use (for examples, see Chris Okasaki's “Purely Functional Data Structures”). However, when resulting from the type inference process, they more often indicate programmer errors than intended use. For this reason, we forbid implicitly defined recursive types (as do OCaml, except when passed the -rectypes flag).

Rather than checking that a type do not occur in the target each time we want to link one (something known as the occurs check), we defer this check to the end of unification.

Here is our new unification function, where cyclic_types t returns a list of types in t which are cyclic:

let unify pos actual expected =
   let rec aux pos actual expected =
      ... (* our old unify definition, renamed aux *) in
   (* Unifies given types without any provision for recursive types *)
   aux pos actual expected;
   (* Check that resulting type is not cyclic (the "occurs check") *)
   if cyclic_types actual <> [] then
      raise (Error(pos, "inferred a cyclic type: " ^ string_of_type actual))

The cyclic_types function detect cyclic types using two marks: one for visited type nodes, and one for nodes under visit. When we see a type for the first time, we mark it as under visit. We then recurse through its sub-types. When we have finished, we mark the type as already visited. Should we meet a type with the under visit mark, it means that this type is cyclic and it is added to the list of cyclic types.

let cyclic_types t =
   let visiting = new_mark () in
   let visited = new_mark () in
   let cycles = ref [] in
   let rec visit t =
      let t = deref t in
      if t.mark = visited then () (* already done, not cyclic *)
      else if t.mark = visiting then
         (* cycle detected, add it to list *)
         cycles := t :: !cycles
      else (
         (* neither visited nor under visit, i.e. newly encountered type *)
         t.mark <- visiting;
         begin match desc t with
           TVar -> ()
         | TApply(_, ts) -> List.iter visit ts
         end;
         t.mark <- visited
      ) in
   visit t;
   !cycles

Of course, we must make provisions for cyclic types when printing a type. We handle this similarly to OCaml, giving a cyclic type a new name on first occurrence and reusing it thereafter. For example, our τi = int → τi type will be typed out as (int → 'a as 'a).

To achieve this, we first obtain the list of cyclic types of a type before converting it to string. We mark any cyclic type as such using a new mark. When we convert the type to string and we meet a cyclic type, we change its mark into a newly generated one (greater than the cyclic mark as marks are generated in increasing order). When meeting a type with a mark greater than the cyclic mark, we do not try to recurse into the type but we print a variable name such as 'a. The letter itself depends of the difference between the mark and the cyclic mark (the first encountered cyclic type will be named 'a, the second one 'b, …). To indicate the name of the cyclic type, we add a as 'a after its first occurrence, where the actual letter used is again determined by the difference between the newly-generated mark and the cyclic mark.

The code additionally handle remaining type variables, and adaptive placement of brackets depending on context, I let you discover the details in the code itself.

Quality of Error Messages

Our code has two shortcomings regarding error messages. The first one is a bias due to our bottom-up approach. Consider the following example:

let f c x = (if c then 1 else x) +. 42.;;

Our type-checker will report that inferred type for if c then 1 else x is int while expected type was float. A top-down type-checker would instead report that 1 has type int while expected type was float.

While I think the bottom-up approach generally provides better error messages, there are errors for which the top-down approach yields better ones. Which approach works best in general is the subject of hot debate. I would say neither approach is intrinsically better than the other. A better way of handling error messages would probably be to decouple constraint generation from constraint solving, in order to generate error messages using heuristics, or reporting several possible points of errors. I will stay with the current version for the time being, but you can have a look at "Generalizing Hindley-Milner Type Inference Algorithms", by Bastiaan Heeren, Jurriaan Hage and S. Doaitse Swierstra for a good introduction about this subject.

Another problem with our approach comes from the use of mutability, which can cause partial unification. If we try to unify τi → int with char → float, our type-checker will report that it cannot unify char → int with char → float. This is because when unification fails (when trying to unify int against float), unification of τi with char has already taken place. It would probably be better to report the initial types, i.e. that τi → int cannot be unified with char → float. However, this would require some form of backtracking.

Note that both of these shortcomings are shared with the OCaml compiler itself.

Source Code

The code accompanying this article is available in archive photon-tut-16.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 16-monomorphic_type_inference.1

In our next installment, we will extend the type inference process to handle polymorphic types.

Discussion

Enter your comment. Wiki syntax is allowed:
 
blog/2012/01/26-monomorphic_type_inference.txt · Last modified: 2012/01/26 20:24 by csoldani