blog:2012:01:26-monomorphic_type_inference

This is the 16th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement type inference for *monomorphic types*.

Rather than requiring every function argument and return value to be explicitly specified by the programmer, we will implement type inference, a process by which missing types are deduced by the compiler.

Our language is still strongly and statically typed. We can flag typing errors at compile time and use type information for compilation.

To achieve type inference, we modify the parser to accept untyped definitions. We then extend the typer to infer types rather than simply enforcing them. This in turn requires a modification of the type representation.

As our type representation will evolve into something more complex, we prefer to completely avoid playing with types during parsing. To represent (optional) types in our AST, we introduce the following new data type:

type parse_type = PTEmpty (** no type provided *) | PTId of Position.t * string (** type given by name *) | PTFun of parse_type list (** function type *)

Function arguments and ASTs are now parameterized by an additional type argument, which is `parse_type`

after parsing and `Type.t`

after typing.

type 't arg = string * 't type ('lbl, 't) top_level = Bind_fun of bool * string * 't arg list * 't * 'lbl t | Bind_val of string * 'lbl t * 't | External of string * 't * string ...

The parser is modified to use the new `parse_type`

. It now also accepts unannotated function arguments, and function declarations without return types. Here is a relevant excerpt:

decl_or_dir : LET IDENT EQUAL expr { Bind_val($2, $4, PTEmpty) } | LET IDENT COLON type_ EQUAL expr { Bind_val($2, $6, $4) } | LET IDENT args EQUAL expr { Bind_fun(false, $2, List.rev $3, PTEmpty, $5) } | LET IDENT args COLON type_ EQUAL expr { Bind_fun(false, $2, List.rev $3, $5, $7) } | LET REC IDENT args EQUAL expr { Bind_fun(true, $3, List.rev $4, PTEmpty, $6) } | LET REC IDENT args COLON type_ EQUAL expr { Bind_fun(true, $3, List.rev $4, $6, $8) } ... arg : IDENT { ($1, PTEmpty) } | LPAR IDENT COLON type_ RPAR { ($2, $4) }

Our treatment of type inference is a variation around the Damas-Milner algorithm. Our implementation is mostly inspired by the book Using, Understanding and Unraveling the OCaml Language: from Practice to Theory and vice-versa, by Didier Rémy.

To illustrate the type inference process, let's first do an example by hand.

let f c x y = (if c then x else y) + 1

First, let's rewrite this with subscripts to identify all the sub-expressions appearing in this function definition:

`let f`

_{1} c_{2} x_{3} y_{4} = (if c_{8} then x_{9} else y_{10})_{6} +_{5} 1_{7}

Let `τ`

denote the type of the _{i}`i`

-th sub-expression, as shown above. Analyzing this function definition AST, we can deduce the following *type constraints* that should hold for the function definition to be well-typed:

`τ`

_{1}= τ_{2}→ τ_{3}→ τ_{4}→ τ_{5}

Because`f`

if a function of`c`

,`x`

and`y`

and returns the value given by its body.`τ`

_{6}= int

Because the first operand to`+`

should be an integer.`τ`

_{7}= int

Because the second operand to`+`

should be an integer.`τ`

_{5}= int

Because the resulting value of an`+`

operation should be an integer.`τ`

_{7}= int

Because`1`

is an integer.`τ`

_{8}= bool

Because in an if-then-else expression, the condition must be boolean.`τ`

_{6}= τ_{9}

Because the resulting type of an if-then-else expression is the same as the one of its*then*branch.`τ`

_{6}= τ_{10}

Because the resulting type of an if-then-else expression is the same as the one of its*else*branch.`τ`

_{8}= τ_{2}

Because`c`

here refers to the first argument of`f`

.`τ`

_{9}= τ_{3}

Because`x`

here refers to the second argument of`f`

.`τ`

_{10}= τ_{4}

Because`y`

here refers to the third argument of`f`

.

It now remains to solve these constraints:

- As
`τ`

(9) and_{2}= τ_{8}`τ`

(6), we have_{8}= bool`τ`

(12)._{2}= bool - As
`τ`

(7) and_{9}= τ_{6}`τ`

(2), we have_{6}= int`τ`

(13)._{9}= int - As
`τ`

(8) and_{10}= τ_{6}`τ`

(2), we have_{6}= int`τ`

(14)._{10}= int - As
`τ`

(10) and_{3}= τ_{9}`τ`

(13), we have_{9}= int`τ`

(15)._{3}= int - As
`τ`

(11) and_{4}= τ_{10}`τ`

(14), we have_{10}= int`τ`

(16)._{4}= int - Injecting (12), (15), (16) and (4) in (1) gives
`τ`

(17)._{1}= bool → int → int → int

The solution is thus given by:

`τ`

_{1}= bool → int → int → int`τ`

_{2}= τ_{8}= bool`τ`

_{3}= τ_{4}= τ_{5}= τ_{6}= τ_{7}= τ_{9}= τ_{10}= int

This solution gives a type to every sub-expression and you can verify it violates none of the generated type constraints.

Rather than first generating the type constraints and then solving them afterward, we will do both at the same time. To achieve this, we use an indirection in the type representation. A type is represented by a mutable pointer to either:

- a (partially) known type such as
`int`

or`bool → τ`

;_{i} - a yet-unknown type, which we call a
*type variable*; - a link to another type.

type t = { mutable node : node; (* partially known type, type variable or type link *) }

When we meet a constraint of the form `τ`

, we mutate _{i} = int`τ`

into a pointer to _{i}`int`

. When we meet a constraint of the form `τ`

, we mutate _{i} = τ_{j}`τ`

into a pointer to _{i}`τ`

. Of course, we must ensure that these modifications are type-safe. If _{j}`τ`

already points to _{i}`bool`

when we meet a constraint `τ`

, we must raise an error instead of replacing _{i} = int`τ`

. This is the role of type _{i}*unification*, described below. Newly introduced unknown types are made pointing to a type variable.

To distinguish between *terminal types* (partially known types and type variables), and *link types* (links to other types), we subdivide types into descriptive types and link types.

type node = TDesc of desc (* terminal type *) | TLink of t (* link to another type *)

Type variables are represented by a dedicated constructor `TVar`

, while partially known types are represented by the application of a type constructor to zero, one or more types.

type desc = TVar (* type variable *) | TApply of constr * t list

For example, `int`

is represented by `TApply(TInt, [])`

while `bool → char`

is represented by `TApply(TFun, [t_bool; t_char])`

where `t_bool`

and `t_char`

are pre-defined boolean and character types.

Finally, we adjoin a mutable integer to each type, which can serve different purposes in type traversals (e.g. see recursive types). Our full type definition is given by:

type t = { mutable node : node; (** type node *) mutable mark : int (** a mark than can be freely used during traversals *) } and node = TDesc of desc (** terminal type *) | TLink of t (** link to another type *) and desc = TVar (** type variable *) | TApply of constr * t list (** type constructor *)

We introduce a function `deref`

to de-reference types, *i.e.* strip potential links, and one function `desc`

which returns the type description associated to a given type.

let rec deref t = match t.node with TLink u -> let v = deref u in t.node <- TLink v; (* we replaces chains of links by a single link *) v | TDesc _ -> t let desc t = match (deref t).node with TLink _ -> assert false | TDesc d -> d

*Unifying* two types `τ`

and _{i}`τ`

is enforcing the type constraint _{j}`τ`

, while checking other type constraints about _{i} = τ_{j}`τ`

and _{i}`τ`

remain valid.
_{j}

For example, unifying `int`

and `bool`

fails while unifying `char → bool`

and `char → τ`

makes _{i}`τ`

points to _{i}`bool`

.

To unify two types, we compare their type descriptions. If one of them is a fully unknown type, *i.e.* a type variable, it is replaced by a link to the other one. If both types are type applications of the same type constructor to lists of types with the same lengths, we recursively unify their type arguments. In any other case, the unification fails and we report a type error at the given position.

To avoid recursing down large types which are the same one, we use physical equality tests (*i.e.* memory address comparisons) to shortcut the process when both types point to the exact same type. To maximize this effect, we replace by links not only type variables but also unified type applications, in order to maximize sharing.

let rec unify pos actual expected = let link t1 t2 = t1.node <- TLink t2 in let actual = deref actual in let expected = deref expected in if actual == expected then () (* Both types point to the same node *) else (* actual and expected points to different types, try to unify them *) match desc actual, desc expected with TVar, _ -> link actual expected | _, TVar -> link expected actual | TApply (con1, ts1), TApply(con2, ts2) when con1 = con2 -> if List.length ts1 <> List.length ts2 then mismatch pos actual expected; begin try List.iter2 (unify Position.dummy) ts1 ts2 with Error(_, _) -> mismatch pos actual expected end; link actual expected | _, _ -> mismatch pos actual expected in

The recursive unification of type arguments above uses a dummy position because no position can be associated to sub-types. When an error is raised while unifying type arguments, we re-raise it with the original position and full types.

Armed with a type representation and type unification, the type inference process itself is rather straightforward.

Every literal has an associated pre-defined type, which we simply return.

let type_of_lit = function LBool _ -> t_bool | LChar _ -> t_char | LFloat _ -> t_float | LInt _ -> t_int | LString _ -> t_string | LUnit -> t_unit let rec annotate_ast type_env ast = let pos = ast.label in let mk_ast expr t = { label = (pos, t); expr = expr } in match ast.expr with ... | Literal lit -> mk_ast (Literal lit) (type_of_lit lit)

We lookup variables in the type environment, and return their associated type.

let type_of_var type_env pos id = try Type_env.find type_env id with Not_found -> raise (Error (pos, "unbound variable " ^ id)) let rec annotate_ast type_env ast = ... match ast.expr with ... | EVar id -> mk_ast (EVar id) (type_of_var type_env pos id)

For each operand, we first infer its type, and then unify it with its expected type as determined by the operator. The operation resulting type is determined by the operator.

We dropped the support for wanna-be type-classes `Eq`

and `Ord`

. We are now more liberal in the types accepted by comparison operators. From now on, it will trigger a *compile error* rather than a *type error* when two values cannot be compared (we still ensure both compared values have the same type).

let rec annotate_ast type_env ast = ... match ast.expr with Bin_op(lhs, op, rhs) -> let (operand_type, ret_type) = match op with Cmp_eq | Cmp_ne | Cmp_ge | Cmp_gt | Cmp_le | Cmp_lt -> (new_tvar (), t_bool) | F_add | F_div | F_mul | F_pow | F_sub -> (t_float, t_float) | I_add | I_div | I_mul | I_sub -> (t_int, t_int) in let lhs = annotate_ast type_env lhs in unify (pos_of lhs) (type_of lhs) operand_type; let rhs = annotate_ast type_env rhs in unify (pos_of rhs) (type_of rhs) operand_type; mk_ast (Bin_op(lhs, op, rhs)) ret_type ...

The code for unary operators is very similar.

We infer a type for the condition, and then unify its type with `t_bool`

. We then infer types for both branches, and unify the type of the *else* branch with the one of the *then* branch. Finally, the inferred type for the whole if-then-else expression is the one of the *then* branch.

let rec annotate_ast type_env ast = ... match ast.expr with ... | If(cond, then_, else_) -> let cond = annotate_ast type_env cond in unify (pos_of cond) (type_of cond) t_bool; let then_ = annotate_ast type_env then_ in let else_ = annotate_ast type_env else_ in unify (pos_of else_) (type_of else_) (type_of then_); mk_ast (If(cond, then_, else_)) (type_of then_)

We first infer a type for the callee. We then generate a type `τ`

for a function of _{1} → … → τ_{n+1}`n`

arguments, and unify it with `f`

's inferred type.

We then infer a type for each argument, which is unified with corresponding `τ`

.
_{i}

Finally, we return `τ`

as the type for the call.
_{n+1}

let rec annotate_ast type_env ast = ... match ast.expr with ... | Call(f, args) -> let arg_types = List.map (fun _ -> new_tvar ()) args in let ret_t = new_tvar () in let f = annotate_ast type_env f in unify (pos_of f) (type_of f) (mk_fun_type (arg_types @ [ret_t])); let annotate_arg arg arg_t = let arg = annotate_ast type_env arg in unify (pos_of arg) (type_of arg) arg_t; arg in let args = List.map2 annotate_arg args arg_types in mk_ast (Call(f, args)) ret_t

We first convert parse types into real types. Unannotated arguments and return value are given a new type variable as type. We then extends the type environment as before and infer a type for the body. The body type is then unified with the return type, and the original type environment is extended with the new binding.

let annotate type_env = function Bind_fun(is_rec, id, args, ret_type, body) -> (* Turns parsing types into real types *) let convert_arg (id, pt) = (id, parse_type_to_type pt) in let args = List.map convert_arg args in let ret_type = parse_type_to_type ret_type in (* Builds updated type environment *) let add_arg type_env (id, t) = Type_env.extend type_env id t in let ext_type_env = List.fold_left add_arg !type_env args in let t = mk_fun_type (List.map snd args @ [ret_type]) in let ext_type_env = if is_rec then Type_env.extend ext_type_env id t else ext_type_env in (* Type body and updates environment *) let body = annotate_ast ext_type_env body in unify (pos_of body) (type_of body) ret_type; type_env := Type_env.extend !type_env id t; (t, Bind_fun(is_rec, id, args, ret_type, body))

As similar technique is used for value bindings.

External declaration are fully typed and no type inference is done, but parsing types are still replaced by real types (raising an error if one type is unknown).

The process of linking types during unification can lead to *recursive types*, *i.e.* types that refer to themselves. For example, if we unify `τ`

with _{i}`int → τ`

, we will simply link _{i}`τ`

with _{i}`int → τ`

. If we try to dump this type naively following links, it will looks like _{i}`int → (int → (int → (…)))`

, an infinite expansion.

Recursive types can have their use (for examples, see Chris Okasaki's “Purely Functional Data Structures”). However, when resulting from the type inference process, they more often indicate programmer errors than intended use. For this reason, we forbid implicitly defined recursive types (as do OCaml, except when passed the `-rectypes`

flag).

Rather than checking that a type do not occur in the target each time we want to link one (something known as the *occurs check*), we defer this check to the end of unification.

Here is our new unification function, where `cyclic_types t`

returns a list of types in `t`

which are cyclic:

let unify pos actual expected = let rec aux pos actual expected = ... (* our old unify definition, renamed aux *) in (* Unifies given types without any provision for recursive types *) aux pos actual expected; (* Check that resulting type is not cyclic (the "occurs check") *) if cyclic_types actual <> [] then raise (Error(pos, "inferred a cyclic type: " ^ string_of_type actual))

The `cyclic_types`

function detect cyclic types using two marks: one for visited type nodes, and one for nodes under visit. When we see a type for the first time, we mark it as *under visit*. We then recurse through its sub-types. When we have finished, we mark the type as already visited. Should we meet a type with the *under visit* mark, it means that this type is cyclic and it is added to the list of cyclic types.

let cyclic_types t = let visiting = new_mark () in let visited = new_mark () in let cycles = ref [] in let rec visit t = let t = deref t in if t.mark = visited then () (* already done, not cyclic *) else if t.mark = visiting then (* cycle detected, add it to list *) cycles := t :: !cycles else ( (* neither visited nor under visit, i.e. newly encountered type *) t.mark <- visiting; begin match desc t with TVar -> () | TApply(_, ts) -> List.iter visit ts end; t.mark <- visited ) in visit t; !cycles

Of course, we must make provisions for cyclic types when printing a type. We handle this similarly to OCaml, giving a cyclic type a new name on first occurrence and reusing it thereafter. For example, our `τ`

type will be typed out as _{i} = int → τ_{i}`(int → 'a as 'a)`

.

To achieve this, we first obtain the list of cyclic types of a type before converting it to string. We mark any cyclic type as such using a new mark. When we convert the type to string and we meet a cyclic type, we change its mark into a newly generated one (greater than the cyclic mark as marks are generated in increasing order). When meeting a type with a mark greater than the cyclic mark, we do not try to recurse into the type but we print a variable name such as `'a`

. The letter itself depends of the difference between the mark and the cyclic mark (the first encountered cyclic type will be named `'a`

, the second one `'b`

, …). To indicate the name of the cyclic type, we add a `as 'a`

after its first occurrence, where the actual letter used is again determined by the difference between the newly-generated mark and the cyclic mark.

The code additionally handle remaining type variables, and adaptive placement of brackets depending on context, I let you discover the details in the code itself.

Our code has two shortcomings regarding error messages. The first one is a bias due to our bottom-up approach. Consider the following example:

let f c x = (if c then 1 else x) +. 42.;;

Our type-checker will report that inferred type for `if c then 1 else x`

is `int`

while expected type was `float`

. A top-down type-checker would instead report that `1`

has type `int`

while expected type was `float`

.

While I think the bottom-up approach generally provides better error messages, there are errors for which the top-down approach yields better ones. Which approach works best in general is the subject of hot debate. I would say neither approach is intrinsically better than the other. A better way of handling error messages would probably be to decouple constraint generation from constraint solving, in order to generate error messages using heuristics, or reporting several possible points of errors. I will stay with the current version for the time being, but you can have a look at "Generalizing Hindley-Milner Type Inference Algorithms", by Bastiaan Heeren, Jurriaan Hage and S. Doaitse Swierstra for a good introduction about this subject.

Another problem with our approach comes from the use of mutability, which can cause partial unification. If we try to unify `τ`

with _{i} → int`char → float`

, our type-checker will report that it cannot unify `char → int`

with `char → float`

. This is because when unification fails (when trying to unify `int`

against `float`

), unification of `τ`

with _{i}`char`

has already taken place. It would probably be better to report the initial types, *i.e.* that `τ`

cannot be unified with _{i} → int`char → float`

. However, this would require some form of backtracking.

Note that both of these shortcomings are shared with the OCaml compiler itself.

The code accompanying this article is available in archive photon-tut-16.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git cd photon git checkout 16-monomorphic_type_inference.1

In our next installment, we will extend the type inference process to handle polymorphic types.

blog/2012/01/26-monomorphic_type_inference.txt · Last modified: 2012/01/26 20:24 by csoldani

## Discussion