blog:2012:02:01-polymorphism

This is the 17th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement *polymorphism*, both for the type inference process and for the compilation itself.

In last installment, we implemented *monomorphic* type inference. Why *monomorphic*? Because a single well-defined type is assigned to each entity.

For example, the code

let id x = x;; id 42;; id true;;

would make our compiler break with a message stating that `true`

has `bool`

type, while `int`

was expected^{1)}. Why is it so? Because although the type of `id`

is not fully determined solely from its definition (it acquires type `τ`

, where _{i} → τ_{i}`τ`

is a type variable), its type becomes prosaic _{i}`int → int`

after the call to `id 42`

, *i.e.* our `τ`

is unified with _{i}`int`

.

As we don't want having to define families of functions such as `id_int`

, `id_bool`

, … differing only by their types, we will now implement *polymorphic* type inference. By polymorphic, we mean that a function such as `id : 'a → 'a`

can accept any argument, whatever its type. No need to duplicate it.

We need to be able to distinguish between (partially) unknown types such as `τ`

where _{i} → τ_{i}`τ`

is a type variable to be unified later with some other type, and _{i}*type schemes* such as `∀α.α → α`

, where `α`

is universally quantified and can take any type.

We represent a type scheme `∀α`

by a pair _{1}…∀α_{n}.τ`([α`

:
_{1}; …; α_{n}], τ)

type scheme = t list * t

In a type scheme `(ts, t)`

, each type appearing in `ts`

is assumed to be universally quantified when it appears in `t`

. For example, in `([τ`

, _{i}], τ_{i} → τ_{j})`τ`

is universally quantified while _{i}`τ`

is a normal type variable that can be unified.
_{j}

We modify the type environment to accept type schemes instead of types. A type `t`

which is not polymorphic can always be presented as the `([], t)`

type scheme.

When we lookup a variable in the type environment, we not only retrieve its type scheme, but we *instantiate* it, *i.e.* we replace each universally-quantified type by a new type variable.

Let `id : ([τ`

be in the environment, and expression:
_{i}], τ_{i} → τ_{i})

f (id true) (id 42)

In the first occurrence (`id true`

), `id`

is given the instantiated type `τ`

, where _{j} → τ_{j}`τ`

is a new type variable (to be unified with _{j}`bool`

). In the second occurrence (`id 42`

), `id`

is given the instantiated type `τ`

where _{k} → τ_{k}`τ`

is a new type variable (to be unified with _{k}`int`

).

`type_instance`

is conceptually simple, but our implementation is a bit more involved than expected because we want to maximize sharing. When we meet a type for the first time, we not only instantiate it, but we put its instance in a vector at a given index which is recorded in the type's mark (modulo an offset). When we meet an already met type (we know by its changed mark), we just fetch its instance back from the instance vector. A `Vector.t`

is similar to an array, except that it is extensible in size (see `vector.ml`

). The `TPoly`

constructor is used for user-provided annotations.

let type_instance (bounded, t) = let offset = new_mark () in let insts = Vector.make (List.length bounded) (new_tvar ()) in let index mark = mark - offset - 1 in (* First create instances of bounded types *) let inst_bounded t = t.mark <- new_mark (); Vector.set insts (index t.mark) (new_tvar ()) in List.iter inst_bounded bounded; (* Then instantiate the whole type. *) let rec visit t = let t = deref t in if t.mark > offset then Vector.get insts (index t.mark) else match desc t with TPoly _ | TVar -> t | TApply(con, ts) -> let ts' = List.map visit ts in let inst = if List.for_all2 (==) ts ts' then t else mk_type (TApply(con, ts')) in t.mark <- new_mark (); Vector.set insts (index t.mark) inst; inst in visit t

Generalization is an operation roughly inverse to instantiation. Genaralizing a type gives a type scheme for this type, where every free type variable has been universally quantified. As we only allow generalization at the top-level for the time being, every type variable occurs free and `generalize`

only has to collect type variables inside a type.

let generalize t = (free_type_vars t, t)

Rather than writing `free_type_vars`

as a direct traversal of the type, we factorize out the *traversal* part into a `fold`

function which will be reused in several type utilities. `fold`

is to types what `fold_left`

is to lists. Given a function, an accumulator and a type, it accumulates a result by applying the function to every type nodes.

let fold f acc t = let visited = new_mark () in let rec aux acc t = let t = deref t in if t.mark = visited then acc else ( t.mark <- visited; let acc = f acc t in match desc t with TApply(_, ts) -> List.fold_left aux acc ts | TPoly _ | TVar -> acc ) in aux acc t

`free_type_vars`

is then a simple fold:

let free_type_vars t = let f tvars t = match desc t with TPoly _ | TVar -> t :: tvars | _ -> tvars in fold f [] t

We modify the parser to allow `QUOTE IDENT`

as type annotation, *i.e.* types such as `'a`

or `'meaningfull_name`

. But what type should we ascribe for such annotations? If we simply ascribe a type variable, we are at risk of unifying it with some other type.

let id (x : 'a) : 'a = x + 1 val id : int -> int = <fun>

Although this is surprisingly the OCaml compiler behaviour, I find this misleading. If the programmer specified a polymorphic type somewhere, the compiler should ensure it is actually used polymorphically.

Our compiler will give the following error instead:

let id (x : 'a) : 'a = x + 1;; <stdin>:1.25: this expression has type int, but is declared to have more general type 'a

The question is how do we achieve that? Following OCaml, we restrict ourselves to rank-1 polymorphism, and we avoid recursive types so that we cannot just assign type schemes for these arguments and functions: this would be *too* polymorphic. *E.g.*

let f g id = g (id true) (id 4)

would type-check fine with a type scheme `∀α.α → α`

assigned to argument `id`

, but this is not the OCaml rank-1 behaviour (arguments are **not** used polymorphically):

let f g id = g (id true) (id 4);; -----------------------------^ Error: This expression has type int but an expression was expected of type bool

To solve this problem, we introduce a new type constructor for types:

type t = ... and desc = ... | TPoly of string (** declared polymorphic type *)

A type `TPoly name`

is treated like an ordinary type variable, but it cannot be unified. If we try to unify it, we raise an error using the provided `name`

.

let not_general_enough pos id expected = let msg = "this expression has type " ^ string_of_type expected ^ ", but is declared to have more general type " ^ id in raise (Error (pos, msg)) let unify pos actual expected = let rec aux pos actual expected = ... match desc actual, desc expected with ... | TPoly id, _ -> not_general_enough pos id expected | _, TPoly id -> not_general_enough pos id actual ...

We also modify the printing of types to use user-provided polymorphic type names, which I find clearer. See `Type.string_of_type`

code for details.

Now that our polymorphic typing process is more or less complete, it remains to discuss how we compile a polymorphic function, also called a *generic* function.

How can one compile a function such as `id : 'a → 'a`

? There are roughly two possibilities: uniform value representation, or templates.

A *uniform value representation* is when every value has the same representation. OCaml mostly uses this approach. Although the details may vary significantly, the idea is to pass values by pointer. That way, all arguments and return values have the same size and a single definition can handle any variant. An identity function can be compiled to the following C equivalent:

void *id(void *val) { return val; }

The other alternative is to use the generic function code as a *template*, which is then instantiated with specific types when the need arises (this is roughly what C++ templates do). Here, our `id`

function is not compiled when defined, but its AST is kept around. When `id`

is used with, say, `int`

type, an `id_int`

version is compiled if it does not already exists. If `id`

is later called with a `bool`

, a new version `id_bool`

is compiled:

int id_int(int x) { return x; } bool id_bool(bool x) { return x; }

The fully generic version obviously generate less code, but the boxing and *unboxing* of values may decrease performance. For the time being, we chose the template approach, but it is more an experiment than a definite choice. We may end up with a combination of the two approaches in the end.

When the type of a function contains universally quantified variables, *i.e.* when `ts`

is not empty in its `(ts, t)`

scheme, we do not compile it immediately, but we instead add its definition to a new `generics`

table in `Compiler.t`

.

When we look up a variable which is neither an argument nor a global variable, is is necessarily a generic function (else, it would have been detected during typing). We look the expected type (*e.g.* `int → int`

for `id 42`

), and deduce the template instance name using *name mangling* (see below). If this instance already exists, it is reused. If it is not, we define it.

The name mangling process attributes a suffix composed of letters corresponding to types, using underscores as brackets for function types. A type should correspond to a single suffix, and a suffix should correspond to a single type. We generate an instance name by appending the type suffix to the generic name. The name and suffix are separated by `$`

, which cannot appear in user-defined identifiers (so a user could not define a function with the name of the instance of another generic function).

let rec mangle t = match desc t with | TApply(TBool, []) -> "b" | TApply(TChar , []) -> "c" | TApply(TFloat , []) -> "f" | TApply(TInt , []) -> "i" | TApply(TString , []) -> "s" | TApply(TUnit, []) -> "u" | TApply(TFun , ts) when ts <> [] -> "_" ^ String.concat "" (List.map mangle ts) ^ "_" | TApply(_, _) -> failwith "mangle: arity" | TPoly _ | TVar -> failwith "mangle: type variable"

For example, `f : int → (bool → char) → float`

would give instance name `f$i_bc_f`

.

It remains to see how we define an instance of a generic function. The expected type is known because it is recorded in the AST label during typing. But types appearing in the generic function definition are generic types. These types must be replaced by the corresponding types as deduced from the expected type. Roughly, we instantiate the generic type and unify it with the expected type. However, we cannot simply call `type_instance`

in practice because every sub-node from the function AST has to be updated to reflect the new, instantiated type. Introducing a function `label_map f ast`

which returns a copy of the AST where each label has been obtained by passing previous label to `f`

, the code is as follows:

let instantiate comp id t = let inst_id = id ^ "$" ^ Type.mangle t in try lookup_var comp inst_id with Not_found -> let gen_f = Hashtbl.find comp.generics id in let (qs, t') = gen_f.scheme in let qs = List.map (fun q -> (q, new_tvar ())) qs in let rec adjust_type t = let t = deref t in try List.assoc t qs with Not_found -> match desc t with TPoly _ | TVar -> failwith "instantiate: type variable" | TApply(_, []) -> t | TApply(con, ts) -> mk_type (TApply(con, List.map adjust_type ts)) in let body = label_map (fun (pos, t) -> (pos, adjust_type t)) gen_f.body in let t' = adjust_type t' in Typer.unify Position.dummy t' t; let old_block = insertion_block comp.builder in let f = define_fun comp gen_f.is_rec inst_id gen_f.gen_args t' body in position_at_end old_block (comp.builder); f

We take care of saving and restoring the insertion point because a generic function might be instantiated in the middle of the compilation of another function.

# let id x = x;; val id : 'a -> 'a = <fun> # id 42;; - : int = 42 # id true;; - : bool = true # let inc x = x + 1;; val inc : int -> int = <fun> # id inc;; - : int -> int = <fun> # let sub x = x -. 1.;; val sub : float -> float = <fun> # (id sub) 43.;; - : float = 42.000000 # let k x y = x;; val k : 'a -> 'b -> 'a = <fun> # k 42 false;; - : int = 42 # k 42 id;; Fatal error: exception Failure("mangle: type variable") # let f x y = inc x;; val f : int -> 'a -> int = <fun> # let inc x = if x then 'a' else 'b';; val inc : bool -> char = <fun> # f 1 2;; Call parameter type does not match function signature! %0 = call i8 @f1(i32 1, i32 2) i32 call void @Printable_print_int(i8 %0) Broken module found, compilation aborted!

Ouch Our code is far from robust enough to handle all polymorphic use cases.

The first error is due to the fact that expected type for `k`

at instantiation time is `int → ('a → 'a) → int`

, where `'a`

is left undetermined (as the second argument to `k`

is never used). We should probably filter out unused arguments as we filter out unit ones, but we defer this to a later time.

The second error is due to `f`

instance calling the wrong `inc`

function. When `f`

was defined (and typed), `inc`

was pointing to the first definition with type `int → int`

. But when `f`

is instantiated, `inc`

now points to the second definition with type `bool → char`

. When recording a generic function definition, we should take care of making variables point *definitively* to the definitions *at that time*. Again, we defer this to a later installment.

There are many, many type systems and each one generally has several implementations. In my opinion, types are what put the most fun in programming languages.

If you want to go further, here is a short list of resources regarding type systems and type inference, which may or may not be used later in Photon. This list is far from exhaustive, it is just a random list of what I found interesting and I remember at the moment.

- Programming Languages: Application and Interpretation (PLAI), Shriram Krishnamurthi. A well written book about programming language features and implementation. It contains some chapters about types, one of them treating type inference. Example implementations in scheme are given. You can get the book for free but can also donate what you want.
- Using, Understanding and Unraveling the OCaml Language: from Practice to Theory and vice-versa, Didier Rémy. Comprehensive book regarding OCaml, giving implementation examples for typing and evaluation of the core language, and good hints everywhere around the place. Most of our current type inference code is directly inspired from this book.
- Types and Programming Languages (TAPL), Benjamin C. Pierce. An absolute must-read on type systems, which covers everything mere mortals could implement, complete with illustrative implementations in OCaml.
- The Implementation of Functional Programming Languages and Implementing functional languages: a tutorial, Simon Peyton-Jones
^{2)}. These two do not put much emphasis on types, and some of the content is a bit dated by now, but large parts are still the best I read about functional programming languages implementation. - Benjamin C. Pierce edited a sequel to TAPL: Advanced Topics in Types and Programming Languages (ATTPL). If you cannot afford this one, or are just interested in ML type inference, you can grab an extended (pre)version of the chapter about type inference: The essence of ML Type Inference, by François Pottier and Didier Rémy. This chapter culminates with the presentation of HM(X), a constraint-based type inference algorithm which is adaptable to various type systems. This is not for the faint-hearted: do not read if you are allergic to formal academic presentation.

- Generalizing Hindley-Milner Type Inference Algorithms, Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. This article is a good introduction to constraint-based type inference.
- Typing Haskell in Haskell, Mark P. Jones. A clear and concise type inference implementation for Haskell 98, with type classes. I translated it in OCaml in my post Typing Haskell in ML.
- Many articles worth reading on the MLF page, a type system (and implementations) by Didier Rémy, Didier Le Botlan, Boris Yakobowski, Paolo Herms, and Gabriel Scherer. ML
^{F}falls in the category of advanced type systems. It offers a great expressiveness (including higher-rank polymorphism) with very few annotations, but type inference is a bit involved. - Another noteworthy article about ML
^{F}is Qualified types for MLF, by Leijen, Daan, Löh, and Andres. This article describes an extension of ML^{F}with qualified types (including*type classes*), and is a good introduction to*evidence passing*, or how smart types can be compiled down to code (*e.g.*passing a method dictionary with each typeclassed argument). - Simpler alternatives to ML
^{F}could be given by HMF: Simple Type Inference for First-Class Polymorphism and Flexible types: robust type inference for first-class polymorphism, both by Daan Leijen. These are accompanied by example implementation in Haskell. - In the other direction (
*i.e.*even more advanced), is OutsideIn(X): Modular type inference with local assumptions, by Dimitrios Vytiniotis, Simon Peyton Jones, Tom Schrijvers and Martin Sulzmann. This system is able to handle quite exotic type systems (GADT, type families, …) and the paper is well presented, but it is not for the squeamish (I did not grabbed everything yet, to be honest. Have I mentioned the paper is 83-page long?).

A good source to find other articles regarding programming language design, implementation, type systems, *etc.* is Lambda the Ultimate, the *ultimate* programming weblog!

The code accompanying this article is available in archive photon-tut-17.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git cd photon git checkout 17-polymorphic_type_inference

In the next installment, we will try to simplify a bit our code base.

`id`

declaration. Our compiler cannot compile functions with (partially) unknown types yet.blog/2012/02/01-polymorphism.txt · Last modified: 2012/02/01 13:43 by csoldani

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International

## Discussion