User Tools

Site Tools


start

This is an old revision of the document!


Home

Welcome on Dev Musings. Here are my ramblings about free open-source software, high-tech gadgets and (mostly functional) programming.

DokuWiki TagList Plugin

The list of topics in the sidebar is generated using a small plugin of mine, which is an addition to the Tag plugin. This is one thing I like the most about DokuWiki: it is simple and well-documented, so that it is easily extensible!

In my opinion, this should be integrated directly into the tag plugin, which is why I did not plublished it directly to the DokuWiki plugin index. I'll propose that to the tag plugin authors when I'll find the time. Feel free to do so yourself if you want to speed it up. In the mean time, here it is.

My Plugin

The plugin is a syntax plugin used as follows:

{{taglist>tag1 tag2 ...}}

for a list of tag links separated by commas inside a span, and:

{{taglist>tag1 tag2 ... &list}}

for a list of tag links inside a ul list.

It is available here.

2012/10/02 09:09 · csoldani · 0 Comments

QuickCheck++ Updated

QuickCheck++ has been updated. The major change is that it now compiles with clang++, which is more strict than GCC about some C++ dark corners.

I do not use QuickCheck++ on a regular basis anymore (as I am not using C++ on a regular basis anymore), but I continue to receive some decent amount of feedback about it. It is amazing for me to see how such a simple library has attracted that amount of attention. I will continue to update it as long as there is interest or until someone else takes the leader role (I will gladly hand it over to anyone interested to maintain it).

The next logical step would be to rewrite the library using variadic templates (C++11), which are only emulated up to 5 property arguments at the moment. Someone told me he has some code for that, I will get in touch and see if we can roll that soon.

2012/09/19 10:53 · csoldani · 0 Comments

Photon Compiler Development: Polymorphism

This is the 17th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement polymorphism, both for the type inference process and for the compilation itself.

Overview

In last installment, we implemented monomorphic type inference. Why monomorphic? Because a single well-defined type is assigned to each entity.

For example, the code

let id x = x;;
id 42;;
id true;;

would make our compiler break with a message stating that true has bool type, while int was expected1). Why is it so? Because although the type of id is not fully determined solely from its definition (it acquires type τi → τi, where τi is a type variable), its type becomes prosaic int → int after the call to id 42, i.e. our τi is unified with int.

As we don't want having to define families of functions such as id_int, id_bool, … differing only by their types, we will now implement polymorphic type inference. By polymorphic, we mean that a function such as id : 'a → 'a can accept any argument, whatever its type. No need to duplicate it.

Typing

Type Schemes

We need to be able to distinguish between (partially) unknown types such as τi → τi where τi is a type variable to be unified later with some other type, and type schemes such as ∀α.α → α, where α is universally quantified and can take any type.

We represent a type scheme ∀α1…∀αn by a pair ([α1; …; αn], τ):

type scheme = t list * t

In a type scheme (ts, t), each type appearing in ts is assumed to be universally quantified when it appears in t. For example, in ([τi], τi → τj), τi is universally quantified while τj is a normal type variable that can be unified.

Type Instances

We modify the type environment to accept type schemes instead of types. A type t which is not polymorphic can always be presented as the ([], t) type scheme.

When we lookup a variable in the type environment, we not only retrieve its type scheme, but we instantiate it, i.e. we replace each universally-quantified type by a new type variable.

Let id : ([τi], τi → τi) be in the environment, and expression:

f (id true) (id 42)

In the first occurrence (id true), id is given the instantiated type τj → τj, where τj is a new type variable (to be unified with bool). In the second occurrence (id 42), id is given the instantiated type τk → τk where τk is a new type variable (to be unified with int).

type_instance is conceptually simple, but our implementation is a bit more involved than expected because we want to maximize sharing. When we meet a type for the first time, we not only instantiate it, but we put its instance in a vector at a given index which is recorded in the type's mark (modulo an offset). When we meet an already met type (we know by its changed mark), we just fetch its instance back from the instance vector. A Vector.t is similar to an array, except that it is extensible in size (see vector.ml). The TPoly constructor is used for user-provided annotations.

let type_instance (bounded, t) =
   let offset = new_mark () in
   let insts = Vector.make (List.length bounded) (new_tvar ()) in
   let index mark = mark - offset - 1 in
   (* First create instances of bounded types *)
   let inst_bounded t =
      t.mark <- new_mark ();
      Vector.set insts (index t.mark) (new_tvar ()) in
   List.iter inst_bounded bounded;
   (* Then instantiate the whole type. *)
   let rec visit t =
      let t = deref t in
      if t.mark > offset then Vector.get insts (index t.mark)
      else match desc t with
        TPoly _ | TVar -> t
      | TApply(con, ts) ->
           let ts' = List.map visit ts in
           let inst =
              if List.for_all2 (==) ts ts' then t
              else mk_type (TApply(con, ts')) in
           t.mark <- new_mark ();
           Vector.set insts (index t.mark) inst;
           inst in
   visit t

Generalization

Generalization is an operation roughly inverse to instantiation. Genaralizing a type gives a type scheme for this type, where every free type variable has been universally quantified. As we only allow generalization at the top-level for the time being, every type variable occurs free and generalize only has to collect type variables inside a type.

let generalize t = (free_type_vars t, t)

Rather than writing free_type_vars as a direct traversal of the type, we factorize out the traversal part into a fold function which will be reused in several type utilities. fold is to types what fold_left is to lists. Given a function, an accumulator and a type, it accumulates a result by applying the function to every type nodes.

let fold f acc t =
   let visited = new_mark () in
   let rec aux acc t =
      let t = deref t in
      if t.mark = visited then acc
      else (
         t.mark <- visited;
         let acc = f acc t in
         match desc t with
           TApply(_, ts) -> List.fold_left aux acc ts
         | TPoly _ | TVar -> acc
      ) in
   aux acc t

free_type_vars is then a simple fold:

let free_type_vars t =
   let f tvars t =
      match desc t with
        TPoly _ | TVar -> t :: tvars
      | _ -> tvars in
   fold f [] t

Handling User-Provided Annotations

We modify the parser to allow QUOTE IDENT as type annotation, i.e. types such as 'a or 'meaningfull_name. But what type should we ascribe for such annotations? If we simply ascribe a type variable, we are at risk of unifying it with some other type.

let id (x : 'a) : 'a = x + 1
val id : int -> int = <fun>

Although this is surprisingly the OCaml compiler behaviour, I find this misleading. If the programmer specified a polymorphic type somewhere, the compiler should ensure it is actually used polymorphically.

Our compiler will give the following error instead:

let id (x : 'a) : 'a =  x + 1;;
<stdin>:1.25: this expression has type int, but is declared to have more
general type 'a

The question is how do we achieve that? Following OCaml, we restrict ourselves to rank-1 polymorphism, and we avoid recursive types so that we cannot just assign type schemes for these arguments and functions: this would be too polymorphic. E.g.

let f g id = g (id true) (id 4)

would type-check fine with a type scheme ∀α.α → α assigned to argument id, but this is not the OCaml rank-1 behaviour (arguments are not used polymorphically):

let f g id = g (id true) (id 4);;
-----------------------------^
Error: This expression has type int but an expression was expected of type
         bool

To solve this problem, we introduce a new type constructor for types:

type t = ...
and desc =
   ...
   | TPoly of string (** declared polymorphic type *)

A type TPoly name is treated like an ordinary type variable, but it cannot be unified. If we try to unify it, we raise an error using the provided name.

let not_general_enough pos id expected =
   let msg =
      "this expression has type " ^ string_of_type expected
      ^ ", but is declared to have more general type " ^ id in
   raise (Error (pos, msg))
 
let unify pos actual expected =
   let rec aux pos actual expected =
      ...
         match desc actual, desc expected with
         ...
         | TPoly id, _ -> not_general_enough pos id expected
         | _, TPoly id -> not_general_enough pos id actual
...

We also modify the printing of types to use user-provided polymorphic type names, which I find clearer. See Type.string_of_type code for details.

Compilation

Uniform Value Representation vs Templates

Now that our polymorphic typing process is more or less complete, it remains to discuss how we compile a polymorphic function, also called a generic function.

How can one compile a function such as id : 'a → 'a? There are roughly two possibilities: uniform value representation, or templates.

A uniform value representation is when every value has the same representation. OCaml mostly uses this approach. Although the details may vary significantly, the idea is to pass values by pointer. That way, all arguments and return values have the same size and a single definition can handle any variant. An identity function can be compiled to the following C equivalent:

void *id(void *val) { return val; }

The other alternative is to use the generic function code as a template, which is then instantiated with specific types when the need arises (this is roughly what C++ templates do). Here, our id function is not compiled when defined, but its AST is kept around. When id is used with, say, int type, an id_int version is compiled if it does not already exists. If id is later called with a bool, a new version id_bool is compiled:

int id_int(int x) { return x; }
bool id_bool(bool x) { return x; }

The fully generic version obviously generate less code, but the boxing and unboxing of values may decrease performance. For the time being, we chose the template approach, but it is more an experiment than a definite choice. We may end up with a combination of the two approaches in the end.

Implementation

When the type of a function contains universally quantified variables, i.e. when ts is not empty in its (ts, t) scheme, we do not compile it immediately, but we instead add its definition to a new generics table in Compiler.t.

When we look up a variable which is neither an argument nor a global variable, is is necessarily a generic function (else, it would have been detected during typing). We look the expected type (e.g. int → int for id 42), and deduce the template instance name using name mangling (see below). If this instance already exists, it is reused. If it is not, we define it.

The name mangling process attributes a suffix composed of letters corresponding to types, using underscores as brackets for function types. A type should correspond to a single suffix, and a suffix should correspond to a single type. We generate an instance name by appending the type suffix to the generic name. The name and suffix are separated by $, which cannot appear in user-defined identifiers (so a user could not define a function with the name of the instance of another generic function).

let rec mangle t =
   match desc t with
   | TApply(TBool, []) -> "b"
   | TApply(TChar , []) -> "c"
   | TApply(TFloat , []) -> "f"
   | TApply(TInt , []) -> "i"
   | TApply(TString , []) -> "s"
   | TApply(TUnit, []) -> "u"
   | TApply(TFun , ts) when ts <> [] ->
        "_" ^ String.concat "" (List.map mangle ts) ^ "_"
   | TApply(_, _) -> failwith "mangle: arity"
   | TPoly _ | TVar -> failwith "mangle: type variable"

For example, f : int → (bool → char) → float would give instance name f$i_bc_f.

It remains to see how we define an instance of a generic function. The expected type is known because it is recorded in the AST label during typing. But types appearing in the generic function definition are generic types. These types must be replaced by the corresponding types as deduced from the expected type. Roughly, we instantiate the generic type and unify it with the expected type. However, we cannot simply call type_instance in practice because every sub-node from the function AST has to be updated to reflect the new, instantiated type. Introducing a function label_map f ast which returns a copy of the AST where each label has been obtained by passing previous label to f, the code is as follows:

let instantiate comp id t =
   let inst_id = id ^ "$" ^ Type.mangle t in
   try lookup_var comp inst_id
   with Not_found ->
      let gen_f = Hashtbl.find comp.generics id in
      let (qs, t') = gen_f.scheme in
      let qs = List.map (fun q -> (q, new_tvar ())) qs in
      let rec adjust_type t =
         let t = deref t in
         try List.assoc t qs
         with Not_found ->
            match desc t with
              TPoly _ | TVar -> failwith "instantiate: type variable"
            | TApply(_, []) -> t
            | TApply(con, ts) ->
                 mk_type (TApply(con, List.map adjust_type ts)) in
      let body = label_map (fun (pos, t) -> (pos, adjust_type t)) gen_f.body in
      let t' = adjust_type t' in
      Typer.unify Position.dummy t' t;
      let old_block = insertion_block comp.builder in
      let f =
         define_fun comp gen_f.is_rec inst_id gen_f.gen_args t' body in
      position_at_end old_block (comp.builder);
      f

We take care of saving and restoring the insertion point because a generic function might be instantiated in the middle of the compilation of another function.

Tests

# let id x = x;;
val id : 'a -> 'a = <fun>
# id 42;;
- : int = 42
# id true;;
- : bool = true
# let inc x = x + 1;;
val inc : int -> int = <fun>
# id inc;;
- : int -> int = <fun>
# let sub x = x -. 1.;;
val sub : float -> float = <fun>
# (id sub) 43.;;
- : float = 42.000000
# let k x y = x;;
val k : 'a -> 'b -> 'a = <fun>
# k 42 false;;
- : int = 42
# k 42 id;;
Fatal error: exception Failure("mangle: type variable")
# let f x y = inc x;;
val f : int -> 'a -> int = <fun>
# let inc x = if x then 'a' else 'b';;
val inc : bool -> char = <fun>
# f 1 2;;
Call parameter type does not match function signature!
  %0 = call i8 @f1(i32 1, i32 2)
 i32  call void @Printable_print_int(i8 %0)
Broken module found, compilation aborted!

Ouch :-/ Our code is far from robust enough to handle all polymorphic use cases.

The first error is due to the fact that expected type for k at instantiation time is int → ('a → 'a) → int, where 'a is left undetermined (as the second argument to k is never used). We should probably filter out unused arguments as we filter out unit ones, but we defer this to a later time.

The second error is due to f instance calling the wrong inc function. When f was defined (and typed), inc was pointing to the first definition with type int → int. But when f is instantiated, inc now points to the second definition with type bool → char. When recording a generic function definition, we should take care of making variables point definitively to the definitions at that time. Again, we defer this to a later installment.

Further Reading

There are many, many type systems and each one generally has several implementations. In my opinion, types are what put the most fun in programming languages.

If you want to go further, here is a short list of resources regarding type systems and type inference, which may or may not be used later in Photon. This list is far from exhaustive, it is just a random list of what I found interesting and I remember at the moment.

Books

Articles

  • Generalizing Hindley-Milner Type Inference Algorithms, Bastiaan Heeren, Jurriaan Hage, and S. Doaitse Swierstra. This article is a good introduction to constraint-based type inference.
  • Typing Haskell in Haskell, Mark P. Jones. A clear and concise type inference implementation for Haskell 98, with type classes. I translated it in OCaml in my post Typing Haskell in ML.
  • Many articles worth reading on the MLF page, a type system (and implementations) by Didier Rémy, Didier Le Botlan, Boris Yakobowski, Paolo Herms, and Gabriel Scherer. MLF falls in the category of advanced type systems. It offers a great expressiveness (including higher-rank polymorphism) with very few annotations, but type inference is a bit involved.
  • Another noteworthy article about MLF is Qualified types for MLF, by Leijen, Daan, Löh, and Andres. This article describes an extension of MLF with qualified types (including type classes), and is a good introduction to evidence passing, or how smart types can be compiled down to code (e.g. passing a method dictionary with each typeclassed argument).
  • Simpler alternatives to MLF could be given by HMF: Simple Type Inference for First-Class Polymorphism and Flexible types: robust type inference for first-class polymorphism, both by Daan Leijen. These are accompanied by example implementation in Haskell.
  • In the other direction (i.e. even more advanced), is OutsideIn(X): Modular type inference with local assumptions, by Dimitrios Vytiniotis, Simon Peyton Jones, Tom Schrijvers and Martin Sulzmann. This system is able to handle quite exotic type systems (GADT, type families, …) and the paper is well presented, but it is not for the squeamish (I did not grabbed everything yet, to be honest. Have I mentioned the paper is 83-page long?).

A good source to find other articles regarding programming language design, implementation, type systems, etc. is Lambda the Ultimate, the ultimate programming weblog!

Source Code

The code accompanying this article is available in archive photon-tut-17.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 17-polymorphic_type_inference

In the next installment, we will try to simplify a bit our code base.

2012/02/01 13:43 · csoldani · 0 Comments

DokuWiki: Remove Media Files from Feed

After upgrade to Angua, my feed was overfilled with numerous entries for media files. How to prevent them from appearing in the feed?

One way is to disable media revisions. However, these may be useful so it is a bit too drastic.

I found an alternative way. There is an undocumented new parameter to feed.php which control what appears inside it: view can be one of pages, media or both. It defaults to both. If you replace feed.php with feed.php?view=pages, you obtain the old (sane) behavior back.

If you already have a lot of people subscribed to the old URL (without view, i.e. with view=both), you can arrange this through URL rewriting, or by modifying the default from both into pages in the feed.php file.

2012/01/27 08:53 · csoldani · 0 Comments

Photon Compiler Development: Monomorphic Type Inference

This is the 16th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement type inference for monomorphic types.

Overview

Rather than requiring every function argument and return value to be explicitly specified by the programmer, we will implement type inference, a process by which missing types are deduced by the compiler.

Our language is still strongly and statically typed. We can flag typing errors at compile time and use type information for compilation.

To achieve type inference, we modify the parser to accept untyped definitions. We then extend the typer to infer types rather than simply enforcing them. This in turn requires a modification of the type representation.

Parsing

As our type representation will evolve into something more complex, we prefer to completely avoid playing with types during parsing. To represent (optional) types in our AST, we introduce the following new data type:

type parse_type =
     PTEmpty (** no type provided *)
   | PTId of Position.t * string (** type given by name *)
   | PTFun of parse_type list (** function type *)

Function arguments and ASTs are now parameterized by an additional type argument, which is parse_type after parsing and Type.t after typing.

type 't arg = string * 't
 
type ('lbl, 't) top_level =
     Bind_fun of bool * string * 't arg list * 't * 'lbl t
   | Bind_val of string * 'lbl t * 't
   | External of string * 't * string
   ...

The parser is modified to use the new parse_type. It now also accepts unannotated function arguments, and function declarations without return types. Here is a relevant excerpt:

decl_or_dir
   : LET IDENT EQUAL expr  { Bind_val($2, $4, PTEmpty) }
   | LET IDENT COLON type_ EQUAL expr  { Bind_val($2, $6, $4) }
   | LET IDENT args EQUAL expr
                           { Bind_fun(false, $2, List.rev $3, PTEmpty, $5) }
   | LET IDENT args COLON type_ EQUAL expr
                           { Bind_fun(false, $2, List.rev $3, $5, $7) }
   | LET REC IDENT args EQUAL expr
                           { Bind_fun(true, $3, List.rev $4, PTEmpty, $6) }
   | LET REC IDENT args COLON type_ EQUAL expr
                           { Bind_fun(true, $3, List.rev $4, $6, $8) }
   ...
 
arg
   : IDENT                 { ($1, PTEmpty) }
   | LPAR IDENT COLON type_ RPAR { ($2, $4) }

Typing

Our treatment of type inference is a variation around the Damas-Milner algorithm. Our implementation is mostly inspired by the book Using, Understanding and Unraveling the OCaml Language: from Practice to Theory and vice-versa, by Didier Rémy.

An Example

To illustrate the type inference process, let's first do an example by hand.

let f c x y = (if c then x else y) + 1

First, let's rewrite this with subscripts to identify all the sub-expressions appearing in this function definition:

let f1 c2 x3 y4 = (if c8 then x9 else y10)6 +5 17

Let τi denote the type of the i-th sub-expression, as shown above. Analyzing this function definition AST, we can deduce the following type constraints that should hold for the function definition to be well-typed:

  1. τ1 = τ2 → τ3 → τ4 → τ5
    Because f if a function of c, x and y and returns the value given by its body.
  2. τ6 = int
    Because the first operand to + should be an integer.
  3. τ7 = int
    Because the second operand to + should be an integer.
  4. τ5 = int
    Because the resulting value of an + operation should be an integer.
  5. τ7 = int
    Because 1 is an integer.
  6. τ8 = bool
    Because in an if-then-else expression, the condition must be boolean.
  7. τ6 = τ9
    Because the resulting type of an if-then-else expression is the same as the one of its then branch.
  8. τ6 = τ10
    Because the resulting type of an if-then-else expression is the same as the one of its else branch.
  9. τ8 = τ2
    Because c here refers to the first argument of f.
  10. τ9 = τ3
    Because x here refers to the second argument of f.
  11. τ10 = τ4
    Because y here refers to the third argument of f.

It now remains to solve these constraints:

  • As τ2 = τ8 (9) and τ8 = bool (6), we have τ2 = bool (12).
  • As τ9 = τ6 (7) and τ6 = int (2), we have τ9 = int (13).
  • As τ10 = τ6 (8) and τ6 = int (2), we have τ10 = int (14).
  • As τ3 = τ9 (10) and τ9 = int (13), we have τ3 = int (15).
  • As τ4 = τ10 (11) and τ10 = int (14), we have τ4 = int (16).
  • Injecting (12), (15), (16) and (4) in (1) gives τ1 = bool → int → int → int (17).

The solution is thus given by:

  • τ1 = bool → int → int → int
  • τ2 = τ8 = bool
  • τ3 = τ4 = τ5 = τ6 = τ7 = τ9 = τ10 = int

This solution gives a type to every sub-expression and you can verify it violates none of the generated type constraints.

Type Representation

Rather than first generating the type constraints and then solving them afterward, we will do both at the same time. To achieve this, we use an indirection in the type representation. A type is represented by a mutable pointer to either:

  • a (partially) known type such as int or bool → τi;
  • a yet-unknown type, which we call a type variable;
  • a link to another type.
type t = {
   mutable node : node; (* partially known type, type variable or type link *)
}

When we meet a constraint of the form τi = int, we mutate τi into a pointer to int. When we meet a constraint of the form τi = τj, we mutate τi into a pointer to τj. Of course, we must ensure that these modifications are type-safe. If τi already points to bool when we meet a constraint τi = int, we must raise an error instead of replacing τi. This is the role of type unification, described below. Newly introduced unknown types are made pointing to a type variable.

To distinguish between terminal types (partially known types and type variables), and link types (links to other types), we subdivide types into descriptive types and link types.

type node =
     TDesc of desc (* terminal type *)
   | TLink of t (* link to another type *)

Type variables are represented by a dedicated constructor TVar, while partially known types are represented by the application of a type constructor to zero, one or more types.

type desc =
     TVar (* type variable *)
   | TApply of constr * t list

For example, int is represented by TApply(TInt, []) while bool → char is represented by TApply(TFun, [t_bool; t_char]) where t_bool and t_char are pre-defined boolean and character types.

Finally, we adjoin a mutable integer to each type, which can serve different purposes in type traversals (e.g. see recursive types). Our full type definition is given by:

type t = {
   mutable node : node; (** type node *)
   mutable mark : int (** a mark than can be freely used during traversals *)
} and node =
     TDesc of desc (** terminal type *)
   | TLink of t (** link to another type *)
and desc =
     TVar (** type variable *)
   | TApply of constr * t list (** type constructor *)

We introduce a function deref to de-reference types, i.e. strip potential links, and one function desc which returns the type description associated to a given type.

let rec deref t =
   match t.node with
     TLink u ->
        let v = deref u in
        t.node <- TLink v; (* we replaces chains of links by a single link *)
        v
   | TDesc _ -> t
 
let desc t =
   match (deref t).node with
     TLink _ -> assert false
   | TDesc d -> d

Type Unification

Unifying two types τi and τj is enforcing the type constraint τi = τj, while checking other type constraints about τi and τj remain valid.

For example, unifying int and bool fails while unifying char → bool and char → τi makes τi points to bool.

To unify two types, we compare their type descriptions. If one of them is a fully unknown type, i.e. a type variable, it is replaced by a link to the other one. If both types are type applications of the same type constructor to lists of types with the same lengths, we recursively unify their type arguments. In any other case, the unification fails and we report a type error at the given position.

To avoid recursing down large types which are the same one, we use physical equality tests (i.e. memory address comparisons) to shortcut the process when both types point to the exact same type. To maximize this effect, we replace by links not only type variables but also unified type applications, in order to maximize sharing.

let rec unify pos actual expected =
   let link t1 t2 = t1.node <- TLink t2 in
   let actual = deref actual in
   let expected = deref expected in
   if actual == expected then () (* Both types point to the same node *)
   else
      (* actual and expected points to different types, try to unify them *)
      match desc actual, desc expected with
        TVar, _ -> link actual expected
      | _, TVar -> link expected actual
      | TApply (con1, ts1), TApply(con2, ts2) when con1 = con2 ->
           if List.length ts1 <> List.length ts2 then
              mismatch pos actual expected;
           begin try List.iter2 (unify Position.dummy) ts1 ts2
           with Error(_, _) -> mismatch pos actual expected
           end;
           link actual expected
      | _, _ -> mismatch pos actual expected in

The recursive unification of type arguments above uses a dummy position because no position can be associated to sub-types. When an error is raised while unifying type arguments, we re-raise it with the original position and full types.

Type Inference

Armed with a type representation and type unification, the type inference process itself is rather straightforward.

Literals

Every literal has an associated pre-defined type, which we simply return.

let type_of_lit = function
     LBool _ -> t_bool
   | LChar _ -> t_char
   | LFloat _ -> t_float
   | LInt _ -> t_int
   | LString _ -> t_string
   | LUnit -> t_unit
 
let rec annotate_ast type_env ast =
   let pos = ast.label in
   let mk_ast expr t = { label = (pos, t); expr = expr } in
   match ast.expr with
     ...
   | Literal lit -> mk_ast (Literal lit) (type_of_lit lit)
Variables

We lookup variables in the type environment, and return their associated type.

let type_of_var type_env pos id =
   try Type_env.find type_env id
   with Not_found -> raise (Error (pos, "unbound variable " ^ id))
 
let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | EVar id -> mk_ast (EVar id) (type_of_var type_env pos id)
Operators

For each operand, we first infer its type, and then unify it with its expected type as determined by the operator. The operation resulting type is determined by the operator.

We dropped the support for wanna-be type-classes Eq and Ord. We are now more liberal in the types accepted by comparison operators. From now on, it will trigger a compile error rather than a type error when two values cannot be compared (we still ensure both compared values have the same type).

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
     Bin_op(lhs, op, rhs) ->
        let (operand_type, ret_type) =
           match op with
             Cmp_eq | Cmp_ne | Cmp_ge | Cmp_gt | Cmp_le | Cmp_lt ->
                  (new_tvar (), t_bool)
             | F_add | F_div | F_mul | F_pow | F_sub -> (t_float, t_float)
             | I_add | I_div | I_mul | I_sub -> (t_int, t_int) in
        let lhs = annotate_ast type_env lhs in
        unify (pos_of lhs) (type_of lhs) operand_type;
        let rhs = annotate_ast type_env rhs in
        unify (pos_of rhs) (type_of rhs) operand_type;
        mk_ast (Bin_op(lhs, op, rhs)) ret_type
   ...

The code for unary operators is very similar.

If-Then-Else

We infer a type for the condition, and then unify its type with t_bool. We then infer types for both branches, and unify the type of the else branch with the one of the then branch. Finally, the inferred type for the whole if-then-else expression is the one of the then branch.

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | If(cond, then_, else_) ->
        let cond = annotate_ast type_env cond in
        unify (pos_of cond) (type_of cond) t_bool;
        let then_ = annotate_ast type_env then_ in
        let else_ = annotate_ast type_env else_ in
        unify (pos_of else_) (type_of else_) (type_of then_);
        mk_ast (If(cond, then_, else_)) (type_of then_)
Function Calls

We first infer a type for the callee. We then generate a type τ1 → … → τn+1 for a function of n arguments, and unify it with f's inferred type.

We then infer a type for each argument, which is unified with corresponding τi.

Finally, we return τn+1 as the type for the call.

let rec annotate_ast type_env ast =
   ...
   match ast.expr with
   ...
   | Call(f, args) ->
        let arg_types = List.map (fun _ -> new_tvar ()) args in
        let ret_t = new_tvar () in
        let f = annotate_ast type_env f in
        unify (pos_of f) (type_of f) (mk_fun_type (arg_types @ [ret_t]));
        let annotate_arg arg arg_t =
           let arg = annotate_ast type_env arg in
           unify (pos_of arg) (type_of arg) arg_t;
           arg in
        let args = List.map2 annotate_arg args arg_types in
        mk_ast (Call(f, args)) ret_t
Function Definition

We first convert parse types into real types. Unannotated arguments and return value are given a new type variable as type. We then extends the type environment as before and infer a type for the body. The body type is then unified with the return type, and the original type environment is extended with the new binding.

let annotate type_env = function
     Bind_fun(is_rec, id, args, ret_type, body) ->
        (* Turns parsing types into real types *)
        let convert_arg (id, pt) = (id, parse_type_to_type pt) in
        let args = List.map convert_arg args in
        let ret_type = parse_type_to_type ret_type in
        (* Builds updated type environment *)
        let add_arg type_env (id, t) = Type_env.extend type_env id t in
        let ext_type_env = List.fold_left add_arg !type_env args in
        let t = mk_fun_type (List.map snd args @ [ret_type]) in
        let ext_type_env =
           if is_rec then Type_env.extend ext_type_env id t
           else ext_type_env in
        (* Type body and updates environment *)
        let body = annotate_ast ext_type_env body in
        unify (pos_of body) (type_of body) ret_type;
        type_env := Type_env.extend !type_env id t;
        (t, Bind_fun(is_rec, id, args, ret_type, body))

As similar technique is used for value bindings.

External declaration are fully typed and no type inference is done, but parsing types are still replaced by real types (raising an error if one type is unknown).

Recursive Types

The process of linking types during unification can lead to recursive types, i.e. types that refer to themselves. For example, if we unify τi with int → τi, we will simply link τi with int → τi. If we try to dump this type naively following links, it will looks like int → (int → (int → (…))), an infinite expansion.

Recursive types can have their use (for examples, see Chris Okasaki's “Purely Functional Data Structures”). However, when resulting from the type inference process, they more often indicate programmer errors than intended use. For this reason, we forbid implicitly defined recursive types (as do OCaml, except when passed the -rectypes flag).

Rather than checking that a type do not occur in the target each time we want to link one (something known as the occurs check), we defer this check to the end of unification.

Here is our new unification function, where cyclic_types t returns a list of types in t which are cyclic:

let unify pos actual expected =
   let rec aux pos actual expected =
      ... (* our old unify definition, renamed aux *) in
   (* Unifies given types without any provision for recursive types *)
   aux pos actual expected;
   (* Check that resulting type is not cyclic (the "occurs check") *)
   if cyclic_types actual <> [] then
      raise (Error(pos, "inferred a cyclic type: " ^ string_of_type actual))

The cyclic_types function detect cyclic types using two marks: one for visited type nodes, and one for nodes under visit. When we see a type for the first time, we mark it as under visit. We then recurse through its sub-types. When we have finished, we mark the type as already visited. Should we meet a type with the under visit mark, it means that this type is cyclic and it is added to the list of cyclic types.

let cyclic_types t =
   let visiting = new_mark () in
   let visited = new_mark () in
   let cycles = ref [] in
   let rec visit t =
      let t = deref t in
      if t.mark = visited then () (* already done, not cyclic *)
      else if t.mark = visiting then
         (* cycle detected, add it to list *)
         cycles := t :: !cycles
      else (
         (* neither visited nor under visit, i.e. newly encountered type *)
         t.mark <- visiting;
         begin match desc t with
           TVar -> ()
         | TApply(_, ts) -> List.iter visit ts
         end;
         t.mark <- visited
      ) in
   visit t;
   !cycles

Of course, we must make provisions for cyclic types when printing a type. We handle this similarly to OCaml, giving a cyclic type a new name on first occurrence and reusing it thereafter. For example, our τi = int → τi type will be typed out as (int → 'a as 'a).

To achieve this, we first obtain the list of cyclic types of a type before converting it to string. We mark any cyclic type as such using a new mark. When we convert the type to string and we meet a cyclic type, we change its mark into a newly generated one (greater than the cyclic mark as marks are generated in increasing order). When meeting a type with a mark greater than the cyclic mark, we do not try to recurse into the type but we print a variable name such as 'a. The letter itself depends of the difference between the mark and the cyclic mark (the first encountered cyclic type will be named 'a, the second one 'b, …). To indicate the name of the cyclic type, we add a as 'a after its first occurrence, where the actual letter used is again determined by the difference between the newly-generated mark and the cyclic mark.

The code additionally handle remaining type variables, and adaptive placement of brackets depending on context, I let you discover the details in the code itself.

Quality of Error Messages

Our code has two shortcomings regarding error messages. The first one is a bias due to our bottom-up approach. Consider the following example:

let f c x = (if c then 1 else x) +. 42.;;

Our type-checker will report that inferred type for if c then 1 else x is int while expected type was float. A top-down type-checker would instead report that 1 has type int while expected type was float.

While I think the bottom-up approach generally provides better error messages, there are errors for which the top-down approach yields better ones. Which approach works best in general is the subject of hot debate. I would say neither approach is intrinsically better than the other. A better way of handling error messages would probably be to decouple constraint generation from constraint solving, in order to generate error messages using heuristics, or reporting several possible points of errors. I will stay with the current version for the time being, but you can have a look at "Generalizing Hindley-Milner Type Inference Algorithms", by Bastiaan Heeren, Jurriaan Hage and S. Doaitse Swierstra for a good introduction about this subject.

Another problem with our approach comes from the use of mutability, which can cause partial unification. If we try to unify τi → int with char → float, our type-checker will report that it cannot unify char → int with char → float. This is because when unification fails (when trying to unify int against float), unification of τi with char has already taken place. It would probably be better to report the initial types, i.e. that τi → int cannot be unified with char → float. However, this would require some form of backtracking.

Note that both of these shortcomings are shared with the OCaml compiler itself.

Source Code

The code accompanying this article is available in archive photon-tut-16.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 16-monomorphic_type_inference.1

In our next installment, we will extend the type inference process to handle polymorphic types.

2012/01/26 20:24 · csoldani · 0 Comments

Create your own Firefox Search Plugins

You didn't found a search plugin for a service you use often and are tired of first going to the service search form or typing your search terms directly in the URL? Here's how to add your own custom search plugins in two minutes.

How It Works

In Firefox/Iceweasel, you can do only limited searches using the URL bar (I miss the combined URL/search bar and intelligent bookmarks from Epiphany). You need to use a different search box (Ctrl-K). This search box can select a different search engine to use, by clicking on the drop-down icon or by typing Ctrl-Down/Up.

You can disable some of the default search engines by choosing “Manage Search Engines…”, but there is no easy way to add one. When you try to add one, you are directed on a Mozilla extension page suggesting you should install one of a very limited list given there.

How to Add Your Own

Have a look at /usr/lib/firefox/searchplugins (or /usr/lib/iceweasel/searchplugins, or find it yourself). This folder contains XML definitions for all preloaded search engines. Copy one that look reasonably close to the service you want to use and edit it to replace descriptions, icon, URL and parameters.

For example, here is my search plugin to translate from English to French using Google Translate, which I derived from google.xml:

translate_en_fr.xml
<SearchPlugin xmlns="http://www.mozilla.org/2006/browser/search/">
<ShortName>Translate (En -> Fr)</ShortName>
<Description>Google Translate from English to French</Description>
<InputEncoding>UTF-8</InputEncoding>
<Image width="16" height="16" type="image/x-icon">http://translate.google.com/favicon.ico</Image>
<Url type="text/html" method="GET" template="http://translate.google.com/#en|fr|{searchTerms}" />
<SearchForm>http://translate.google.com/#en|fr</SearchForm>
</SearchPlugin>

Create a ~/.mozilla/firefox/YOUR_PROFILE/searchplugins/ folder, move your XML definition there and restart Firefox/Iceweasel. You are done!

Even Easier

Some nice searchable sites or services supports OpenSearch. When its the case, just go to the site and unroll your search bar menu. You should have an entry that says “Add THE_SITE”. When clicked, it creates the search plugin for you.

You can try it on this blog!

2012/01/25 10:27 · csoldani · 3 Comments

Photon Compiler Development: Batch Processing

This is the 15th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we implement a batch mode, i.e. the possibility to call photon interpreter on files, rather than interactively.

Overview

To allow batch processing, we allow file names to be given on the command line. When present, we parse, type, compile and executes them rather than starting the interactive interpreter.

When batch-processing a file, we will be less verbose than when in interactive mode.

Finally, we will clean-up a bit after ourselves. As a great number of files can be given on the command line, we need to reclaim resources as we go, rather than waiting the end of compiler execution.

Parsing

Files are not parsed a single top-level statement at a time. Indeed, the ;; is optional between top-level statements when the separation between them is not ambiguous, and most OCaml files do not contain a single ;;. The ;; is mostly used in the interactive interpreter, where it signals to the interpreter that the user has finished its input and that the interpreter can parse, type and evaluate it. Indeed, even in the interpreter, OCaml allows multiple statements in one go:

# let a = 1
  let b = 2;;
val a : int = 1
val b : int = 2

Our current interpreter would report a syntax error when seeing the second let where it expects a ;;.

The OCaml interpreter goes further than a ;; when parsing, but not for typing. The following file with a syntax error does not print the message when evaluated.

syntax_error.ml
print_endline "Hello";;
let let
% ocaml syntax_error.ml
File "syntax_error.ml", line 2, characters 4-7:
Error: Syntax error

However, following code with a type error is partly executed:

type_error.ml
print_endline "Hello";;
1 + 2.0
% ocaml type_error.ml
Hello
File "type_error.ml", line 2, characters 4-7:
Error: This expression has type float but an expression was expected of type
         int

I don't know exactly how OCaml interpreter buffers and process file input, but we will do something simpler. We parse, type and execute until a ;; or end-of-file is found. We can use this process both for interactive input and for file input.

For files, we could have considered parsing and typing the whole file in a single step. While this would prevent partial execution, it would also prevent calling our interpreter on a named pipe (our interpreter would block until end-of-file is reached, defeating the purpose of a named pipe in most cases). Our approach is also a bit more memory efficient, as it has not to build the whole-file AST into memory. Of course, if the given file contains no ;;, the behavior will be the same as whole-file processing.

Practically, we rename Parser.interactive into top_levels and we make it returns a list of top-level statements, instead of just one. The grammar is modified accordingly. Here is a relevant excerpt:

parser.mly
...
%start top_levels
%type<Position.t Ast.top_level list> top_levels
 
%%
 
decl_or_dirs
   : decl_or_dir             { [$1] }
   | decl_or_dirs decl_or_dir  { $2 :: $1 }
 
top_levels
   : EOF                   { raise End_of_file }
   | decl_or_dirs SEMI_SEMI { List.rev $1 }
   | decl_or_dirs EOF      { List.rev $1 }
   | expr SEMI_SEMI        { [Eval $1] }
   | expr EOF              { [Eval $1] }
 
decl_or_dir
   : LET IDENT EQUAL expr  { Bind_val($2, $4) }
   | LET IDENT args COLON type_ EQUAL expr
                           { Bind_fun(false, $2, List.rev $3, $5, $7) }
   | LET REC IDENT args COLON type_ EQUAL expr
                           { Bind_fun(true, $3, List.rev $4, $6, $8) }
   | EXTERN IDENT COLON type_ EQUAL STRING { External($2, $4, $6) }
   | directive             { $1 }
   | error                 { raise (Error(pos 1, "syntax error")) }
...

Compiling

We simply compile top-level statements in sequence, using the same machinery as before. The only difference is that we do not systematically print the evaluated results on standard output. While desirable in interactive mode, it is not for batch processing. I thus added a must_print argument in relevant places to control whether or not result printing should be compiled in or not.

A second change is the addition of a dispose function to reclaim resources associated with the compiler. It simply releases the underlying module and manually clears the tables.

compiler.ml
...
let dispose comp =
   dispose_module comp.module_;
   Hashtbl.clear comp.globals;
   Hashtbl.clear comp.args

Driver

Command-line parsing has been modified to accept arguments in addition to options. Each argument is verified to be an existing file, and then queued for processing.

When command-line parsing is done, we check whether or not there are files in queue. If there are, we process them one at a time. Each file is processed in a fresh environment (compiler, lexing buffer, type environment). Once done with a file, we call dispose on its compiler to release underlying LLVM module.

When processing a file, we stop at the first encountered error, processing neither additional tokens from the faulty file, nor additional files still queued for processing.

When no file has been queued for processing, we start the interpreter as before.

As we use the same ;;-delimited block approach both for interactive and file processing, most of the logic is shared between the two cases and has been accordingly shared in the code.

photon.ml has been significantly refactored, but although modifications are numerous, they are simple so I refer you directly to the code for details.

Source Code

The code accompanying this article is available in archive photon-tut-15.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 15-batch_processing
2012/01/24 22:00 · csoldani · 0 Comments

Photon Compiler Development: Improved Unit Handling

This is the 14th article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we improve the compilation of unit values.

What Is Unit?

We already discussed the Unit type when we considered external declarations. This type can only represent a single value, also named unit and denoted () in OCaml.

As unit value are unequivocally known solely from their type, there is no need to physically store or return unit values, or to pass them as arguments.

Current Implementation

In the current photon compiler, we use a 1-bit integer type to represent unit values. To avoid storing needlessly many times unit values, we use one .unit constant for this purpose. Unit arguments are also ignored, and when unit is the return type of a function, it is replaced by void.

Other than having a dubious unit-typed constant, our code has a serious bug regarding units arguments, which are fully ignored, i.e. not even evaluated. Here is an example of the surprising behaviour:

# extern puts : string -> unit = "puts";;
extern puts : string -> unit = "puts"
# let f (u : unit) : unit = puts "f called";;
val f : unit -> unit = <fun>
# f (puts "Hello");;
f called
- : unit = ()

Contrarily to what one could expect, no “Hello” message is printed.

Improvement

Rather than using 1-bit integer type for unit, we will use the void_type.

There is no longer something special to do for function return types, which will already be converted to void_type by llvm_type_of.

When unit values are needed, we will use undefined values through LLVM's undef. These values of course cannot be used, but we will never do. If we do, an error will quickly remind us which is better than stupidly mishandling 1-bit values.

As unit values are now direct LLVM values rather than a pointer to .unit global unit constant, we modify Llvm_utils.is_function to accept non-pointer arguments. We also add a is_pointer function to discriminate between references and direct values. The new variable lookup procedure is now as follows:

(* Look up variable [id]. *)
let lookup_var comp id =
   try Hashtbl.find comp.args id
   with Not_found ->
      let g = Hashtbl.find comp.globals id in
      if is_function_pointer g then g
      else if is_global_constant g then global_initializer g
      else if is_pointer g then build_load g id comp.builder
      else g

Unit arguments are still filtered out for function calls, but now only after being compiled. We compile all arguments (whether unit-typed or not) from left to right. Although one should not rely on this evaluation order, this is the most intuitive one if you rely on it anyway.

let rec compile comp ast =
   match ast.node with
     ...
   | Call(f, args) ->
        (* Compile arguments, discarding unit-typed ones once compiled *)
        let handle_arg args arg =
           let arg' = compile comp arg in
           if type_of_ast arg = Unit then args
           else arg' :: args in
        let args = List.rev (List.fold_left handle_arg [] args) in
        (* Compile and call function with compiled and filtered arguments *)
        let f' = compile comp f in
        build_call f' (Array.of_list args) "" comp.builder

Our example session now runs fine:

# extern puts : string -> unit = "puts";;
extern puts : string -> unit = "puts"
# let f (u : unit) : unit = puts "f called";;
val f : unit -> unit = <fun>
# f (puts "Hello");;
Hello
f called
- : unit = ()

Source Code

The code accompanying this article is available in archive photon-tut-14.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 14-improved_unit_handling

In the next installment, we will cover batch processing.

2012/01/24 08:28 · csoldani · 0 Comments
1) If you really try this example, it actually breaks directly after the id declaration. Our compiler cannot compile functions with (partially) unknown types yet.
2) By the way, everything I read from Simon Peyton-Jones was worth reading.
start.1259654416.txt.gz · Last modified: 2011/02/20 20:28 (external edit)