User Tools

Site Tools


Photon Compiler Development: External Declarations

This is the seventh article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we will add external declarations, i.e. a foreign function interface to allow interaction with C libraries.

Differences with OCaml Runtime Model

While we intend our compiler to accept a subset of the OCaml language, we do not intend to support OCaml runtime model. We will not have any binary compatibility, i.e. photon-generated modules and libraries will be incompatible with OCaml-generated ones and vice versa.

While it would be interesting to be able to interact gracefully with OCaml libraries, it would require a deep understanding of OCaml internals (garbage collection, exception mechanisms, binary formats, …) and impose stringent requirements on our implementation (which should always remain runtime-compatible).

As we intend Photon to diverge from OCaml later, it is not worth the effort.

As the foreign function interface (FFI) of OCaml is dependent of its runtime model, we will not support the OCaml FFI (no external keyword). We will need a FFI however (our compiler will need at least bindings to the LLVM library). We thus have chosen to extend our OCaml subset with a custom FFI. We stay very close to the OCaml syntax, but we use extern instead of external to avoid confusion. The semantics of our FFI is described in the next two sections.

Value Representation

OCaml has a uniform value representation (see the OCaml Manual). While this approach is probably the simplest, it has some drawbacks.

  • It implies a performance loss, due mostly to boxing.
  • It requires more memory for storage of values of size less than a machine word size.
  • It makes marshalling a bit complex.
  • It imposes a clumsy foreign function interface, as we will see in a moment.

In Photon, we will use specific, direct representations for primitive types. While it is a bit more complex, we hope the performance gain will be worth the effort and it simplifies the foreign function interface.

Foreign Function Interface

In OCaml, you cannot directly call generic C code, you have to use (and write!) a wrapper for every C function you want to import. The reason is precisely the value representation. Arguments and return value must be translated from/to OCaml values to/from C-compatible values in the wrapper.

For example, imagine you want to import C sine function double sin(double). You would be tempted to write

external sin : float -> float = "sin"

but this fails as the argument that will be passed to C's sin is not the floating-point value but an OCaml value, i.e. a pointer to a structure containing (among other things) the floating-point value.

The correct way to do it would be to define

external sin : float -> float = "sin_wrapper"

on the OCaml side and the following wrapper on the C side.

#include <math.h>
#include <caml/alloc.h>
#include <caml/mlvalues.h>
CAMLprim value sin_wrapper(value x)
   double x_double = Double_val(x);
   double res_double = sin(x_double);
   return caml_copy_double(res_double);

In Photon, one imports the C function directly.

extern sin : float -> float = "sin"

Here is the mapping between primitive Photon types and equivalent C types, along with their LLVM representation. the handling of bool and unit is a bit specific and detailed in the two next sections.

Photon C LLVM
bool int32_t i1 signext
char char i8
float double double
int int i32 or i64
string char * i8 *
unit void i1 or void

The int type corresponds to the machine word size integer, i.e. a 32-bit integer on a 32-bit machine, and a 64-bit integer on a 64-bit machine. Note that it is not strictly equivalent to the OCaml int type (31-bit on a 32-bit machine, and 63-bit on a 64-bit machine). We assumed however that the difference was not significant enough to justify naming our integer type by another name. We chose not to use nativeint either. Although it is of the right bit-width, it is boxed in OCaml and it would have made the code generally heavier.

Bool Type

Our boolean type bool is represented by a 1-bit integer. It is convenient as all boolean operations in LLVM have i1 arguments or return values. It is also the more compact representation.

However, C has no type smaller than a byte, i.e. 8 bits. We can thus not pass bools directly as arguments to a C function or receive them directly as the result of a C function.

However, rather than having to wrap any foreign code that uses or produces bool values, we will use the facility provided by the signext LLVM attribute. When this attribute is set on a function argument, it will be sign-extended to 32-bits by the caller. On the contrary, if signext attribute is set on a function, its return value is assumed to be 32-bit long and is coerced to the return type. Both operations are handled transparently by LLVM.

For all external declarations, we will set automatically the signext attributes on bool arguments and functions returning a value of type bool. These should thus be int32_t on the C side, as the following example illustrates:

extern f : bool -> int = "f";;
extern g : char -> bool = "g";;
int f(int32_t b) { /* ... */ }
int32_t g(char c) { /* ... */ }

Beware of the fact that this transformation is only applied on function arguments and return values. If you declare an external global variable of type bool, it is assumed to be of LLVM i1 type and is not converted automatically to i32 (except of course, when passed as argument to an external function).

Unit Type

The unit type is the type that comprises only a single element, also called unit. The unit value is denoted () in OCaml. It is similar but more flexible than the void type which has no element at all (see unit type for an example).

However, unit has not to be implemented literally. Any function that receives a unit-type argument already knows its value, (). It is thus not necessary to really pass the value. Similarly, a function returning () has no need to return it in practice, as the caller already knows the return value from the type.

When an external function receives an argument of unit type, we will not pass it. It should be considered absent on the C side. When an external function returns unit, we will use the void type instead. This will avoid unnecessary argument and return value passing, while simplifying the FFI at the same time. We can use unit as argument type when an external function takes no argument and as return type for void functions.

Here is an example:

extern get_char : unit -> int = "getchar";; (* int getchar(void) *)
extern puts : string -> unit = "puts";; (* void puts(char * ) *)


We introduce three new tokens, one for extern, one for and one for colons.

Rather than returning (string * Position.t Ast.t) pairs from the parser, we now return a new 'a Ast.top_level type.

(** Top-level input phrase. *)
type 'a top_level =
     BindVal of string * 'a t
   | Eval of 'a t
   | External of string * Type.t * string

In addition to Eval ast for expressions (formerly (“-”, ast)) and BindVal(id, ast) for bindings (formerly (id, ast)), it contains a constructor for external declarations comprising the locally given name, the type and finally the external name.

A new rule is added to interactive, which is also adapted to the new return type.

   | expr SEMI_SEMI        { Eval $1 }
   | LET IDENT EQUAL expr SEMI_SEMI { BindVal($2, $4) }
                           { External($2, mk_fun_type $4, $6) }

The new types rule collects the declared type through the auxiliary function mk_fun_type, given below. We add a new constructor Ref to Type.t to denote a type by its name while recording its position in input source code (see typing).

   : type_                 { [$1] }
   | types ARROW type_     { $3 :: $1 }
   : IDENT                 { Type.Ref(pos 1, $1) }
   | LPAR types RPAR       { mk_fun_type $2 }
   | LPAR types error      { raise (Error(pos 3, Unclosed_paren (pos 1))) }

mk_fun_type turns a reversed list of types [t_n; …; t_1] into a function type (t_1 → … → t_n). If a single type is provided (external variable rather than function), it is returned directly.

Fun(c, [a; b]) and Fun(Fun(c, [b]), [a]) are equivalent from a typing point of view. As is right-associative, a → b → c is equivalent to a → (b → c). However, they have not the same representation and will not be matched by the same patterns. To avoid this issue, mk_fun_type will always return the former (flattened) type even if the latter representation is given. This allows simpler typing code and will become especially important when we will introduce currying later.

let mk_fun_type = function
     [] -> failwith "mk_fun_type"
   | [t] -> t
   | t :: ts ->
        match t with
          Type.Fun(t', ts') ->
             Type.Fun(t', List.rev ts @ ts')
        | _ -> Type.Fun(t, List.rev ts)


The Type.t type is extended with three constructors. One to refer to type by name (recording the name position in case of error), one for the function type and one for the unit type. As just explained in parsing, the function type considers many arguments at once, even if it is represented in curried form in the concrete syntax. For now, function types are all built through Parser.mk_fun_type which ensures that the return value type is not itself a function.

type t =
   | Fun of t * t list (** function (return type, argument types) *)
   | Ref of Position.t * string (** type given by name *)
   | Unit

When representing a function type by a string, we are careful to add parentheses when necessary. string_of_type will also add parentheses around the return type if it is not atomic. We can thus discriminate between Fun(c, [a; b]) and Fun(Fun(c, [b]), [a]) as the former one will be printed as a → b → c while the later will be printed as a → (b → c).

let rec string_of_type = function
   | Fun(ret, args) ->
        let is_atomic = function
             Bool | Char | Float | Int | Ref(_, _) | String | Unit -> true
           | Fun(_, _ ) -> false in
        let prepend_arg t s =
           let t_str = string_of_type t in
           (if is_atomic t then t_str else "(" ^ t_str ^ ")") ^ " -> " ^ s in
        let ret_str = string_of_type ret in
        List.fold_right prepend_arg args
           (if is_atomic ret then ret_str else "(" ^ ret_str ^ ")")
   | Ref(_, s) -> s
   | Unit -> "unit"

The Typer is responsible for the conversion of an external declaration parsed type (full of Refs) into a concrete de-referenced type. This is achieved through an extension of type environments to record not only variable bindings, but also type definitions (i.e. mappings from type references to concrete types).

When created, the type environment is populated with type definitions for all the primitive types.
open Type
type t = {
   vars : (string, Type.t) Hashtbl.t;
   types : (string, Type.t) Hashtbl.t;
let add_type type_env = Hashtbl.add type_env.types
let create () =
   let type_env = {
      vars = Hashtbl.create 16;
      types = Hashtbl.create 16;
   } in
   add_type type_env "bool" Bool;
   add_type type_env "char" Char;
   add_type type_env "float" Float;
   add_type type_env "int" Int;
   add_type type_env "string" String;
   add_type type_env "unit" Unit;
let add type_env = Hashtbl.add type_env.vars
let find type_env = Hashtbl.find type_env.vars
let find_type type_env = Hashtbl.find type_env.types

The de-referencing operation in Typer is then trivially defined as follows. When a type reference does not corresponds to any defined type, the new error Unbound_type is thrown.

let rec replace_refs type_env t =
   match t with
     Bool | Char | Float | Int | String | Unit -> t
   | Fun(ret, args) ->
        let ret' = replace_refs type_env ret in
        let args' = (replace_refs type_env) args in
        if ret' == ret && List.for_all2 (==) args args' then t
        else Fun(ret', args')
   | Ref(pos, id) ->
        try Type_env.find_type type_env id
        with Not_found -> raise (Error(pos, Unbound_type id))


Handling of global Variables

Our approach of renaming old variables when defining new ones becomes unsafe in the presence of external bindings. If we change the name of a global variable in our module, it will not change in the external module where it is defined and we will get linking errors.

Ignoring the issue of which symbols will be exported for now (we will revisit this topic when we will consider multiple modules compilation later on), we will let LLVM rename not the old variable but the new one, when a collision occurs. To be able to always refer to last defined variable with its binding name, we use a hash table mapping variable names to the corresponding global variable (whatever its symbol name may be).

As both LLVM and Photon use the same namespace for variables and functions, we use a single hash table for both.

We need to thread this hash table along with the LLVM module. Rather than expose it directly as a new argument, we define a new Compiler.t type which encapsulates both the llmodule and the hash table for globals.

type t = {
   module_ : llmodule;
   globals : (string, llvalue) Hashtbl.t;

create_and_initialize_module is renamed to simply create and is made to return a Compiler.t instead of just a llmodule.

All llmodule arguments are replaced by a Compiler.t argument. compile, which used the module indirectly through builder, is given a new argument.

Variable look-up is modified to use the globals hash table rather than looking up the global variable directly. Now that variables can also denote functions, we also take care to differentiate between global variables and functions (which should never be the target of a load instruction).

let rec compile comp ast builder =
   match ast.node with
   | Var id ->
        let g = Hashtbl.find comp.globals id in
        if classify_type (element_type (type_of g)) = TypeKind.Function then g
           if is_global_constant g then global_initializer g
           else build_load g id builder

Finally, we add a llvm_module function to return the LLVM module which is associated to the given Compiler.t. This is necessary for dumping the module and building the interpreter.

External Declarations

We first define a function llvm_type_of to convert our internal type representation Type.t to the LLVM lltype. For functions, we replace unit by void in return types and ignore unit arguments completely.

let rec llvm_type_of = function
     Bool -> i1_type
   | Char -> i8_type
   | Float -> double_type
   | Fun(ret, args) ->
        (* Convert unit to void in return type and omit it in arguments *)
        let ret = if ret = Unit then void_type else llvm_type_of ret in
        let args = llvm_type_of (List.filter ((<>) Unit) args) in
        function_type ret (Array.of_list args)
   | Int -> int_type
   | Ref(_, s) -> failwith ("llvm_type_of: remaining type ref '" ^ s ^ "'")
   | String -> pointer_type i8_type
   | Unit -> i1_type

Then, an external declaration is handled as follows. If the value being defined is a variable (i.e. not a function), we simply declare a new global variable for the external name with the given type. If the global variable is already declared, it is returned so that we can handle redundant definitions. If the type of the newly defined variable does not match the type of the already declared variable, a bitcast to the new type is returned. We could thus have different views of the same external variable.

Functions are handled similarly, except that we also take care of sign-extending arguments or return values that are fewer than 8-bit long as described for the bool type.

We finish by storing a reference to the declared global variable or function under the local name in the globals hash table.

let declare_external comp in_id t out_id =
   let need_extend t =
      classify_type t = TypeKind.Integer && integer_bitwidth t < 8 in
   let t = llvm_type_of t in
   let ext =
      if classify_type t = TypeKind.Function then (
         let f = declare_function out_id t comp.module_ in
         let ret = return_type t in
         if need_extend ret then add_function_attr f Attribute.Sext;
         let extend_if_needed p =
            if need_extend (type_of p) then add_param_attr p Attribute.Sext in
         Array.iter extend_if_needed (params f);
      ) else declare_global t out_id comp.module_ in
   Hashtbl.add comp.globals in_id ext

The create function is modified to declare printer functions through declare_external. We choose Print.<type> as their internal name. We also add a new string constant “<fun>”.

let create name =
   let m = create_module name in
   let globals = Hashtbl.create 16 in
   (* String constants *)
   let add_string_constant name s =
      let g = define_global name (const_stringz s) m in
      set_linkage Linkage_internal g;
      set_global_constant true g in
   add_string_constant ".newline" "\n";
   add_string_constant ".abstract" "<abstract>";
   add_string_constant ".fun" "<fun>";
   (* Declare external [printf] *)
   let printf_type = var_arg_function_type int_type [|pointer_type i8_type|] in
   ignore (declare_function "printf" printf_type m);
   (* Intrinsics *)
   let t = function_type double_type [|double_type; double_type|] in
   ignore (declare_function "llvm.pow.f64" t m);
   (* Declare builtin printers *)
   let comp = { module_ = m; globals = globals } in
   declare_external comp "Print.bool" (Fun(Unit, [Bool])) "print_bool";
   declare_external comp "Print.char" (Fun(Unit, [Char])) "print_char";
   declare_external comp "Print.float" (Fun(Unit, [Float])) "print_float";
   declare_external comp "" (Fun(Unit, [Int])) "print_int";
   declare_external comp "Print.string" (Fun(Unit, [String])) "print_string";
   declare_external comp "Print.unit" (Fun(Unit, [Unit])) "print_unit";

Finally, compile_evaluator_and_printer is adapted to make use of the globals hash table and local printer names.

let compile_evaluator_and printer comp id ast =
   (* Store result if an [id] is not "-" *)
   if id <> "-" then (
      if is_constant value then (
         let g = define_global id value comp.module_ in
         set_global_constant true g;
         Hashtbl.add comp.globals id g
      ) else (
         let undefined = undef (type_of value) in
         let g = define_global id undefined comp.module_ in
         ignore (build_store value g builder);
         Hashtbl.add comp.globals id g
   (* Prints result *)
   let printer_name = "Print." ^ string_of_type val_type in
   begin try
      let f = Hashtbl.find comp.globals printer_name in
      if val_type = Unit then ignore (build_call f [||] "" builder)
      else ignore (build_call f [|value|] "" builder)
   with Not_found ->
      let str =
         match val_type with
           Fun(_, _) -> unsafe_lookup_global ".fun" comp.module_
         | _ -> unsafe_lookup_global ".abstract" comp.module_ in
      let p_str = build_in_bounds_gep str [|zero; zero|] "p_str" builder in
      ignore (build_call printf [|p_str|] "" builder)


Our main function is modified according to the change in the parser return type. The previous code to process (identifier, ast) pairs is moved into an internal function compute_and_run in order to be reused for both Eval and BindVal top-level input phrases.

The handling of external declarations proceeds as follows. We first de-reference the parsed type. We print it and it is bound to the local identifier in the type environment. We then call Compiler.declare_external to record its declaration.

Here is the full main function.

let main () =
   let lexbuf = Lexing.from_function refill in
   Lexer.set_file_name lexbuf "<stdin>";
   let comp = Compiler.create "Photon JIT" in
   let interp = Interpreter.create (Compiler.llvm_module comp) in
   let type_env = Type_env.create () in
   let compute_and_run id ast =
      let typed_ast = Typer.annotate type_env ast in
      let t = snd typed_ast.label in
      print_endline (string_of_type t);
      if id <> "-" then Type_env.add type_env id t;
      let f = Compiler.compile_evaluator_and_printer comp id typed_ast in
      Llvm.dump_value f;
      Interpreter.run_and_delete interp f in
   (* Read-eval-print loop *)
      while true do
         (* We drop any input that possibly remains, either because it follows
          * [;;], or because an error was met *)
         Lexer.reset lexbuf;
         reset_input ();
            let top_level = Parser.interactive Lexer.token lexbuf in
            print_endline (string_of_top_level top_level);
            match top_level with
              BindVal(id, ast) -> compute_and_run id ast
            | Eval ast -> compute_and_run "-" ast
            | External(in_id, t, out_id) ->
                 let t = Typer.replace_refs type_env t in
                 print_endline (string_of_type t);
                 Type_env.add type_env in_id t;
                 Compiler.declare_external comp in_id t out_id
           Lexer.Error err -> prerr_endline (Lexer.string_of_error err)
         | Parser.Error err -> prerr_endline (Parser.string_of_error err)
         | Typer.Error err -> prerr_endline (Typer.string_of_error err)
   with End_of_file -> Llvm.dump_module (Compiler.llvm_module comp)


In this installment, we added a FFI to our language. Rather than implementing faithfully the OCaml external mechanism, we introduced a new keyword extern to implement our own, simpler FFI.

Doing so, Photon is no more an OCaml subset stricto sensu. However, you could consider that our OCaml subset has no FFI at all and hide all the FFI use in the standard library. Keeping that in mind, we will continue to describe Photon as an OCaml subset.

Our FFI is completely useless at the moment, as we have no way to call the defined functions. We will correct this by adding function calls in the next installment.

Source Code

The code accompanying this article is available in archive photon-tut-07.tar.xz or through the git repository:

 git clone
 cd photon
 git checkout 7-external_declarations


Enter your comment. Wiki syntax is allowed:
blog/2011/01/28/external_declarations.txt · Last modified: 2011/02/20 20:28 (external edit)