User Tools

Site Tools


Photon Compiler Development: Basic Pipeline

At first was the value.

This is the fourth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we will add typing, compilation and interpretation to finish our very first complete pipeline for a trivial subset of our language (constant values of built-in types).


Similarly to OCaml, Photon will be a statically typed language. That means that types are determined at compile time. This presents three advantages :

  • The code is safer. A lot of programming errors are catched by the compiler during compilation because types won't match. This is contrary to dynamic languages like Python were an invalid statement can stay undetected until actually executed.
  • The code is more efficient. Types being known, the code generator can directly emit calls to the right functions or emits specialized instructions rather than having to find out what to call or do at run time.
  • Types document the code. The intended usage of functions and variables is partially reflected in their type, which should be available to the developer. This is true of explicit types (type annotations provided by the developer) but also of implicit types that can be asked for to the compiler (e.g. when we ask the OCaml compiler to infer an interface with ocamlc -i).

The drawback of static typing is that it can be a bit too restrictive at times, but we will later introduce constructs to regain flexibility in most such cases.

In our previous post, we parsed constant values of built-in types into a Position.t Ast.t. We will transform these ASTs by labeling nodes also with the type of the expression they represent, giving a (Position.t, Type.t) Ast.t.



The definition of types is just a enumeration. Later on, as we will have more complex types like function or parametric types (e.g. 'a list), that will need a more elaborate type for types, but let us just keep things simple for now.

type t =
     Bool (** boolean *)
   | Char (** 1-byte character *)
   | Float (** double-precision floating-point value *)
   | Int (** signed WORD_SIZE-bit integer *)
   | String (** character string *)

Type Inference

Type inference is fun and we will explore some of its aspects in a later installment (in the meantime, you can still have a look at our post Typing Haskell in ML).

For the present time, it is trivial. As we only have literal constants, just return the associated type.

let type_of_lit = function
     Ast.Bool _ -> Type.Bool
   | Ast.Char _ -> Type.Char
   | Ast.Float _ -> Type.Float
   | Ast.Int _ -> Type.Int
   | Ast.String _ -> Type.String
let annotate ast =
     match ast.node with
       Literal lit ->
          { label = (ast.label, type_of_lit lit); node = ast.node }



Simpler LLVM Interface

We use the Low-Level Virtual Machine as our compiler backend (our machine code generator or just-in-time interpreter). The LLVM library has official OCaml bindings. These bindings are however limited to the essential. We will extend them a bit in a Llvm_utils module to ease our life.

We will need to refine our build setup to use the LLVM library. This will be discussed in building our interpreter.

Fixing a potential Bug

The first thing we define is a new binding to create modules. In some versions of the OCaml bindings (at least the 2.6), the provided binding for create_module has a bug.

OCaml bindings have two parts, an OCaml external declaration and a C wrapper.

In LLVM 2.6, the OCaml external declaration has two arguments: an LLVM context and a string for the name. However, the corresponding C code has only one: the name. The C wrapper then pass the context instead of the name to LLVMModuleCreateWithName. This results in a wrong module name, or even crashes.

Our C wrapper is

#include <caml/mlvalues.h>
#include <llvm-c/Core.h>
CAMLprim LLVMModuleRef llvm_create_module_fixed(LLVMContextRef c, value id)
   return LLVMModuleCreateWithNameInContext(String_val(id), c);

The corresponding external declaration in Llvm_utils is

external create_module : llcontext -> string -> llmodule =


A lot of LLVM functions take a context argument, but we will just use the global context. Profiting from the fact that OCaml definitions are not recursive by default (they need the rec keyword to be recursive), we re-define a bunch of functions to avoid the need to pass the global context explicitly. Along the way, we define an int_type type to represent native integers, whether that means 32 or 64 bits.

(** {2 De-contextified types} *)
let void_type = void_type (global_context ())
let double_type = double_type (global_context ())
let i1_type = i1_type (global_context ())
let i8_type = i8_type (global_context ())
let i16_type = i16_type (global_context ())
let i32_type = i32_type (global_context ())
let i64_type = i64_type (global_context ())
(** Native integer type. *)
let int_type =
   match Sys.word_size with
     32 -> i32_type
   | 64 -> i64_type
   | _ -> failwith "int_type: unsupported word size"
(** {2 De-contextified functions} *)
let builder_at_end = builder_at_end (global_context ())
let const_stringz = const_stringz (global_context ())
let create_module = create_module (global_context ())


Finally, we define some helper functions and values. A i32 zero for use with the getelementptr instruction, two functions to lookup global variables or functions in a module, raising an exception if it is not found instead of returning an option type (for cases when we are sure the function or global variable is defined), and a function to allocate strings locally (see basic compilation).

(** Constant zero of [i32] type for [getelementptr] instructions. *)
let zero = const_int i32_type 0
(** [unsafe_lookup_function name module_] returns function named [name] in
  * module [module_]. Throws an exception if not found. *)
let unsafe_lookup_function name module_ =
   match lookup_function name module_ with
     Some f -> f
   | None -> failwith ("function " ^ name ^ " is not defined in module")
(** [unsafe_lookup_global name module_] returns global named [name] in module
  * [module_]. Throws an exception if not found. *)
let unsafe_lookup_global name module_ =
   match lookup_global name module_ with
     Some g -> g
   | None -> failwith ("global " ^ name ^ " is not defined in module")
(** [build_local_stringz s builder] allocates locally a buffer to hold string
  * [s], copy [s] to allocated buffer and returns a pointer to its first
  * character. *)
let build_local_stringz s builder =
   let len = String.length s in
   let buf =
      build_alloca (array_type i8_type (len + 1)) "str" builder in
   let s = const_stringz s in
   ignore (build_store s buf builder);
   build_gep buf [|zero; zero|] "p_str" builder

The build_local_stringz deserves a word of explanation. It takes a builder as argument. A builder is like a position in the LLVM IR. When an LLVM function takes a builder as argument, it emits code just before the builder position (alternatively, one could say that instructions are emitted at builder position, followed by a move of the builder at the end of emitted instructions).

build_alloca emits an alloca instruction, i.e. reserves space on stack. It returns a pointer to allocated space.

const_stringz builds a constant array of characters to hold the given string plus a terminating zero character.

build_store copies the constant array of characters into the buffer that has been allocated on stack.

Finally, build_gep is the OCaml binding for the dreaded getelementptr instruction (see The Often Misunderstood GEP Instruction). It is quite simple when you grasped it. Here, buf has type [len x i8]* (pointer to an array of len bytes). The first zero index will compute the address of the first (and only) array, while the second will select the address of the first character of this array. The resulting address has type i8*, our string type (see basic compilation).

Basic Compilation

We represent booleans by integers. As we only need one bit to store a boolean value, we use the i1 type. We use 1 for true and 0 for false. Our representation of booleans is the exact representation expected by LLVM so that we will have no conversions to do when we will use LLVM instructions that take boolean arguments.

We represent characters by 8-bit integers, i.e. bytes. A unicode character may thus not fit on a single Photon character at the moment.

We represent floating-point numbers by double-precision IEEE 754 numbers.

We represent integer number by machine words, i.e. 32-bit or 64-bit numbers depending on the platform.

Finally, we represent strings by pointers to the first character of a character array. The character array itself is obtained by const_stringz. We tried to obtain directly a pointer to the first character of the array returned by const_stringz. However, our version of the bindings segfaults when we try to do that. We could instead have stored the string into a global variable (as apparently intended by const_stringz), but we would then have leaked memory in our interpreter (the library provides no way to free global variables). Finally, we decided to allocate a buffer of the right size on the stack, where we store the string content before returning a pointer to its first character. This is precisely what build_local_stringz does. Note that we need to know where to insert the buffer, so that both build_local_stringz and our compilation function will need an LLVM instruction builder (an helper object to emit LLVM instructions).

let compile_lit lit builder = 
   match lit with
     Ast.Bool b -> const_int i1_type (if b then 1 else 0)
   | Ast.Char c -> const_int i8_type (int_of_char c)
   | Ast.Float f -> const_float double_type f
   | Ast.Int i -> const_int int_type i
   | Ast.String s -> build_local_stringz s builder
let compile ast builder =
   match ast.node with
     Literal lit -> compile_lit lit builder

Printing Values

We just saw that while most of our values are just LLVM constants, the string literals require to emit instructions. Where should we put these instructions? We wrap the compilation of a top-level input phrase into a function. Instructions will be emitted in the body of this function, i.e. we will pass a builder positioned at function body to compile. We call our wrapping function compile_evaluator_and_printer. In addition to the AST, it takes an LLVM module into which to define the anonymous function.

In the LLVM tutorial, expressions are compiled as functions that return the computed value. This value is then extracted by GenericValue.as_float. This approach has two disadvantages. It is not easy to extract values which are not provided for by the GenericValue module and it disallows a custom printing routine provided by the user in the target language.

Instead, we will print computed values directly in the compile function, using LLVM code and external bindings (see next section). Once the value has been computed, it is passed to the function print_<type of value>. If no such function exists, we print <abstract> instead. We do not use GenericValue at all.

Before printing the value, we first print “- : type = ” like the OCaml top-level.

We store two strings as global variables rather than allocating them on the stack. These are <abstract> and \n. These strings are potentially reused over and over so it makes sense to keep a permanent copy of them. They are used in much the same way as buffers in build_local_stringz, through build_gep.

To finish, we call assert_valid_function on our function before returning it. This LLVM function performs a number of checks on the given function, throwing an exception if an inconsistency is met. This should not happen in the production code if we coded our compiler correctly, but is very useful during its development.

let compile_evaluator_and_printer module_ ast =
   (* Compile anonymous function to compute and print result *)
   let f_type = function_type void_type [||] in
   let f = define_function "" f_type module_ in
   let builder = builder_at_end (entry_block f) in
   (* Computes result *)
   let value = compile ast builder in
   (* Prints type *)
   let printf = unsafe_lookup_function "printf" module_ in
   let val_type = snd ast.label in
   let val_type_str = string_of_type val_type in
   let prefix =
      build_local_stringz ("- : " ^ val_type_str ^ " = ") builder in
   ignore (build_call printf [|prefix|] "" builder);
   (* Prints result *)
   let printer_name = "print_" ^ string_of_type val_type in
   begin match lookup_function printer_name module_ with
     Some f -> ignore (build_call f [|value|] "" builder)
   | None ->
        let abstract = unsafe_lookup_global "__abstract" module_ in
        let abstract =
           build_gep abstract [|zero; zero|] "abstract" builder in
        ignore (build_call printf [|abstract|] "" builder)
   let newline = unsafe_lookup_global "__newline" module_ in
   let newline = build_gep newline [|zero; zero|] "newline" builder in
   ignore (build_call printf [|newline|] "" builder);
   ignore (build_ret_void builder);
   assert_valid_function f;


The preceding code supposed that a number of global variables and functions were defined in the given LLVM module. We thus provides a create function to create and initialize an LLVM module, that will add all that is necessary for our printing code.

Bindings have two parts. A declaration in the LLVM module (with declare_function) and an implementation which we put into bindings.c and its accompanying header.

We will soon be able to define some functions which are now part of the bindings in Photon, and define bindings directly in Photon code. The hard-coding of these bindings in the module creation and most of the code in bindings.c is just a temporary measure.

The initialization code is mainly trivial (see the documentation of define_global and declare_function), except for two functions:

  • The type of printf must be defined with var_arg_function_type rather than with function_type to account for the fact that it takes a variable number of argument.
  • The argument of print_bool must be extended. In C, there is no 1-bit integer type, the smallest unit being the byte (8 bits). We thus add the zeroext attribute to the parameter of print_bool. This attribute means that the caller should extend the value to a 32-bit1) integer before passing it to print_bool. This happens automatically at call site so that we will not have to manually convert from i1 to i32. Accordingly, in the C binding part, the print_bool functions takes an argument of uint32_t type (we could as well have used the signext attribute and int32_t type in this case).
let create_and_initialize_module name =
   let m = create_module name in
   (* String constants *)
   let add_string name s = ignore (define_global name (const_stringz s) m) in
   add_string "__newline" "\n";
   add_string "__abstract" "<abstract>";
   (* Declare external [printf] *)
   let printf_type = var_arg_function_type int_type [|pointer_type i8_type|] in
   ignore (declare_function "printf" printf_type m);
   (* Declare builtin type printers *)
   let add_fun name ret_type arg_types =
      declare_function name (function_type ret_type arg_types) m in
   ignore (add_fun "print_char" void_type [|i8_type|]);
   ignore (add_fun "print_float" void_type [|double_type|]);
   ignore (add_fun "print_int" void_type [|int_type|]);
   ignore (add_fun "print_string" void_type [|pointer_type i8_type|]);
   (* Sets [zeroext] attribute on [print_bool] argument *)
   let print_bool = add_fun "print_bool" void_type [|i1_type|] in
   let print_bool_arg = param print_bool 0 in
   add_param_attr print_bool_arg Attribute.Zext;

The bindings are also pretty simple, the only subtlety being that we escape non-printable characters.

static inline void print_escaped_char(char c)
   switch (c) {
      case '"':
      case '\\':
      case '\b':
      case '\n':
      case '\r':
      case '\t':
         if (isprint(c))
            printf("\\%03d", (uint8_t) c);
void print_bool(uint32_t b)
   if (b == 0) printf("false");
   else printf("true"); 
void print_char(char c)
void print_float(double f)
   printf("%f", f);
void print_int(int n)
   printf("%d", n);
void print_string(const char *s)
   while (*s != '\0')


The LLVM library comes with an interpretation facility under the name of an execution engine. All we have to do is associate our module to an execution engine, and then use this execution engine to execute generated functions.

Once an anonymous compute-and-print function has computed and printed its result, it is not useful anymore. In order to reclaim its allocated memory, we free corresponding machine code (in the execution engine) and the function IR (in the module) after it has run.

When we load our Interpreter module, we call initalize_native_target which is necessary to initialize the LLVM execution engine library.

(** [create module_] creates an execution engine for LLVM module [module_]. *)
let create module_ = ExecutionEngine.create (ModuleProvider.create module_)
(** [run_and_delete interp f] runs function [f] in the interpreter [interp].
  * [f] should belong to the LLVM module assotiated with execution engine
  * [interp]. *)
let run_and_delete interp f =
   (* Run function *)
   ignore (ExecutionEngine.run_function f [||] interp);
   (* Remove function *)
   ExecutionEngine.free_machine_code f interp;
   delete_function f
(* Initialize LLVM execution engine library when this module is loaded. *)
let _ = initialize_native_target ()


Our driver needs few changes. First, we need to create an LLVM module and execution engine.

Second, we pass the parsed AST to the typer, the typed AST to the compiler and the generated IR to the interpreter (which executes and deletes it). We also prints the type after typing, and the IR after compilation.

Here is the full new main

let main () =
   let lexbuf = Lexing.from_function refill in
   Lexer.set_file_name lexbuf "<stdin>";
   let mod_ = Compiler.create_and_initialize_module "Photon JIT" in
   let interp = Interpreter.create mod_ in
   (* Read-eval-print loop *)
      while true do
         (* We drop any input that possibly remains, either because it follows
          * [;;], or because an error was met *)
         Lexer.reset lexbuf;
         reset_input ();
            let ast = Parser.interactive Lexer.token lexbuf in
            print_endline (string_of_ast ast);
            let typed_ast = Typer.annotate ast in
            let t = snd typed_ast.label in
            print_endline (string_of_type t);
            let f = Compiler.compile_evaluator_and_printer mod_ typed_ast in
            Llvm.dump_value f;
            Interpreter.run_and_delete interp f
           Lexer.Error err -> prerr_endline (Lexer.string_of_error err)
         | Parser.Error err -> prerr_endline (Parser.string_of_error err)
   with End_of_file -> print_endline ""

Building our Interpreter

To compile and link with LLVM, we modify our ocamlbuild plugin. We declare the libraries, compilation and linking flags, and our fix for create_module in llvm_fix.c.

        ocaml_lib ~extern:true ~dir:"+llvm" "llvm";
        ocaml_lib ~extern:true ~dir:"+llvm" "llvm_analysis";
        ocaml_lib ~extern:true ~dir:"+llvm" "llvm_executionengine";
        flag ["compile"; "c"; "use_llvm"]
           (S [A "-ccopt"; A "`llvm-config --cflags`"]);
        flag ["link"; "use_llvm"] (S [A "-ccopt"; A "`llvm-config --libs`"]);
        dep ["link"; "use_llvm"] ["llvm_fix.o"];

We also specify a dependency on our bindings, and we set the -rdynamic link flag so that our functions will be linked in even if they are not called directly (they will be loaded dynamically by LLVM runtime).

        flag ["link"; "ocaml"; "g++"]
           (S [A "-cc"; A "g++"; A "-ccopt"; A "-rdynamic"]);
        dep ["file:bindings.c"] ["bindings.h"];
        dep ["link"; "ocaml"] ["bindings.o"]

We complement our ocamlbuild plugin by a _tags file specifying the files that use the LLVM libraries, and that photon.native should be linked with g++ (LLVM needs the C++ standard library that is only linked with when linking is done with g++).

<{,i}>:         use_llvm, use_llvm_analysis
<>:          use_llvm, use_llvm_executionengine
<llvm_fix.c>:              use_llvm
<>:           use_llvm
<>:               use_llvm
<photon.native>:           g++, use_llvm, use_llvm_executionengine
<photon.native>:           use_llvm_analysis

The procedure to build our project stays the same. make or ocamlbuild photon.native to generate our interpreter in photon.native and make doc to generate the documentation in photon.docdir.


We have now built a minimal pipeline. Of course, the language subset that is supported at the moment is completely uninteresting, but we hope we have a good basis to build on.

In the next installment, we will add arithmetic operator support to begin making our interpreter at least a little bit useful.

Source Code

The code accompanying this article is available in archive photon-tut-04.tar.xz or through the git repository:

git clone
cd photon
git checkout 4-basic_pipeline
1) Why 32-bit and not machine word size, or an arbitrary bit-width?


Enter your comment. Wiki syntax is allowed:
blog/2011/01/18/basic_pipeline.txt · Last modified: 2012/01/26 21:17 by csoldani