At first was the value.
This is the fourth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.
In this article, we will add typing, compilation and interpretation to finish our very first complete pipeline for a trivial subset of our language (constant values of built-in types).
Similarly to OCaml, Photon will be a statically typed language. That means that types are determined at compile time. This presents three advantages :
ocamlc -i
).The drawback of static typing is that it can be a bit too restrictive at times, but we will later introduce constructs to regain flexibility in most such cases.
In our previous post, we parsed constant values of built-in types into a Position.t Ast.t
. We will transform these ASTs by labeling nodes also with the type of the expression they represent, giving a (Position.t, Type.t) Ast.t
.
The definition of types is just a enumeration. Later on, as we will have more complex types like function or parametric types (e.g. 'a list
), that will need a more elaborate type for types, but let us just keep things simple for now.
type t = Bool (** boolean *) | Char (** 1-byte character *) | Float (** double-precision floating-point value *) | Int (** signed WORD_SIZE-bit integer *) | String (** character string *)
Type inference is fun and we will explore some of its aspects in a later installment (in the meantime, you can still have a look at our post Typing Haskell in ML).
For the present time, it is trivial. As we only have literal constants, just return the associated type.
let type_of_lit = function Ast.Bool _ -> Type.Bool | Ast.Char _ -> Type.Char | Ast.Float _ -> Type.Float | Ast.Int _ -> Type.Int | Ast.String _ -> Type.String let annotate ast = match ast.node with Literal lit -> { label = (ast.label, type_of_lit lit); node = ast.node }
We use the Low-Level Virtual Machine as our compiler backend (our machine code generator or just-in-time interpreter). The LLVM library has official OCaml bindings. These bindings are however limited to the essential. We will extend them a bit in a Llvm_utils
module to ease our life.
We will need to refine our build setup to use the LLVM library. This will be discussed in building our interpreter.
The first thing we define is a new binding to create modules. In some versions of the OCaml bindings (at least the 2.6), the provided binding for create_module
has a bug.
OCaml bindings have two parts, an OCaml external
declaration and a C wrapper.
In LLVM 2.6, the OCaml external declaration has two arguments: an LLVM context and a string for the name. However, the corresponding C code has only one: the name. The C wrapper then pass the context instead of the name to LLVMModuleCreateWithName
. This results in a wrong module name, or even crashes.
Our C wrapper is
#include <caml/mlvalues.h> #include <llvm-c/Core.h> CAMLprim LLVMModuleRef llvm_create_module_fixed(LLVMContextRef c, value id) { return LLVMModuleCreateWithNameInContext(String_val(id), c); }
The corresponding external declaration in Llvm_utils
is
external create_module : llcontext -> string -> llmodule = "llvm_create_module_fixed"
A lot of LLVM functions take a context argument, but we will just use the global context. Profiting from the fact that OCaml definitions are not recursive by default (they need the rec
keyword to be recursive), we re-define a bunch of functions to avoid the need to pass the global context explicitly. Along the way, we define an int_type
type to represent native integers, whether that means 32 or 64 bits.
(** {2 De-contextified types} *) let void_type = void_type (global_context ()) let double_type = double_type (global_context ()) let i1_type = i1_type (global_context ()) let i8_type = i8_type (global_context ()) let i16_type = i16_type (global_context ()) let i32_type = i32_type (global_context ()) let i64_type = i64_type (global_context ()) (** Native integer type. *) let int_type = match Sys.word_size with 32 -> i32_type | 64 -> i64_type | _ -> failwith "int_type: unsupported word size" (** {2 De-contextified functions} *) let builder_at_end = builder_at_end (global_context ()) let const_stringz = const_stringz (global_context ()) let create_module = create_module (global_context ())
Finally, we define some helper functions and values. A i32
zero for use with the getelementptr
instruction, two functions to lookup global variables or functions in a module, raising an exception if it is not found instead of returning an option
type (for cases when we are sure the function or global variable is defined), and a function to allocate strings locally (see basic compilation).
(** Constant zero of [i32] type for [getelementptr] instructions. *) let zero = const_int i32_type 0 (** [unsafe_lookup_function name module_] returns function named [name] in * module [module_]. Throws an exception if not found. *) let unsafe_lookup_function name module_ = match lookup_function name module_ with Some f -> f | None -> failwith ("function " ^ name ^ " is not defined in module") (** [unsafe_lookup_global name module_] returns global named [name] in module * [module_]. Throws an exception if not found. *) let unsafe_lookup_global name module_ = match lookup_global name module_ with Some g -> g | None -> failwith ("global " ^ name ^ " is not defined in module") (** [build_local_stringz s builder] allocates locally a buffer to hold string * [s], copy [s] to allocated buffer and returns a pointer to its first * character. *) let build_local_stringz s builder = let len = String.length s in let buf = build_alloca (array_type i8_type (len + 1)) "str" builder in let s = const_stringz s in ignore (build_store s buf builder); build_gep buf [|zero; zero|] "p_str" builder
The build_local_stringz
deserves a word of explanation. It takes a builder as argument. A builder is like a position in the LLVM IR. When an LLVM function takes a builder as argument, it emits code just before the builder position (alternatively, one could say that instructions are emitted at builder position, followed by a move of the builder at the end of emitted instructions).
build_alloca
emits an alloca
instruction, i.e. reserves space on stack. It returns a pointer to allocated space.
const_stringz
builds a constant array of characters to hold the given string plus a terminating zero character.
build_store
copies the constant array of characters into the buffer that has been allocated on stack.
Finally, build_gep
is the OCaml binding for the dreaded getelementptr
instruction (see The Often Misunderstood GEP Instruction). It is quite simple when you grasped it. Here, buf
has type [len x i8]*
(pointer to an array of len
bytes). The first zero index will compute the address of the first (and only) array, while the second will select the address of the first character of this array. The resulting address has type i8*
, our string type (see basic compilation).
We represent booleans by integers. As we only need one bit to store a boolean value, we use the i1
type. We use 1 for true
and 0 for false
. Our representation of booleans is the exact representation expected by LLVM so that we will have no conversions to do when we will use LLVM instructions that take boolean arguments.
We represent characters by 8-bit integers, i.e. bytes. A unicode character may thus not fit on a single Photon character at the moment.
We represent floating-point numbers by double-precision IEEE 754 numbers.
We represent integer number by machine words, i.e. 32-bit or 64-bit numbers depending on the platform.
Finally, we represent strings by pointers to the first character of a character array. The character array itself is obtained by const_stringz
. We tried to obtain directly a pointer to the first character of the array returned by const_stringz
. However, our version of the bindings segfaults when we try to do that. We could instead have stored the string into a global variable (as apparently intended by const_stringz
), but we would then have leaked memory in our interpreter (the library provides no way to free global variables). Finally, we decided to allocate a buffer of the right size on the stack, where we store the string content before returning a pointer to its first character. This is precisely what build_local_stringz
does. Note that we need to know where to insert the buffer, so that both build_local_stringz
and our compilation function will need an LLVM instruction builder (an helper object to emit LLVM instructions).
let compile_lit lit builder = match lit with Ast.Bool b -> const_int i1_type (if b then 1 else 0) | Ast.Char c -> const_int i8_type (int_of_char c) | Ast.Float f -> const_float double_type f | Ast.Int i -> const_int int_type i | Ast.String s -> build_local_stringz s builder let compile ast builder = match ast.node with Literal lit -> compile_lit lit builder
We just saw that while most of our values are just LLVM constants, the string literals require to emit instructions. Where should we put these instructions? We wrap the compilation of a top-level input phrase into a function. Instructions will be emitted in the body of this function, i.e. we will pass a builder
positioned at function body to compile
. We call our wrapping function compile_evaluator_and_printer
. In addition to the AST, it takes an LLVM module into which to define the anonymous function.
In the LLVM tutorial, expressions are compiled as functions that return the computed value. This value is then extracted by GenericValue.as_float
. This approach has two disadvantages. It is not easy to extract values which are not provided for by the GenericValue
module and it disallows a custom printing routine provided by the user in the target language.
Instead, we will print computed values directly in the compile function, using LLVM code and external bindings (see next section). Once the value has been computed, it is passed to the function print_<type of value>
. If no such function exists, we print <abstract>
instead. We do not use GenericValue
at all.
Before printing the value, we first print “- : type = ”
like the OCaml top-level.
We store two strings as global variables rather than allocating them on the stack. These are <abstract>
and \n
. These strings are potentially reused over and over so it makes sense to keep a permanent copy of them. They are used in much the same way as buffers in build_local_stringz
, through build_gep
.
To finish, we call assert_valid_function
on our function before returning it. This LLVM function performs a number of checks on the given function, throwing an exception if an inconsistency is met. This should not happen in the production code if we coded our compiler correctly, but is very useful during its development.
let compile_evaluator_and_printer module_ ast = (* Compile anonymous function to compute and print result *) let f_type = function_type void_type [||] in let f = define_function "" f_type module_ in let builder = builder_at_end (entry_block f) in (* Computes result *) let value = compile ast builder in (* Prints type *) let printf = unsafe_lookup_function "printf" module_ in let val_type = snd ast.label in let val_type_str = string_of_type val_type in let prefix = build_local_stringz ("- : " ^ val_type_str ^ " = ") builder in ignore (build_call printf [|prefix|] "" builder); (* Prints result *) let printer_name = "print_" ^ string_of_type val_type in begin match lookup_function printer_name module_ with Some f -> ignore (build_call f [|value|] "" builder) | None -> let abstract = unsafe_lookup_global "__abstract" module_ in let abstract = build_gep abstract [|zero; zero|] "abstract" builder in ignore (build_call printf [|abstract|] "" builder) end; let newline = unsafe_lookup_global "__newline" module_ in let newline = build_gep newline [|zero; zero|] "newline" builder in ignore (build_call printf [|newline|] "" builder); ignore (build_ret_void builder); assert_valid_function f; f
The preceding code supposed that a number of global variables and functions were defined in the given LLVM module. We thus provides a create
function to create and initialize an LLVM module, that will add all that is necessary for our printing code.
Bindings have two parts. A declaration in the LLVM module (with declare_function
) and an implementation which we put into bindings.c
and its accompanying header.
We will soon be able to define some functions which are now part of the bindings in Photon, and define bindings directly in Photon code. The hard-coding of these bindings in the module creation and most of the code in bindings.c
is just a temporary measure.
The initialization code is mainly trivial (see the documentation of define_global
and declare_function
), except for two functions:
printf
must be defined with var_arg_function_type
rather than with function_type
to account for the fact that it takes a variable number of argument.print_bool
must be extended. In C, there is no 1-bit integer type, the smallest unit being the byte (8 bits). We thus add the zeroext
attribute to the parameter of print_bool
. This attribute means that the caller should extend the value to a 32-bit1) integer before passing it to print_bool
. This happens automatically at call site so that we will not have to manually convert from i1
to i32
. Accordingly, in the C binding part, the print_bool
functions takes an argument of uint32_t
type (we could as well have used the signext
attribute and int32_t
type in this case).let create_and_initialize_module name = let m = create_module name in (* String constants *) let add_string name s = ignore (define_global name (const_stringz s) m) in add_string "__newline" "\n"; add_string "__abstract" "<abstract>"; (* Declare external [printf] *) let printf_type = var_arg_function_type int_type [|pointer_type i8_type|] in ignore (declare_function "printf" printf_type m); (* Declare builtin type printers *) let add_fun name ret_type arg_types = declare_function name (function_type ret_type arg_types) m in ignore (add_fun "print_char" void_type [|i8_type|]); ignore (add_fun "print_float" void_type [|double_type|]); ignore (add_fun "print_int" void_type [|int_type|]); ignore (add_fun "print_string" void_type [|pointer_type i8_type|]); (* Sets [zeroext] attribute on [print_bool] argument *) let print_bool = add_fun "print_bool" void_type [|i1_type|] in let print_bool_arg = param print_bool 0 in add_param_attr print_bool_arg Attribute.Zext; m
The bindings are also pretty simple, the only subtlety being that we escape non-printable characters.
static inline void print_escaped_char(char c) { switch (c) { case '"': case '\\': putchar('\\'); putchar(c); break; case '\b': putchar('\\'); putchar('b'); break; case '\n': putchar('\\'); putchar('n'); break; case '\r': putchar('\\'); putchar('r'); break; case '\t': putchar('\\'); putchar('t'); break; default: if (isprint(c)) putchar(c); else printf("\\%03d", (uint8_t) c); } } void print_bool(uint32_t b) { if (b == 0) printf("false"); else printf("true"); } void print_char(char c) { putchar('\''); print_escaped_char(c); putchar('\''); } void print_float(double f) { printf("%f", f); } void print_int(int n) { printf("%d", n); } void print_string(const char *s) { putchar('"'); while (*s != '\0') print_escaped_char(*s++); putchar('"'); }
The LLVM library comes with an interpretation facility under the name of an execution engine. All we have to do is associate our module to an execution engine, and then use this execution engine to execute generated functions.
Once an anonymous compute-and-print function has computed and printed its result, it is not useful anymore. In order to reclaim its allocated memory, we free corresponding machine code (in the execution engine) and the function IR (in the module) after it has run.
When we load our Interpreter
module, we call initalize_native_target
which is necessary to initialize the LLVM execution engine library.
(** [create module_] creates an execution engine for LLVM module [module_]. *) let create module_ = ExecutionEngine.create (ModuleProvider.create module_) (** [run_and_delete interp f] runs function [f] in the interpreter [interp]. * [f] should belong to the LLVM module assotiated with execution engine * [interp]. *) let run_and_delete interp f = (* Run function *) ignore (ExecutionEngine.run_function f [||] interp); (* Remove function *) ExecutionEngine.free_machine_code f interp; delete_function f (* Initialize LLVM execution engine library when this module is loaded. *) let _ = initialize_native_target ()
Our driver needs few changes. First, we need to create an LLVM module and execution engine.
Second, we pass the parsed AST to the typer, the typed AST to the compiler and the generated IR to the interpreter (which executes and deletes it). We also prints the type after typing, and the IR after compilation.
Here is the full new main
let main () = let lexbuf = Lexing.from_function refill in Lexer.set_file_name lexbuf "<stdin>"; let mod_ = Compiler.create_and_initialize_module "Photon JIT" in let interp = Interpreter.create mod_ in (* Read-eval-print loop *) try while true do (* We drop any input that possibly remains, either because it follows * [;;], or because an error was met *) Lexer.reset lexbuf; reset_input (); try let ast = Parser.interactive Lexer.token lexbuf in print_endline (string_of_ast ast); let typed_ast = Typer.annotate ast in let t = snd typed_ast.label in print_endline (string_of_type t); let f = Compiler.compile_evaluator_and_printer mod_ typed_ast in Llvm.dump_value f; Interpreter.run_and_delete interp f with Lexer.Error err -> prerr_endline (Lexer.string_of_error err) | Parser.Error err -> prerr_endline (Parser.string_of_error err) done with End_of_file -> print_endline ""
To compile and link with LLVM, we modify our ocamlbuild plugin. We declare the libraries, compilation and linking flags, and our fix for create_module
in llvm_fix.c
.
ocaml_lib ~extern:true ~dir:"+llvm" "llvm"; ocaml_lib ~extern:true ~dir:"+llvm" "llvm_analysis"; ocaml_lib ~extern:true ~dir:"+llvm" "llvm_executionengine"; flag ["compile"; "c"; "use_llvm"] (S [A "-ccopt"; A "`llvm-config --cflags`"]); flag ["link"; "use_llvm"] (S [A "-ccopt"; A "`llvm-config --libs`"]); dep ["link"; "use_llvm"] ["llvm_fix.o"];
We also specify a dependency on our bindings, and we set the -rdynamic
link flag so that our functions will be linked in even if they are not called directly (they will be loaded dynamically by LLVM runtime).
flag ["link"; "ocaml"; "g++"] (S [A "-cc"; A "g++"; A "-ccopt"; A "-rdynamic"]); dep ["file:bindings.c"] ["bindings.h"]; dep ["link"; "ocaml"] ["bindings.o"]
We complement our ocamlbuild plugin by a _tags
file specifying the files that use the LLVM libraries, and that photon.native
should be linked with g++
(LLVM needs the C++ standard library that is only linked with when linking is done with g++
).
<compiler.ml{,i}>: use_llvm, use_llvm_analysis <interpreter.ml>: use_llvm, use_llvm_executionengine <llvm_fix.c>: use_llvm <llvm_utils.ml>: use_llvm <photon.ml>: use_llvm <photon.native>: g++, use_llvm, use_llvm_executionengine <photon.native>: use_llvm_analysis
The procedure to build our project stays the same. make
or ocamlbuild photon.native
to generate our interpreter in photon.native
and make doc
to generate the documentation in photon.docdir
.
We have now built a minimal pipeline. Of course, the language subset that is supported at the moment is completely uninteresting, but we hope we have a good basis to build on.
In the next installment, we will add arithmetic operator support to begin making our interpreter at least a little bit useful.
The code accompanying this article is available in archive photon-tut-04.tar.xz or through the git repository:
git clone http://git.legiasoft.com/photon.git cd photon git checkout 4-basic_pipeline
Discussion