This is the sixth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.
In this article, we will add the possibility to name values, and to reuse them later by their name.
We will allow the naming of expressions through the syntax
let name = expression;;
We only allow naming at the top-level for now, we will lift this restriction later.
Note that this does not define a variable. The expression bound to name is fixed once for all, and may not vary. A new binding
let name = new_expression;;
can look as an assignment, but it is only a re-binding (we say that the previous name
is shadowed). By abuse of language, we will however often call expressions referenced by their name variables.
We add a new variable node, which contains an identifier.
and 'a node = ... | Var of string (** variable *)
We add a new token for the let
keyword, and one for the =
token.
We change the return type of the parser to return the bound identifier in addition to the AST.
%type<string * Position.t Ast.t> interactive
When the top-level input statement is a binding, we return the bound name. When it is a simple expression, we return “-”
to indicate that the result must not be bound.
interactive ... | expr SEMI_SEMI { ("-", $1) } | LET IDENT EQUAL expr SEMI_SEMI { ($2, $4) }
We also add a new atom
for variables.
atom ... | IDENT { label 1 (Var $1) }
The typing process is extended with a typing environment, mapping identifiers to types. It is defined in a new Type_env
module with the following functions:
create
builds a new empty type environment.add
adds a new (identifier, type) binding.find
recovers the type associated to an identifier.
annotate
is given a type environment as its first argument. When a variable is met, its type is looked up in the type environment. If the variable is not defined (we say it is unbound), we report an Unbound_var
error.
let rec annotate type_env ast = match ast.node with ... | Var id -> try let t = Type_env.find type_env id in { label = (ast.label, t); node = Var id } with Not_found -> raise (Error(ast.label, Unbound_var id))
The type environment will be populated by the driver.
To store bindings, we use global variables (or constants) in the LLVM module. We add a new identifier argument to compile_evaluator_and_printer
. If the identifier is different from “-”
, we define a new global variable of that name with the computed value.
OCaml allows shadowing. let name = expr1
followed by let name = expr2
will let name
refer to expr2
. expr1
is no more accessible directly. However, all references to name
between the two bindings will forever refer to expr1
. Here is an example:
# let a = 42;; (* expr1 *) val a : int = 42 # let return_a () = a;; val return_a : unit -> int = <fun> # let a = "not even the same type";; (* expr2 *) val a : string = "not even the same type" # a;; - : string = "not even the same type" # return_a ();; (* should still return expr1 *) - : int = 42
We handle this by renaming any previous global variable which has the same name as the one being defined. The renaming is done consistently by LLVM, i.e. all references to the old name automatically become references to the new name. LLVM also avoid collisions by appending a unique number to our new name if a variable with that name already exists.
As it is the old occurrence that we rename and not the new one, the exported symbol for a variable will always refer to the last defined one. This will take importance when we will implement multiple module compilation later on.
Accordingly, we sets the linkage of renamed variables to be internal
as they can no longer be accessed outside our module (their new name being not known).
We use dots in the new name, because these are valid in LLVM identifiers, but not in Photon/OCaml ones. Proceeding that way, there cannot be any confusion between renamed variables and user-defined variables.
Finally, we differentiate between constant and not-constant expressions, because constants offers much more optimization opportunities. Constants are defined directly while not-constant expressions are built by first defining an undefined global variable and then storing the computed value inside it.
let compile_evaluator_and_printer module_ id ast = ... (* Store result if an [id] is not "-" *) if id <> "-" then ( (* Rename potential older global variable with same name *) begin match lookup_global id module_ with Some g -> set_value_name (".old." ^ id) g; set_linkage Linkage_internal g | None -> () end; (* Store value *) if is_constant value then ( let g = define_global id value module_ in set_global_constant true g ) else ( let undefined = undef (type_of value) in let g = define_global id undefined module_ in ignore (build_store value g builder) ) ); ...
When we meet a variable, we replace it by its value fetched from the corresponding global variable. In the case of a constant, we recover the value directly. If the global variable is not constant, its value is loaded from memory.
let rec compile ast builder = match ast.node with ... | Var id -> let m = global_parent (block_parent (insertion_block builder)) in let g = unsafe_lookup_global id m in if is_global_constant g then global_initializer g else build_load g id builder
In Photon Compiler Development: Basic Pipeline, we implemented strings as pointers to stack-allocated buffers of characters.
This strategy is no more adapted when the string is to be stored in a global variable, as the buffer will cease to exist as soon as the function returns. The stored pointer would then outlive the pointed buffer, we would obtain immediately a dangling pointer.
We will add strings as global variables directly instead. The problem with this approach is that LLVM execution engine provides no way to reclaim storage occupied by global variables (on the contrary to functions for which free_machine_code
exists). Even if global variables could be reclaimed, it is not clear how the dependency analysis should be done to determine when they are not used anymore.
Our interpreter now leaks memory as a result of this. However, the memory leak should stay reasonable as most interactive sessions are short, except if the user begins to enter very large strings.
Note that this leak only concerns the interpreter, where one could take the opportunity of freeing a value once it has been printed and cannot be used anymore. It does not concern compiled code where the string constant must be present in the executable one way or another1).
When we will add dynamic allocation to our language (with garbage collection), we will revisit this topic.
We also switch to global strings for strings needed to print results. While we could have kept our stack-allocated strings for these, we remarked that these were vastly inefficient (at least on x86_64, where they resulted in one move per character). Our stack-allocated strings were a bad idea from the start, and we remove them altogether.
We are compiling our project against LLVM 2.6 at the moment. In this release, the OCaml bindings and C bindings are out-of-sync regarding the linkage types. If we set the linkage of a variable to internal
with
set_linkage Linkage.internal the_var
it is set to weak_odr
instead.
We fix this by defining in Llvm_utils
a new type lllinkage
which is in sync with the LLVMLinkage
enumeration in the C code (constructor tags are assigned values in the same way as a C enumeration, so it suffices to lay them in the same order).
type lllinkage = Linkage_external | Linkage_available_externally | Linkage_link_once_any | Linkage_link_once_odr | Linkage_weak_any | Linkage_weak_odr | Linkage_appending | Linkage_internal | Linkage_private | Linkage_dll_import | Linkage_dll_export | Linkage_external_weak | Linkage_ghost | Linkage_common | Linkage_linker_private
We then redefine linkage
and set_linkage
functions to use this new type instead of Linkage.t
. The underlying C bindings stay the same, though.
external linkage : llvalue -> lllinkage = "llvm_linkage" external set_linkage : lllinkage -> llvalue -> unit = "llvm_set_linkage"
The first needed change is that an identifier is received in addition to the AST as the parsing result.
Secondly, the type environment needed for typing must be created, and passed to Typer.annotate
. Additionally, it is updated by the ad-junction of a new type binding in the case of a binding.
Finally, now that we define global variables in addition to functions, we dumps the full LLVM module at program exit to allow having a look at these. We could have dumped the module after each top-level statement but it would have generated too much clutter. We will refine our interpreter debugging facilities (e.g. allowing the user to print the AST when he wishes) later.
Here is the full updated code for main
:
let main () = let lexbuf = Lexing.from_function refill in Lexer.set_file_name lexbuf "<stdin>"; let mod_ = Compiler.create_and_initialize_module "Photon JIT" in let interp = Interpreter.create mod_ in let type_env = Type_env.create () in (* Read-eval-print loop *) try while true do (* We drop any input that possibly remains, either because it follows * [;;], or because an error was met *) Lexer.reset lexbuf; reset_input (); try let (id, ast) = Parser.interactive Lexer.token lexbuf in print_endline (string_of_ast ast); let typed_ast = Typer.annotate type_env ast in let t = snd typed_ast.label in print_endline (string_of_type t); if id <> "-" then Type_env.add type_env id t; let f = Compiler.compile_evaluator_and_printer mod_ id typed_ast in Llvm.dump_value f; Interpreter.run_and_delete interp f with Lexer.Error err -> prerr_endline (Lexer.string_of_error err) | Parser.Error err -> prerr_endline (Parser.string_of_error err) | Typer.Error err -> prerr_endline (Typer.string_of_error err) done with End_of_file -> Llvm.dump_module mod_
Our interpreter has now reached the functionality of a basic desktop calculator. It allows arithmetic computations, storing intermediate results as variables.
In the next installment, we will add the possibility to declare external functions (i.e. functions compiled from another language such as C) and to call them.
The code accompanying this article is available in archive photon-tut-06.tar.xz or through the git repository:
git clone http://git.legiasoft.com/photon.git cd photon git checkout 6-value_naming
Discussion