User Tools

Site Tools


blog:2011:01:21:value_naming

Photon Compiler Development: Value Naming

This is the sixth article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

In this article, we will add the possibility to name values, and to reuse them later by their name.

Naming

We will allow the naming of expressions through the syntax

let name = expression;;

We only allow naming at the top-level for now, we will lift this restriction later.

Note that this does not define a variable. The expression bound to name is fixed once for all, and may not vary. A new binding

let name = new_expression;;

can look as an assignment, but it is only a re-binding (we say that the previous name is shadowed). By abuse of language, we will however often call expressions referenced by their name variables.

Abstract Syntax Tree

We add a new variable node, which contains an identifier.

and 'a node =
   ...
   | Var of string (** variable *)

Lexing and Parsing

We add a new token for the let keyword, and one for the = token.

We change the return type of the parser to return the bound identifier in addition to the AST.

%type<string * Position.t Ast.t> interactive

When the top-level input statement is a binding, we return the bound name. When it is a simple expression, we return “-” to indicate that the result must not be bound.

interactive
   ...
   | expr SEMI_SEMI        { ("-", $1) }
   | LET IDENT EQUAL expr SEMI_SEMI { ($2, $4) }

We also add a new atom for variables.

atom
   ...
   | IDENT                 { label 1 (Var $1) }

Typing

The typing process is extended with a typing environment, mapping identifiers to types. It is defined in a new Type_env module with the following functions:

  • create builds a new empty type environment.
  • add adds a new (identifier, type) binding.
  • find recovers the type associated to an identifier.

annotate is given a type environment as its first argument. When a variable is met, its type is looked up in the type environment. If the variable is not defined (we say it is unbound), we report an Unbound_var error.

let rec annotate type_env ast =
   match ast.node with
   ...
   | Var id ->
          try
             let t = Type_env.find type_env id in
             { label = (ast.label, t); node = Var id }
          with Not_found -> raise (Error(ast.label, Unbound_var id))

The type environment will be populated by the driver.

Compilation

Storing Bindings

To store bindings, we use global variables (or constants) in the LLVM module. We add a new identifier argument to compile_evaluator_and_printer. If the identifier is different from “-”, we define a new global variable of that name with the computed value.

OCaml allows shadowing. let name = expr1 followed by let name = expr2 will let name refer to expr2. expr1 is no more accessible directly. However, all references to name between the two bindings will forever refer to expr1. Here is an example:

# let a = 42;; (* expr1 *)
val a : int = 42
# let return_a () = a;;
val return_a : unit -> int = <fun>
# let a = "not even the same type";; (* expr2 *)
val a : string = "not even the same type"
# a;;
- : string = "not even the same type"
# return_a ();; (* should still return expr1 *)
- : int = 42

We handle this by renaming any previous global variable which has the same name as the one being defined. The renaming is done consistently by LLVM, i.e. all references to the old name automatically become references to the new name. LLVM also avoid collisions by appending a unique number to our new name if a variable with that name already exists.

As it is the old occurrence that we rename and not the new one, the exported symbol for a variable will always refer to the last defined one. This will take importance when we will implement multiple module compilation later on.

Accordingly, we sets the linkage of renamed variables to be internal as they can no longer be accessed outside our module (their new name being not known).

We use dots in the new name, because these are valid in LLVM identifiers, but not in Photon/OCaml ones. Proceeding that way, there cannot be any confusion between renamed variables and user-defined variables.

Finally, we differentiate between constant and not-constant expressions, because constants offers much more optimization opportunities. Constants are defined directly while not-constant expressions are built by first defining an undefined global variable and then storing the computed value inside it.

let compile_evaluator_and_printer module_ id ast =
   ...
   (* Store result if an [id] is not "-" *)
   if id <> "-" then (
      (* Rename potential older global variable with same name *)
      begin match lookup_global id module_ with
        Some g ->
           set_value_name (".old." ^ id) g;
           set_linkage Linkage_internal g
      | None -> ()
      end;
      (* Store value *)
      if is_constant value then (
         let g = define_global id value module_ in
         set_global_constant true g
      ) else (
         let undefined = undef (type_of value) in
         let g = define_global id undefined module_ in
         ignore (build_store value g builder)
      )
   );
   ...

Variable References

When we meet a variable, we replace it by its value fetched from the corresponding global variable. In the case of a constant, we recover the value directly. If the global variable is not constant, its value is loaded from memory.

let rec compile ast builder =
   match ast.node with
   ...
   | Var id ->
        let m = global_parent (block_parent (insertion_block builder)) in
        let g = unsafe_lookup_global id m in
        if is_global_constant g then global_initializer g
        else build_load g id builder

Strings

In Photon Compiler Development: Basic Pipeline, we implemented strings as pointers to stack-allocated buffers of characters.

This strategy is no more adapted when the string is to be stored in a global variable, as the buffer will cease to exist as soon as the function returns. The stored pointer would then outlive the pointed buffer, we would obtain immediately a dangling pointer.

We will add strings as global variables directly instead. The problem with this approach is that LLVM execution engine provides no way to reclaim storage occupied by global variables (on the contrary to functions for which free_machine_code exists). Even if global variables could be reclaimed, it is not clear how the dependency analysis should be done to determine when they are not used anymore.

Our interpreter now leaks memory as a result of this. However, the memory leak should stay reasonable as most interactive sessions are short, except if the user begins to enter very large strings.

Note that this leak only concerns the interpreter, where one could take the opportunity of freeing a value once it has been printed and cannot be used anymore. It does not concern compiled code where the string constant must be present in the executable one way or another1).

When we will add dynamic allocation to our language (with garbage collection), we will revisit this topic.

We also switch to global strings for strings needed to print results. While we could have kept our stack-allocated strings for these, we remarked that these were vastly inefficient (at least on x86_64, where they resulted in one move per character). Our stack-allocated strings were a bad idea from the start, and we remove them altogether.

A new Bug Fix

We are compiling our project against LLVM 2.6 at the moment. In this release, the OCaml bindings and C bindings are out-of-sync regarding the linkage types. If we set the linkage of a variable to internal with

set_linkage Linkage.internal the_var

it is set to weak_odr instead.

We fix this by defining in Llvm_utils a new type lllinkage which is in sync with the LLVMLinkage enumeration in the C code (constructor tags are assigned values in the same way as a C enumeration, so it suffices to lay them in the same order).

type lllinkage =
     Linkage_external
   | Linkage_available_externally
   | Linkage_link_once_any
   | Linkage_link_once_odr
   | Linkage_weak_any
   | Linkage_weak_odr
   | Linkage_appending
   | Linkage_internal
   | Linkage_private
   | Linkage_dll_import
   | Linkage_dll_export
   | Linkage_external_weak
   | Linkage_ghost
   | Linkage_common
   | Linkage_linker_private

We then redefine linkage and set_linkage functions to use this new type instead of Linkage.t. The underlying C bindings stay the same, though.

external linkage : llvalue -> lllinkage = "llvm_linkage"
 
external set_linkage : lllinkage -> llvalue -> unit = "llvm_set_linkage"

Driver

The first needed change is that an identifier is received in addition to the AST as the parsing result.

Secondly, the type environment needed for typing must be created, and passed to Typer.annotate. Additionally, it is updated by the ad-junction of a new type binding in the case of a binding.

Finally, now that we define global variables in addition to functions, we dumps the full LLVM module at program exit to allow having a look at these. We could have dumped the module after each top-level statement but it would have generated too much clutter. We will refine our interpreter debugging facilities (e.g. allowing the user to print the AST when he wishes) later.

Here is the full updated code for main:

let main () =
   let lexbuf = Lexing.from_function refill in
   Lexer.set_file_name lexbuf "<stdin>";
   let mod_ = Compiler.create_and_initialize_module "Photon JIT" in
   let interp = Interpreter.create mod_ in
   let type_env = Type_env.create () in
   (* Read-eval-print loop *)
   try
      while true do
         (* We drop any input that possibly remains, either because it follows
          * [;;], or because an error was met *)
         Lexer.reset lexbuf;
         reset_input ();
         try
            let (id, ast) = Parser.interactive Lexer.token lexbuf in
            print_endline (string_of_ast ast);
            let typed_ast = Typer.annotate type_env ast in
            let t = snd typed_ast.label in
            print_endline (string_of_type t);
            if id <> "-" then Type_env.add type_env id t;
            let f = Compiler.compile_evaluator_and_printer mod_ id typed_ast in
            Llvm.dump_value f;
            Interpreter.run_and_delete interp f
         with
           Lexer.Error err -> prerr_endline (Lexer.string_of_error err)
         | Parser.Error err -> prerr_endline (Parser.string_of_error err)
         | Typer.Error err -> prerr_endline (Typer.string_of_error err)
      done
   with End_of_file -> Llvm.dump_module mod_

Conclusion

Our interpreter has now reached the functionality of a basic desktop calculator. It allows arithmetic computations, storing intermediate results as variables.

In the next installment, we will add the possibility to declare external functions (i.e. functions compiled from another language such as C) and to call them.

Source Code

The code accompanying this article is available in archive photon-tut-06.tar.xz or through the git repository:

git clone http://git.legiasoft.com/photon.git
cd photon
git checkout 6-value_naming
1) We could imagine to unload part of the program during its execution, but this is way beyond the current state of this tutorial and would not necessarily be worth it. The operating system can swap off pages that are not lively anymore.

Discussion

Enter your comment. Wiki syntax is allowed:
 
blog/2011/01/21/value_naming.txt · Last modified: 2011/02/20 20:28 (external edit)