User Tools

Site Tools


Photon Compiler Development: Basic Parsing

This is the third article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

This article details the basics of syntactic analysis.

Syntactic Analysis

Syntactic analysis, or parsing, is the process of analyzing the input source code (seen as a stream of tokens) to determine its conformance with the language grammar.

The parser will not only verify that input source code is syntactically correct, it will also build an abstract syntax tree (AST) along the way. This tree represents the input program in a form that is convenient for semantic analysis (e.g. typing) and compilation.


While we provided nearly a complete lexer in the previous installment, we will not provide a complete parser in this one. Instead, we will first parse a very minimal part of the language (constant values), but we will also provide a minimal driver program that will allow us to test our lexer and parser a bit.

In the next installment, we will complete our minimal pipeline by adding trivial typing and a first contact with LLVM.

Parser Generator

We will use a parser generator rather than implementing the parsing process directly.

One of the most well-known parser generator is Yacc (the original companion of Lex), or its popular alternative Bison (generally combined with Flex). A similar tool is available for the OCaml programmer under the name of ocamlyacc (which nicely integrates with ocamllex).

Ocamlyacc is provided with the OCaml distribution. It is documented in the OCaml manual. Another useful resource is the ocamlyacc tutorial.

While lexing is not polemic, there exist many parsing techniques, each with its advantages and drawbacks. We chose to go for a LALR parser:

  • It is probably the most widely used.
  • LALR grammars are relatively legible.
  • It is well-supported by OCaml.

Another LALR parser generator for OCaml in Menhir1). Menhir is a bit more expressive than ocamlyacc and provides IMHO better grammar debugging facilities. However, Menhir is a bit painful for position management, reason why we sticked with ocamlyacc. We use Menhir from time to time during development though, to debug our grammar or verify the absence of some possible error conditions that would otherwise pass undetected by ocamlyacc (Menhir accepts ocamlyacc input format).

Input Format

We call our input file to ocamlyacc parser.mly. Ocamlyacc will produce as output our parser in the files and parser.mli. The input file has the following format:

%{ (* Header: OCaml code *) %}
(* Trailer: OCaml code *)

Header and trailer are again ordinary OCaml code that will be respectively prepended and appended to ocamlyacc-generated code.

Declarations are of four kinds:

  • token definitions;
  • associativity definitions;
  • start rule definitions;
  • start rule types.

A token definition has the form

%token<Fully_qualified_type.t> WITH_ARG_NAME

If a type is specified, then the corresponding token constructor takes an argument of the given type. In our example, the following token type would be generated

type token =
   | WITH_ARG_NAME of Fully_qualified_type.t

We will not use associativity in our first parser. We will give the syntax of associativity declarations when we will introduce operators later on.

Start rules are the equivalent of lexer entry points for ocamlyacc. These rules will be compiled as functions taking a lexing function (of type Lexing.lexbuf → token) and a lexing buffer (of type lexbuf) as arguments. The return type of a start rule must be provided.

In our first parser, we will only have one starting rule for interactive input, which we declare as follows (we will introduce its return type below).

%start interactive
%type<Position.t Ast.t> interactive

Syntax Errors

We handle syntax errors in a similar way than we did for lexical errors. We define a Parser_error.t type in parser_error.mli and a parse error exception in parser.mly.

exception Error of (Position.t * Parser_error.t)

ocamlyacc supports a predefined special token error which can be used where we want to be able to detect errors. If a syntax error is met, the parser will discard tokens until an error token can be shifted. If there is no rule available where an error token can be shifted, the parsing fails with a Parsing.Parse_error exception.

To avoid the later, we will generally use the error token at the end of a rule, allowing it to be shifted as soon as an error is detected, without regard to what may follow it. This is contrary to the traditional approach of waiting for a synchronization point such as ;; to resume parsing. We will do the resynchronization in the driver, which is less error-prone (e.g. no need to ensure that all possible error cases allows the shifting of an error token) and sufficient for our purpose (we will abort the current top-level interactive statement, discarding all buffered input as explained below).

We will be careful to provide an error rule for each start rule to be sure to throw our own exception with position information, instead of the predefined one. We will also use error tokens liberally as our parser grows, in order to provide the user with meaningful error messages in place of the rather vague “syntax error”.

Position Handling

The parser manages positions automatically and always knows the position in input source code that corresponds to both terminal (tokens) and non-terminal symbols.

The position of the nth right-hand side of a grammar rule can be obtained through calls to Parsing.rhs_start_pos and Parsing.rhs_end_pos. The position of the left-hand side can be obtained by by calling Parsing.symbol_start_pos and Parsing.symbol_end_pos.

We define a helper function pos to return a Position.t for the given element by calls to the Parsing functions.

(* [pos n] returns the position of the left-hand side of a rule if [n = 0] and
 * the position of the [n]th right-hand side if [n > 0]. *)
let pos n =
   let s = if n = 0 then symbol_start_pos () else rhs_start_pos n in
   let e = if n = 0 then symbol_end_pos () else rhs_end_pos n in
      pos_start = s;
      pos_length = e.pos_cnum - s.pos_cnum;

Abstract Syntax Tree

Before jumping to our parser, let us say a word about the AST. We chose to label the nodes of our AST with an arbitrary label. This allows us to keep the same tree shape at different phases of compilation.

During parsing, we will label the AST nodes with positions. The position indicates the region of input source code that corresponds to the node.

During typing, we will add a type to the label.

We will refine the AST as we go. For now, as we only parse literals, it can only represent literals.

(** Literals. *)
type lit =
     Bool of bool
   | Char of char
   | Float of float
   | Int of int
   | String of string
(** AST. AST nodes are labelled and the AST type is polymorphic in the label
  * type to allow to share the same AST structure for different compilation
  * phases.
  * The AST label will be a position after parsing and a [(position, type)]
  * pair after typing. *)
type 'a t = {
   label : 'a;
   node : node;
} and node =
   Literal of lit (** literals *)


Our parser is rather trivial for now. The definitions are

%token EOF
%token FALSE
%token QUOTE
%token TRUE
%token<char> CHAR
%token<float> FLOAT
%token<int> INT
%token<string> IDENT
%token<string> STRING

%start interactive
%type<Position.t Ast.t> interactive

We define a helper function to build labeled AST nodes in the header

(* [label n node] builds an AST by labelling node [node] with [pos n]. *)
let label n node = { label = pos n; node = node }

The interactive start rule has only three cases for now:

  • We met end-of-file, raise an End_of_file exception.
  • We met a literal, return it.
  • We met an error, throw an exception as explained in syntax_errors.
   : EOF                   { raise End_of_file }
   | lit SEMISEMI          { label 1 (Literal $1) }
   | error                 { raise (Error(pos 1, Syntax_error)) }

The lit rule is trivial

   : FALSE                 { Bool false }
   | TRUE                  { Bool true }
   | CHAR                  { Char $1 }
   | FLOAT                 { Float $1 }
   | INT                   { Int $1 }
   | STRING                { String $1 }

Custom Parser Interface

On the contrary to ocamllex which generates no interface file and is happy with whatever the user provided as long as it is consistent with the implementation, ocamlyacc generates an interface file. As we keep our parser in parser.mly, a parser.mli file will be generated.

The problem is that this file does not contain any documentation for ocamldoc and does not export our Parser.Error exception (nor string_of_error that we will introduce below).

We solve this problem by providing a custom interface in parser.override.mli as explained in a previous post.

We will see in the ocamlbuild section what is necessary to use this custom interface file in place of the ocamlyacc-generated one.


For now, we will read, parse and print interactive input from standard input.


The intended behavior is as follows:

  • We don't buffer input more than one line in advance, so that a result and new prompt may be printed as soon as we press Return, if the statement is complete.
  • We print a primary prompt “# ” at the start of a top-level interactive statement, and a continuation prompt “ ” elsewhere.
  • Error message positions should be relative to the start of current top-level interactive statement.

To control buffering and prompting, we will use Lexing.from_function which builds a lexing buffer that obtain its input by calling the given function.

The input function must refill a finite-size buffer. In order to read input line-by-line, we will call input_line which returns a new string of arbitrary length. We will thus need our own buffer.

We will also need a flag to remember if the next prompt to be printed should be the primary prompt or the continuation prompt. We will store the prompt directly.

For simplicity, the program handling at most one interactive input at a time, we store both the input buffer and current prompt as mutable global variables in our driver program in

let input_buf = ref ""
let curr_prompt = ref "# "

We then define the refill function. If there is no buffered input, we print the prompt, read one line and return as much of it as we can, buffering the rest if any. If some input is still present in buffer, we return as much of it as we can, keeping the rest if any. When we have read a line, we set the prompt to be the continuation prompt until further notice.

let refill s n =
   (* Refills internal buffer if necessary *)
   if String.length !input_buf = 0 then (
      print_string !curr_prompt;
      flush stdout;
      curr_prompt := "  ";
      try input_buf := input_line stdin ^ "\n"
      with End_of_file -> ()
   (* Copy (at most n) characters from internal buffer to s *)
   let buf_len = String.length !input_buf in
   let n_min = min n buf_len in
   String.blit !input_buf 0 s 0 n_min;
   (* Update internal buffer *)
   input_buf := String.sub !input_buf n_min (buf_len - n_min);
   (* Return number of stored characters *)

Finally, we define a reset_input function to discard buffered input and restore the primary prompt.

let reset_input () =
   input_buf := "";
   curr_prompt := "# "

When discarding buffered input, we must no forget input that has already been buffered in lexbuf. We add to Lexer a new reset function for this purpose2). reset will also reset the position of given lexing buffer.

let reset lexbuf =
   flush_input lexbuf;
   set_line_num lexbuf 1

Program Entry Point

Our main function creates the lexing buffer, then enters the read-eval-print loop (REPL), that repeatedly parses a top-level interactive statement and prints its AST, until an End_of_file is raised.

Similarly to OCaml, we do not process further input on the same line as a ;; token in interactive mode. Each time a top-level interactive input phrase has been parsed (each one ending in ;;), we discards remaining buffered input.

If a lexical or syntax error occurs, it is printed, then the loop resumes. As we discard remaining input at the start of the loop, it will not trigger further errors. The compilation of the current top-level statement is thus completely aborted when an error occurs.

let main () =
   let lexbuf = Lexing.from_function refill in
   Lexer.set_file_name lexbuf "<stdin>";
   (* Read-eval-print loop *)
      while true do
         (* We drop any input that possibly remains, either because it follows
          * [;;], or because an error was met *)
         Lexer.reset lexbuf;
         reset_input ();
            let ast = Parser.interactive Lexer.token lexbuf in
            print_endline (string_of_ast ast)
           Lexer.Error err -> prerr_endline (Lexer.string_of_error err)
         | Parser.Error err -> prerr_endline (Parser.string_of_error err)
   with End_of_file -> print_endline ""

This code make use of string_of_ast and string_of_error functions, that convert respectively an AST and an error into a string. We have chosen not to use the Format or Printf module, both needing the ('a, 'b, 'c, 'd, 'e, 'f) format6 type. This special type is not trivial to implement and will be avoided (at least) in this first version of Photon. We also did not use a string buffer. AST printing is used only in debugging and errors should be both rare (hopefully) and short. However, we may change this in the future is string concatenation becomes too detrimental to performance.

(** [string_of_ast ast] returns the string representation of [ast], ignoring
  * labels. *)
let string_of_ast ast =
   match ast.node with
     Literal lit ->
        let string_of_lit = function
             Bool b -> "Bool(" ^ (if b then "true" else "false") ^ ")"
           | Char c -> "Char('" ^ Char.escaped c ^ "')"
           | Float f -> "Float(" ^ string_of_float f ^ ")"
           | Int i -> "Int(" ^ string_of_int i ^ ")"
           | String s -> "String(\"" ^ String.escaped s ^ "\")" in
        "Literal(" ^ string_of_lit lit ^ ")"
let string_of_error (pos, err) =
   let pos = string_of_position pos in
   let err =
      match err with
        Illegal_char c ->
           "character '" ^ Char.escaped c ^ "' is illegal in this context"
      | Invalid_escape_sequence s -> "invalid escape sequence '" ^ s ^ "'"
      | Too_big_int i -> i ^ " exceeds the range of representable integers"
      | Unterminated_comment -> "unterminated comment"
      | Unterminated_string -> "unterminated string literal" in
   pos ^ ": " ^ err
let string_of_error (pos, err) =
   let pos = string_of_position pos in
   let err =
      match err with
        Syntax_error -> "Syntax error" in
   pos ^ ": " ^ err

As both Lexer.string_of_error and Parser.string_of_error make use of the string representation of a position, we extends Position with a string_of_position function.

(** [string_of_position pos] returns the string representation of position
  * [pos]. *)
let string_of_position { pos_start = s; pos_length = len } =
   let name = s.pos_fname in
   let line_str = string_of_int s.pos_lnum in
   let col = s.pos_cnum - s.pos_bol + 1 in
   let col_str = string_of_int col in
   name ^ ":" ^ line_str ^ "." ^ col_str 
      ^ if len <= 1 then "" else ("-" ^ string_of_int (col + len - 1))

Finally, we add a call to main at the end of

let _ = main ()


We have now gathered sufficient code to be able to compile our project. For this purpose, we use ocamlbuild. Ocamlbuild is the kind of tools one wonders how he made without it in the past. While the OCaml compiler requires us to be very explicit about dependencies when compiling files, ocamlbuild will handle it all for us. It also knows about ocamllex and ocamlyacc and will call them for us.

All we have to do to obtain an executable version of our interpreter is to enter ocamlbuild photon.native in our project directory. To generate the documentation with ocamldoc, we enter instead ocamlbuild photon.docdir/index.html and that's it.

At least, it would have been if we did not used a custom parser interface in parser.override.mli and comments with a * on each line. To configure ocamlbuild to suit our needs, we add to the project the following plugin, responsible to add the -stars arguments to calls to ocamldoc and to copy parser.override.mli onto parser.mli as described in this previous post.

open Ocamlbuild_plugin
(* Hook handler containing our extensions for the [ocamlbuild] system. *)
let hook_handler = function
     After_rules ->
        (* Adds '-stars' to ocamldoc invocations *)
        flag ["doc"; "ocaml"] (S[A"-stars"]);
        (* Overrides ocamlyacc-generated parser.mli with parser.override.mli *)
        copy_rule "Overrides Parser interface" ~insert: (`before "ocamlyacc")
           "parser.override.mli" "parser.mli"
   | _ -> ()
(* Registers our hook handler when this module is loaded (by ocamlbuild). *)
let _ = dispatch hook_handler

Having just said how ocamlbuild is nice and easy, we still met a problem we were able to solve only with the addition of a Makefile. The problem is that we used some UTF-8 characters in comments (for diagrams, have a look at generated documentation), while ocamldoc only supports ISO-8859-1. After having generated the documentation, we need to replace each occurrences of iso-8859-1 with utf-8 in generated HTML files.

We would have liked to be able to do this cleanly in ocamlbuild, but we did not find a way to do it. The problem is that there seems to be no provision to extend an existing rule. Once ocamlbuild has found a rule that generates the asked resource, it calls this rule and ignores all others. We can easily replace a rule by specifying another which appears earlier in the rule list, but it must then realize the full generation, and not only an after-the-facts modification.

This would not be problematic if we had access to existing rule commands to build the replacement rule (calling the old rule commands, then applying our own), but we could not find a way to do that.

We also tried to play with dependencies (including with stamps such as %.docdir/html.stamp) to generate the documentation using the predefined ocaml rule as a dependency, and then doing our modifications. It worked, but only as long as you are ready to ask a custom target such as photon.docdir/utf8.stamp instead of the well-known photon.docdir/index.html.

Finally, out-of-luck, we decided to write the following Makefile, resorting to ocamlbuild for all dependency analysis, but doing a sed after documentation generation. If you find a way to do this cleanly entirely in ocamlbuild, we are eager to hear it.

	ocamlbuild photon.native
	ocamlbuild photon.docdir/index.html
	sed -i "s/iso-8859-1/utf-8/" photon.docdir/*.html
	ocamlbuild -clean

One should then build the documentation with make doc instead of ocamlbuild photon.docdir/index.html. native and clean targets are provided for convenience (e.g. we have just to type :mak in Vim to compile our project).


Although our parser is ridiculously trivial and our pipeline is only half complete, at least we have got some code to compile and to run. We can now parse some values and check the behavior of our interpreter when lexical or syntax errors occur.

% ./photon.native                            
# 42   
  ;; what follows a ';;' is ignored in interactive mode
# '\068' (* some comment *);;
#    "foo  
  calls for
Literal(String("foo\ncalls for\n\tbar"))
# 123 456;;
<stdin>:1.5-7: Syntax error
# 42 @
<stdin>:1.4: character '@' is illegal in this context
# true;;

To check end-of-file error conditions, one can load input from a file.

% echo "   \"unterminated string\nblah blah" > /tmp/
% ./photon.native < /tmp/
#     <stdin>:1.4: unterminated string literal

In the next installment, we will add trivial typing and LLVM compilation to obtain our first complete pipeline, although for a completely trivial language.

Source Code

The code accompanying this article is available in archive photon-tut-03.tar.xz or through the git repository:

git clone
cd photon
git checkout 3-basic_parsing.1
1) More precisely, Menhir is a LR parser generator.
2) We group utility functions extending the Lexing module in Lexer.


Enter your comment. Wiki syntax is allowed:
blog/2011/01/13/syntactic_analysis.txt · Last modified: 2011/02/20 20:28 (external edit)