This is the third article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.
This article details the basics of syntactic analysis.
Syntactic analysis, or parsing, is the process of analyzing the input source code (seen as a stream of tokens) to determine its conformance with the language grammar.
The parser will not only verify that input source code is syntactically correct, it will also build an abstract syntax tree (AST) along the way. This tree represents the input program in a form that is convenient for semantic analysis (e.g. typing) and compilation.
While we provided nearly a complete lexer in the previous installment, we will not provide a complete parser in this one. Instead, we will first parse a very minimal part of the language (constant values), but we will also provide a minimal driver program that will allow us to test our lexer and parser a bit.
In the next installment, we will complete our minimal pipeline by adding trivial typing and a first contact with LLVM.
We will use a parser generator rather than implementing the parsing process directly.
One of the most well-known parser generator is Yacc (the original companion of Lex), or its popular alternative Bison (generally combined with Flex). A similar tool is available for the OCaml programmer under the name of ocamlyacc (which nicely integrates with ocamllex).
Ocamlyacc is provided with the OCaml distribution. It is documented in the OCaml manual. Another useful resource is the ocamlyacc tutorial.
While lexing is not polemic, there exist many parsing techniques, each with its advantages and drawbacks. We chose to go for a LALR parser:
Another LALR parser generator for OCaml in Menhir1). Menhir is a bit more expressive than ocamlyacc and provides IMHO better grammar debugging facilities. However, Menhir is a bit painful for position management, reason why we sticked with ocamlyacc. We use Menhir from time to time during development though, to debug our grammar or verify the absence of some possible error conditions that would otherwise pass undetected by ocamlyacc (Menhir accepts ocamlyacc input format).
We call our input file to ocamlyacc parser.mly
. Ocamlyacc will produce as output our parser in the files parser.ml
and parser.mli
. The input file has the following format:
%{ (* Header: OCaml code *) %} declarations %% rules %% (* Trailer: OCaml code *)
Header and trailer are again ordinary OCaml code that will be respectively prepended and appended to ocamlyacc-generated code.
Declarations are of four kinds:
A token definition has the form
%token NULLARY_NAME %token<Fully_qualified_type.t> WITH_ARG_NAME
If a type is specified, then the corresponding token constructor takes an argument of the given type. In our example, the following token type would be generated
type token = NULLARY_NAME | WITH_ARG_NAME of Fully_qualified_type.t
We will not use associativity in our first parser. We will give the syntax of associativity declarations when we will introduce operators later on.
Start rules are the equivalent of lexer entry points for ocamlyacc. These rules will be compiled as functions taking a lexing function (of type Lexing.lexbuf → token
) and a lexing buffer (of type lexbuf
) as arguments. The return type of a start rule must be provided.
In our first parser, we will only have one starting rule for interactive input, which we declare as follows (we will introduce its return type below).
%start interactive %type<Position.t Ast.t> interactive
We handle syntax errors in a similar way than we did for lexical errors. We define a Parser_error.t
type in parser_error.mli
and a parse error exception in parser.mly
.
exception Error of (Position.t * Parser_error.t)
ocamlyacc supports a predefined special token error
which can be used where we want to be able to detect errors. If a syntax error is met, the parser will discard tokens until an error
token can be shifted. If there is no rule available where an error
token can be shifted, the parsing fails with a Parsing.Parse_error
exception.
To avoid the later, we will generally use the error
token at the end of a rule, allowing it to be shifted as soon as an error is detected, without regard to what may follow it. This is contrary to the traditional approach of waiting for a synchronization point such as ;;
to resume parsing. We will do the resynchronization in the driver, which is less error-prone (e.g. no need to ensure that all possible error cases allows the shifting of an error
token) and sufficient for our purpose (we will abort the current top-level interactive statement, discarding all buffered input as explained below).
We will be careful to provide an error rule for each start rule to be sure to throw our own exception with position information, instead of the predefined one. We will also use error
tokens liberally as our parser grows, in order to provide the user with meaningful error messages in place of the rather vague “syntax error”.
The parser manages positions automatically and always knows the position in input source code that corresponds to both terminal (tokens) and non-terminal symbols.
The position of the nth right-hand side of a grammar rule can be obtained through calls to Parsing.rhs_start_pos
and Parsing.rhs_end_pos
. The position of the left-hand side can be obtained by by calling Parsing.symbol_start_pos
and Parsing.symbol_end_pos
.
We define a helper function pos
to return a Position.t
for the given element by calls to the Parsing
functions.
(* [pos n] returns the position of the left-hand side of a rule if [n = 0] and * the position of the [n]th right-hand side if [n > 0]. *) let pos n = let s = if n = 0 then symbol_start_pos () else rhs_start_pos n in let e = if n = 0 then symbol_end_pos () else rhs_end_pos n in { pos_start = s; pos_length = e.pos_cnum - s.pos_cnum; }
Before jumping to our parser, let us say a word about the AST. We chose to label the nodes of our AST with an arbitrary label. This allows us to keep the same tree shape at different phases of compilation.
During parsing, we will label the AST nodes with positions. The position indicates the region of input source code that corresponds to the node.
During typing, we will add a type to the label.
We will refine the AST as we go. For now, as we only parse literals, it can only represent literals.
(** Literals. *) type lit = Bool of bool | Char of char | Float of float | Int of int | String of string (** AST. AST nodes are labelled and the AST type is polymorphic in the label * type to allow to share the same AST structure for different compilation * phases. * * The AST label will be a position after parsing and a [(position, type)] * pair after typing. *) type 'a t = { label : 'a; node : node; } and node = Literal of lit (** literals *)
Our parser is rather trivial for now. The definitions are
%token EOF %token FALSE %token QUOTE %token SEMISEMI %token TRUE %token<char> CHAR %token<float> FLOAT %token<int> INT %token<string> IDENT %token<string> STRING %start interactive %type<Position.t Ast.t> interactive
We define a helper function to build labeled AST nodes in the header
(* [label n node] builds an AST by labelling node [node] with [pos n]. *) let label n node = { label = pos n; node = node }
The interactive
start rule has only three cases for now:
End_of_file
exception.interactive : EOF { raise End_of_file } | lit SEMISEMI { label 1 (Literal $1) } | error { raise (Error(pos 1, Syntax_error)) }
The lit
rule is trivial
lit : FALSE { Bool false } | TRUE { Bool true } | CHAR { Char $1 } | FLOAT { Float $1 } | INT { Int $1 } | STRING { String $1 }
On the contrary to ocamllex which generates no interface file and is happy with whatever the user provided as long as it is consistent with the implementation, ocamlyacc generates an interface file. As we keep our parser in parser.mly
, a parser.mli
file will be generated.
The problem is that this file does not contain any documentation for ocamldoc and does not export our Parser.Error
exception (nor string_of_error
that we will introduce below).
We solve this problem by providing a custom interface in parser.override.mli
as explained in a previous post.
We will see in the ocamlbuild section what is necessary to use this custom interface file in place of the ocamlyacc-generated one.
For now, we will read, parse and print interactive input from standard input.
The intended behavior is as follows:
“# ”
at the start of a top-level interactive statement, and a continuation prompt “ ”
elsewhere.To control buffering and prompting, we will use Lexing.from_function which builds a lexing buffer that obtain its input by calling the given function.
The input function must refill a finite-size buffer. In order to read input line-by-line, we will call input_line
which returns a new string of arbitrary length. We will thus need our own buffer.
We will also need a flag to remember if the next prompt to be printed should be the primary prompt or the continuation prompt. We will store the prompt directly.
For simplicity, the program handling at most one interactive input at a time, we store both the input buffer and current prompt as mutable global variables in our driver program in photon.ml
.
let input_buf = ref "" let curr_prompt = ref "# "
We then define the refill function. If there is no buffered input, we print the prompt, read one line and return as much of it as we can, buffering the rest if any. If some input is still present in buffer, we return as much of it as we can, keeping the rest if any. When we have read a line, we set the prompt to be the continuation prompt until further notice.
let refill s n = (* Refills internal buffer if necessary *) if String.length !input_buf = 0 then ( print_string !curr_prompt; flush stdout; curr_prompt := " "; try input_buf := input_line stdin ^ "\n" with End_of_file -> () ); (* Copy (at most n) characters from internal buffer to s *) let buf_len = String.length !input_buf in let n_min = min n buf_len in String.blit !input_buf 0 s 0 n_min; (* Update internal buffer *) input_buf := String.sub !input_buf n_min (buf_len - n_min); (* Return number of stored characters *) n_min
Finally, we define a reset_input
function to discard buffered input and restore the primary prompt.
let reset_input () = input_buf := ""; curr_prompt := "# "
When discarding buffered input, we must no forget input that has already been buffered in lexbuf
. We add to Lexer
a new reset
function for this purpose2). reset
will also reset the position of given lexing buffer.
let reset lexbuf = flush_input lexbuf; set_line_num lexbuf 1
Our main function creates the lexing buffer, then enters the read-eval-print loop (REPL), that repeatedly parses a top-level interactive statement and prints its AST, until an End_of_file
is raised.
Similarly to OCaml, we do not process further input on the same line as a ;;
token in interactive mode. Each time a top-level interactive input phrase has been parsed (each one ending in ;;
), we discards remaining buffered input.
If a lexical or syntax error occurs, it is printed, then the loop resumes. As we discard remaining input at the start of the loop, it will not trigger further errors. The compilation of the current top-level statement is thus completely aborted when an error occurs.
let main () = let lexbuf = Lexing.from_function refill in Lexer.set_file_name lexbuf "<stdin>"; (* Read-eval-print loop *) try while true do (* We drop any input that possibly remains, either because it follows * [;;], or because an error was met *) Lexer.reset lexbuf; reset_input (); try let ast = Parser.interactive Lexer.token lexbuf in print_endline (string_of_ast ast) with Lexer.Error err -> prerr_endline (Lexer.string_of_error err) | Parser.Error err -> prerr_endline (Parser.string_of_error err) done with End_of_file -> print_endline ""
This code make use of string_of_ast
and string_of_error
functions, that convert respectively an AST and an error into a string. We have chosen not to use the Format
or Printf
module, both needing the ('a, 'b, 'c, 'd, 'e, 'f) format6
type. This special type is not trivial to implement and will be avoided (at least) in this first version of Photon. We also did not use a string buffer. AST printing is used only in debugging and errors should be both rare (hopefully) and short. However, we may change this in the future is string concatenation becomes too detrimental to performance.
(** [string_of_ast ast] returns the string representation of [ast], ignoring * labels. *) let string_of_ast ast = match ast.node with Literal lit -> let string_of_lit = function Bool b -> "Bool(" ^ (if b then "true" else "false") ^ ")" | Char c -> "Char('" ^ Char.escaped c ^ "')" | Float f -> "Float(" ^ string_of_float f ^ ")" | Int i -> "Int(" ^ string_of_int i ^ ")" | String s -> "String(\"" ^ String.escaped s ^ "\")" in "Literal(" ^ string_of_lit lit ^ ")"
let string_of_error (pos, err) = let pos = string_of_position pos in let err = match err with Illegal_char c -> "character '" ^ Char.escaped c ^ "' is illegal in this context" | Invalid_escape_sequence s -> "invalid escape sequence '" ^ s ^ "'" | Too_big_int i -> i ^ " exceeds the range of representable integers" | Unterminated_comment -> "unterminated comment" | Unterminated_string -> "unterminated string literal" in pos ^ ": " ^ err
let string_of_error (pos, err) = let pos = string_of_position pos in let err = match err with Syntax_error -> "Syntax error" in pos ^ ": " ^ err
As both Lexer.string_of_error
and Parser.string_of_error
make use of the string representation of a position, we extends Position
with a string_of_position
function.
(** [string_of_position pos] returns the string representation of position * [pos]. *) let string_of_position { pos_start = s; pos_length = len } = let name = s.pos_fname in let line_str = string_of_int s.pos_lnum in let col = s.pos_cnum - s.pos_bol + 1 in let col_str = string_of_int col in name ^ ":" ^ line_str ^ "." ^ col_str ^ if len <= 1 then "" else ("-" ^ string_of_int (col + len - 1))
Finally, we add a call to main
at the end of photon.ml
.
let _ = main ()
We have now gathered sufficient code to be able to compile our project. For this purpose, we use ocamlbuild. Ocamlbuild is the kind of tools one wonders how he made without it in the past. While the OCaml compiler requires us to be very explicit about dependencies when compiling files, ocamlbuild will handle it all for us. It also knows about ocamllex and ocamlyacc and will call them for us.
All we have to do to obtain an executable version of our interpreter is to enter ocamlbuild photon.native
in our project directory. To generate the documentation with ocamldoc, we enter instead ocamlbuild photon.docdir/index.html
and that's it.
At least, it would have been if we did not used a custom parser interface in parser.override.mli
and comments with a *
on each line. To configure ocamlbuild to suit our needs, we add to the project the following plugin, responsible to add the -stars
arguments to calls to ocamldoc and to copy parser.override.mli
onto parser.mli
as described in this previous post.
open Ocamlbuild_plugin (* Hook handler containing our extensions for the [ocamlbuild] system. *) let hook_handler = function After_rules -> (* Adds '-stars' to ocamldoc invocations *) flag ["doc"; "ocaml"] (S[A"-stars"]); (* Overrides ocamlyacc-generated parser.mli with parser.override.mli *) copy_rule "Overrides Parser interface" ~insert: (`before "ocamlyacc") "parser.override.mli" "parser.mli" | _ -> () (* Registers our hook handler when this module is loaded (by ocamlbuild). *) let _ = dispatch hook_handler
Having just said how ocamlbuild is nice and easy, we still met a problem we were able to solve only with the addition of a Makefile. The problem is that we used some UTF-8 characters in comments (for diagrams, have a look at generated documentation), while ocamldoc only supports ISO-8859-1. After having generated the documentation, we need to replace each occurrences of iso-8859-1
with utf-8
in generated HTML files.
We would have liked to be able to do this cleanly in ocamlbuild, but we did not find a way to do it. The problem is that there seems to be no provision to extend an existing rule. Once ocamlbuild has found a rule that generates the asked resource, it calls this rule and ignores all others. We can easily replace a rule by specifying another which appears earlier in the rule list, but it must then realize the full generation, and not only an after-the-facts modification.
This would not be problematic if we had access to existing rule commands to build the replacement rule (calling the old rule commands, then applying our own), but we could not find a way to do that.
We also tried to play with dependencies (including with stamps such as %.docdir/html.stamp
) to generate the documentation using the predefined ocaml rule as a dependency, and then doing our modifications. It worked, but only as long as you are ready to ask a custom target such as photon.docdir/utf8.stamp
instead of the well-known photon.docdir/index.html
.
Finally, out-of-luck, we decided to write the following Makefile, resorting to ocamlbuild for all dependency analysis, but doing a sed after documentation generation. If you find a way to do this cleanly entirely in ocamlbuild, we are eager to hear it.
native: ocamlbuild photon.native doc: ocamlbuild photon.docdir/index.html sed -i "s/iso-8859-1/utf-8/" photon.docdir/*.html clean: ocamlbuild -clean
One should then build the documentation with make doc
instead of ocamlbuild photon.docdir/index.html
. native
and clean
targets are provided for convenience (e.g. we have just to type :mak
in Vim to compile our project).
Although our parser is ridiculously trivial and our pipeline is only half complete, at least we have got some code to compile and to run. We can now parse some values and check the behavior of our interpreter when lexical or syntax errors occur.
% ./photon.native # 42 ;; what follows a ';;' is ignored in interactive mode Literal(Int(42)) # '\068' (* some comment *);; Literal(Char('D')) # "foo calls for \tbar";; Literal(String("foo\ncalls for\n\tbar")) # 123 456;; <stdin>:1.5-7: Syntax error # 42 @ <stdin>:1.4: character '@' is illegal in this context # true;; Literal(Bool(true))
To check end-of-file error conditions, one can load input from a file.
% echo " \"unterminated string\nblah blah" > /tmp/test.ml % ./photon.native < /tmp/test.ml # <stdin>:1.4: unterminated string literal #
In the next installment, we will add trivial typing and LLVM compilation to obtain our first complete pipeline, although for a completely trivial language.
The code accompanying this article is available in archive photon-tut-03.tar.xz or through the git repository:
git clone http://git.legiasoft.com/photon.git cd photon git checkout 3-basic_parsing.1
Discussion