This is the second article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.
This article details the lexical analysis process.
Lexical analysis, also known as lexing or tokenization is the process of converting the program source code from a stream of characters into a stream of tokens. Tokens are atomic elements from a syntactic point of view, such as identifiers or integer literals.
While lexing can be programmed by hand, it is both easier and more legible to use a lexical analyzer generator to generate the lexing program (called a lexer or scanner) from token descriptions, generally using regular expressions.
The most well-known lexer generator is Lex, or its popular alternative Flex. A similar tool is available for the OCaml programmer under the name of ocamllex.
Ocamllex is provided with the OCaml distribution. It is documented in the OCaml manual. Another useful resource is the ocamllex tutorial, a translation of the flex manual to ocamllex.
We call our input file to ocamllex lexer.mll
. Ocamllex will produce as output our lexer in the file lexer.ml
. The input file has the following format:
{ (* Header: OCaml code *) } (* Named regular expressions *) let name = regexp (* Rules *) rule entry_point_1 [arg_1 ... arg_n] = parse pattern { action } | pattern { action } and entry_point_2 [arg_1 ... arg_n] = parse pattern { action } | pattern { action } { (* Trailer: OCaml code *) }
Header and trailer contain ordinary OCaml code. The header will be prepended to the generated lexer while the trailer will be appended. Actions then have access to definitions in the header while definitions in the footer have access to both header definitions and rules (each entry point being turned into a function).
Named regular expressions are there for legibility and reuse. They simply bind a regular expression to a name. The name can latter be used in place of the regular expression in other regular expressions and in rules. e.g.
let digit = ['0'-'9'] let int_lit = digit+
The syntax of regular expressions is similar to Lex, except for the mandatory single-quotes around characters and double-quotes around strings. You are referred to the documentation for details, but we will give the peculiarities as we meet them.
The rule part is the most important. Each entry point contains a mapping from patterns to actions, defined as OCaml code. Patterns are defined using regular expressions with a special syntax for binding parts of matched input. regexp as name
binds the input matched by regular expression regexp
to name
in corresponding action.
Each entry point is be turned by ocamllex into a function, taking the arguments given in its definition plus a lexing buffer called lexbuf
as a last argument. The lexing buffer type is defined in the standard library as Lexing.lexbuf
.
The lexing process may fail. What happens if we enter an unrecognized operator? What happens if we meet the end-of-file in the middle of a string of a comment?
In these cases, and others, the user should be presented with a message descriptive of the error. This message should specify a tentative location for the error in input source code.
For each rule entry point, we must ask ourselves the following questions:
We will keep these questions in minds while writing our lexer definition.
We signal lexical errors by the mean of OCaml exceptions. For this purpose, we define a lexical error type Lexer_error.t
which contains the different kind of lexical errors. Why define it in lexer_error.mli
rather than directly in lexer.mli
? OCaml has the strange peculiarity of not inheriting type definitions from interface files in implementation files. We would then need to define the type both in lexer.mli
and lexer.ml
. This would not be error-prone as OCaml will still check the types are equal1), but it would still be redundant and tedious to maintain.
Of course, we could dispense with writing a lexer.mli
file in the first place. When no interface file is present, it is deduced by the OCaml compiler from the implementation file. However, every auxiliary entry point and function would then be exported which would break encapsulation. In our compiler, we will follow the recommended practice of using interface files, except when a module is fully exported.
We also define an exception (this time directly in Lexer
)
(** Lexical error exception. *) exception Error of (Position.t * Lexer_error.t)
The Position.t
will serve to denote a tentative position for the error, as will be explained shortly. We prefer to use a pair as the exception argument rather than multiple arguments in order to be able to treat the lexical error exceptions more abstractly2)
To present the user with a tentative position for the error, we need to be able to identify the position of the currently matched input in input source code.
This is handled partially by ocamllex. The lexbuf
type contains both the starting position of currently matched input (lexbuf.lex_start_p
), and the position of the character following the end of matched input (lexbuf.lex_curr_p
).
Both positions are of the Lexing.position
type:
type position = { pos_fname : string; (* file name *) pos_lnum : int; (* line number *) pos_bol : int; (* offset of the beginning of current line *) pos_cnum : int; (* offset relative to the start of the file *) }
However, ocamllex only updates the pos_cnum
field of positions, i.e. the byte offset since the start of input. It would not be convenient for users to indicate error positions so crudely.
While we could always recover a line and column number from an offset for a given file, it is not always possible for interactive input (part of the input may have been discarded from buffer). We have then opted to update the pos_lnum
and pos_bol
fields during the lexing. We thus always know the current line (given by pos_lnum
) and column (by subtraction of pos_bol
from pos_cnum
) of the currently matched input.
To keep pos_lnum
and pos_bol
up-to-date, it suffices to call the function Lexing.new_line
on lexbuf
each time we see a line feed. This means that patterns possibly matching a line feed must be split into one pattern never matching a line feed, and one pattern always matching a line feed (and calling new_line
accordingly in its action).
For cases when a line feed is present but does not occur at the end of the matched pattern, we define a new_line_with_offset
function that will update the given lexing buffer by recording a new line offset
characters before the end of matched input.
As OCaml itself, we keep track not only of the starting position but also of matched element length. We encapsulate our extended position into a new Position
module.
(** Positions in file. *) type t = { pos_start : position; (** start of our position as a [Lexing.position] *) pos_length : int; (** number of bytes occupied by positioned element *) }
In the lexer, we define the following functions to manipulate the positions.
(* [curr_pos lexbuf] returns the position of the currently matched input in * [lexbuf]. *) let curr_pos lexbuf = let start = lexbuf.lex_start_p in let curr = lexbuf.lex_curr_p in { pos_start = start; pos_length = curr.pos_cnum - start.pos_cnum; } (* Updates the current position of [lexbuf] to account for a line feed [offset] * characters before the end of the currently matched input. *) let new_line_with_offset lexbuf offset = let pos = lexbuf.lex_curr_p in lexbuf.lex_curr_p <- { pos with pos_lnum = pos.pos_lnum + 1; pos_bol = pos.pos_cnum - offset; } (** [set_file_name lexbuf name] sets the file name of [lexbuf] to [name]. *) let set_file_name lexbuf name = let pos = lexbuf.lex_curr_p in lexbuf.lex_curr_p <- { pos with pos_fname = name; } (** [set_line_num lexbuf n] sets the line number of [lexbuf] to [n] and its * column number to 1. *) let set_line_num lexbuf n = let pos = lexbuf.lex_curr_p in lexbuf.lex_curr_p <- { pos with pos_lnum = n; pos_bol = pos.pos_cnum; }
Tokens are defined in the Parser
module. They are just constructors of the Parser.token
type. Some are nullary constructors such as QUOTE
while other have arguments such as INT of int
. By convention, token constructors are uppercase (this will make the grammar more legible, more about this later).
Now that we have seen all this preliminary material, it is time to jump in our lexer definition proper. In this presentation, we will try to follow closely the OCaml lexical convention.
Our lexer has three rule entry points. The first one, named token
, takes no argument other than the implicitly passed lexing buffer lexbuf
. It is the one that will be invoked by the parser to obtain tokens and most of its actions return a token directly.
The two other entry points are comment
and string_
. These auxiliary entry points are used to process respectively comments and strings, as their name imply.
We considered a newline
to be a line feed (\n
), optionally preceded by a carriage return (\r
). It is defined separately as it is reused in different places.
let newline = '\r'? '\n'
Blanks are defined as spaces, horizontal tabulations, carriage returns, line feeds and form feeds. The later having no specific escape character in OCaml, we specify it as \012
. We omit the line feed in the list as we handle it separately (we need to call new_line
when we meet one).
let blank = [' ' '\t' '\r' '\012']
We are told to ignore blanks, so we just call token
again recursively to process further input when we meet some.
rule token = parse newline { new_line lexbuf; token lexbuf } | blank+ { token lexbuf }
OCaml comments are introduced by (*
and closed by *)
. They can nest.
In token
, we introduce a pattern to match comment start. When we match a comment start, we skip the comment to its end using the comment
entry point. We then proceed to the next token by calling token
again.
| "(*" { comment (curr_pos lexbuf) lexbuf; token lexbuf }
Skipping a comment is done by ignoring any character until the end of comment *)
is met. However, we must take care of possible nested comments. This is accomplished by also matching on start of comment. If a start of comment is met, we call comment
recursively to match the inner comment, and then again to match the end of current comment. The nesting depth is conveniently stored implicitly in the call stack.
It is an error if a comment is still opened when we meet the end of the input. If end-of-file is met in the comment
entry point, we raise an exception. The position of the end-of-file would be of little use to the developer. We indicate the starting position of the unterminated comment instead. For this purpose, we pass the comment starting position as an argument to comment
.
Finally, the '_' regular expression matches any character. The lexer prefers longest prefix matches3). When it meets a (
character, it looks the next character. If it is a *
, it chooses the “(*”
pattern, else the _
pattern (keeping the *
for later).
and comment start_pos = parse "(*" { (* Opening nested comment *) comment (curr_pos lexbuf) lexbuf; (* End of current comment *) comment start_pos lexbuf } | "*)" { () } | newline { new_line lexbuf; comment start_pos lexbuf } | eof { raise (Error(start_pos, Unterminated_comment)) } | _ { comment start_pos lexbuf }
We will ignore the identifiers for now. We will return to them in the section about keywords.
The regular expression for integer literals is a straightforward translation of the one given in the OCaml lexical convention.
let digit = ['0'-'9'] let bin_digit = ['0'-'1'] let oct_digit = ['0'-'7'] let hex_digit = digit | ['A'-'F' 'a'-'f'] let int_lit = '-'? digit (digit | '_')* | '-'? '0'['x' 'X'] hex_digit (hex_digit | '_')* | '-'? '0'['o' 'O'] oct_digit (oct_digit | '_')* | '-'? '0'['b' 'B'] bin_digit (bin_digit | '_')*
The translation is made trivial by the fact that our source and target languages are the same. We just have to call OCaml standard function int_of_string
to convert the integer literal string representation to an integer. This conversion may fail, however, if the provided literal is too big (i.e. exceeds the representation range of OCaml integers). We define a new function checked_int_of_string
to detect this error condition and report it accordingly.
let checked_int_of_string s lexbuf = try int_of_string s with Failure "int_of_string" -> raise (Error(curr_pos lexbuf, Too_big_int s))
token
is then extended with
| int_lit as i { INT (checked_int_of_string i lexbuf) }
Floating-point literals are handled in a similar way to integer literals, except that we don't check for errors (similarly to OCaml, we accept a possible loss of precision between the given literal and its floating-point representation). The regular expression is
let exponent = ['e' 'E'] ['+' '-']? digit (digit | '_')* let float_lit = '-'? digit (digit | '_')* ('.' (digit | '_')* exponent? | exponent)
token
is extended with
| float_lit as f { FLOAT (float_of_string f) }
Character literals are enclosed in single quotes. They can be regular characters (i.e. any character different from a single-quote or a backslash) or escape sequences. We exclude line feeds from regular characters and handle them separately to update position in that case.
Escape sequences can be one of the seven symbolic escape sequences (\b
for backspace, \n
for line feed, \r
for carriage return, \t
for horizontal tabulation, \\
for \
, \
' for ' and
\“
for ”
) or a numeric escape sequence, either in decimal or in hexadecimal. We define the regular expressions without the enclosing single-quotes to be able to reuse them in strings.
let regular_char = [^'\n' '\'' '\\'] let decimal_char = "\\" digit digit digit let hex_char = "\\x" hex_digit hex_digit let escaped_char = "\\" ['b' 'n' 'r' 't' '\\' '\'' '"']
We may look a bit too liberal in the regular expression for decimal escape sequences (allowing values over 255), but we will check the value during conversion. This will allow better error messages (rather than just reporting a syntax error, we can tell the user he used a wrong escape sequence). It is a small departure from the lexical convention, but is the way the OCaml compiler is actually implemented.
For the escape sequences, we define three helper functions returning the character corresponding to the escape sequence given as a character string. If we receive a decimal escape sequence with a value over 255, we raise an exception.
let invalid_escape lexbuf seq = raise (Error(curr_pos lexbuf, Invalid_escape_sequence seq)) (* Converts [\ddd] escape sequence to corresponding character. *) let char_of_decimal seq lexbuf = try Char.chr (int_of_string (String.sub seq 1 3)) with Invalid_argument "Char.chr" -> invalid_escape lexbuf seq (* Converts [\xhh] escape sequence to corresponding character. *) let char_of_hex seq = let code = int_of_string ("0x" ^ String.sub seq 2 2) in Char.chr code (* Converts [\l] escape sequence to corresponding character. If [l] is not [n], * [r], [t] or [b], [l] is returned. *) let char_of_escape seq = match seq.[1] with 'n' -> '\n' | 'r' -> '\r' | 't' -> '\t' | 'b' -> '\b' | c -> c
In char_of_escape
, we just return the escape character itself when it is not n
, r
, t
or b
. A call char_of_escape “\x”
would thus dubiously return x
. This should hopefully happen only for \
, ' and
“
as the function is only called on strings matching escaped_char
.
token
is adapted accordingly:
| "'" newline "'" { new_line_with_offset lexbuf 1; CHAR '\n' } | "'" (regular_char as c) "'" { CHAR c } | "'" (decimal_char as d) "'" { CHAR (char_of_decimal d lexbuf) } | "'" (hex_char as h) "'" { CHAR (char_of_hex h) } | "'" (escaped_char as e) "'" { CHAR (char_of_escape e) } | "'" ("\\" _ as s) { invalid_escape lexbuf s }
Note how the pattern for newline
calls new_line_with_offset
rather than just new_line
to account for the closing single-quote character.
A single-quote is a valid OCaml operator. If none of our pattern matches on a ', a QUOTE token will be returned (e.g.
'ab
' would be lexed as QUOTE, IDENT(“ab”) and QUOTE). Without the last pattern, '\x
would first return QUOTE, then complain that the backslash is illegal (if the QUOTE did not triggered a syntax error before). We found wiser to match this pattern and report invalid escape sequence instead.
String literals are enclosed in ”
. They are like character literals, except that they can contain more than one character and that it is the “
that should be escaped instead of the '.
let string_char = [^'"' '\\']
When we see a ”
, we return the result of the auxiliary entry point string_
4). This entry point takes a string buffer as argument in order to efficiently collect the string literal characters. We also pass the string starting position.
| '"' { let buf = Buffer.create 256 in string_ (curr_pos lexbuf) buf lexbuf }
string_
recognizes a character using previously defined regular expressions and functions, then adds it to its string buffer. It then calls string_
again recursively to collect the rest of the string literal. The process ends when a “
is read, in which case a STRING
token is returned with buffer contents.
As the parser will use lexbuf.lex_start_p
to determine the position of the returned token, we must update lexbuf
to record that the string literal starts at start_pos
and not at the closing ”
before returning a string token.
and string_ start_pos buf = parse '"' { (* Restore start position in lexbuf *) lexbuf.lex_start_p <- start_pos.pos_start; STRING (Buffer.contents buf) } | "\\" newline (blank* as s) { new_line_with_offset lexbuf (String.length s); string_ start_pos buf lexbuf } | '\n' { new_line lexbuf; Buffer.add_char buf '\n'; string_ start_pos buf lexbuf } | string_char as c { Buffer.add_char buf c; string_ start_pos buf lexbuf } | decimal_char as d { Buffer.add_char buf (char_of_decimal d lexbuf); string_ start_pos buf lexbuf } | hex_char as h { Buffer.add_char buf (char_of_hex h); string_ start_pos buf lexbuf } | escaped_char as e { Buffer.add_char buf (char_of_escape e); string_ start_pos buf lexbuf } | "\\" _ as s { invalid_escape lexbuf s } | eof { raise (Error(start_pos, Unterminated_string)) }
The second pattern simply corresponds to the lexical convention that says that a backslash at end of line should be discarded along with all directly following blanks (this convention allows splitting long strings over multiple lines, optionally with indentation).
While we think named and optional arguments are useful and desirable, we decided to avoid them in the first phase of Photon for the two following reasons:
Accordingly, we will avoid to use them in our compiler source code.
For the very same reasons, we chose to disallow user-defined operators in the first phase of Photon.
Pre-defined operators are matched directly in token
, as the following example illustrates (we will add other symbols as the need arises).
| "'" { QUOTE }
let letter = ['A'-'Z' 'a'-'z'] let ident = (letter | '_') (letter | digit | ['_' '\''])*
We could match keywords directly in token
, along with a pattern for identifiers.
| ident as i { IDENT i } | "false" { FALSE } | "true" { TRUE }
However, as the number of keyword increases, this technique would rapidly explode the size of the lexing tables generated by ocamllex for our lexer.
Instead, we only match identifiers. But before returning an IDENT
token, we check if it is a keyword. We do the keyword check using a hash table mapping keyword strings to keyword tokens. We thus add to the header the following function
let ident_or_keyword = (* The hash table is built once for all when this module is loaded *) let kwds = [ "false", FALSE; "true", TRUE; ] in let table = Hashtbl.create (List.length kwds) in let add_keyword (kwd, token) = Hashtbl.add table kwd token in List.iter add_keyword kwds; (* Returns corresponding keyword token if [s] is a keyword and [IDENT s] * otherwise *) fun s -> try Hashtbl.find table s with Not_found -> IDENT s
In token
, we only keep the single pattern
| ident as i { ident_or_keyword i }
Of course, we will have more keywords than just false
and true
. However, we just keep these two as examples for now and will add the others as the need arises.
The definition for line number directives given in the lexical convention is too strict, complex and incomplete.
It does not allow spaces but spaces are inserted by many tools such as ocamllex and accepted without problem by the OCaml compiler.
It is complex by allowing escaping in the file name. We must either match the whole expression and decode it afterward or reuse the string_
entry point, but with the risk of not meeting the expected end of directive after the string (requiring backtracking to return instead a SHARP
, INT
and STRING
tokens).
Finally, it does not mention the ending newline
which should however not be accounted for, according to OCaml compiler behaviour.
We thus decided to make our pattern a bit more liberal (by allowing spaces and tabulations between elements) and to keep the file name in its escaped form (it thus should not be used as is except to print the file name). We will also forbid literal line feeds in the file name (the line number directive must stand on a single line).
Our patterns in token
are
| '#' [' ' '\t']* (digit+ as n) [' ' '\t']* newline { set_line_num lexbuf (int_of_string n); token lexbuf } | '#' [' ' '\t']* (digit+ as n) [' ' '\t']* '"' ([^'\n' '"']* as s) '"' [' ' '\t']* newline { set_line_num lexbuf (int_of_string n); set_file_name lexbuf s; token lexbuf }
We chose to represent the end-of-file by a specific EOF
token. This token is returned when we meet the end-of-file, thanks to the predefined eof
pseudo regular expression. We add to token
:
| eof { EOF }
While the comment
and string
entry points matches all possible characters, this is not yet the case of token
. If the lexer try to match a character which is such that no pattern matches, the lexer will terminate the program indicating an empty token, without any position reference. As this would be most unfortunate, we rather match the wildcard character _
and throw an educated exception ourselves.
| _ as c { raise (Error(curr_pos lexbuf, Illegal_char c)) }
Having finished our lexer, we can now sum up our lexical errors.
(** Lexical errors. These are not included directly in [lexer.mli] to avoid * redundancy between [lexer.mli] and [lexer.mll]. *) type t = Illegal_char of char (** given character is illegal in its context *) | Invalid_escape_sequence of string (** given escape sequence is not valid *) | Too_big_int of string (** given integer literal (as string) is not representable *) | Unterminated_comment (** end-of-file reached before all comments are closed *) | Unterminated_string (** end-of-file reached before seeing string ending double-quote *)
In this installment, we saw how we can instruct ocamllex to build a lexer for our subset of the OCaml language.
In the next installment, we will develop a minimal parser and driver in order to be able to test our lexer.
The code accompanying this article is available in archive photon-tut-02.tar.xz or through the git repository:
git clone http://git.legiasoft.com/photon.git cd photon git checkout 2-lexing
Error err
and pass err
around rather than having to know about positions and lexical error kinds.shortest
instead of parse
in the rule entry point definition to make the lexer prefer shorter matches.string
.
Discussion