This is the first article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.
Along the way, we will try to document the process in a gentle, incremental manner. This series could serve as an intermediate tutorial about compiler implementation, more advanced than the LLVM tutorial but still more accessible than perusing the code of an industrial-strength compiler1).
Rather than establishing a long list of topics which will supposedly be addressed in this tutorial, We will only list those that have been already implemented and documented. You may also refer to the table of contents.
You will need a bit of general computer science background, with emphasis on compilers and type systems. We will try to be self-contained, but will not enter the details of background material. We will give references should you need some refreshment or further information.
You are supposed to know at least a little bit of OCaml, which we will use both as our compiler and target programming language (more about this below). However, we will only make use of basic features of the language (no objects, no functors, no polymorphic variants) and having read any tutorial should be more than enough.
Having read the introductory LLVM tutorial is not strictly necessary, but it may help.
We intend to develop a new programming language called Photon. However, it is still much in flux. Rather than procrastinate the compiler development in favor of the language design, we decided to implement first a compiler for a simple subset of the OCaml language.
Our idea is to use this subset of OCaml both as the source and the target language, which presents many advantages:
The compiler code should be self-contained (using the foreign function interface only for LLVM bindings and very few primitives) and legible2).
As a first approximation, our subset superficially looks like OCaml without its object sub-system, polymorphic variants and functors.
If we reach this target milestone, we may then continue this series with Photon language design and accompanying modifications to our compiler.
First, the Lexer does the lexical analysis, i.e. consume source input as characters to produce an output stream of tokens. Tokens are atomic elements from a syntactic point of view such as identifiers or integer literals.
Second, the Parser does the syntactic analysis, i.e. check that source program conforms to the grammar of the language. The token stream is translated by the parser into an abstract syntax tree (AST) representing the input program.
Third, the Typer annotates the untyped AST with types, using a mix of user-provided annotations and type inference.
Finally, the Compiler transforms the typed AST into LLVM intermediate representation (IR). The LLVM IR is suitable to compilation to native executables or just-in-time (JIT) compilation for interpretation. This last operation is handled by the LLVM tools and library.
Accompanying source code can be obtained from archives, linked from the relevant pages, or through the git repository (see tags)
git clone http://git.legiasoft.com/photon.git cd photon git tag
Except otherwise noted, our code falls under the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
See the source code distribution for details.
This tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0).
You are welcome to make comments and remarks, as well on the contents as on the form (even spelling or syntax mistakes). You can let your comments in the discussion part at end of page, or by e-mail to firstname.lastname@example.org.