User Tools

Site Tools


blog:2011:01:10:introduction

Photon Compiler Development: Introduction

This is the first article of a series (table of contents) about compiler development with LLVM using OCaml. We intend to develop a compiler for a subset of OCaml large enough to allow our compiler to compile itself.

Along the way, we will try to document the process in a gentle, incremental manner. This series could serve as an intermediate tutorial about compiler implementation, more advanced than the LLVM tutorial but still more accessible than perusing the code of an industrial-strength compiler1).

Addressed Topics

Rather than establishing a long list of topics which will supposedly be addressed in this tutorial, We will only list those that have been already implemented and documented. You may also refer to the table of contents.

  • Lexical analysis (using ocamllex)
  • Error reporting (with the handling of positions of errors in source input)

Requirements

You will need a bit of general computer science background, with emphasis on compilers and type systems. We will try to be self-contained, but will not enter the details of background material. We will give references should you need some refreshment or further information.

You are supposed to know at least a little bit of OCaml, which we will use both as our compiler and target programming language (more about this below). However, we will only make use of basic features of the language (no objects, no functors, no polymorphic variants) and having read any tutorial should be more than enough.

Having read the introductory LLVM tutorial is not strictly necessary, but it may help.

The Target Language

We intend to develop a new programming language called Photon. However, it is still much in flux. Rather than procrastinate the compiler development in favor of the language design, we decided to implement first a compiler for a simple subset of the OCaml language.

Our idea is to use this subset of OCaml both as the source and the target language, which presents many advantages:

  • We avoid language design issues entirely in the first phase.
  • We intend to make Photon a strict, impure functional programming language so OCaml is not too far from it.
  • OCaml syntax is familiar.
  • OCaml is well suited to write compilers.
  • OCaml has good, official bindings to LLVM.
  • It is rather easy to extract a “simple” subset from OCaml.
  • We will be able to test our implementation by compiling test programs both with our compiler and the OCaml one, comparing their results.
  • The compiler should in the end be able to compile itself, bootstrapping the development of Photon in Photon.

The compiler code should be self-contained (using the foreign function interface only for LLVM bindings and very few primitives) and legible2).

As a first approximation, our subset superficially looks like OCaml without its object sub-system, polymorphic variants and functors.

If we reach this target milestone, we may then continue this series with Photon language design and accompanying modifications to our compiler.

Overview of the Compiler

Compiler Pipeline Our compiler follows the traditional compiling pipeline. Simplifying a bit for now, the compilation process proceeds as follows.

First, the Lexer does the lexical analysis, i.e. consume source input as characters to produce an output stream of tokens. Tokens are atomic elements from a syntactic point of view such as identifiers or integer literals.

Second, the Parser does the syntactic analysis, i.e. check that source program conforms to the grammar of the language. The token stream is translated by the parser into an abstract syntax tree (AST) representing the input program.

Third, the Typer annotates the untyped AST with types, using a mix of user-provided annotations and type inference.

Finally, the Compiler transforms the typed AST into LLVM intermediate representation (IR). The LLVM IR is suitable to compilation to native executables or just-in-time (JIT) compilation for interpretation. This last operation is handled by the LLVM tools and library.

Source Code

Accompanying source code can be obtained from archives, linked from the relevant pages, or through the git repository (see tags)

git clone http://git.legiasoft.com/photon.git
cd photon
git tag

License

Except otherwise noted, our code falls under the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

See the source code distribution for details.

This tutorial is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0).

Request for Comments

You are welcome to make comments and remarks, as well on the contents as on the form (even spelling or syntax mistakes). You can let your comments in the discussion part at end of page, or by e-mail to devmusings@legiasoft.com.

1) However, efforts such as the GHC commentary help.
2) A compiler for any language could be written with any turing-complete language, but you would not necessarily like to read its code.
blog/2011/01/10/introduction.txt · Last modified: 2011/02/20 20:28 (external edit)