sec/compiler.ltx


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

\documentclass[../main.tex]{subfiles}

\begin{document}
\section{Idris 2 Compiler for Artyst}%
\label{sec:compiler}

To aid the development of the System~T reducer and encoding
algorithms, we have implemented a type checker and compiler for
\lang{}. The compiler is written in Idris~2~\cite{idris}, a
dependently typed programming language. The compiler is split into
three main components:
\begin{itemize}
  \item First we have a parser based on \textcite{alg-parsing}. The
    parser interface guarantees that we process the input without
    backtracking. This produces an intrinsically-well-scoped abstract
    syntax tree.

  \item Second we perform type checking using a bidirectional
    algorithm~\cite{bidir}. Along with the usual type checking result,
    the algorithm produces a runtime-erased typing derivation.

  \item Finally we compile the term to Guile~Scheme~\cite{guile}. By
    requiring an erased type derivation, the compiler can statically
    guarantee it will only ever have to process well-typed terms.
\end{itemize}

We present a simplified form of the Idris types used in the compiler.

\subsection{Parser}

The \lang{} parser is derived from \textcite{bidir}. At its core is a
data type for parser combinators.\@ \textcite{bidir} gave a type system
for the combinators such that well-formed combinators can parse an
input: in linear time; without backtracking; and with only a single
token of lookahead. They implemented both an interpreter and a staged
compiler for expressions. The interpreter only reports errors about
the combinators at run time.

We adapted the combinator interpreter for Idris 2. First, we used
dependant types to statically enforce a combinator is well-formed. The
``parse'' interpretation function requires a certificate that a parser
is well-formed. Second, our interpreter produces a certificate that
the list of unprocessed tokens is a suffix of the input tokens. This
certificate is required to satisfy the Idris termination checker when
interpreting fixed point combinators. Finally, we thread state
throughout the combinators.

It is common for parsers to rely on some additional
state~\cite{states}. For example \lang{} keeps an environment of bound
names as state. As Idris has dependant types, we can make the
result of parsing a combinator depend on the initial state.

A parser combinator can interact with state when:
\begin{itemize}
  \item processing a sequence, where it can update the state between
    items; or
  \item performing a semantic action (modifying the parse result),
    where it can read the state.
\end{itemize}
For example, consider parsing the syntax \verb|\x => x|, binding the
variable \verb|x| before referencing it. After processing the first
occurrence of \verb|x|, the \lang{} parser updates the state adding
\verb|x| to the environment of bound names. After parsing the second
\verb|x| as an identifier, the parser can look up the name in the
environment and convert it into a deBruijn index.

The state is not stored globally, but is local to the current parser
combinator. We can imagine this as a stack of states. Before parsing
the first item in a sequence, we push a copy of the current state on
top of the stack. After parsing that item, we pop the local state off
the stack. We can then modify the original state using the first
item's value. We then copy this modified state onto the stack before
parsing the second item, and so on.

The parser returns an intrinsically well-scoped
term~\cite{iwss}. Terms are a dependent data type with signature
\verb|Context -> Type|. All variable names within a term are
guaranteed to be bound within the context parameter or a term
constructor such as \letkw{} or \foldkw{}. The term is
\emph{intrinsically} well-scoped because the binding rules are
enforced by the term constructors; an extrinsically well-scoped term
would use a separate predicate to prove raw syntax is well-scoped.

Normally parsers produce ``raw'' abstract syntax trees with concrete
names instead of deBruijn indices. We use dependant state to directly
create the well-scoped AST because starting from a well-scoped AST
simplifies the type checker.

We also had the option to produce a raw AST and an extrinsic
certificate it is well-scoped. Extrinsic certificates are best for
certifying properties of preexisting structures. As our parser is
creating the well-scoped AST, the raw AST cannot have been used in any
prior computations, so we do not need to reference the raw AST.

\subsection{Bidirectional Type Checker}
\TODO{
  \begin{itemize}
    \item describe how type checking produces a derivation
    \item explain the derivation is runtime erased
    \item justify extrinsic derivation
    \item justify the derivation having no runtime cost
    \item describe why knowing there is a derivation is necessary
    \item explain why these extra features do not guarantee compiler correctness
  \end{itemize}
}

\subsection{Scheme Backend}
\TODO{
  \begin{itemize}
    \item justify how dynamic types make Scheme a good compilation target
    \item describe how products are compiled to assoc-lists
    \item describe how sums are compiled to tagged values
    \item describe how dynamic typing removes the need for roll
    \item explain why fold needs to know the term has a valid type
    \item explain why fold must be compiled recursively with map
  \end{itemize}
}

\end{document}