sec/compiler.ltx


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269

\documentclass[../main.tex]{subfiles}

\begin{document}
\section{Idris 2 Compiler for Artyst}%
\label{sec:compiler}

To aid the development of the System~T reducer and encoding
algorithms, we have implemented a type checker and compiler for
\lang{}. The compiler is written in Idris~2~\cite{idris}, a
dependently typed programming language. The compiler is split into
three main components:
\begin{itemize}
  \item First we have a parser based on \textcite{alg-parsing}. The
    parser interface guarantees that we process the input without
    backtracking. This produces an intrinsically-well-scoped abstract
    syntax tree.

  \item Second we perform type checking using a bidirectional
    algorithm~\cite{bidir}. Along with the usual type checking result,
    the algorithm produces a runtime-erased typing derivation.

  \item Finally we compile the term to Guile~Scheme~\cite{guile}. By
    requiring an erased type derivation, the compiler can statically
    guarantee it will only ever have to process well-typed terms.
\end{itemize}

We present a simplified form of the Idris types used in the compiler.

\subsection{Parser}

The \lang{} parser is derived from \textcite{bidir}. At its core is a
data type for parser combinators.\@ \textcite{bidir} gave a type system
for the combinators such that well-formed combinators can parse an
input: in linear time; without backtracking; and with only a single
token of lookahead. They implemented both an interpreter and a staged
compiler for expressions. The interpreter only reports errors about
the combinators at run time.

We adapted the combinator interpreter for Idris 2. First, we used
dependant types to statically enforce a combinator is well-formed. The
``parse'' interpretation function requires a certificate that a parser
is well-formed. Second, our interpreter produces a certificate that
the list of unprocessed tokens is a suffix of the input tokens. This
certificate is required to satisfy the Idris termination checker when
interpreting fixed point combinators. Finally, we thread state
throughout the combinators.

It is common for parsers to rely on some additional
state~\cite{states}. For example \lang{} keeps an environment of bound
names as state. As Idris has dependant types, we can make the
result of parsing a combinator depend on the initial state.

A parser combinator can interact with state when:
\begin{itemize}
  \item processing a sequence, where it can update the state between
    items; or
  \item performing a semantic action (modifying the parse result),
    where it can read the state.
\end{itemize}
For example, consider parsing the syntax \verb|\x => x|, binding the
variable \verb|x| before referencing it. After processing the first
occurrence of \verb|x|, the \lang{} parser updates the state adding
\verb|x| to the environment of bound names. After parsing the second
\verb|x| as an identifier, the parser can look up the name in the
environment and convert it into a deBruijn index.

The state is not stored globally, but is local to the current parser
combinator. We can imagine this as a stack of states. Before parsing
the first item in a sequence, we push a copy of the current state on
top of the stack. After parsing that item, we pop the local state off
the stack. We can then modify the original state using the first
item's value. We then copy this modified state onto the stack before
parsing the second item, and so on.

The parser returns an intrinsically well-scoped
term~\cite{iwss}. Terms are a dependent data type with signature
\verb|Context -> Type|. All variable names within a term are
guaranteed to be bound within the context parameter or a term
constructor such as \letkw{} or \foldkw{}. The term is
\emph{intrinsically} well-scoped because the binding rules are
enforced by the term constructors; an extrinsically well-scoped term
would use a separate predicate to prove raw syntax is well-scoped.

Normally parsers produce ``raw'' abstract syntax trees with concrete
names instead of deBruijn indices. We use dependant state to directly
create the well-scoped AST because starting from a well-scoped AST
simplifies the type checker.

We also had the option to produce a raw AST and an extrinsic
certificate it is well-scoped. Extrinsic certificates are best for
certifying properties of preexisting structures. As our parser is
creating the well-scoped AST, the raw AST cannot have been used in any
prior computations, so we do not need to reference the raw AST.

\subsection{Bidirectional Type Checker}

\begin{figure}
  \[
  \begin{array}{cccc}
    \begin{prooftree}
      \hypo{x : A \in \Gamma}
      \infer1{\judgement{\Gamma}{x}[\Rightarrow]{A}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma, x : A}{t}[\Leftarrow]{B}}
      \infer1{\judgement{\Gamma}{\lambda x.t}[\Leftarrow]{A \to B}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{f}[\Rightarrow]{A \to B}}
      \hypo{\judgement{\Gamma}{t}[\Leftarrow]{A}}
      \infer2{\judgement{\Gamma}{f~t}[\Rightarrow]{B}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Rightarrow]{A}}
      \hypo{\judgement{\Gamma, x : A}{u}[\Leftrightarrow]{B}}
      \infer2{\judgement{\Gamma}{\lettm{t}{x}{u}}[\Leftrightarrow]{B}}
    \end{prooftree}
    \\\\
    \begin{prooftree}
      \hypo{\rangeover{\judgement{\Gamma}{t_i}[\Leftrightarrow]{A_i}}{i}}
      \infer1{\judgement{\Gamma}{\tuple{\rangeover{t_i}{i}}}[\Leftrightarrow]{\prod_i A_i}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Rightarrow]{\prod_i A_i}}
      \infer1{\judgement{\Gamma}{t.i}[\Rightarrow]{A_i}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Leftarrow]{A_i}}
      \infer1{\judgement{\Gamma}{\tuple{i, t}}[\Leftarrow]{\sum_i A_i}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Rightarrow]{\sum_i A_i}}
      \hypo{\rangeover{\judgement{\Gamma, x_i : A_i}{t_i}[\Leftarrow]{B}}{i}}
      \infer2{\judgement{\Gamma}{\casetm{t}{\tuple{i,x_i}}{t_i}{i}}[\Leftarrow]{B}}
    \end{prooftree}
    \\\\
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Rightarrow]{A}}
      \hypo{A = B}
      \infer2{\judgement{\Gamma}{t}[\Leftarrow]{B}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Leftarrow]{A}}
      \infer1{\judgement{\Gamma}{t : A}[\Rightarrow]{A}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Leftarrow]{\sub{A}{X/\mu X. A}}}
      \infer1{\judgement{\Gamma}{\roll~t}[\Leftarrow]{\mu X. A}}
    \end{prooftree}
    &
    \begin{prooftree}
      \hypo{\judgement{\Gamma}{t}[\Rightarrow]{\mu X. A}}
      \hypo{\judgement{\Gamma, x : \sub{A}{X/B}}{u}[\Leftarrow]{B}}
      \infer2{\judgement{\Gamma}{\foldtm{t}{x}{u}}[\Leftarrow]{B}}
    \end{prooftree}
  \end{array}
  \]
  \caption{Bidirectional typing rules for
    \lang{}.\@ We say \(t\) synthesises type \(A\) when
    \(\judgement{\Gamma}{t}[\Rightarrow]{A}\) and \(t\) checks at type \(A\) when
    \(\judgement{\Gamma}{t}[\Leftarrow]{A}\). We use \(\Leftrightarrow\) when a term can both
    synthesise and check at a type. All arrows must resolve in the
    same direction within the rule.}\label{fig:bidir}
\end{figure}

\Cref{fig:bidir} shows how we have modified the \lang{} typing rules
for bidirectional type checking. The changes are
standard~\cref{bidir}. Products can both check and synthesise types.
The \roll{} operator can only be checked as there is no unique correct
guess for the shape type. The body of a \foldkw{} must be checked. The
induction hypothesis is added to the context, so its type must be
known before processing the body. As the induction hypothesis is
filled by the motive, the only mode-correct way to obtain the motive
type is for \foldkw{} to be checked.

The Idris~2 type for certifying the bidirectional judgement is almost
a direct translation of the judgement in \cref{fig:bidir}. The
differences are down to products and sums allowing permutations of
their labels. Whilst on paper we can represent these \(n\)-ary types
as sets, in Idris they are realised by lists. This discrepancy
manifests in \lang{}'s type checker in three ways:
\begin{itemize}
  \item when switching from synthesis to checking, the synthesised and
    expected types are compared up to reordering, instead of equality;
  \item for case splits, we check there is a permutation between
    variants and branches; and
  \item for checking tuples, we check every label in the type has a
    unique component in the tuple.
\end{itemize}

\TODO{describe how type checking produces a derivation}

For error reporting we introduced Idris types for negative
certifications. Values of these types indicate how the bidirectional
algorithm failed. Consider synthesising the type of an
application. A negative certificate has three possibilities:
\begin{itemize}
  \item the function does not synthesise a type;
  \item the function's type is not a function type;
  \item the argument does not check at domain type.
\end{itemize}
Negative certificates give a precise reason why the algorithm failed.

We also designed negative certificates to report as many errors as
possible. For example, when a tuple fails to check at a type, we can
record all of the fields that fail to check, not only the first. We
could enforce that negative certificates record all the errors by
requiring a certificate (positive or negative) for every field.
Because this guarantee is unnecessary for the rest of the compiler we
have not done this. Our more lax certificates are still sufficiently
expressive to prove a term cannot have both a positive and negative
certificate.

\TODO{explain the derivation is runtime erased}

Like with the well-scoped AST, we have the choice for type checking to
produce an intrisically well-typed AST.\@ We have opted against it
because it is harder to express the relation between the input and
output terms. Consider the hypothetical signature of the type
synthesis function for an intrinsically well-typed AST:
\begin{verbatim}
synths : Term context -> (env : TyEnv context) -> Maybe (a : Ty ** WfTerm env a)
\end{verbatim}
The result does not depend on the given input term, so we can give a
trivial implementation such as the function constantly returning a
well-typed closed term. Compare this to a (simplified) signature for
synthesis with an extrinsic certificate:
\begin{verbatim}
synths : (tm : Term context) -> (env : TyEnv context)
  -> Maybe (a : Ty ** SynthsTy env tm a)
\end{verbatim}
The result explicitly depends on the input invalidating the trivial
implementation.

\TODO{describe why knowing there is a derivation is necessary}

We want to emphasise that producing a certificate during type checking
does not prove the type checker is correct. The certificate only
proves the type checker matches the specification, which may be
wrong. As a concrete example, the initial Idris~2 specification for
\lang{} certificates had a bug. When checking an abstraction with
multiple arguments, the types were assigned to variables in reverse
order. For instance when checking \verb|\x, y => x| at type \(A \to B \to
A\), the specification assigned \verb|x| type \(B\) and \verb|y| type
\(A\). Thus, the certified type checker followed the specification and
rejected this term, and even produced a certificate of why it was
right to do so.

\subsection{Scheme Backend}
\TODO{
  \begin{itemize}
    \item justify how dynamic types make Scheme a good compilation target
    \item describe how products are compiled to assoc-lists
    \item describe how sums are compiled to tagged values
    \item describe how dynamic typing removes the need for roll
    \item explain why fold needs to know the term has a valid type
    \item explain why fold must be compiled recursively with map
  \end{itemize}
}

\end{document}