π is an experimental tensor computation language with declarative semantics, explicit scheduling, and an extremely tiny surface area.
π programs use a small set of combinators to build up computation graphs not over primitive tensor ops, but over π expressions β atomic einsum-like expressions describing a single scalar op densely applied over a multidimensional domain. π expressions are paired with schedules consisting of loop splits, loop ordering, and input producer staging. Under the π scheduling model, staging decisions have three predictable lowering consequences: loop fusion, storage folding, and online reduction correction. This allows π to express sophisticated algorithms like numerically stable online blockwise FlashAttention:
mm_t = i("ik*jk~ijk | i:16,j:16 | jii'j'k") >> i("+ijk~ij | i:16,j:16 | jii'j'k0")
row_max_shift = (I & i(">ij~i | i:16,j:16 | ji0i'j'")) >> i("ij-i~ij | i:16,j:16 | ji01i'j'")
exp = i("^ij~ij | i:16,j:16 | ji0i'j'")
row_normalize = (I & i("+ij~i | i:16,j:16 | ji0i'j'")) >> i("ij/i~ij | i:16,j:16 | ji01i'j'")
mm = i("ij*jk~ikj | i:16,j:16 | ji0i'kj'") >> i("+ikj~ik | i:16,j:16 | jii'kj'0")
attn = mm_t >> row_max_shift >> exp >> row_normalize >> mm
This project is at the proof-of-concept stage. There are significant gaps in the language, the generated code is not yet performant, and the repo carries a lot of AI agent debt.
Right now, we have Python frontend -> runtime -> compiler -> C backend working on Linux+macOS, demonstrating the scheduling model, and allowing correctness verification against NumPy/etc.
The π compiler has no dependencies and generates a standalone dependency-free dynamic library. The π runtime depends only on the compiler for the target platform.
The critical priorities to take π from proof-of-concept to being legitimately useful on real-world workloads are the following:
f32 dtypes. Convolution is probably the most critical usability gap at
present. The goal is to add language features in as principled a way as possible
to keep π small. We favor general and composable over bespoke constructs.The first βbetβ of π as a project was that a simple scheduling model could admit FlashAttention-like target implementations. This bet has paid off although there is still the risk that things get messy as the language expands to express a broader set of tensor computations.
The bet now is that this simple scheduling model will make schedule search more tractable. A lot of tensor compilers do search, but they do it in a complex IR with too much configuration complexity. π deliberately has fewer knobs to turn. The bet is that they are the right knobs. The scheduling model being resident to the language (instead of an IR layer) means search wonβt happen somewhere deep within the π compiler, but over π components. You write (or trace from a Torch model) an π component, and search simply finds you a better one.
Just cargo build the core crate and then run
ilang-python/flash-attn.py (you will need NumPy
installed).
This computes a reference value with NumPy and then computes the same value in π once with the naive schedule and then again with the FlashAttention schedule. The generate C code for FlashAttention is printed out for inspection. We then assert the π-computed values match the NumPy reference to a reasonable tolerance.
The semantic part of π expressions looks similar to einsum notation but does
not perform implicit summation. Reductions live in their own expressions. π
expressions have a unary form (e.g. -i~i) and binary form (e.g. i-i~i).
Repeated input indices constrain the inputs shapes. For example, i-i~i
requires the left and right inputs be the same length. The shapes of the input
dimensions inform the shapes of the output. For example, the output of i-i~i
will be the same length as the inputs. Reductions are notated by input indices
left absent from the output. For example, +ij~i performs a sum across the
columns of the input.
| symbol | name | default | reducible |
|---|---|---|---|
+ |
add |
0 | β |
* |
mul |
1 | β |
- |
sub |
0 | Β |
/ |
div |
1 | Β |
> |
max |
-β | β |
< |
min |
β | β |
^ |
pow |
e | Β |
$ |
log |
e | Β |
Note: pow and log in unary form are just exp and ln respectively.
In general, π expressions are written with three βsegmentsβ delimited by |.
The first being the semantic expression described above, the second being a list
of loop splits, and the third the schedule of loops and input staging
directives.
A scheduled π expressions looks like this: +ijk~ij | i:16,k:16 | iki'jk'.
Here, the i and k axes are each split by a factor of 16 (tiling each loop
with a tile width of 16) and the loops are ordered with the tile loops i and
k on the outside and the element loops i', j, and k' on the inside.
If one π expression (the consumer) takes another π expression (the
producer) as input in the computation graph, the producerβs computation can be
staged inside the schedule of the consumer. For example: +ijk~ij | i:16,k:16 |
iki'jk'0 stages the 0-th input producer at the innermost loop of the consumer.
Staging producers in this way has three important lowering consequences:
This scheduling model is the key difference between π and other tensor compilers. These lowering decisions are predictable consequences of the model rather than being opaque compiler optimizations buried in some complex IR somewhere.
The following combinators are used to compose π expressions into computation graphs called π components.
| symbol | name | semantics |
|---|---|---|
<< |
compose | (f << g)(x) = f(g(x)) |
>> |
chain | (f >> g)(x) = g(f(x)) |
& |
fanout | (f & g)(x) = (f(x), g(x)) |
\| |
pair | (f \| g)(x, y) = (f(x), g(y)) |
~ |
swap | (~f)(x, y) = f(y, x) |