π is an experimental tensor computation language with declarative semantics, explicit scheduling, and an extremely tiny surface area.
π aims to pull just enough scheduling capability from the kernel DSL layer into an otherwise high-level tensor language. The π scheduling model only has three concepts: loop splits, loop ordering, and input producer staging, but these have predictable lowering consequences including loop/operator fusion, storage folding, and online reduction rewriting.
These are all the necessary ingredients for implementing numerically stable online blockwise FlashAttention. Here is what that looks like (itβs dense but details of the syntax are given below):
# does a matmul with right-hand input transposed
mm_t = i("ik*jk~ijk | i:16,j:16 | jii'j'k") >> i("+ijk~ij | i:16,j:16 | jii'j'k0")
# shifts values down by row-max for numerical stability
row_max_shift = (I & i(">ij~i | i:16,j:16 | ji0i'j'")) >> i("ij-i~ij | i:16,j:16 | ji01i'j'")
# applies exponentiation (first part of the softmax)
exp = i("^ij~ij | i:16,j:16 | ji0i'j'")
# normalizes along rows (second part of the softmax)
row_normalize = (I & i("+ij~i | i:16,j:16 | ji0i'j'")) >> i("ij/i~ij | i:16,j:16 | ji01i'j'")
# does a regular matmul
mm = i("ij*jk~ikj | i:16,j:16 | ji0i'kj'") >> i("+ikj~ik | i:16,j:16 | jii'kj'0")
# puts it all together
attn = mm_t >> row_max_shift >> exp >> row_normalize >> mm
The first βbetβ of π as a project was that a simple scheduling model could admit FlashAttention-like target implementations. This bet has paid off, although there is still the risk that things get messy as the language expands to express a broader set of tensor computations.
The bet now is that this simple scheduling model will make schedule search more tractable. A lot of tensor compilers do search, but they do it in a complex IR with too much configuration complexity. π deliberately has fewer knobs to turn. The bet is that they are the right knobs. The scheduling model being resident in the language (instead of an IR layer) means search wonβt happen somewhere deep within the π compiler, but over π components. You write (or trace from a Torch model) an π component, and search simply finds you a better one.
This project is at the proof-of-concept stage. There are significant gaps in the language, the generated code is not yet performant, and the repo carries a lot of AI agent debt.
Right now, we have Python frontend -> runtime -> compiler -> C backend working on Linux and macOS, demonstrating the scheduling model, and allowing correctness verification against NumPy/etc.
The π compiler has no dependencies and generates a standalone dynamic library. The π runtime depends only on the compiler for the target platform.
You will need Rust, Python, and NumPy installed.
cargo build --package i-core
python ilang-python/flash-attn.py
This computes a reference tensor with NumPy and then computes the same tensor in π, once with the naive schedule and then again with the FlashAttention schedule. The generated C code for FlashAttention is printed out for inspection. We then assert the π-computed values match the NumPy reference to a reasonable tolerance.
There is an annotated version of the C output here.
π is a pure expression language. The base construct is the π expression which contains a scalar operation, indexing semantics, and scheduling information. These are wired up into arbitrarily complex computation graphs using a small set of combinators.
In general, π expressions are written with three βsegmentsβ delimited by |.
The first is the semantic expression, the second is a list of loop splits, and
the third is the schedule of loops and input staging directives.
The semantic part of π expressions looks similar to einsum notation but does
not perform implicit summation. All reductions happen in their own expressions.
π expressions have a unary form (e.g. -i~i) and binary form (e.g. i-i~i).
Repeated input indices constrain the input shapes. For example, i-i~i
requires the left and right inputs be the same length. The shapes of the input
dimensions inform the shapes of the output. For example, the output of i-i~i
will be the same length as the inputs. Reductions are written by omitting input
indices from the output. For example, +ij~i performs a sum across the second
dimension of the input.
| symbol | name | default | reducible |
|---|---|---|---|
+ |
add |
0 | β |
* |
mul |
1 | β |
- |
sub |
0 | Β |
/ |
div |
1 | Β |
> |
max |
-β | β |
< |
min |
β | β |
^ |
pow |
e | Β |
$ |
log |
e | Β |
The unary forms of π expressions are just the binary forms with the default
value assumed on the left-hand side. The default value is chosen to be the
reduction identity if there is one, otherwise a value that gives sane unary
behavior. This way, we get sub -> neg, div -> recip, pow -> exp,
log -> ln.
The following combinators are used to compose π expressions into computation graphs called π components.
| symbol | name | semantics |
|---|---|---|
<< |
compose | (f << g)(x) = f(g(x)) |
>> |
chain | (f >> g)(x) = g(f(x)) |
& |
fanout | (f & g)(x) = (f(x), g(x)) |
\| |
pair | (f \| g)(x, y) = (f(x), g(y)) |
~ |
swap | (~f)(x, y) = f(y, x) |
The only scheduling concepts in π are loop splits, loop ordering, and input producer staging. Splits are declared in the second segment of an π expression, loops are ordered in the third segment, and inputs are staged within that same loop ordering string.
Take for example: +ijk~ij | i:16,k:16 | iki'jk'. Here, the i and k axes
are each split by a factor of 16 (tiling each loop with a tile width of 16) and
the loops are ordered with the tile loops i and k on the outside and the
element loops i', j, and k' on the inside.
If one π expression (the consumer) takes another π expression (the
producer) as input in the computation graph, the producerβs computation can be
staged inside the schedule of the consumer. For example: +ijk~ij | i:16,k:16 |
iki'jk'0 stages the 0-th input producer at the innermost loop of the consumer.
Staging producers in this way has three important lowering consequences:
This scheduling model is the key difference between π and other tensor compilers. These lowering decisions are predictable consequences of the model rather than being opaque compiler optimizations buried in some complex IR somewhere.
The critical priorities to take π from proof-of-concept to being legitimately useful on real-world workloads are the following:
f32 dtypes. Convolution is probably the most critical usability gap
at present. The goal is to add language features in as principled a way as
possible to keep π small. We favor general and composable constructs over
purpose-specific ones.Interested in π, tensor compilers, or related work? Reach out at contact@ilang.dev.