On Architecture Research

05 Jun, 2026

Strong architectures are results of accidental/intentional hardware-software codesign. The dogma, to me, appears to be: how much useful compute/time can we throw at the model? (exemplified by: "the models, they just want to learn" - Ilya Sutskever)

In a sense, the principal inductive bias that we adhere to is the fact that we run learning systems on the arbitrary computing substrate that we have developed (i.e, the code must run on a CPU or GPU, and so on). Hence, scalable architectures are those which express the learning contract in a manner amenable to the accelerator it is written for.

The learning contract generally appears to be something like a Neural Turing machine (i.e, an LSTM with a tape and read/write; in comparison, a transformer may read its KVCache, but not write to it). However, to rely solely on the learning principle is to forget that the fact that we run them on our computers, which is an inductive bias we cannot account for easily; in fact -- we shouldn't.

It is possible that an arbitrary learning algorithm will learn that the matmul is a strong primitive from nothing, but considering that we already have access to the matmul, we should shape our learning contract such that it exploits something it would have accessed anyway. ...so, incredible architectural choices are abstract insights into the inductive prior of the matrix multiplication upon the universal learning algorithm.