Why I like Really Big Models
Do these jokes still work?
Anyway, Claude Mythos is coming, and I'm excited. I'll hence allow myself some scientific fiction, for jest. This is all very hand-wavy. Pardon me. I originally wrote this on X, but I'm keeping it here too.
I am always excited for larger and larger models. This is a little attempt at saying why.
We sometimes look at an LLM's pass@K on a problem: can this model solve the problem, given K independent attempts. This is like asking after how many flips of the dice do I get something greater than 3.
pass@K appears to be an approximation of the limiting pass@\infty -- i.e, if you had The Most Comprehensive Evaluation Suite, and the model could be "pushforwarded" across all the seeds, then perhaps you could procure a scalar "score" for the intelligence of a model. In some sense, this is saying, "what is the largest number on the dice you could get if I let you roll any number of times"?
Or you could think of it as the limiting K-on-expectation, barring "pathological seeds" (as we do for measures), that solves the entirety of this Most Comprehensive Evaluation Suite. The lower the expected K over your model weights, the more intelligent your model.
In fact, I think we can also see this roughly as: The prompt/eval is a program (in the SICP sense), written in an informal language (here, English), and the LLM is a stochastic interpreter of that informal language, while the output is the execution trace of that interpreter ingesting the program. Because the language is informal, the interpreter has to be stochastic.
Just as we formally verify our PLs, evals are informal verifiers for our informal language compilers. The True/False of formal verification, then, becomes a real number, some function of this true K.
Reinforcement learning, now, pulls from pass@K and brings to pass@1; at pass@\infty, note that the same wrong actions will sampled again, but in the limiting case, RL makes bad actions less frequent.
Continued RL, or RL at scale, allows us to access more and more of this pass@K territory; after all, the "true" accessible K is very large, very very large, and there is a lot of alpha to simply take a small model and pull from pass@K to pass@1, and hillclimb on K.
Here is where I find things of great interest, and I will handwave even more than I just did:
Larger models seem to be nice because they seem to have a richer K-space, so to say. @phillip_isola and @yule_gan 's recent neural thickets work suggest some similar things (I'm not claiming it's the same, of course). But large models have more parameters, more tweedle-space, and can say more things, which directly affects how long you can hillclimb on K before you saturate because your learning algorithm lacks the granularity to further exploit the problem.
Now, there's no reason for K to be anywhere near what we are capable of today. It could trillions until we think of "exhausting" how much K-space we have before the small models saturate... but we seem to have lucked out in the sense that larger models are also more capable of learning better and faster.
This means RL over sheerly larger models should unlock capabilities that just didn't exist in earlier models, and we could just be lucky that some abilities that appear unfair to us could be closeby in the tweedlespace of large models.
What also excites me is that as we scale upwards, we should find laws that say how scaling pretraining and scaling RL interact with each other. I look forward to, for example, how could something like @TheGregYang 's brainchild, MuP and the tensor programs, be thought of in the light of RL over just sheerly larger models. And then, with time, a general rule for how different kinds of training interact with each other, as things abstract out further.
At that point I think LLMs as we know it cease to be very interesting. Most of the interesting problems will have been solved, and the rest shall perhaps be too pedantic or a convex combination of problems already solved. We shall have a recipe for mostly-optimal scaling, which in the grander scheme of things, will have been as optimal as scaling needs to be, and that's mostly all we'll need to do.