What would AlphaZero-Style RL on LLMs look like?

RL on LLMs from scratch is hard because (1) large action space (2) exploration is hard (3) token-level action-space is conceptually awkward.

Chess eschews this. Average number of possible next moves is ~40. Exploration (e.g. pUCT) is tractable because we can programmatically enumerate all legal moves. Action-space is obvious since the task is a game with formally defined moves.

Language is trickier. If state-space is sequence-so-far and action-space is picking the next token, action space is |Vocabulary|, which is >>~40. Beyond grammar, no concept of “illegal” next tokens to prune. And for most tasks, the level of abstraction for actions are closer to tactics, theorems, constructions, sentences, than singular tokens.

CoT RL’s action-space is token-level, but it seems to implicitly learn higher-level actions and do test-time search over higher-level cognitive strategies. This is great, but the need for a strong prior remains. We also can’t program exploration into this test-time search.

Why do we want a policy from scratch in the first place? Without a way to enumerate all legal moves and a tractable action space, we can’t program exploration and are reliant on the policy. Policies with strong priors settle in local maxima, policies from scratch could escape. Unfortunately it looks like language is too complex to go from scratch. What’s the best compromise?

I think the right direction is to find state-action spaces and priors where

Action-space is bigger than token-level
Action validity is verifiable
Prior s.t. sampled actions are syntactically valid but semantic quality is random

I have no idea what this would look like.

I think the closest analogue is AlphaProof and AlphaGeometry 2. AlphaProof’s action-space is a tactic in Lean, e.g. let p := minFac (n! + 1). AlphaGeometry’s is a construction, e.g. “Construct D as the circumcenter of triangle BXC.” AlphaGeometry doesn’t use a generally pre-trained LM but a LM trained solely on synthetic actions in a DSL. The prior is narrow and task-specific, but it still has a semantic bias, so doesn’t fully meet condition 3.

The question I am left with is: what else could we apply this to?