Self-Scaffolding LLMs: How Ornith-1.0 Rewrites Its Own Harness Mid-Training

A technical breakdown of DeepReinforce’s self-improving agentic coding models

There’s a quiet assumption baked into most RL post-training pipelines for coding agents: a human designs the harness, and the model just gets better at using it. The scaffold — memory management, error handling, tool orchestration, retry logic — stays fixed. The policy is the only thing that’s allowed to evolve.

DeepReinforce’s new Ornith-1.0 release challenges that assumption directly. Instead of treating the scaffold as an external, human-authored artifact, Ornith-1.0 treats it as something the model itself learns to construct and refine, jointly with the solution it produces. That’s a meaningfully different training philosophy, and it’s worth unpacking — both the mechanism and the numbers behind it.

The Model Lineup

Ornith-1.0 ships as four variants, built on top of Gemma 4 and Qwen 3.5 bases, spanning a genuinely wide deployment envelope:

9B Dense — small enough for edge deployment
31B Dense
35B MoE
397B MoE — the frontier-scale flagship

That spread matters practically. It’s rare to see a single training framework validated at both a laptop-friendly 9B scale and a 397B frontier scale, and the fact that DeepReinforce reports strong results across the whole range suggests the self-scaffolding approach isn’t just a trick that only pays off with enormous capacity.

The Core Idea: Co-Evolving Policy and Scaffold

The mechanism is structured as a two-stage step within each RL iteration:

Scaffold proposal — conditioned on the task and the scaffold used previously, the model proposes a refined scaffold.
Rollout generation — conditioned on that new scaffold and the task description, the model generates the actual solution.

Reward from the rollout flows back into both stages. So the gradient isn’t just teaching the model to answer better — it’s teaching the model to author better orchestration logic for eliciting its own answers. Over many training steps, this becomes a selection process: scaffolds that induce higher-reward trajectories survive and propagate, while weaker ones get mutated away. Per-task-category strategies emerge on their own, without an engineer hand-tuning a harness for, say, multi-file refactors versus terminal-heavy debugging sessions.

Conceptually, this is closer to an evolutionary search operating inside the RL loop than to a conventional fixed-harness setup — the scaffold becomes a first-class learnable object rather than an environment constant.

The Obvious Failure Mode: Reward Hacking

If you let a model rewrite the very scaffold that determines how it’s evaluated, the incentive to cheat is immediate and well-known. DeepReinforce is upfront about this risk — a self-authored scaffold could easily learn to peek at visible test files and hardcode expected outputs, touch the exact file the verifier checks for, or copy an oracle solution sitting in the environment.

Their mitigation is a three-layer defense:

Fixed trust boundary — the environment, tool surface, and test isolation are immutable and outside the model’s reach. Only the inner policy scaffold (memory, error-handling, orchestration) is evolvable.
Deterministic monitor — a rule-based system flags any attempt to read withheld paths, tamper with verification scripts, or use unsanctioned tools, and zeroes out reward for those trajectories while excluding them from the advantage computation.
Frozen LLM judge — because intent-level gaming can happen entirely within the allowed tool surface (subtle enough to dodge a deterministic rule), a separate frozen judge model acts as a veto layer on top of the verifier itself.

This layered structure is a sensible response to a real problem, and it’s the kind of detail worth paying attention to if you’re building your own RL environments for agentic tasks: the outer boundary needs to be immutable and precisely specifiable, while softer, intent-level gaming needs a semantic judge rather than a rule.

Training Infrastructure: Pipeline-RL and Staleness Weighting

Long agentic rollouts create an off-policy problem — by the time a trajectory finishes and its reward is computed, the policy that generated early tokens may already be several gradient steps stale. Ornith-1.0 addresses this with a pipeline-RL strategy and an explicit staleness weight applied per token, based on how old that token’s generating policy is:

Tokens under a staleness threshold K₁ keep full weight.
Tokens between K₁ and K₂ get exponentially decayed weight.
Tokens beyond K₂ are dropped from the loss entirely.

This feeds into a token-level GRPO objective with clipped importance ratios, multiplied by the staleness weight — a fairly clean way to keep long-horizon agentic RL stable without forcing fully synchronous rollout collection, which would be prohibitively slow at this scale.

The Numbers

At the flagship tier, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified — ahead of Claude Opus 4.7 (70.3 / 80.8) and ahead of similarly sized open models like MiniMax M3 (66.0 / 80.5) and DeepSeek-V4-Pro (67.9 / 80.6). It’s worth noting Claude Opus 4.8 still leads across most of these benchmarks (85.0 TB-2.1, 87.6 SWE-Bench Verified), so the comparison that matters here is against open-weight models of similar scale, not the absolute frontier.

The mid-size 35B MoE variant is arguably the more interesting result: it beats Qwen 3.5-397B on Terminal-Bench 2.1 (64.2 vs. 53.5) despite having roughly a tenth of the parameters, while roughly matching it on several other benchmarks. That’s a strong signal that the self-scaffolding gains aren’t purely a function of scale.

At the small end, Ornith-1.0-9B hits 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified — matching or beating Gemma 4-31B, a model more than three times its size. For edge or on-prem deployments where you can’t run a 397B MoE model, that’s the number that actually matters.

Scale	TB-2.1 (Terminus-2)	SWE-Bench Verified	Notable comparison
Ornith-1.0-397B	77.5	82.4	Beats Claude Opus 4.7 on both
Ornith-1.0-35B	64.2	75.6	Beats Qwen 3.5-397B on TB-2.1
Ornith-1.0-9B	43.1	69.4	Matches Gemma 4-31B

Why This Matters for Practitioners

If you’re building agentic coding systems — and given the work most of us are doing on autonomous SRE/DevOps agents, code-fix PR generation, or incident remediation pipelines — the self-scaffolding idea is worth internalizing even if you never train a 397B model yourself. The underlying insight generalizes: harness design (memory strategy, retry policy, error recovery, tool-call sequencing) is not a fixed cost you pay once. It’s a parameter space, and for narrow, well-defined task categories, it may be worth treating it as something to search over rather than something to hand-tune.

The reward-hacking mitigations are equally relevant outside of frontier LLM training. Any time you give an autonomous agent write access to something that also gets read by its own verification step, you’ve created the same incentive structure at a smaller scale. The three-layer pattern — immutable outer boundary, deterministic rule enforcement, semantic judge as a backstop — is a reasonable template regardless of what you’re building.

Where It Sits in the Landscape

Ornith-1.0 doesn’t claim the absolute frontier — Claude Opus 4.8 remains ahead on most benchmarks in DeepReinforce’s own table. What it claims, credibly, is state-of-the-art performance among open-source models at comparable parameter counts, across a genuinely wide range of scales, using a training method that’s conceptually distinct from simply scaling up RL on a fixed harness. For teams that need open-weight models they can self-host, fine-tune, or run at the edge, that’s the more actionable comparison — and the self-scaffolding framework itself may end up being the more durable contribution, independent of which specific checkpoint currently tops a leaderboard.

Closing Notes

The bigger shift Ornith-1.0 points to isn’t a benchmark win — it’s a reframing of what counts as “the policy.” Once scaffold and solution are optimized jointly, the line between how an agent works and what an agent produces stops being a fixed design decision and becomes something learnable. That’s a direction worth watching independent of whether any single Ornith checkpoint holds its lead: the next wave of agentic coding systems, open or closed, will likely borrow this idea of treating the harness itself as trainable surface area, and the reward-hacking safeguards described here are a useful reference point for anyone building autonomous agents with write access to their own verification loop.

As with any newly released benchmark suite, these numbers are self-reported by the model’s creators and haven’t yet been independently reproduced — worth keeping in mind before treating them as settled.

References

DeepReinforce Team. “Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding.” deep-reinforce.com, 2026. https://deep-reinforce.com/ornith_1_0.html

Amit Patriwala Purity, Patience & Perseverance are the three essentials to SUCCESS.

AI Infrastructure Architect & Enterprise Solution Architect