Manufacturing reality

If you can't capture enough real data, you manufacture it. World models like NVIDIA Cosmos generate physical interaction — joint angles, contacts, trajectories — and try to close the gap between simulation and reality.

We argued in our flagship piece that physical-world AI is bottlenecked not on architecture or compute but on data that doesn't exist yet — and that the field is attacking the shortage three ways: capture it, pool it, or generate it. That essay spent most of its time on the first two and the geography they imply. This one goes deep on the third, because the third is the strange one. Capture and pooling work within the same economics; generation tries to break them.

Here is the problem generation is trying to escape. Capture — teleoperation, human video — is linear-cost: every demonstration is paid for in operator-hours, and the marginal example never gets free. A usable dataset for one new manipulation task still runs roughly 10 to 100 operator-hours.² Pooling helps, but it only multiplies what someone, somewhere, already captured — share a corpus across sixty labs and twenty-two robot bodies and you have more data, but you didn't make any new physics. Both routes ride the same curve. The third route attacks the curve itself: manufacture the data.

The instrument for that is a world model — a model that learns the dynamics of the physical world and can then roll them forward. Hand it a scene and an action and it predicts what happens next, the way a language model predicts the next token. Crucially, what it emits is not just pixels. It emits action data: joint angles, gripper states, contact events, full trajectories — the same channels a teleop rig records off a real robot, except synthesized rather than performed.

Capture records the world. Pooling shares the recording. A world model tries to run the world — and write down what it sees.

Cosmos as the concrete case

The cleanest thing to point at today is NVIDIA Cosmos 3, the open physical-AI foundation model released in June 2026. It was trained on roughly 20 trillion multimodal tokens — on the order of a billion images and 400 million videos, a deliberate mix of real and already-synthetic footage — and its job is precisely to generate physical-interaction data: action sequences plus sim-to-real synthetic episodes you can train a policy on.⁵ The numbers matter less as trophies than as a signal of where the capital and the compute are pointing: a model this size, trained on this corpus, exists to manufacture the data the physical world never recorded.

20T

Multimodal training tokens · Cosmos 3

~1B

Images in the training mix

400M

Videos · real + synthetic

The fuel behind a world model. Source: NVIDIA; Axios.⁵

But a world model is only half the idea. The half that bends economics is what you wrap around it.

The sim-to-real loop

Generation pays off inside a cycle, not a one-shot. It runs like this. The world model produces synthetic episodes. You train a policy on them. You deploy the policy onto a real robot. The robot fails in places the simulation got wrong, and you capture those real-world corrections — a relatively small, targeted slice of real data. That slice goes back to improve the world model, which then generates better synthetic episodes. Repeat.

The self-reinforcing engine. Each turn, synthetic episodes are ~free at the margin. ILLUSTRATIVE.

The reason this matters is in one phrase from the caption: each turn, synthetic episodes are roughly free at the margin. Once the world model exists, generating the ten-thousandth episode costs about what the first one did — GPU-seconds, not operator-hours. That is the exact inverse of teleop, where the ten-thousandth demonstration costs the same human wage as the first, forever. That asymmetry is the whole reason generation could finally bend the curve that capture and pooling cannot. The loop is the engine; the marginal-cost collapse is why the engine is worth building.

It is worth being precise about what is cheap and what isn't. The world model is expensive to build — 20 trillion tokens of training is not free, and the real corrections you feed back each turn are still captured the hard way. What collapses is the marginal cost of episodes. You pay a large fixed cost to stand up the generator and a small recurring cost to keep it honest, and in exchange the variable cost per training example falls toward zero. Capture has it backwards: almost no fixed cost, and a variable cost that never quits.

The honest hard part — the gap

Now the catch, and it is a real one. Synthetic data that is subtly wrong teaches subtly wrong behavior. A policy trained in a simulator where friction is a little too forgiving, where a deformable object holds its shape a little too well, where contact resolves a millisecond too cleanly, will do beautifully in sim and then fumble the real cup. The distance between simulated physics and real physics — the sim-to-real gap — is not a footnote to this approach. It is the entire game.

And the gap is widest exactly where the value is highest. Rigid-body motion through free space is close to solved; you can simulate a robot arm swinging through empty air convincingly. The money is in the other regime — dexterous, contact-rich manipulation, friction and deformation and the ten thousand ways a grip slips. That is the hardest physics to fake and the most valuable to get right, which is an uncomfortable combination.

Two trajectories diverge where the physics is hard. The gap is the error a policy inherits. ILLUSTRATIVE.

The clean way to hold all of this in your head is to stop treating a synthetic episode as worth one real episode. It isn't. Treat it as worth some fraction of a real one — a trust coefficient that says how much you believe the simulation where it counts. Then your effective dataset is real data plus discounted synthetic:

$$ D_{\text{eff}} \;=\; D_{\text{real}} \;+\; \alpha\,D_{\text{syn}}, \qquad 0 < \alpha < 1 $$

Here $D_{\text{real}}$ is captured episodes, $D_{\text{syn}}$ is generated ones, and $\alpha$ — the trust coefficient — encodes how well the sim-to-real gap is closed. At $\alpha \to 0$, synthetic data is noise: you can generate a billion episodes and your effective dataset barely moves. At $\alpha \to 1$, a synthetic episode is as good as a real one, generation is nearly free, and the data wall falls. Today we are somewhere in between, and nobody can tell you the exact number — it depends on the task, the simulator, and how much real correction you fold back in.

This is the honest frame for the whole bet. Generation is not a solved problem; it is a wager on raising $\alpha$. Every dollar poured into better physics engines, better world models, better sim-to-real transfer is, in the end, a dollar spent trying to push that one coefficient toward one. The technology is a means; $\alpha$ is the scoreboard.

The whole bet on one axis: close the gap, raise α. ILLUSTRATIVE — the marker is not a measured value.

The bet, in one line

Every advance in world models is a campaign to raise one number, $\alpha$ — the fraction of a real episode a synthetic one is worth. Generation wins if and only if $\alpha$ climbs.

What it means if the gap closes

This is where generation collides with the thesis we staked out in the flagship. We argued there that physical data has a geography — that you can only capture interaction data where the physical work happens, which is overwhelmingly Asia and the manufacturing belt, not the Bay Area. Generation is the single strongest counter to that wedge, and I want to flag it plainly rather than bury it.

If $\alpha$ gets close to one, you manufacture reality in a data center — anywhere there is power and GPUs. The need to be physically present where the work happens weakens; the data wall partly falls; geography matters less. A team in a server farm could, in principle, out-data a team standing on a factory floor. That is the bear case for the geography wedge, and it is a serious one. It is not a coincidence that the most credible push on world models comes from the company that sells the GPUs.

I don't think $\alpha$ gets there alone, or soon — and the reason is the gap I just drew. Synthetic data still has to be grounded in and validated against the real thing; the loop only works because real corrections keep flowing back in. The harder the physics — contact, friction, deformation, the dexterous and gloriously messy — the lower $\alpha$ sits and the more real capture you still need. So my read is that generation multiplies captured data rather than replacing it: it raises the return on every real episode by letting you spin variations around it, which makes the captured seed more valuable, not less. On that read, real data stays king and its geography holds.

But I hold that as a bet, not a fact — and the cleanest way to be wrong about Seeker's whole geography wedge is for someone to close the sim-to-real gap faster than I expect. The capital is certainly acting as if it might: robotics and physical-AI venture funding hit $27.6 billion in 2025, more than double the year before, and a meaningful slice of that is chasing exactly this — the dream of data without dirt under its fingernails.⁹

So the honest tension is this. Generation is the one route that could break the linear cost of physical data, because its episodes are free at the margin. It is also the one route whose value rests entirely on a number — $\alpha$ — that nobody has yet driven to one, in precisely the contact-rich regimes that matter most. If that number climbs, the data wall cracks and the map stops mattering. If it stalls where it is, captured real data stays the scarce, grounding input, and everything we said about geography still stands. I would not bet the firm on $\alpha$ reaching one. I would bet the firm on it being worth watching like a hawk.

Notes

Teleoperation data economics — roughly 10–100 operator-hours per task variant, with ~10× dataset scaling running into the millions, not thousands: IBM, "The data gap that's holding back robotics," plus industry estimates. This is the linear marginal cost generation is trying to beat.
NVIDIA Cosmos 3 (June 2026) — open physical-AI world-foundation model trained on ~20T multimodal tokens (~1B images, 400M real + synthetic videos); generates action data (joint angles, gripper states, trajectories) and sim-to-real synthetic episodes. NVIDIA; Axios.
Robotics & physical-AI venture funding ≈ $27.6B in 2025, more than double 2024's $13.7B. PitchBook, Q4 2025 Robotics & Physical AI VC Trends.