What a robot's training data actually costs

Teleoperation is the gold standard, and it costs 10–100 operator-hours per task. This is the unit economics of data you can't crawl — and why scaling 10× costs millions, not thousands.

Our flagship thesis made the structural claim: physical-world AI is bottlenecked not on architecture or compute but on data that was never recorded, and that has to be manufactured one example at a time. This piece is the invoice. If you actually try to build a robot dataset, what does it cost — per example, per task, per order of magnitude? The answer is the most important number in the field, and almost nobody writes it down. So we will.

Start with the asymmetry that makes the whole thing hard, because it is an economics problem before it is an engineering one. The internet was free per example. The crawl that fed the language-model era cost real money — datacenters, bandwidth, storage — but you paid it essentially once, and then amortized it over every token and every model thereafter. The marginal cost of the ten-trillionth token was about zero. That single fact is what made scale a strategy: when each additional example is free, "use more data" is always the right move.

Physical data inverts it. There is no crawl, because the data does not exist anywhere to be scraped — a hand stacking dishes, a gripper finding a bolt by feel, the ten ways a fold goes wrong were never written down. To get one example you pay a person to produce it, in real time. To get the next one you pay again. The marginal cost never falls to zero, because there is no artifact to amortize over. You do not crawl physical data; you buy it by the hour.

Internet data has a fixed cost and zero marginal cost. Physical data has the opposite shape — and that shape is the entire problem.

The two cost curves

Write it down as cost C as a function of dataset size $N$ — the number of examples (demonstrations, episodes, trajectories). For the web, the cost is dominated by the one-time crawl and barely moves as you ask for more:

$$ C_{\text{web}}(N) \;\approx\; C_0 $$

$C_0$ is paid up front; the curve is flat. For robot data, every example carries its own price tag, so the cost is linear in the number of examples:

$$ C_{\text{robot}}(N) \;\approx\; c \cdot N $$

where $c$ is the all-in cost of a single example — operator-hours per example times the hourly wage (plus rig time, plus the examples you throw away). There is no $1/N$ anywhere in that expression, and that absence is the whole story. The punchline falls straight out of the algebra: to go from $N$ to $10N$ examples you pay $\approx 10\,C_{\text{robot}}(N)$. Ten times the dataset is ten times the bill. No economies of scale, no amortization, no free tail.

ILLUSTRATIVE — the shapes, not the scale. One crawl amortizes; every demonstration is bought again.

Teleoperation, by the hour

The gold-standard source — the one every other method gets measured against — is teleoperation. A human operator drives the robot through a task with a controller, a glove, or a leader arm, while the system records the full action: every joint angle, gripper command, and the camera stream that goes with it. The result is exactly what a policy wants to imitate — real actions, on the real robot, in the real world, with all the contact and friction intact. It is also the most expensive data on Earth, per useful example.

The reported rule of thumb is 10 to 100 operator-hours per task variant.¹ The spread is the whole education. The low end is a simple, repeatable motion in a fixed setup; the high end is a contact-rich task that has to be demonstrated across the lighting, clutter, object instances, and failure recoveries you need before a policy generalizes instead of memorizing. And "operator-hour" undercounts: a chunk of every session is reset, mis-grasps, and throwaway takes, so the useful yield per paid hour is lower than the clock suggests. Put numbers to $c$: a few hundred clean demonstrations of a single variant, at tens of operator-hours, at a real loaded wage, lands a one-task dataset in the tens of thousands of dollars before you have shipped anything.

Now turn the crank on $C_{\text{robot}}(N) = c\cdot N$. A capable manipulation policy doesn't want one task; it wants the long tail — hundreds of variants across objects and scenes. Because the curve is linear with no volume discount, the bill scales with ambition, and pushing a serious robot dataset up an order of magnitude runs into the millions, not the thousands.¹ That is the sentence to sit with. In the language-model world, an order of magnitude more data was a procurement detail. Here it is a fundraising round.

Why "just collect more" isn't a plan

When the marginal example is free, scale is a strategy. When the marginal example costs an operator-hour, scale is a budget — and the budget grows linearly with the very generality you're trying to buy.

The cheaper sources — and the tax each one charges

Every serious lab knows the teleop curve is brutal, so the real work is bending it: finding data that costs less per example. There is no free lunch in that search. Each cheaper source buys down the cost and pays for it somewhere — in fidelity, in coverage, or in a domain gap you have to close later. The trade is the point.

The most important cost-bender is passive human video: instead of paying someone to operate a robot, record people doing the task with their own hands. Apple's EgoDex — released May 2025, large-scale egocentric video with 3D hand and finger tracking — is the cleanest example of the shortcut.² Footage like this is enormously more scalable per dollar: no robot, no rig, no operator wage, just cameras on tasks that were going to happen anyway. But it pays a steep fidelity tax. There are no action labels — you see the hand move but not the joint commands that would reproduce it. There are no forces — grip, friction, and contact, the signals that matter most for manipulation, simply aren't in the pixels. And there is a domain gap: a five-fingered human hand is not a two-fingered gripper, so the demonstration has to be retargeted onto a body it was never recorded for. You traded dollars for distance-from-the-target.

Teleoperation

Cost per exampleHigh — an operator-hour each, no discount
FidelityNative — real actions + contact on the real robot
ScalabilityLinear, capped by human throughput
What's missingNothing — but you pay full price for it

Passive human video

Cost per exampleLow — cameras, no robot or operator wage
FidelityLossy — pixels only, no torques
ScalabilityHigh — record work happening anyway
What's missingAction labels · forces · hand-to-gripper gap

The shortcut and its tax — accent column is the scalable-but-lossy source. Sources: industry; Apple EgoDex.¹²

Two other levers bend the same curve, each with its own discount and its own tax. Pooling spreads the cost across labs: rather than every team paying to re-collect the same skills, share one corpus. Open X-Embodiment — assembled by 60+ labs across 22 robot types — is the canonical pool, and the discount is real because skills partly transfer across bodies.³ The tax is heterogeneity: data captured on someone else's robot, in someone else's lab, with someone else's conventions, is noisier and needs alignment before it helps yours. Synthetic data from world models is the most aggressive lever — generate examples instead of capturing them, and the marginal cost of the millionth simulated trajectory really can approach zero, which is the one way to recover the web's economics. Its tax is the sim-to-real gap: a synthetic example that is subtly wrong about friction or contact teaches the policy something false, and you only find out on hardware.

Every lab is a data-cost engineering company

Step back and the strategy picture collapses to one axis. Architectures are largely shared, compute is for sale, and the open robot-learning recipes are converging. What is not commoditized is the cost of data — and so the entire competitive game in physical AI is bending the cost-of-data curve: lowering $c$, or substituting cheaper sources for expensive ones without letting fidelity collapse. Every serious lab is, underneath its mission statement, a data-cost-engineering company — whether it frames itself that way or not. The teleop rig, the wearable-video pipeline, the cross-embodiment pool, the world model: these are not separate bets. They are four attempts to move the same curve.

Capital has priced this in, even if it narrates the story as "robots." Robotics and physical-AI venture funding hit $27.6B in 2025 — more than double the $13.7B of 2024.⁴ That is not a bet that the data wall is low. It is a bet that it is solvable and compounding: that whoever bends the curve first captures data faster, trains better policies, deploys more, and captures still more data — the one place in this business where returns might actually scale. The funding ramp under a single leader makes the conviction legible: Physical Intelligence went from a $2.4B valuation to north of $11B in roughly sixteen months.⁵ The money is not paying for models. It is paying for a cheaper unit cost of reality.

10–100hrs

Operator-hours per task variant — the gold-standard price

10× data

≈ millions, not thousands · no economies of scale

$27.6B

2025 robotics & physical-AI VC · betting it's solvable

The unit economics, in three numbers. Sources: IBM / industry; PitchBook.¹⁴

What's hard — and what would make me wrong

Here is the honest part, and it is not a footnote. The cost-of-data curve is real, but so is the fidelity-cost frontier, and the frontier is a genuine trade — not a free lunch waiting to be unlocked. Cheaper data that is subtly wrong can be worse than no data at all. A human-video clip with no forces, retargeted onto the wrong hand, or a synthetic trajectory that lies about contact, doesn't just fail to help — it can actively teach a policy a false model of the world, and you pay to discover that on a real robot. Bending the curve downward in dollars while bending it upward in error is not progress; it is a more expensive way to be wrong.

So what would make me wrong about the whole framing? If world models close the sim-to-real gap on contact-rich, dexterous tasks — not just on navigation and pick-and-place — then the marginal cost of an example really does fall toward zero, and physical data inherits the web's economics after all. The linear curve becomes a fixed cost plus a flat tail, and the "you buy it by the hour" thesis dies a deserved death. I don't think it happens soon, because the gap is widest exactly where the value is highest, and synthetic data still has to be grounded in and validated against real capture it cannot fully replace. But it is the live counter, and I'd rather name it than pretend the curve bends only one way.

Until then, the number stands. A robot's training data costs an operator-hour at a time, the bill scales linearly with the generality you are trying to buy, and the winners will be the teams that drive $c$ down without quietly letting fidelity go with it. That is the unit economics of intelligence for the physical world — and it is the most underpriced line item in AI.

Notes

Teleoperation data economics — roughly 10–100 operator-hours per task variant, with ~10× dataset scaling running into the millions rather than thousands because there are no economies of scale: IBM, "The data gap that's holding back robotics," plus industry estimates.
Apple EgoDex — large-scale egocentric video dataset with 3D hand and finger tracking, released May 2025; a passively scalable source of human-demonstration data, at the cost of action labels, forces, and a human-to-gripper domain gap. Apple.
Open X-Embodiment — a cross-embodiment robotics dataset assembled by 60+ labs across 22 robot types; pooling spreads collection cost and skills partly transfer across bodies, at the cost of heterogeneity.
Robotics & physical-AI venture funding ≈ $27.6B in 2025, more than double 2024's $13.7B. PitchBook, Q4 2025 Robotics & Physical AI VC Trends.
Physical Intelligence funding: $400M Series A at $2.4B (Nov 2024) → $600M Series B at $5.6B (Nov 2025) → ~$1B at >$11B (in talks, Mar 2026) — roughly $2.4B to >$11B in ~16 months. Bloomberg; TechCrunch; Sacra.