There is no internet for robots
The web handed models a pre-recorded copy of human thought. For the body, that recording was never made — and you cannot scrape what was never written down.
The last decade of AI ran on a recording nobody set out to make. For thirty years, people posted what they thought and saw — text, photos, video — to the web, for their own reasons. The result is an accidental log of human cognition at planetary scale: trillions of tokens, billions of images, already digitized and, crucially, already labeled by a billion hours of human activity. A caption sits under a picture. A question sits above its answer. A reply explains a joke. Nobody wrote the web for the machines, yet the machines inherited a pre-graded copy of how we reason, for the cost of a crawl.
I laid out the broader thesis here — that the binding constraint on physical-world AI is data, not architecture or compute. This piece goes one level down on the part people skim past. The claim isn't just that physical data is scarcer. It's that for an entire class of what a body does, the recording was never made at all — and you cannot scrape what was never written down.
What the web actually recorded
Be precise about what the internet is, because the precision is the whole argument. The web recorded what we said and what we saw. Text is a transcript of human language. Images and video are a transcript of human vision. Both are abundant past the point of exhaustion — frontier systems already train on the order of ten trillion tokens plus billions of images, and the labs now talk openly about a data wall because they are scraping the bottom of even that barrel.
Notice what is not on that list. The web recorded what we said and saw. It never recorded what we did with our bodies. When you pick up a coffee cup, your nervous system is running a control loop the web has no column for — and that loop, not the photograph of the cup, is what a robot has to learn.
The modalities that were never written down
Go modality by modality. A physical action is produced by three streams of information that text and video simply do not contain:
- Proprioception — the body's own sense of itself: every joint angle, the configuration of the limb, the internal state from which the next motion is computed. Your arm knows where it is in the dark. A third-person video does not.
- Force and contact — how hard the gripper squeezes, the exact instant contact begins and breaks, the friction and torque that decide whether the grape is held or crushed. This is the channel that separates a successful grasp from a mess, and it lives entirely off-camera.
- Action-trajectories — the actual motor commands, the sequence of states over time that produced the outcome. Not the result; the policy. Not the folded shirt; the path the hands took to fold it.
These were never logged at scale, anywhere, by anyone. There was no reason to: people don't narrate their joint angles, and a camera pointed at a task captures the scene, not the controller behind it. The single richest input to physical intelligence — the closed loop of state → force → motion → new state — is precisely the input the internet has no representation of.
What the web recorded — abundant
- Text — a transcript of language
- Images — a transcript of vision
- Video — vision, over time
What the body needs — ~absent
- Proprioception — joint angles, body state
- Force & contact — grip, friction, the moment of touch
- Action-trajectories — the motion that caused the outcome
Why YouTube isn't robot data
The obvious objection is video. There are billions of hours of people doing things on the internet — surely that is a corpus of physical action? It isn't, and the reasons are worth stating one at a time, because the gap is structural rather than a matter of volume.
A third-person clip is pixels with no action labels. You see a hand move; you do not get the joint angles that moved it. You see a box lifted; you do not get the forces — whether it was heavy or light, where the grip was, when contact began. The footage is monocular and uncalibrated: no reliable 3D, no depth, no camera pose, so you cannot even recover the geometry of what happened, let alone the dynamics. And it shows the result, not the policy — the outcome of a control loop you never observed. A video of someone folding a shirt does not contain the torques. It is a recording of the effect of an action, with the action itself — the thing you actually need to imitate — edited out by the medium.
This is why the serious attempts to mine human video work so hard to add back what the medium drops. Apple's EgoDex, released in 2025, is a large-scale egocentric dataset that runs 3D hand-and-finger tracking on first-person footage4 — an honest attempt to recover the missing geometry. It is a meaningful shortcut, and it is also the exception that proves the rule: you have to instrument the capture specifically, on purpose, to extract a fraction of what was never natively recorded. The raw web does not hand it to you.
The size of the hole
Now put numbers on the gap, because the scale is hard to feel otherwise. The internet side is trillions of tokens and billions of images. The richest pooled robot corpus we have is Open X-Embodiment — more than sixty labs across twenty-two robot types, painstakingly assembled into a single dataset (Open X-Embodiment) — and it lands on the order of a few million episodes. Set against the web, that is a rounding error.
There is no common crawl for bodies moving through space, and — this is the part that matters — there can't be one to find, because the underlying signal was never recorded in the first place. A crawler can only harvest what someone already wrote to disk. Proprioception, force, and trajectories were never written to disk, so there is nothing on the open web to crawl. The corpus does not exist as latent, un-scraped data waiting for a better spider. It has to be manufactured, one recorded episode at a time.
The asymmetry is not an accident of where we are on a curve; it is a property of the two datasets. The web's marginal example was free because it was a byproduct of people living their digital lives. Physical-interaction data has no byproduct — every episode is produced on purpose, by someone, in real time, on specific hardware. That is why the field's whole effort is reframed as a question of building a corpus: pool what exists across labs and bodies (Open X-Embodiment), and capture more of it more cheaply (the EgoDex-style human-video shortcut). And it is why the talent is voting with its feet — robot-learning and robot-foundation-model papers grew roughly 60% from 2022 to 2024, outpacing other applied-AI subfields,2 while robotics and physical-AI venture funding hit $27.6B in 2025, more than double the year before.3 The money and the researchers have both concluded that the bottleneck is the corpus, not the model.
For language, the wall is that we are running out of text to scrape. For the body, the wall is harder: the data was never written down, so there is nothing to scrape at any price — only something to record.
What would make me wrong
Here is the live counter-bet, and it is a strong one. Maybe the missing modalities don't have to be recorded directly — maybe they can be reconstructed. Pour enough passive human video into a good enough world model, and perhaps the system learns to infer the hidden state: to predict the forces from the pixels, the 3D from the monocular frame, the trajectory from the result. If video-plus-world-models can hallucinate the proprioception and contact channels accurately enough, then the hole I just described gets filled by inference rather than capture, and the premise of this piece weakens. EgoDex is an early gesture in exactly that direction — squeeze 3D hand tracking out of first-person footage — and it is improving.
I don't think it closes the gap soon, for one stubborn reason: the channels that are missing are the ones that are hardest to infer from the channels that survive. Force and contact are nearly invisible in pixels — two grasps that look identical on camera can differ entirely in grip and friction — and the sim-to-real gap is widest exactly where the value is highest, in dexterous, contact-rich, gloriously messy tasks. A reconstruction has to be validated against ground truth, and the ground truth is the recorded interaction data we don't have. So real capture stays the scarce, grounding input even in the world where world models work. But I'll name the test plainly: if a generalist manipulation policy trained mostly on passive human video starts matching one trained on teleoperated, fully-instrumented episodes — on contact-rich tasks, not just pick-and-place — then the recording was less essential than I claimed, and the corpus could be inferred after all. That is the result I am watching for. Until it lands, the honest summary holds: there is no internet for robots, and there can't be — it has to be built.
- Open X-Embodiment — a cross-embodiment robotics dataset assembled by 60+ labs across 22 robot types, pooled to the order of a few million episodes; the largest open corpus of recorded robot interaction, and still a rounding error next to the web.
- Robot-learning / robot-foundation-model publications grew roughly 60% from 2022 to 2024, outpacing other applied-AI subfields. Survey literature on the robot-learning field.
- Robotics & physical-AI venture funding ≈ $27.6B in 2025, more than double 2024's $13.7B. PitchBook, Q4 2025 Robotics & Physical AI VC Trends.
- Apple EgoDex (May 2025) — large-scale egocentric human video with 3D hand-and-finger tracking; a passive-capture attempt to recover, from first-person footage, some of the geometry that third-person video drops. Apple.