Out of the Loop

Sidekick Engineering Team · Token Fusion Series, Part 2

The System

A frozen backbone already understands the scene. Last time, we showed how we extract that understanding as a single contextual token and fuse it into how our robots learn. Today we are sharing what we built with it first: a system that lets a deployed robot keep improving when there is no human in the loop at all.

01The bottleneck we set out to remove

The verdict is the reward signal. It is also the bottleneck.

Our Learning Loop is simple in spirit: the robot attempts a task, a person says “good” or “not quite,” and the robot improves. A person can label a training session. A person cannot stand behind every robot in every deployment, every hour, rendering verdicts. So until now, the data a robot generated in the field, hundreds of real attempts in real conditions on real tasks, was unlabeled, and unlabeled meant unlearnable. The robot was practicing constantly and learning from none of it.

The question was never whether deployment data is valuable. It is the most valuable data in the ecosystem: it comes from exactly the environment the robot needs to master. The question was who supplies the verdict when nobody is there to give one.

Token Fusion gave us the answer. The robot does.

02A judge that reads the token

A small learned probe. One question.

In our last post we described the contextual token: a compact summary of everything the model understands about the scene. We also sketched a family of lightweight heads that could read that one vector instead of re-deriving understanding from raw sensors, and we named visual goal verification as one of them.

That head now exists, and it is the heart of this system. We call it the judge: a small learned probe that reads the contextual token and answers one question: does this scene show a completed task?

The Judge

Contextual Token

scene, already understood

Judge Head

small learned probe

Verdict

Tap success or failure to see what happens next.

03The labels were already there

Training the judge required no new annotation campaign.

Every supervised training session we have ever run already contains human verdicts; that is what the Learning Loop collects. Scenes a person confirmed as successes become the judge's positive examples. Mid-task scenes, and the endings of attempts a person marked as failures, become its negatives. The supervision was sitting in our training data the whole time; the judge simply mines it.

This means the cost of standing up a judge for a task is the cost of the human-supervised training we were going to do anyway.

The human labels a few sessions. The judge learns to label everything after.

04Precision first, by construction

A reward model that is sometimes wrong is not equally wrong in both directions.

If the judge misses a real success, we lose one good episode. Regrettable, recoverable. If the judge calls an unfinished task a success, something much worse happens: the policy is taught that an unfolded towel is a goal state. A false positive does not just waste data; it actively trains the robot toward the wrong objective.

So the system is precision-first by construction, not by hope. The judge's decision threshold is chosen by simulating the exact decision rule the downstream system will apply, on held-out episodes the judge never trained on, and selecting the operating point that catches as many true successes as possible subject to a precision requirement. When two thresholds tie, we take the more conservative one.

If no threshold can meet the precision bar on held-out data, because the task is too new or the labeled set too small, the system refuses to label autonomously. It will not silently degrade into a sloppy judge. It tells us to collect more human-labeled episodes instead.

False Negative

miss

Missed a real success. One good training episode lost.

Regrettable. Recoverable.

False Positive

critical

Called an unfinished task a success. The policy learns the wrong objective.

Not recoverable without intervention.

Rule

Taking the human out of the loop is a privilege the system has to earn, per task, with evidence.

05Closing the loop

With a qualified judge in hand, the loop closes in three moves.

loop

Tap each step to expand. The loop repeats continuously in deployment.

Task Success Rate

as the loop turns — human labels vs. autonomous

84%

Wk 5

Wk 1

Wk 2

Wk 3

Wk 4

Wk 5

human-labeled: 5%

autonomous: 95%

hover to inspect

Percentages are demonstrative, not measured.

06The human moves up the stack

We are not removing people from the system. We are removing them from the inner loop.

Humans still do what only humans can: they define what success means by labeling the starting sessions, they audit the runs the system flags, they spot-check when the numbers look too good, and they hold the veto on any checkpoint the guardrails do not like.

One afternoon of human judgment now supervises weeks of autonomous practice.

The person's verdict is not gone. It is amplified.

And the asymmetry that defines Token Fusion holds throughout. Nothing in this system ever touches the frozen backbone. The judge is a small head reading the token. The policy improves on top. The model that does the understanding keeps every bit of its broad competence, exactly as designed.

07The Sidekick Layer grows

We think of the capabilities we build on top of the base model as the Sidekick Layer.

Safety was the first element. Reinforcement learning from human feedback, the Learning Loop, was the second. Autonomous learning in deployment is the third, and it is the first shipped application of Token Fusion.

Safety

Guardrails on every action

Reinforcement Learning from Human Feedback

The Learning Loop

Autonomous Learning in Deployment

Out of the Loop — Token Fusion, applied

New

One frozen backbone. One contextual token. One small judge reading it.

Robots that get better while nobody's watching.

Carefully, conservatively, and on the strength of understanding they already had.