Sidekick Token Fusion

Sidekick Engineering Team · Sidekick Token Fusion: An Advancement to the Sidekick Learning Loop

The Method

We introduce Sidekick Token Fusion, a form of asymmetric latent conditioning where a frozen VLM backbone acts as a semantic feature extractor, providing a contextual token that is fused with low-level proprioceptive states to condition our downstream RL policy.

Every Sidekick Robot is already capable. The hard part was making it better, quickly, in the real world, without shipping it back to the lab.

That's the job of our Learning Loop: the robot attempts a task, a person says “good” or “not quite,” and the robot improves. Simple in spirit, brutally hard in practice. Real attempts are slow and feedback is precious. A robot that needs hundreds of tries to learn one improvement will never keep pace with the real world.

So the question driving us is blunt: how do we get the most learning out of the fewest tries? Sidekick Token Fusion is how.

01The problem

A learner can only learn as well as it has been trained. Vision-Language-Action models provide over 10k hours of robot data and millions of image-text pairs as a prior.

Think of the part of our system that improves the robot as a coach. In reinforcement-learning terms, a policy that learns from reward. A coach is only as good as their read of the situation.

Give that coach nothing but joint angles, numbers describing where the arms are, and it's flying blind: it knows posture, not whether the towel is bunched, half-folded, or slipping. Give it the raw camera feed instead, and it's like coaching a sport it has never seen, frame by frame, from scratch. It would learn what matters eventually, after an enormous number of games. We don't have an enormous number of games. We have a robot that can run a handful of real attempts before it needs to rest.

Tap each input to learn more. The third, the contextual token, is the model's own understanding, ready to use.

What the RL policy sees — before and after Token Fusion

Without Token Fusion

Only joint positions. The coach is blind to the scene.

With Token Fusion

Semantic latent fused with proprioception. The coach sees the scene.

The same policy architecture, fed a fundamentally richer signal. The bars animate live to represent the continuous stream of information flowing into the policy.

02Where the token comes from

The understanding is already inside the model.

The robot's main brain is a Vision-Language-Action model. Inside it, the image patches and the text prompt are converted into visual and text sequence tokens, which feed through pre-trained transformer layers. The hidden states the model forms just before it decodes the action sequence are a dense, highly generalized context embedding, a compact summary of everything the model has understood about this exact moment, shaped by tens of thousands of hours of experience.

That embedding is the prize. We don't teach the coach to understand the world from scratch; we read the understanding the model has already produced, and we leave the model itself untouched.

Token extraction pipeline

Image patches

tokenize

↓

Text prompt

tokenize

↓

Visual + text tokens

feed through

↓

Frozen transformer

pre-trained layers

FROZEN

↓

Contextual Token

hidden state before action decoding

★

Patches and prompt become tokens, flow through the frozen pre-trained transformer, and the hidden state just before action decoding becomes the contextual token.

03The fusion

Asymmetric latent conditioning, in one picture.

Here is the whole method. The frozen VLM backbone acts as a semantic feature extractor, emitting the contextual token. We fuse that high-dimensional semantic latent with the robot's low-level proprioceptive state, the physical sense of where its body is, into a single conditioning vector. That fused vector is what conditions our downstream RL policy.

We call it asymmetric for a reason: information flows one way. The token shapes the policy, but the policy's learning never flows back into the backbone. The model that does the understanding is never disturbed, so it keeps every bit of its broad, pre-trained competence while the policy specializes on top.

Asymmetric latent conditioning

NO GRADIENTS RETURN

VLM Backbone

FROZEN

semantic feature extractor

Contextual Token

Proprioceptive State

low-level · where the body is

⊕

Fused Vector

conditioning input

↓

RL Policy

Asymmetric — no gradients return to the frozen backbone

The robot was already paying attention. Now the part that makes it better is conditioned on what it sees.

04What it bootstraps next

One token compresses an enormous amount of meaning.

Because this token compresses massive amounts of multimodal information, visual, textual, and behavioral, into a single dense vector, it is a remarkably versatile interface. Anything downstream that needs to know what's going on can read this one vector instead of re-deriving understanding from raw sensors.

We started with the RL policy. The same token can power far more. Each a small, lightweight head reading the vector, none requiring us to retrain the model beneath.

Contextual Token

one dense vector

Tap any head to learn what it does. Each is a small, lightweight consumer. None require retraining the backbone.

05The bigger picture

Understand once, condition everything.

Sidekick Token Fusion began as a way to make the Learning Loop sharper. Condition the policy on the model's own read of the scene, and every real-world attempt teaches more. For a robot that can't practice forever, that is the difference between learning in a handful of tries and never learning at all.

But the deeper lesson generalizes. The most powerful signal for teaching and extending a robot may not be a new sensor or a bigger dataset. It is the understanding the robot already carries. Extract it once from a frozen backbone, fuse it where it's needed, and one dense token becomes a foundation for control, detection, verification, planning, safety, and even moving across entirely different bodies.

One frozen backbone. One contextual token.

A growing number of ways to put it to work.

That's Sidekick Token Fusion.