T-Rex: Teaching Robots to Feel in Real Time — Not Just See and Plan
When Touch Becomes the Missing Half of Dexterous Intelligence
Have you ever flipped a page while reading, transferred an egg between hands, or screwed in a light bulb without looking?
You probably didn't consciously "plan" each motion. Your fingers felt friction, detected slip, adjusted grip force, and corrected alignment — often within milliseconds. Vision told you where to go; touch told you when and how hard to act.
For robots, this split has been brutally one-sided. Today's Vision-Language-Action (VLA) models excel at coarse approach and semantic reasoning, but they largely ignore touch — or treat it as a static side channel. The result: policies that look smooth until the moment of contact, then fail on the tasks humans handle effortlessly.
In T-Rex: Tactile-Reactive Dexterous Manipulation — a collaboration among researchers at UC Berkeley, NVIDIA, Stanford, and others — the team asks a direct question: Can a robot policy react to high-frequency tactile signals the way human hands do, without giving up the generalization power of modern VLAs?
The answer, validated on a bimanual dexterous platform with 22-DoF Sharpa Wave hands, is a decisive yes: 65% average success across 12 contact-rich real-world tasks — +30 percentage points above the strongest prior baseline.
The Problem: Vision-Only Policies Are Blind at the Critical Moment
Contemporary robot learning has made remarkable progress on locomotion, navigation, and coarse manipulation. Yet three bottlenecks keep dexterous, touch-dependent skills out of reach:
- "Tactile blindness" in VLAs
Most manipulation policies are trained on motion trajectories with little or no force/touch feedback. Even when tactile sensors exist, current VLAs often lack the architecture to ingest high-frequency, time-varying touch signals without drowning out vision and language.
The T-Rex paper makes this concrete: simply adding tactile input to π0.5 drops average success from 17% to 6% — touch without the right representation hurts more than it helps.
-
Static encoders can't capture dynamic contact
Existing tactile encoders compress touch into frame-level features. But real manipulation — wiping a plate, fanning cards, threading a bulb — depends on how contact evolves over time: slip onset, compliance, micro-adjustments. A static snapshot is not enough.
-
The data drought for touch-rich dexterity
High-quality tactile manipulation data is scarce and expensive. Without scalable collection recipes and standardized benchmarks, the field cannot train or compare tactile-reactive policies systematically.
T-Rex addresses all three — dataset, architecture, and evaluation — in one integrated framework.
The Solution: T-Rex — Variable-Rate Touch Meets Mixture-of-Transformers
The core insight of T-Rex mirrors what Sharpa has long argued in its CraftNet VTLA stack: different modalities run at different frequencies, and contact-phase control demands a dedicated fast loop.

T-Rex implements this idea inside a single learnable policy through two key components:
-
A 100-hour tactile-rich dataset, built from motor primitives
Rather than collecting only full task demonstrations, the team uses a data-efficient recipe centered on elementary motor primitives — reusable contact patterns that compose into complex skills. The result is ~100 hours of diverse, touch-dense interaction data: enough to learn reactive contact behavior without the cost of purely task-specific teleoperation at scale.
-
Variable-rate Mixture-of-Transformer (MoT) + temporal tactile VQ-VAE
T-Rex extends the VLA backbone with a Mixture-of-Transformers (MoT) design that routes different token types — vision, language, action, and tactile — through specialized expert pathways at appropriate rates.
The tactile pathway is powered by a temporal tactile VQ-VAE encoder that compresses high-frequency touch streams into discrete tokens the transformer can consume — preserving dynamic contact cues (slip, pressure change, edge detection) that static encoders discard.
This is the architectural answer to "π0.5 + tactile fails": touch needs its own temporal representation and expert capacity, not a bolt-on feature.
How It Works: Three Layers of Reactive Dexterity
Think of T-Rex as closing the loop between slow intent and fast contact:
| Layer | Role | Frequency analogy |
| Vision + Language | Task understanding, coarse approach | ~1–10 Hz planning |
| Action generation | End-effector and hand trajectories | ~10–30 Hz control |
| Tactile VQ-VAE + MoT expert | Contact-phase corrections: grip, slip, alignment | High-frequency touch reactivity |
At contact, the tactile expert does not merely flag "touch detected." It encodes how contact is changing — enabling the policy to tighten a grasp on an egg, modulate wipe pressure, or search for a card edge by feel when vision is occluded.
This parallels Sharpa CraftNet's System 0 (Interaction Brain) at ~100 Hz — but T-Rex demonstrates the principle inside a unified VLA-style MoT backbone trained end-to-end on real tactile data.
Putting It to the Test: 12 Tasks That Punish Touch Ignorance
Researchers validated T-Rex on 12 real-world tactile-reactive manipulation tasks on a bimanual dexterous platform — the same class of setup used in prior NVIDIA GEAR work (e.g., EgoScale) with 22-DoF Sharpa Wave hands and high-resolution fingertip tactile sensing.
Tasks span force-sensitive contact, deformable manipulation, and bimanual coordination:
| Task | Why touch matters |
| Flip Page | Delicate friction; page bend without tear |
| Transfer Egg | Compliance + slip detection |
| Wipe Plate | Continuous force modulation |
| Apply Toothpaste | Squeeze compliance |
| Split Cup | Bimanual force coordination |
| Sort Mahjong | In-hand reorientation by feel |
| Open Lock | Key alignment without vision |
| Refill Tablet | Precision insertion |
| Acid-Base Neutralization | Liquid handling |
| Extract Card | Edge detection, minimal force |
| Deal Poker | Sub-millimeter card separation |
| Screw Lightbulb | Thread engagement + rotation |
Several of these — card dealing, light-bulb screwing, delicate extraction — overlap directly with benchmarks in SaTA (Spatially-anchored Tactile Awareness), which also ran on Sharpa Wave hardware. T-Rex pushes the frontier from spatially grounded touch to temporally reactive touch at policy scale.
The Results: +30 Points Over the Strongest Baseline
Across 12 tasks (16 rollouts each), T-Rex achieves 65% average success rate:
| Method | Avg Success (%) |
| ViTacFormer | 3 |
| RDP | 6 |
| π0.5 + tactile | 6 |
| Tactile-VLA | 15 |
| π0.5 | 17 |
| EgoScale (prior SOTA) | 35 |
| T-Rex (Ours) | 65 |
Standout per-task gains include Flip Page (96%), Transfer Egg (75%), Split Cup (78%), and Extract Card (70%) — tasks where vision-only or weakly tactile baselines remain in single digits.
Critically, T-Rex beats EgoScale — the strongest prior method built on 20k+ hours of human video pretraining on Sharpa Wave's 22-DoF action space — by 30 absolute points on average. Scale from human video alone does not substitute for reactive touch at contact time. Both matter; T-Rex shows what becomes possible when architecture and tactile data catch up to visual pretraining.
Why This Matters for Sharpa Wave
T-Rex is not an abstract algorithm demo. It is evidence that anthropomorphic hardware and tactile-reactive software are co-dependent:
1:1 human-scale dexterity (22 DoF)
Sharpa Wave matches human hand size, kinematics, and workspace — the same 22-DoF joint space used in EgoScale pretraining and T-Rex evaluation. Policies trained on human-retargeted motion and tested on Wave preserve pinch semantics, coordinated finger timing, and in-hand rotation structure that lower-DoF end effectors cannot express.
High-resolution Dynamic Tactile Array (DTA)
Each Sharpa Wave fingertip carries >1,000 tactile pixels with high pressure sensitivity — the sensory resolution T-Rex's VQ-VAE encoder is designed to exploit. Without this resolution, "reactive" touch collapses back into coarse contact detection.
High-frequency physical interaction
Wave combines >4 Hz gesture speed with >20 N fingertip force — the mechanical envelope needed for tasks like card fanning, egg transfer, and thread engagement. T-Rex's 100-hour primitive dataset and MoT tactile expert assume a hand that can both feel and act at human-relevant rates.
Sim-to-real consistency Sharpa Wave ships with Isaac Sim URDF/USD assets and tactile parameters integrated into the NVIDIA Isaac GR00T Reference Humanoid workflow — enabling teams to train tactile policies in simulation and deploy with minimal gap, the same development path T-Rex's ecosystem builds on.
Bimanual coordination
Half of T-Rex's task suite requires two hands working together — split cup, mahjong sorting, egg transfer. Wave on dual-arm platforms (RealMan + Sharpa, Galaxea R1 Pro + Sharpa, Unitree H2 + Sharpa) is the hardware baseline for this class of research.
From T-Rex to Product: The "Last Millimeter" Becomes Trainable
Sharpa's mission — "We manufacture time by making robots useful" — hinges on solving contact, not just motion. T-Rex independently validates the same thesis Sharpa articulated in CraftNet:
"90% of the effort is in the last millimeter when interacting with objects."
T-Rex proves that last millimeter is now learnable at scale — given the right hand, the right sensors, the right data recipe, and an architecture that treats touch as a first-class, high-frequency modality rather than an afterthought.
For researchers and OEMs building on Sharpa Wave, T-Rex offers a concrete roadmap: pair anthropomorphic 22-DoF hardware with tactile-reactive policy architectures, collect primitive-rich touch data efficiently, and evaluate on standardized contact-heavy benchmarks — not just pick-and-place, but the tasks that actually require human-level dexterity.
Read the original paper: tactile-rex.github.io
Related Sharpa Research: