Touch begins where vision ends: Generalizable policies for contact-rich manipulation

1New York University Shanghai 2New York University 3Honda Research
† Corresponding author: raunaqbhirangi@nyu.edu
ViTaL teaser image

VisuoTactile Local (ViTaL) policies, an effective framework for precise, contact-rich manipulation that combines visuotactile learning, visual augmentation, and residual RL for highly robust, generalizable policies.

Abstract

Data-driven approaches struggle with precise manipulation: imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (VITAL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic VITAL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. VITAL achieves ~90% success on contact-rich tasks in unseen environments and is robust to distractors. VITAL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that VITAL integrates well with high-level VLMs enabling robust, reusable low-level skills.

ViTaL Policy Learning

ViTaL Figure 2

Policy Learning for 4 Precise Tasks

The following videos are learnt ViTaL policy rollouts being executed on the robot at 1x speed.

Plug in Socket

USB Insertion

Card Swiping

Key in Lock

ViTaL deployment with Reaching and Manipulation

The following videos are rollouts being executed on the robot combing reaching and precise manipulation with spatial and environmental variations.

Plug in Socket

USB Insertion

Card Swiping

Key in Lock

ViTaL deployment with Input Modalities

The following videos show rollouts with the fisheye view (bottom-left) and tactile readings (top-right) displayed simultaneously.

Plug in Socket

USB Insertion

Card Swiping

Key in Lock

ViTaL deployment with Perturbation

The following videos are rollouts of robot doing precise manipulations with human perturbations.

ViTaL New Robot deployment

The following videos are rollouts of robot doing precise manipulations with policy training with different robot and environment setting.

Experimental Results

We run 10 evaluations each across 3 seeds, on held out unseen target object positions for each task.

Policy Performance for in-domain experiments

Policy Performance for in-domain experiments

Scene and Spatial Generalization of ViTaL

Scene and Spatial Generalization of ViTaL

Ablations and Design Choices

Ablations and Design Choices

VLM Navigation for Spatial Generalization Results

VLM Navigation for Spatial Generalization Results