Theia: A Robot Vision Foundation Model that Simultaneously Distills Off-the-Shelf VFMs such as CLIP, DINOv2, and ViT

Visual understanding is the abstracting of high-dimensional visual signals like images and videos. Many problems are involved in this process, ranging from depth prediction and vision-language correspondence to classification and object grounding, which include tasks defined along spatial and temporal axes and tasks defined along coarse to fine granularity, like object grounding. In light of this variety, the vision community has long sought to create models well-suited to a single or small number of visual comprehension tasks. Vision foundation models (VFMs) are a group of models that have recently attained outstanding generalizability to unexplored domains and new tasks.

Learning action policies from visual inputs, as in vision-based robot policy learning, necessitates robust and varied visual perception. Although there is no universal model for vision tasks, these principles incorporate numerous implicit vision tasks, including object identification and semantic grounding, for which commercially available VFMs are suitable. When compared to visual representation models developed specifically for robot learning tasks, generic VFMs like CLIP typically need to catch up, according to the research. This shows a disparity between what robots need to learn and what anyone VFM can visually perceive. Improving training data and defining objective functions have been the primary focus of prior work on learning foundational visual representation models for robots. However, there needs to be more emphasis on improving the ability to perform various implicit visual comprehension tasks.

Proposing a unique approach, researchers from The AI Institute and Stony Brook University advocate for the consolidation of multiple large VFMs into a single, more compact model for robot learning. This is achieved through knowledge distillation, a method that allows for the enhancement of visual representation for robot learning, a task that VFMs are not typically trained for. Knowledge distillation involves transferring knowledge from a large, complex model (the ‘teacher ‘) to a smaller, simpler model (the ‘student ‘) by training the student to mimic the output of the teacher. Unlike the common method of distillation from larger to smaller models on the same task, the researchers distill VFMs that are customized for different vision tasks.

Their study presents Theia, a paradigm for robot vision foundations that simultaneously consolidates commercially available VFMs like CLIP, DINOv2, and ViT. By thoroughly grasping numerous spatially-leveled visual sub-problems, Theia produces detailed representations for use in downstream robot learning. Theia provides superior pre-trained visual representations for improved downstream robot learning performance at lower computing costs than commercially available VFMs and previous research.

Furthermore, the proposed model, Theia, demonstrates remarkable efficiency. Previous studies required significantly more computation for training Theia; however, just ImageNet and approximately 150 GPU hours are needed for training Theia. Theia’s model size, spatial token usage, and the entropy of representation norms are identified as critical performance factors for robot learning, providing reassurance to the audience about the model’s efficiency. These findings pave the way for future studies aimed at improving robot learning using visual representations.

The proposed model comprises an underlying visual encoder and a feature translator suite. A collection of encoded tokens representing input picture patches is the Theia representation. These tokens, which can be thought of as ‘building blocks’ of the visual representation, are used to capture spatial information in the image. The robust per-patch features in DINOv2 demonstrate that spatially dense representations form the basis for diversified visual understanding, which is why the team opted for spatial tokens. The researchers aimed to remove all spatial tokens and save the [CLS] token before distillation. They began with a normalization step to ensure that the various teacher representation scales were appropriately considered. The teacher representations are normalized over each latent dimension after calculating the mean and variance from all the ImageNet training examples.

During the training process, the researchers ensured that the feature translators’ outputs match the teacher VFM representations. This was achieved by combining cosine and smooth-L1 losses to merge the ground truth and predicted versions of the same image, followed by taking the weighted average of the two.

To assess the quality of pre-trained visual representations, they employed the simulation tasks found in CortexBench. These tasks comprise a combination of those from Habitat (ImageNav, ObjectNav, and MobilePick), Trifinger, and MuJoCo (Adroit, DeepMind Control Suite (DMC), and MetaWorld). Some tasks are imitation learning (IL), while others are reinforcement learning (RL), such as ImageNav and MobilePick. The following works are taken into account: R3M, VIP, MVP, and VC-1; RADIO and E-RADIO are agglomerative models for vision tasks; and off-the-shelf vision foundation models ViT, DINOv2, and CLIP are vision foundation frameworks. For the sake of this experiment, all pre-trained representations have been frozen.

The findings of this study demonstrate that consolidating numerous VFMs into a single model significantly improves performance across various robot learning applications. By establishing a strong correlation between the entropy of feature norms and enhanced downstream performance, the researchers answer a key question about what kinds of visual representations lead to better robot learning. This not only validates the effectiveness of Theia but also provides valuable insights for future research on optimizing visual representations for robotics.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here