This AI Paper from China Unveils ‘Vary-toy’: A Groundbreaking Compact Large Vision Language Model for Standard GPUs with Advanced Vision Vocabulary

In the past year, large vision language models (LVLMs) have become a prominent focus in artificial intelligence research. When prompted differently, these models show promising performance across various downstream tasks. However, there’s still significant potential for improvement in LVLMs’ image perception capabilities.

Enhanced perceptual abilities for visual concepts are crucial for advancing model development and implementation. Two main challenges hinder this progress: deficiencies in current vision vocabulary networks and the high computational cost of optimizing numerous parameters.

Popular LVLMs excel in tasks at the intersection of Computer Vision (CV) and Natural Language Processing (NLP), such as image captioning, Visual Question Answering (VQA), meme understanding, and scene OCR, largely due to the impressive vision vocabulary network like CLIP. These LVLMs typically employ two main structures: image tokens as prefixes or cross-attention for feature fusion. However, regardless of architecture, the model’s upper limit may be constrained by the efficiency of its vision vocabulary network in encoding visual signals.

To address this, researchers have proposed a straightforward and effective method to scale up the vision vocabulary for LVLMs by training a new visual vocabulary network using a smaller auto-regressive model like OPT-125M and merging it with the existing vocabulary to create a final LVLM. However, Vary has drawbacks, including wasted network capacity and high iteration costs with Vary-base using 7B LLM.

In response, researchers at MEGVII Technology introduced Vary-toy, a smaller version aimed at mitigating these issues. Vary-toy follows the same pipeline as Vary but optimizes the vision vocabulary creation process. Instead of treating natural images as negative samples, they incorporate object detection tasks into the vocabulary network, combining dense textual data (PDF) and natural object location data. This approach enhances Vary-toy’s universality. After creating and reinforcing the vocabulary, they merge it with CLIP and integrate it into a 1.8B language model.

Experimental results on challenging benchmarks like DocVQA, ChartQA, MMvet, and RefCOCO demonstrate Vary-toy’s capabilities. It achieves impressive performance across these benchmarks, showcasing its potential as a smaller yet powerful LVLM.

Vary-toy achieves impressive results, including 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% on MMVet.Vary-toy’s compact size makes it accessible for researchers with limited resources as a practical baseline for further exploration and improvement in LVLM research. Researchers plan to release the code publicly for further exploration and adoption within the research community.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.

🎯 [FREE AI WEBINAR] ‘Create Embeddings on Real-Time Data with OpenAI & SingleStore Job Service’ (Jan 31, 2024)

Source link

This AI Paper from China Unveils ‘Vary-toy’: A Groundbreaking Compact Large Vision Language Model for Standard GPUs with Advanced Vision Vocabulary

You May Also Like

MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data

A Decade of Transformation: How Deep Learning Redefined Stereo Matching in the Twenties