In 3D reconstruction and generation, pursuing techniques that balance visual richness with computational efficiency is paramount. Effective methods such as Gaussian Splatting often have significant limitations, particularly in handling high-frequency signals and sharp edges due to their inherent low-pass characteristics. This limitation affects the quality of the rendered scenes and imposes a substantial memory footprint,…
The introduction of Augmented Reality (AR) and wearable Artificial Intelligence (AI) gadgets is a significant advancement in human-computer interaction. With AR and AI gadgets facilitating data collection, there are new possibilities to develop highly contextualized and personalized AI assistants that function as an extension of the wearer’s cognitive processes.
Currently, existing multimodal AI assistants, like…
Text-to-image (T2I) and text-to-video (T2V) generation have made significant strides in generative models. While T2I models can control subject identity well, extending this capability to T2V remains challenging. Existing T2V methods need more precise control over generated content, particularly identity-specific generation for human-related scenarios. Efforts to leverage T2I advancements for video generation need help maintaining…
There has been a recent uptick in the development of general-purpose multimodal AI assistants capable of following visual and written directions, thanks to the remarkable success of Large Language Models (LLMs). By utilizing the impressive reasoning capabilities of LLMs and information found in huge alignment corpus (such as image-text pairs), they demonstrate the immense potential…
The intersection of artificial intelligence and creativity has witnessed an exceptional breakthrough in the form of text-to-image (T2I) diffusion models. These models, which convert textual descriptions into visually compelling images, have broadened the horizons of digital art, content creation, and more. Yet this rapidly evolving area of Personalized T2I generation study grapples with several core…
Deep learning has revolutionized view synthesis in computer vision, offering diverse approaches like NeRF and end-to-end style architectures. Traditionally, 3D modeling methods like voxels, point clouds, or meshes were employed. NeRF-based techniques implicitly represent 3D scenes using MLPs. Recent advancements focus on image-to-image approaches, generating novel views from collections of scene images. These methods often…
MoD-SLAM is a state-of-the-art method for Simultaneous Localization And Mapping (SLAM) systems. In SLAM systems, it is challenging to achieve real-time, accurate, and scalable dense mapping. To address these challenges, researchers have introduced a novel method focusing on unbounded scenes using only RGB images. Existing neural SLAM methods often rely on RGB-D input which leads…
The task of view synthesis is essential in both computer vision and graphics, enabling the re-rendering of scenes from various viewpoints akin to the human eye. This capability is vital for everyday tasks and fosters creativity by allowing the envisioning and crafting of immersive objects with depth and perspective. Researchers at Dyson Robotics Lab aim…
Integrating natural language understanding with image perception has led to the development of large vision language models (LVLMs), which showcase remarkable reasoning capabilities. Despite their progress, LVLMs often encounter challenges in accurately anchoring generated text to visual inputs, manifesting as inaccuracies like hallucinations of non-existent scene elements or misinterpretations of object attributes and relationships.
Researchers…
The landscape of image segmentation has been profoundly transformed by the introduction of the Segment Anything Model (SAM), a paradigm known for its remarkable zero-shot segmentation capability. SAM’s deployment across a wide array of applications, from augmented reality to data annotation, underscores its utility. However, SAM’s computational intensity, particularly its image encoder’s demand of 2973…