Natural language processing (NLP) has entered a transformational period with the introduction of Large Language Models (LLMs), like the GPT series, setting new performance standards for various linguistic tasks. Autoregressive pretraining, which teaches models to forecast the most likely tokens in a sequence, is one of the main factors causing this amazing achievement. Because of…
The problem of generating synchronized motions of objects and humans within a 3D scene has been addressed by researchers from Stanford University and FAIR Meta by introducing CHOIS. The system operates based on sparse object waypoints, an initial state of things and humans, and a textual description. It controls interactions between humans and objects by…
Text-to-image diffusion models represent an intriguing field in artificial intelligence research. They aim to create lifelike images based on textual descriptions utilizing diffusion models. The process involves iteratively generating samples from a basic distribution, gradually transforming them to resemble the target image while considering the text description. Multiple steps are involved, adding progressive noise to…
The researchers from The University of Hong Kong, Alibaba Group, and Ant Group developed LivePhoto to solve the issue of temporal motions being overlooked in current text-to-video generation studies. LivePhoto enables users to animate images with text descriptions while reducing ambiguity in text-to-motion mapping.
The study addresses limitations in existing image animation methods by presenting…
How can high-quality 3D reconstructions be achieved from a limited number of images? A team of researchers from Columbia University and Google introduced ‘ReconFusion,’ An artificial intelligence method that solves the problem of limited input views when reconstructing 3D scenes from images. It addresses issues such as artifacts and catastrophic failures in reconstruction, providing robustness…
A team of researchers from the University of Wisconsin-Madison, NVIDIA, the University of Michigan, and Stanford University have developed a new vision-language model (VLM) called Dolphins. It is a conversational driving assistant that can process multimodal inputs to provide informed driving instructions. Dolphins are designed to address the complex driving scenarios faced by autonomous vehicles…
How can the effectiveness of vision transformers be leveraged in diffusion-based generative learning? This paper from NVIDIA introduces a novel model called Diffusion Vision Transformers (DiffiT), which combines a hybrid hierarchical architecture with a U-shaped encoder and decoder. This approach has pushed the state of the art in generative models and offers a solution to…
How can Neural Radiance Fields (NeRFs) be improved to handle scale variations and reduce aliasing artifacts in scene reconstruction? A new research paper from CMU and Meta addresses this issue by proposing PyNeRF (Pyramidal Neural Radiance Fields). It improves neural radiation fields (NeRFs) by training model heads at different spatial grid resolutions, which helps…
Recently, there have been significant advancements in video editing, with editing using Artificial Intelligence (AI) at its forefront. Numerous novel techniques have emerged, and among them, Diffusion-based video editing stands out as a particularly promising field. It leverages pre-trained text-to-image/video diffusion models for tasks like style change, background swapping, etc. However, The challenging part in…
An essential function of multi-view camera systems is novel view synthesis (NVS), which attempts to generate photorealistic images from new perspectives using source photos. The subfields of human NVS have the potential to significantly contribute to real-time efficiency and consistent 3D appearances in areas such as holographic communication, stage performances, and 3D/4D immersive scene capture…