Integrating visual inputs like images alongside text and speech into large language models (LLMs) is considered an important new direction in AI research by many experts in the field. By augmenting these models to handle multiple modes of data beyond just language, there is potential to significantly broaden the scope of applications they can be utilised for as well as enhance their overall intelligence and performance on existing NLP tasks.
The promise of multimodal AI spans from more engaging user experiences like conversational agents that can see their surroundings and refer to objects around them, to robots that can fluidly translate commands into physical actions using combined knowledge of language and vision. By uniting historically separate areas of AI around a unified model architecture, multimodality may accelerate progress in tasks relying on multiple skills like visual question answering or image captioning. The synergies between learning algorithms, data types, and model designs across fields could lead to rapid advancement.
Many companies have already embraced multimodality in various forms: OpenAI, Anthropic, Google (Bard and Gemini) allow you to upload your own image or text data and chat with them.
In this article, I hope to demonstrate a straightforward yet powerful application of large language models with computer vision in finance. Equity researchers and investment banking analysts may find this especially useful, as you likely spend considerable time reading reports and statements containing various tables and graphs. Reading lengthly tables and graphs and interpreting them correctly requires a great amount of time, knowledge in the field as well as adequate focus to avoid mistakes. More tediously, analysts occasionally need to manually enter tabular data from PDFs simply to create new charts. An automated solution could alleviate these pains by extracting and interpreting key information without the capacity for human oversight or fatigue.
In fact, by combining NLP with computer vision, we can create an assistant to handle many repetitive analytical tasks, freeing analysts to focus on higher-level…