Was this email forwarded to you? Sign up here

The Sequence Chat: Salesforce Research's Junnan Li on Multimodal Generative AI

One of the creators of the famous BLIP-2 model shares his insights about the current state of multimodal generative AI.

Apr 12

Share

👤 Quick bio

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning (ML)?

I ‘m a research scientist at Salesforce Research focusing on multimodal AI research. I did my PhD at National University of Singapore in Computer Vision. I got started in computer vision and machine learning in my undergrad FYP project

🛠 ML Work

Recently, you have been working on BLIP-2, which can be considered one of the first open-source multimodal conversational agents ever released. Could you elaborate on the vision and history of the project?

BLIP-2 is a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images. It unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype. Checkout this blog post for more details: https://blog.salesforceairesearch.com/blip-2/

Before BLIP-2, we have published BLIP, one of the most popular vision-and–language models and the #18 high-cited AI papers in 2022. BLIP-2 achieves significant enhancement over BLIP by effectively leveraging frozen pre-trained image encoders and LLMs.

One of the biggest contributions of BLIP-2 is the idea of zero-shot image-to-text generation. Could you explain this concept and how your team was able to implement it?

BLIP-2 achieves zero-shot image-to-text generation by enabling LLMs to understand images, thus harvesting the zero-shot text generation capability from LLMs. It is challenging for LLMs to understand images, due to the domain gap between images and texts. We propose a novel two-stage pre-training strategy to bridge this gap.

The release of GPT-4 highlighted the importance of multimodality as one of the key elements of the new wave of generative AI models. How would you compare the strengths and weaknesses of GPT-4 and BLIP-2?

GPT-4 is amazing and demonstrates strong image-to-text generation capabilities. There are two key differences between BLIP-2 and GPT-4

BLIP-2 is a generic multimodal pre-training method that can enable any LLMs to understand images. GPT-4 refers to a particular model/family of models.
Compared to GPT-4, BLIP-2 is much more efficient both during pre-training and inference. In fact, BLIP-2 is one of the most computation-friendly modern multimodal pre-training methods.
What are some of the next milestones and the biggest research breakthroughs needed to push multimodal generative AI to the next level?

The world is multimodal by nature, thus an AI agent that can understand and simulate the world need to be multimodal. In my opinion, multimodal generative AI will drive the next wave of AI breakthroughs. There are so many exciting areas, such as video generation, embodied multimodal AI, etc.

💥 Miscellaneous – a set of rapid-fire questions

Favorite area of AI research outside generative AI?

Self-supervised/unsupervised learning

How do you see the balance and risks between the open source and closed source approach to foundation models?

I believe that open-source is the preferable approach to drive safer and responsible AI research that can benefit a larger community. However, it requires careful planning before open-sourcing a model to mitigate its potential risks.

The next domain for multimodal AI would be language and video, audio, 3D, all of them?

Yes!

Is the Turing Test still relevant? Any clever alternatives?

This question is out of my scope and cannot answer :).

You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities.

Like

Comment

Restack

The Sequence Chat: Salesforce Research's Junnan Li on Multimodal Generative AI

The Sequence Chat: Salesforce Research's Junnan Li on Multimodal Generative AI

One of the creators of the famous BLIP-2 model shares his insights about the current state of multimodal generative AI.

👤 Quick bio

🛠 ML Work

💥 Miscellaneous – a set of rapid-fire questions

Older messages

Inside LangChain: The Super Popular LLM Framework You Need to Know About

📌 Webinar: Improving search relevance with ML monitoring

Big vs. Small, Open Source vs. API Based, the Philosophical Frictions of Foundation Models

📝 Guest Post: How to Enhance the Usefulness of Large Language Models*

Edge 283: Federated Learning and Differential Privacy

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR