The Sequence Chat: Deyao Zhu and Ju Chen on MiniGPT-4
Was this email forwarded to you? Sign up here The Sequence Chat: Deyao Zhu and Ju Chen on MiniGPT-4The researchers behind the open source GPT-4 alternative share their insights about the state of multimodal AI agents."Visually informed language models open up a new route to generalized visual intelligence. By providing these models with visual input, they are already capable of solving tasks such as recognizing cats or dogs and answering questions about a car brand. Moreover, they can be further developed to solve more tasks on-demand, such as interactively generating a painting or designing a new chair to be placed in your living room." Mohamed Elhoseiny, Assistant Professor of Computer Science at KAUST. 👤 Quick bio
I am Deyao Zhu from Quanzhou, China, and I am a fourth-year PhD student at King Abdullah University of Science and Technology (KAUST), under the supervision of Prof. Mohamed Elhoseiny. My recent research focuses on AI in decision making and vision-language understanding. Prior to joining KAUST, I obtained my Bachelor’s degree in Mechatronics from Tongji University in Shanghai and my Master’s degree in Electrical Engineering from Leibniz Universitaet Hannover in Germany. My interest in artificial neural networks began in high school when I read an article about the Human Brain Project in Scientific American. I was attracted by the idea of simulating brains via computers. A few years later, the success of AlphaGo made me believe in the future of deep learning, and I found a research assistant position in human motion prediction at my Master's university. I am Jun Chen, originally from Qinhuangdao, China. Currently, I am a fourth-year PhD student at King Abdullah University of Science and Technology (KAUST), under the guidance of Prof. Mohamed Elhoseiny. My research primarily revolves around multi-modal learning, with a particular emphasis on vision and language learning. Prior to joining KAUST, I earned my Bachelor's degree from Xi'an Jiaotong-Liverpool University, situated in the beautiful city of Suzhou. My fascination with machine learning began during my undergraduate studies when the groundbreaking release of AlphaGo deeply impressed me with its remarkable success. This served as a driving force, motivating me to concentrate on the study of machine learning. 🛠 ML Work
When OpenAI released GPT-4, we were shocked by its unbelievable vision-language abilities, such as coding a website according to a photo of a website draft and describing a given image with rich details. These abilities had never been demonstrated in any previous vision-language models. However, OpenAI did not release any technical details about GPT-4 or its model. We were curious about how they created it and how we could reproduce their website coding demo. At the same time, we were also inspired by our previous project, ChatCaptioner. In ChatCaptioner, we found that one of the best open-source vision-language models, BLIP-2, could see many image details and answer questions related to those details. However, compared to GPT-4, BLIP-2 lacks many advanced abilities, such as generating code based on handwritten text and describing images in great detail. We believe that the absence of these abilities is due to the lack of an advanced language model, such as ChatGPT. Therefore, we decided to work on aligning our vision model with an advanced language model, such as Vicuna.
At the beginning of this project, Vicuna had not yet been released, and we used other open-source models. We tried different models but did not find any that could perform similarly to ChatGPT. Three days after starting the project, Vicuna was released, and after playing with their demo, we found that Vicuna behaved similarly to ChatGPT and was the strongest model among all the available open-source language models. Thus, we began using Vicuna. We believe that LLaMA is still the strongest open-source base model and that it significantly boosts the performance of other open-source LLMs, such as Alpaca and Vicuna. Without LLaMA, models as strong as Vicuna could not have been developed as quickly as they were.
The first pre-training stage is a standard procedure used to align the vision and language modules. In this stage, the VLM takes the image as input and is trained to generate ground-truth image captions in the output distribution. As the model is trained to predict the caption accurately, it learns to understand the visual content in the image. However, our goal is not just to train a model that can comprehend the image contents but also to develop a chatting bot that can articulate about the image fluently. We noticed that after the first fine-tuning stage, the powerful Vicuna's speaking ability is affected by the visual inputs. For instance, it began to generate incomplete sentences or sentences with significant repetitions, and it only functioned with carefully designed prompts. Therefore, we propose the second fine-tuning stage to restore Vicuna's speaking ability. Traditional image-caption datasets, such as those used in stage one, typically contain brief image captions. As humans usually prefer a chatbot that provides rich information, we created a small image-text-pair dataset with long image descriptions generated by the model itself after stage one training, using carefully designed prompts and a post-processing mechanism based on ChatGPT and hard-coded rules. Additionally, the model inputs in stage two take images wrapped in a conversation template, rather than images alone. After fine-tuning our model on this small dataset, which only takes 7 minutes, Vicuna successfully recovers its speaking ability, and our final MiniGPT-4 system is complete.
I was surprised by the first demo we made, which is about writing a poem from a given photo. Honestly, we expected that this could work, as Vicuna is able to write a poem, and if it can see the image, it should be able to write a poem about it, But still, when we found it worked so well, we were still surprised. Most of the capacities came from aligning the vision part with Vicuna. However, Vicuna loses the ability to speak smoothly when we added the vision feature to it after the first pretraining stage. Therefore, we propose the second finetune stage to make Vicuna recover its speaking ability.
As vision and language serve as crucial inputs to the human brain, with language also acting as a primary tool for thinking, the potential of a multimodal AI system to comprehend a wide range of tasks, environments, and situations becomes evident. Consequently, such a system holds the capacity to automate numerous human jobs closely tied to these abilities. However, it is important to acknowledge that current multimodal foundation models possess several limitations. For instance, they exhibit a significant issue with hallucinations, struggle with object counting, and have difficulty understanding spatial information. Nevertheless, we maintain the belief that future developments will lead to the creation of a single, all-encompassing AI model capable of understanding a vast array of modalities and domains. 💥 Miscellaneous – a set of rapid-fire questions
Deyao: For me, it is decision making. I still think that decision making is the most important next step of AI research. Now we have many AI systems that can understand language and vision well. It is time to use them to do some real jobs for humans. And this is related to decision making. AutoGPT as an example is a good first step. Jun: Ensuring safe and robust data alignment is of utmost importance to me. While generative AI has shown remarkable success, achieving a truly safe AI necessitates training on high-quality aligned data. By leveraging safe aligned data, we can establish a more secure AI environment that promotes safety at its core.
MiniGPT-4 is an attempt to unveil the secret of GPT-4’s vision-language ability. Although we are able to reproduce many GPT-4’s vision-language demos, MiniGPT-4’s abilities are still weaker than GPT-4. For example, GPT-4 is able to read small and long texts in the image. In contrast, MiniGPT-4 can only read short and big texts. Besides, the pure language ability of GPT-4 is much stronger than both ChatGPT and Vicuna. MiniGPT-4 doesn’t focus on improving language ability and directly uses a frozen Vicuna as the language part. Thus, MiniGPT-4’s language ability is also weaker than GPT-4. Lastly, although we reproduce GPT-4’s vision-language demo, we are based on open-sourced models and we still don’t know how GPT-4 is implemented. So, I expect that there should be a difference between the method they used to train GPT-4 and our method.
From what we learned from OpenAI’s talks, I think we are close to consuming all the internet data. I think we will hit a scaling limit when we cannot build a larger dataset. However, as humans learn from interacting with the environment instead of a given dataset, I think the next promising step to further finetuning LLM can be developing an algorithm to let LLM collect the data in the environment by itself and learn from it.
We think the next step for multimodal foundation models is video understanding. We now have good models to understand images. However, it is still unclear how to build LLMs that can understand videos well. Solving this problem is important and can enable many new applications. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 290: Inside Koala, Berkeley University’s LLaMA-Based Model Fine-Tuned with ChatGPT Dialogues
Friday, May 12, 2023
The model provides a lighter, open-source alternative to ChatGPT and includes EasyLM, a framework for training and fine-tuning LLMs.
Edge 289: What is Chain of Thought Prompting?
Tuesday, May 9, 2023
Chain of thought prompting(CoTP), Google's original (CoTP) paper and the OpenChatKit framework
The Hemingway Effect and Generative AI Coding Revolution
Sunday, May 7, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📌 Event: The Future of Data-Centric AI 2023
Friday, May 5, 2023
on June 7-8
Edge 288: Inside DeepSpeed-Chat: Microsoft’s New Framework to Create ChatGPT-Like Models Based on Human Feedback
Thursday, May 4, 2023
The new framework builds on the scalability capabilities of DeepSpeed to fine tune LLMs using RLHF.
You Might Also Like
PHP 8.4 is released, Dynamic Mailer Configuration, and more! - №540
Sunday, November 24, 2024
Your Laravel week in review ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Lumoz RaaS Introduces Layer 2 Solution on Move Ecosystem
Sunday, November 24, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 24, 2024? The HackerNoon
😼 The hottest new AI engineer
Sunday, November 24, 2024
Plus, an uncheatable tech screen app Product Hunt Sunday, Nov 24 The Roundup This newsletter was brought to you by Countly Happy Sunday! Welcome back to another edition of The Roundup, folks. We've
Transformers are Eating Quantum
Sunday, November 24, 2024
DeepMind's AlphaQubit addresses one of the main challenges in quantum computing. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Retro Recomendo: Gift Ideas
Sunday, November 24, 2024
Recomendo - issue #438 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Kotlin Weekly #434
Sunday, November 24, 2024
ISSUE #434 24th of November 2024 Hi Kotliners! Next week is the last one to send a paper proposal for the KotlinConf. We hope to see you there next year. Announcements State of Kotlin Scripting 2024
Weekend Reading — More time to write
Sunday, November 24, 2024
More Time to Write A fully functional clock that ticks backwards, giving you more time to write. Tech Stuff Martijn Faassen (FWIW I don't know how to use any debugger other than console.log) People
🕹️ Retro Consoles Worth Collecting While You Still Can — Is Last Year's Flagship Phone Worth Your Money?
Saturday, November 23, 2024
Also: Best Outdoor Smart Plugs, and More! How-To Geek Logo November 23, 2024 Did You Know After the "flair" that servers wore—buttons and other adornments—was made the butt of a joke in the
JSK Daily for Nov 23, 2024
Saturday, November 23, 2024
JSK Daily for Nov 23, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Not Ready For The Camera 📸
Saturday, November 23, 2024
What (and who) video-based social media leaves out. Here's a version for your browser. Hunting for the end of the long tail • November 23, 2024 Not Ready For The Camera Why hasn't video