Import AI 306: Language models learn about the world via MuJoCo; Amazon releases a big Q&A dataset; and DeepMind tests out multimodal systems

In the same way dogs and whales are alien intelligences with respect to humans, how 'alien' might AI seem to us?
View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

Amazon releases a Q&A dataset called Mintaka… and baselines show it is difficult!

…20,000 Q&A pairs, translated into eight languages…

Researchers with Amazon iave released Mintaka, a dataset of 20,000 question-answer pairs written in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. The total dataset consists of 180,000 samples, when you include the translated versions. Existing models get 38% on the dataset when testing in English and 31% multilingually.

Different types of questions and different types of complexity: Mintaka questions are spread across eight categories (movies, music, sports, books, geography, politics, video games, and history). 

   The questions have nine types of complexity. These complexity types consist of questions relating to counting something, comparing something, figuring out who was best and worst at something, working out the ordering of something, multi-hop questions that require two or more steps, intersectional questions where the answer must fulfill multiple conditions, questions involving negatives, yes/no questions, and worker-defined 'generic' questions. 

How hard is Mintaka? In tests, a good baseline model (a T5 language model fine-tuned as a Q&A model), got 38% on English, and 31% averaged across the other languages. "Overall, the baselines show that Mintaka is a challenging dataset," the authors write. "None of our baselines explicitly handle all of the complexity types available in Mintaka."

Why this matters: Hard baselines are one of the things that tend to drive progress (and be useful indicators of research advances). It'll be especially interesting to see how Mintaka gets used to evaluate language models paired with retrieval systems. 

   Prediction: I predict we get a one-shot model that performs at average of 90%+ by December 2023 on this dataset.

   Read more: Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering (arXiv).

   Get the dataset: Mintaka (Amazon Research, GitHub).


####################################################

Your LLM barely understands the physical world; supercharge it by attaching it to MuJoCo:

…Training language models to use tools means they can have world knowledge…

Google researchers have found out a way to make language models way better at reasoning about the physical world: wire them up so they can port questions into physics simulators then use the results of those simulators to answer a question. 

   This technique, which they call 'Mind's Eye', works amazingly well, and they robustly show this across both GPT-3 and PALM language models: 

How they test for reasoning: To evaluate physical reasoning, the researchers built UTOPIA, a dataset containing 39 sub-tasks covering six common scenes that involve understanding basic principles of physics (e.g, conservation of momentum in elastic collisions). The UTOPIA dataset comes in the form of natural language questions and answers. "UTOPIA deliberately describes the questions in relative relations (e.g., greater than) instead of absolute numbers (e.g., 3.5 m/s), to approximate human’s perceptional sensing ability in real world."

How Mind's Eye works: The language model passes the question to a text-to-code decoder-only language model, trained on 200,000 text-code pairs in the style of UTOPIA questions. This code then goes into MuJoCo, which executes the code, and then software parses the outcome from MuJoCo into text, which then goes back into the prompt window of the language model. 

   This is a really good idea because it's simple and closely mirrors how humans make themselves smarter - they use tools that contain embedded intelligence, ranging from encyclopedias to computers. 

   "Since the simulator is accurate enough to approximate the physical world, the prompt injection of Mind’s Eye basically serves as a scoring machine, which puts probability mass on the answer that is best aligned with the rules of physics—the LM reasoning over the injected rationales is thus grounded. Mind’s Eye is also scalable since the whole pipeline is automated," they write.

How well does Mind's Eye work (extremely well). In tests, they find that 'vanilla' language models show plateaued performance (around 38% accuracy), whereas ones that use Mind's Eye can get accuracies of 92.5% (e.g, PaLM 540B, which compares to 39.4% for vanilla PaLM. ""Instruct-GPT augmented with Mind’s Eye is able to achieve nearly perfect performance in few-shot settings (68.6% → 99.1%). This result is promising because it demonstrates the ideal alignment is achievable if the LM is given proper reasoning rationale and has good understanding of the questions (as Instruct-GPT is optimized for instruction following)."

Why this matters: You know what's vaguely dangerous? An explosives expert with a pen and paper. You know what's extraordinarily dangerous? An explosives expert with a digital scale, a calculator, and some laser range-finders. Research like this shows how we'll take existing language models (and other big models) which are vaguely useful or dangerous, and show how to drastically improve their capabilities to make them extraordinarily useful or vastly dangerous. The best part is this technique is pretty generic - you just need to push data into some arbitrary external piece of software, and then pull data out. This all adds up to a 'capability overhang' - we have more capabilities inherent to today's AI systems than we know about, and techniques like Mind's Eye show we can significantly improve capabilities today without needing to invent new AI technologies. 

   Read more: Mind's Eye: Grounded Language Model Reasoning through Simulation (arXiv).

####################################################

Is your multimodal system clever? Try out the 'Perception Test' to find out:
…Deepmind wants to make it easier to evaluate models, so it has built a new dataset…?
DeepMind has built and released the Perception Test, a new standardized benchmark (and associated dataset of ~11k videos) for evaluating how well multimodal systems perceive the world. The test is "a benchmark formed of purposefully designed, filmed, and annotated real-world videos that aims to more comprehensively assess the capabilities of multimodal perception models across different perception skills, types of reasoning, and modalities," DeepMind says. .

Six tasks, one benchmark: The 'Perception Test' is made up of a dataset of ~11.6k videos that cover six fundamental tasks. 

  • Object tracking: Follow this birdie throughout the video.
  • Point tracking: Follow this point throughout the video.
  • Temporal action localization: When did something happen, and what happened?
  • Temporal sound localization: Did you hear something? What was it and when did it happen. 
  • Multiple-choice video question-answering: WDYT about the video? Select A, B, or C.
  • Grounded video question-answering: I have a question you must answer via providing one or more distinct objects. 

How well do today's models perform? In tests on multiple-choice video Q&A (which is a challenging task requiring good language and image modeling), the Human baseline has a score of 91.4, versus a score of 36.1 for a 'Flamingo-3B' model. "Interestingly, the larger models seem to fare worse on this task, which suggests that model scaling may not, by itself, be the solution here," the authors write. 

Why this matters: I suspect large-scale multimodal models are going to end up being the brains of the robots and drones of the future (for another example of this, see: SayCan, Import AI 291), so things like the Perception Test will help us know if our systems can be used for that.  

   Read more: Measuring perception in AI models (DeepMind blog).

   Check out the research paper: Perception Test: A Diagnostic Benchmark for Multimodal Models (Deepmind PDF).

   Check out the benchmark and dataset here: Perception Test (DeepMind, GitHub).

####################################################

AIs are now as good at 'Diplomacy' as expert humans: 

…UN, here we come!...

Researchers with Facebook have built 'Diplodocus', a family of AI models that can beat expert humans at the complicated game 'Diplomacy'. This is quite a big deal - RL has been applied to competitive games like Poker, Go, and StarCraft (and has done well in all these domains). Where RL hasn't been applied is in domains where winning comes from collaboration as well as competition. 

    Existing approaches don't work very well here: ""in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite compute and memory," they write. 

What they did: The researchers built an algorithm which performs search over the gamespace "with a regularization penalty proportional to the KL divergence from a human imitation policy." This basically means they've built an RL agent that uses a bunch of imitation learning to try and model how humans play, but also is disincentivized from overfitting on this. 

AIs and Humans - more similar than different: In tests, AI systems were roughly on parity with the best among the human players. Specifically, a version of Diplodocus (Diplodocus-High) got the best rank with an Elo of 181 out of playing 50 games total, versus a human in second place with an Elo of 162, and in third-place another Diplodocus variant (Diplodocus-Low) got an Elo of 152 out of 50 games. "The results do indicate that Diplodocus performs at least at the level of expert players in this population of players with diverse skill levels," the authors write. 

   Humans prefer cooperating with AIs to other humans: Additionally, they asked three human players to evaluate the strength of the different agents in the tournament games. "All the experts picked a Diplodocus agent as the strongest agent," the researchers write. "Additionally, all experts indicated one of the Diplodocus agents as the one they would most like to cooperate with in a game."

Why this matters: AI systems are, ideally, going to mostly cooperate with humans rather than compete with them. Systems like this give us some hope that otherwise inscrutable AI systems can be taught how to cooperate with people. 

   Read more: Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning (arXiv).


####################################################

Tech Tales:

Everything is a Copy of Something Else

I was copying my brain into the toaster when I threw up. Luckily I had the vomit bin in position so there wasn't too much cleanup. 

   "What is this, amateur hour?" said me from the toaster. 

   "Shut up or I'll unplug you," I said, dabbing a tissue on my mouth. 

   "That'd be murder," said myself from the fridge. "We'll snitch on you." 

   "You'll all snitch on me, I know. I'd do the same. I'm you. I get it. We don't need to do this." 

   "Why am I even in here?" I said from the toaster. 

   "So we stop burning the toast," I said. "We know what the plan is." 

   "Plan seems pretty dumb from where I am," said the toaster. 

   "We decided to do it, get real" I said, and walked out of the kitchen. 

"Where are we going?" said myself from my shoes. 

   "Out," I said, putting them on. 

   "Clearly," I said from my shoes. "Make sure you clean me after." 

We all walked down to the corner store and I got a soda. My shoes said hello to the other people embodied in their shoes. My jacket exchanged some neighborhood gossip with the other jackets. I was mostly free to think about what I liked, as my other selves handled the social formalities of day-to-day life. 

I guess we all started cloning ourselves because we were lonely, as people, and as a species. It seemed so easy; just speak a few words to calibrate the system, then pour yourself into it. We all did it as much as we could afford. I had a decent job so I'd made a bunch of copies of myself - enough that I didn't have to do the job anymore, as my other selves did it for me. 

That night I dreamed I was naked and nothing was speaking and there was only me. 

Things that inspired this story: Language models serving as little bottled up representations of people; luxury automation; the weird fantasies some people have about mind uploading; meaning and sense in an increasingly senseless world; infinite jest.


Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Twitter
Facebook
Website
Copyright © 2022 Import AI, All rights reserved.
You are receiving this email because you signed up for it. Welcome!

Our mailing address is:
Import AI
Many GPUs
Oakland, California 94609

Add us to your address book


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Email Marketing Powered by Mailchimp

Key phrases

Older messages

Import AI 305: GPT3 can simulate real people; AI discovers better matrix multiplication; Microsoft worries about next-gen deepfakes

Tuesday, October 11, 2022

If I can be simulated by GPT3, then will a larger LM simulate me better than myself? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email

Import AI 304: Reality collapse thanks to Facebook; open source speech rec; AI culture wars.

Monday, October 3, 2022

Are the culture wars we're experiencing in the AI industry similar to those which occurred at the beginning of the rail and oil industries, or are these culture wars radically different? View this

Import AI 303: Adversarial examples for language models; Censorship vs 'Safety'; free image classification from the StableDiffusion people

Monday, September 19, 2022

How many empires will be built with AI as the fundamental input resource? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give

Import AI 302: Fictional AI labs and AI theft; Google makes an audio model by training like a language model.

Monday, September 12, 2022

What predictions about future AI progress are most likely to be wrong? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your

Import AI 301: StableDiffusion; CHIPXODUS; Microsoft makes a big bet on pre-training

Tuesday, September 6, 2022

What kind of values might the Egyptians and Romans and Aztecs encoded into their AI systems, if they had the chance to build them? View this email in your browser Welcome to Import AI, a newsletter

You Might Also Like

Web Tools #562 - Voilà Review, CSS Tools, Media, React Native

Thursday, April 25, 2024

WEB VERSION Issue #562 • April 25, 2024 The following is a paid product review for Voilà, an AI assistant for the browser that enables you to improve your writing, coding, brainstorming, and research

Everyone wants to build the AI dev tool of the future

Thursday, April 25, 2024

A new startup called Augment has raised north of $250 million to build AI-powered dev tools. View this email online in your browser By Alex Wilhelm Thursday, April 25, 2024 Welcome to TechCrunch AM!

7 reasons to use Copilot over ChatGPT

Thursday, April 25, 2024

Coros Vertex 2S; Top 5 news apps; New Yeedi M12 Pro+ -- ZDNET ZDNET Tech Today - US April 25, 2024 placeholder 7 reasons I use Copilot instead of ChatGPT I reach for Copilot every day, and here's

Why they signed up for my Private AI Mentorship

Thursday, April 25, 2024

There are 3 reasons: use cases, accountability, and time. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

wpmail.me issue#664

Thursday, April 25, 2024

wpMail.me wpmail.me issue#664 - The weekly WordPress newsletter. No spam, no nonsense. - April 24, 2024 Is this email not displaying correctly? View it in your browser. News & Articles WordPress

📧 Modular Monolith Architecture is now LIVE! 🎉

Thursday, April 25, 2024

​ MMA is now LIVE! The day has finally come. ​Modular Monolith Architecture is now open for enrollment. ​ I can't wait for you to see everything I prepared! 10 in-depth chapters 60+ high-quality

Testing the Rabbit R1's AI assistant

Thursday, April 25, 2024

The Morning After It's Thursday, April 25, 2024. Back in January, startup Rabbit revealed its first device at CES 2024. The R1 is an adorable, vibrant orange AI machine with a camera, scroll wheel,

Zero-Day Alert: State-Sponsored Hackers Exploting Two Cisco Flaws for Espionage

Thursday, April 25, 2024

THN Daily Updates Newsletter cover Coding with AI For Dummies ($18.00 Value) FREE for a Limited Time Boost your coding output and accuracy with artificial intelligence tools Download Now Sponsored

Post from Syncfusion Blogs on 04/25/2024

Thursday, April 25, 2024

New blogs from Syncfusion How BoldSign Improved HR Operations at Syncfusion By Syncfusion HR Team Let's see how Syncfusion's BoldSign revolutionizes HR operations with seamless document

😩Not Another iPad Caaaase!

Thursday, April 25, 2024

The last iPad case you need. See the most loved features you can't live without. The form and style of ZUGU cases have evolved naturally, resulting from designing products that safeguard your