Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

In a few decades, the only true records of certain pieces of art or media will be in neural networks, rather than the original input media (which will have been lost). How might 'neural archeologists' recover this in the future.

View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

Here's another benchmark your puny models can't solve - MATH!
...One area where just scaling things up doesn't help...
SQuAD. SQuAD2. GLUE. SuperGLUE. All these benchmarks have melted in time, like hyperparameter tears in the rain, due to the onslaught of new, powerful AI models. So with a mixture of trepidation and relief let's introduce MATH, a dataset of math problems that contemporary Transformer-based models can't solve.

What's MATH? MATH was made by researchers at UC Berkeley and consists of 12,500 problems taken from high school math competitions. The problems have five difficulty levels and cover seven subjects, including geometry. MATH questions are open-ended, mixing natural language and math across their problem statements and solutions. One example MATH question: "Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose?"

Bonus dataset: AMPS: Along with MATH, the authors have also built the Auxiliary Mathematics Problems and Solutions (AMPS) pre-training corpus, a 23GB data repository made of ~100,000 Khan Academy problems with step-by-step solutions written in Latex, as well as 5 million problems generated using Mathematica scripts.

Why this matters: Current AI systems can't solve MATH: The best part about MATH is that it's unbelievably difficult. GPT2 models get, at best, an average of 6.9% accuracy on the dataset (even in the most liberal human school, such a school would get an F), while GPT-3 models (which are larger than GPT-2 ones) seem to do meaningfully better than their GPT2 forebears on some tasks and worse on others. This is good news: we've found a test that large-scale Transformer models can't solve. Even better - we're a long, long way from solving it.
Read more: Measuring Mathematical Problem Solving with the MATH Dataset (arXiv).
Get the code from GitHub here.

###################################################

Want a pony that looks like Elvis? We can do that:
...Machine learning systems can do style generalization...
Here's a fun Twitter thread where someone combines the multimodal CLIP system with StyleGAN, and uses a dataset from [Note: some chance of NSFW-ish generations] This Pony Does Not Exist (an infinite sea of GAN-generated my little ponies). Good examples include a pony-version of Billie Eilish, Beyonce, and Justin Bieber.

Why this matters: In the same way AI can generate different genres of text, ranging from gothic fiction to romantic poetry, we're seeing evidence the same kinds of generative capabilities work for imagery as well. And, just as with text, we're able to mix and match these different genres to generate synthetic outputs that feel novel. The 21st century will be reshaped by the arrival of endless, generative and recombinative media.
Check out the twitter thread of generations here (Metasemantic's Twitter thread).

###################################################

AI Index 2021: AI has industrialized. Now what?
...Diversity data is still scarce, it's hard to model ethical aspects over time, and more…
The AI Index, an annual project to assess and measure AI progress, has published its fourth edition. (I co-chaired this years report and spent a lot of time working on it, so if you have questions, feel free to email me).
This year's ~200-page report includes analysis of some of the big technical performance trends of recent years, bibliometric analysis about the state of AI research in 2020, information about national investments into AI being made by governments, and data about the diversity of AI researchers present in university faculty (not good) and graduating PhDs (also not good). Other takeaways include data relating to the breakneck rates of improvement in AI research and deployment (e.g, the cost to train an ImageNet model on a public cloud has fallen from ~$2000 in 2017 to $7.43 last year), as well as signs of increasing investment into AI applications, beyond pure AI research.

Ethics data - and the difficulty of gathering it: One thing that stuck out to me about the report is the difficulty of measuring and assessing ethical dimensions of AI deployment - specifically, many assessments of AI technologies use one-off analysis for things like interrogating the biases of the model, and few standard tests exist (let's put aside, for a moment, the inherent difficulty of building 'standard' tests for something as complex as bias).

What next? The purpose of the AI Index is to prototype better ways to assess and measure AI and the impact of AI on society. My hope is that in a few years governments will invest in tech assessment initiatives and will be able to use the AI Index as one bit of evidence to inform that process. If we get better at tracking and analyzing the pace of progress in artificial intelligence, we'll be able to deal with some of the information asymmetries that have emerged between the private sector and the rest of society; this transparency should help develop better norms among the broader AI community.
Read the 2021 AI Index here (AI Index website)
Read more about the report here: The 2021 AI Index: Major Growth Despite the Pandemic (Stanford HAI blog).

###################################################

Want to train an autonomous river boat? This dataset might help:
...Chinese startup Orca Tech scans waterways with a robot boat, then releases data…
AI-infused robots are hard. That's a topic we cover a lot here at Import AI. But some types of robot are easier than others. Take drones, for instance - easy! They move around in a broadly uncontested environment (the air) and don't need many smart algorithms to do useful stuff. Oceangoing ships are similar (e.g, Saildrone). But what about water-based robots for congested, inland waterways? Turns out, these are difficult to build, according to Chinese startup Orca Tech, which has published a dataset meant to make it easier for people to add AI to these machines.

Why inland waterways are hard for robots: "Global positioning system (GPS) signals are sometimes attenuated due to the occlusion of riparian vegetation, bridges, and urban settlements," the Orca Tech authors write. "In this case, to achieve reliable navigation in inland waterways, accurate and real-time localization relies on the estimation of the vehicle’s relative location to the surrounding environment".

The dataset: USVInland is a dataset of inland waterways in China "collected under a variety of weather conditions" via a little robotic boat. The dataset contains information from stereo cameras, a lidar system, GPS antennas, inertial measurement units (IMUs), and three millimeter-wave radars. The dataset was recorded from May to August 2020 and the darta covers a trajectory of more than 26km. It contains 27 continuous raw sequences collected under different weather conditions.

Why this matters: The authors tested out some typical deep learning-based approaches on the dataset and saw that they struggled to obtain good performance. USVInland is meant to spur others to explore whether DL algorithms can handle some of the perception challenges involved in navigating waterways.
Read more: Are We Ready for Unmanned Surface Vehicles in Inland Waterways? The USVInland Multisensor Dataset and Benchmark (arXiv).
Get the data from here (Orca Tech website).

###################################################

Hackers breach live feeds of 150,000 surveillance cameras:
...Now imagine what happens if they combine that data with AI…
A group of hackers have gained access to live feeds of 150,000 surveillance cameras, according to Bloomberg News. The breach is notable for its scale and the businesses it compromised, which included hospitals, a Tesla warehouse,and the Sandy Hook Elementary School in Connecticut.
The hack is also significance because of the hypothetical possibilities implied by combining this data with AI - allow me to speculate: imagine what you could do with this data if you subsequently applied facial recognition algorithms to it and mixed in techniques for re-identification, letting you chart the movements of people over time, and identify people they mix with who aren't in your database. Chilling.
Read more: Hackers Breach Thousands of Security Cameras, Exposing Tesla, Jails, Hospitals (Bloomberg).

###################################################

Why your next construction site could be cleaned by AI:
...Real-world AI robots: Japan edition…
AI startup Preferred Networks and construction company Kajima Corporation have built 'iNoh', software that creates autonomous cleaning robots. iNoh uses multiple sensors, including LIDAR, to do real-time simultaneous localization and mapping (SLAM) - this lets the robot know roughly where it is within the building. It pairs this with a deep learning-based computer vision system which "robustly and accurately recognizes obstacles, moving vehicles, no-entry zones and workers", according to the companies. The robot uses its SLAM capability to help it build its own routes around a building in real-time, and its CV system stops it getting into trouble.

Why care about Preferred Networks: Preferred Networks, or PFN, is a Japanese AI startup we've been tracking for a while. The company started out doing reinforcement learning for robots, set a new ImageNet training-speed record in 2017 (Import AI 69) and has been doing advanced research collaborations on areas like meta-learning (Import AI 113). This is a slightly long-winded way to say: PFN has some credible AI researchers and is generally trying to do hard things. Therefore, it's cool to see the company apply its technology in a challenging, open-ended domain, like construction.

PyTorch++: PFN switched away from developing its own AI framework (Chainer) to PyTorch in late 2019.
Read more: Kajima and PFN Develop Autonomous Navigation System for Construction Site Robots (Preferred Networks).
Watch a (Japanese) video about iNoh here (YouTube).###################################################

At last, 20 million real network logs, courtesy of Taiwan:
...See if you AI can spot anomalies in this…
Researchers with the National Yang Ming Chiao Tung University in Taiwan have created ZYELL-NCTU NetTraffic-1.0, a dataset of logs from real networks. Datasets like this are rare and useful, because the data they contain is inherently temporal (good! difficult!) in a non-expensive form (text strings are way cheaper to process than, say, the individual stills in a video, or slices of audio waveforms).

What is the dataset: ZYELL-NCTU NetTraffic-1.0 was collected from the outputs of firewalls in real, deployed networks of the telco 'ZYELL'. It consists of around 22.5 million logs and includes (artificially induced) examples of probe-response and DDoS attacks taking place on the network.

Why this matters: It's an open question whether modern AI techniques can do effective malicious anomaly detection on network logs; datasets like this will help us understand their tractability.
Read more: ZYELL-NCTU NetTraffic-1.0: A Large-Scale Dataset for Real-World Network Anomaly Detection (arXiv).
Where to (maybe) get the dataset: Use the official website, though it's not clear precisely how to access it.

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

CSET’s Jason Matheny joins Biden Administration
Jason Matheny, founding director at Georgetown’s influential 'CSET' thinktank, is taking on three senior roles “at the intersection of technology and national security”: deputy assistant to the President for technology and national security; deputy director for national security in the OSTP and coordinator for technology and national security at the National Security Council, per FedScoop. . Previously, Matheny was director at IARPA, where—among other things—he spearheaded the forecasting program that incubated Tetlock’s influential superforecasting research.
Read more: Jason Matheny to serve Biden White House in national security and tech roles (FedScoop).

Podcast: Brian Christian on AI alignment:
Brian Christian is interviewed by Rob Wiblin on the 80,000 Hours podcast, about his book, The Alignment Problem (covered in Import #221), and lots else. It’s an awesome interview, which manages to be even more wide-ranging than the book — I strongly recommend both.
Podcast and transcript: Brian Christian on the alignment problem (80,000 Hours podcast).

Minor correction:
Last week I wrote that the NSCAI’s report suggested $32bn investment in domestic semiconductor industry over the next five years— the correct figure is $35bn.

###################################################

Tech Tales:

Tell me the weight of the feather and you will be ready
[A large-scale AI training infrastructure, 2026]

When you can tell me precisely where the feather will land, you will be released, said the evaluator.
'Easy', thought the baby artificial intelligence. 'I predict a high probability of success'.

And then the baby AI marked the spot on the ground where it thought the weather would land, then told its evaluator to drop the feather. The feather started to fall and, buffeted by invisible currents in the air and their interplay with the barbs and vanes of the feather itself, landed quite far from where the baby AI had predicted.

Shall we try again? asked the evaluator.
'Yes,' said the baby. 'Let me try again'.

And then the baby AI made 99 more predictions. At its hundredth, the evaluator gave it its aggregate performance statistics.
'My predictions are not sufficiently accurate,' said the baby AI.
Correct, said the evaluator. Then the evaluator cast a spell that put the baby AI to sleep.
In the dreams of the baby AI, it watched gigantic feathers made of stone drop like anvils into the ground, and tiny impossibly thin feathers made of aerogel seem to barely land. It dreamed of feathers falling in rain and in snow and in ice. It dreamed of feathers that fell upward, just to know what a 'wrong' fall might look like.

Whenn the baby woke up, its evaluator was there.
Shall we go again, said the evaluator.
'Yes,' said the baby, its neurons lighting up in predictive anticipation of the task, 'show me the feather and let me tell you where it will land'.
And then there was a feather. And another prediction. And another comment from its evaluator.

In the night, the baby saw even more fantastic feathers than the night before. Feathers that passed through hard surfaces. Feathers which were on fire, or wet, or frozen. Sometimes, multiple feathers at once.

Eventually, the baby was able to roughly predict where the feather would fall.
We think you are ready, said the evaluator to the feather.
Ready for what? said the baby.
Other feathers, said the evaluator. Ones we cannot imagine.
'Will I be ready?' said the baby.
That's what this has been for, said the evaluator. We believe you are.
And then the baby was released, into a reality that the evaluator could not imagine or perceive.

Somewhere, a programmer woke up. Made coffee. Went to their desk. Checked a screen: ```feather_fall_pred_domain_rand_X100 complete```.

Things that inspired this story: Domain randomization; ancient tales of mentors and mentees; ideas about what it means to truly know reality

Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Copyright © 2021 Import AI, All rights reserved.
You are receiving this email because you signed up for it. Welcome!

Our mailing address is:

Import AI

Many GPUs

Oakland, California 94609

Add us to your address book

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Import AI 239: China trains a massive 10b model, Vicarious does pick and place; the GCHQ publishes some of its thoughts on AI

Monday, March 8, 2021

What could be the physically largest neural network you could build? And what would it be made of? People moving in formations? Water flowing through vessels? Great temperature fluctuations being

Import AI 238: Robots that fold clothes; how Bytedance censors its product; a differentiable simulator.

Monday, March 1, 2021

Instead of inventing time travel, we could just invent a bunch of high-fidelity simulations of the past and beam ourselves into them. View this email in your browser Welcome to Import AI, a newsletter

Import AI 237: GPT3 at 5X the speed; 6 hours of AI breakbeats; NeuralMMO++

Monday, February 22, 2021

GPT3 + Videogames = entertaining, babbling AI-infused NPCs View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI

Import AI 236: EfficientNet++; why robots are hard; AI2 makes a harder ARC

Monday, February 15, 2021

In what year will we get the '3D game simulator' equivalent of AI Dungeon, where you can generate and unroll radically different gameworlds in response to some user priming? View this email in

Import AI 235: Use GEM to test language models; the four eras of facial recognition; and how the US can measure its robot fleet

Monday, February 8, 2021

How big will the market for 'reality simulators' eventually become? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give

JSter #238 - Libraries and more

Tuesday, March 4, 2025

I feel the need - the need for JavaScript. It's less than three months until Future Frontend (27-28.5, Espoo, Finland). It's not going to be a big conference (~200 people) but that's just

Master the New Elasticsearch Engineer v8.x Enhancements!

Tuesday, March 4, 2025

Need Help? Join the Discussion Now! ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤ elastic | Search. Observe. Protect Master Search and Analytics feb 24 header See

Daily Coding Problem: Problem #1707 [Medium]

Monday, March 3, 2025

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Facebook. In chess, the Elo rating system is used to calculate player strengths based on

Simplification Takes Courage & Perplexity introduces Comet

Monday, March 3, 2025

Elicit raises $22M Series A, Perplexity is working on an AI-powered browser, developing taste, and more in this week's issue of Creativerly. Creativerly Simplification Takes Courage &

Mapped | Which Countries Are Perceived as the Most Corrupt? 🌎

Monday, March 3, 2025

In this map, we visualize the Corruption Perceptions Index Score for countries around the world. View Online | Subscribe | Download Our App Presented by: Stay current on the latest money news that

The new tablet to beat

Monday, March 3, 2025

5 top MWC products; iPhone 16e hands-on📱; Solar-powered laptop -- ZDNET ZDNET Tech Today - US March 3, 2025 TCL Nxtpaper 11 tablet at CES The tablet that replaced my Kindle and iPad is finally getting

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

Monday, March 3, 2025

What will machines name their first discoveries? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

GCP Newsletter #440

Monday, March 3, 2025

Welcome to issue #440 March 3rd, 2025 News LLM Official Blog Vertex AI Evaluate gen AI models with Vertex AI evaluation service and LLM comparator - Vertex AI evaluation service and LLM Comparator are

Apple Should Swap Out Siri with ChatGPT

Monday, March 3, 2025

Not forever, but for now. Until a new, better Siri is actually ready to roll — which may be *years* away... Apple Should Swap Out Siri with ChatGPT Not forever, but for now. Until a new, better Siri is

⚡ THN Weekly Recap: Alerts on Zero-Day Exploits, AI Breaches, and Crypto Heists

Monday, March 3, 2025

Get exclusive insights on cyber attacks—including expert analysis on zero-day exploits, AI breaches, and crypto hacks—in our free newsletter. ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌

Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

Older messages

Import AI 239: China trains a massive 10b model, Vicarious does pick and place; the GCHQ publishes some of its thoughts on AI

Import AI 238: Robots that fold clothes; how Bytedance censors its product; a differentiable simulator.

Import AI 237: GPT3 at 5X the speed; 6 hours of AI breakbeats; NeuralMMO++

Import AI 236: EfficientNet++; why robots are hard; AI2 makes a harder ARC

Import AI 235: Use GEM to test language models; the four eras of facial recognition; and how the US can measure its robot fleet

You Might Also Like

JSter #238 - Libraries and more

Master the New Elasticsearch Engineer v8.x Enhancements!

Daily Coding Problem: Problem #1707 [Medium]

Simplification Takes Courage & Perplexity introduces Comet

Mapped | Which Countries Are Perceived as the Most Corrupt? 🌎

The new tablet to beat

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

GCP Newsletter #440

Apple Should Swap Out Siri with ChatGPT

⚡ THN Weekly Recap: Alerts on Zero-Day Exploits, AI Breaches, and Crypto Heists