Import AI 258:Game engines are data generators; Spanish language models; the logical end of civilization

If there were three planetary-scale AIs on three different planets with a 15 minute delay between them, then how might a conflict unfold?
View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

Open source GPT-ers Eleuther turn one:
...What can some DIY hackers with a Discord channel and a mountain of compute do in a year? A lot, it turns out…
Eleuther, a collective of hackers working on open source AI projects, has recently celebrated their one year birthday by writing a retrospective about their work. For those who haven't kept up to date, Eleuther is trying to do an open source replication of GPT-3 (and people affiliated with the organization have already released GPT-J, a surprisingly powerful code-friendly 6BN parameter model). They've also dabbled in a range of other open source projects. This retrospective gives a peek into what they've been working on and also gives us a sense of the ideology behind the organization - something we find interesting here at Import AI is the different release philosophies encapsulated by orgs like Eleuther, so keeping track of their thinking is worthwhile.
  Read more: What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective (Eleuther blog).

####################################################Game engines are data generators now:
...Unity Perception represents the future of game engines…
Researchers with Unity Technologies, makers of the widely-used Unity game engine, have built an open source tool that lets AI researchers use Unity to generate data to train AI systems on. The 'Unity Perception' package "supports various computer vision tasks (including 2D/3D object detection, semantic segmentation, instance segmentation, and keypoints (nodes and edges attached to 3D objects, useful for tasks such as human-pose estimation)", the authors write. The software also comes with systems to automatically label the generated data, along with tools for randomizing the assets used in a data generation task (which makes it easy to create additional data to train systems on to increase their robustness).

Proving that it works: To test out the system, Unity also built 'SynthDet', a project where they used Unity Perception to generate synthetic data for 63 common grocery objects, then train an object recognition system on this. They used their software to generate a synthetic dataset containing 400,000 images and 2D bounding box annotations, then also collected a real-world dataset of 1627 images of the 63 items. They then show that by pairing the synthetic data with the real data, they can get substantially improved performance. "Our results clearly demonstrate that synthetic data can play a significant role in computer vision model training," they write.

Why this matters - data generators are engines, computers are electricity: I think of game engines like Unity as the equivalent to an engine that you might place in a factory, where here the factory is a datacenter. Systems like Unity help you take in a small amount of input fuel (e.g, a scene rendered in a 3D world), then run electricity (compute) through the engine (Unity) until you output a much larger dataset made possible by the initial fuel. You can then pair this output with 'real' data gathered via other means and in doing so improve the performance and efficiency of your AI factory. This feels like another important trend to look at when thinking about the steady industrialization of AI development.
Read more:Unity Perception: Generate Synthetic Data for Computer Vision (arXiv).

####################################################

Can your algorithm handle the real world? Use the 'Shifts' dataset to find out:
...Distributional shift data from industrial sources = more of a real world dataset than usual…
Much of AI progress is reliant on algorithms doing well on certain narrow, pre-defined benchmarks. These benchmarks are based on datasets that simulate or represent tasks found in the real world. However, once these algorithms get deployed into the real world it can be quite common fro them to break, because they encounter some situation which their dataset and benchmark didn't represent. This phenomenon is called 'distributional shift'.
  Now, researchers with (primarily) Russien tech company Yandex, along with ones at HSE University, Moscow Institute of Physics and Technology, University of Cambridge, University of Oxford, and the Alan Turing Institute, have developed the 'Shifts Dataset', which consists of "data taken directly from large-scale industrial sources and services where distributional shift is ubiquitous".

What data is in Shifts? Shifts contains tabular weather prediction data from the Yandex Weather service, machine translation data taken from the WMT robustness track and mined from Reddit (and annotated in-house by Yandex), and self-driving car data from Yandex's self-driving car project. 
  Read more: Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (arXiv).
  Get the dataset from here (Yandex, GitHub).

####################################################

Buy Sophia the robot (for $80,000):
...Sure, little quadruped robots are cool, but what about the iconic (for better or for worse) human-robot?...
Sophia the robot is a fancy human-appearing robot made by Hanson Robotics. Sophia has become a lightning rod in the AI community for giving wildly unrealistic impressions of what AI is capable of. But the hardware is really, really nice. If you've got $80,000 to spare and want to buy a couple of 21st century animatronics, maybe put a bid in here. I, for one, would love to be invited to a rich person's party where some fancy puppets might be swanning around. Bonus points if you lose the skirt and go for the full hybrid-frightener look. (You could always spend a rumored $75k on a Boston Dynamics 'Spot' robot, but where's the fun in that).
  Consider buyinga robot here (RobotShop).

####################################################

Spanish researchers embed Spanish culture into some large-scale RoBERTa models:
...National data for national models...
Researchers with the wonderfully named "Text Mining Unit" within the Barcelona Supercomputing Center have created a couple of Spanish-language RoBERTa models, helping them to imbue some AI tools with Spanish language and culture. This is part of a recent trend of countries seeking to build their own nationally/culturally representative AI models. Some other examples include Korea, where a startup named Naver created a Korean-representing GPT-3 style model called 'HyperCLOVA' (Import AI 251), and a Dutch RoBERTA (Import AI 182), among others.

What they did: They gathered 570GB of predominantly Spanish-language data, then trained a RoBERTa base and RoBERTA large model on the dataset. In tests, their models generally did better than other pre-existing Spanish-focused BERT models.

The ethics of dragnet data fishing: In the past year, there's been some debate about how large datasets should be constructed, where some people argue such datasets should be heavily curated by the people that gather them, while others argue they should be deliberately uncurated. Here, the researchers opt for what I'd call a curated uncurated strategy - they create three different types of data (theme-based, e.g datasets relating to politics, feminism, etc), event-based (events of significance to Spanish society), and domains at risk of disappearing (e.g, if a website is about to be shutdown). You can find out more information here about the crawls. My expectation is most of the world will move to lightly curated dragnet fishing data gathering, as individual human curation may be too expensive and slow.
  Read more: Spanish Language Models (arXiv).
  Get the RoBERTa base model here (HuggingFace).
Get the RoBERTa large model here (HuggingFace).

####################################################

Tech Tales:

Repetition and Recitation at the End of Time

[A historian in another Solar System, either now or thousands of years prior or thousands of years in the future]

He was a historian and he studied the long-dead by the traces they had created in the AI systems that had outlasted the civilization. It worked like this: he found a computational artefact, got it running, worked out how to prime it, then started plugging details in until the system would spit out data it had memorized about the individual's life: home addresses, contact details, extracts of speeches they had made, and so on.

Of course, some of the data was fuzzy. Most AI systems trend towards a form of poetic license, much like how when people recite things from memory they have a tendency to embellish - to over-dramatize, or to insert illusory facts that come from their own lives and dreams.

But it was all they had to work with: the living beings that had made the AI were longdead, and so he made do with these bottled up representations of their culture. He wrote his reports and published them to the system-wide internet, where they were read and commented on. And, of course, ingested in turn by his own civilization's AI systems.

Just a decade ago, the first AI probes had been sent out - trained artefacts embedded into craft and then sent, in hopes they might arrive at target systems intact and in stable orbits and then exist there, waiting to be found by other civilizations, other forms of life, who might probe them and learn to extract their secrets and develop an understanding of the civilization they came from. His own reports were in there, as well. So perhaps one day soon some being unlike him would sit down and try to extract his name and habits and details, eager to learn about the strange beings now showing up as zeros and ones in cold machines, sent into the dark.

Things that inspired this story: The recent discussion about memorization and recitation in neural nets; ideas about how culture gets represented within AI models; thoughts of space and the purpose of existing in space; the idea that there may be a more limited design space for AI than for biological life so perhaps such things as the above may be possible; hope for a stellar future and fear that if we don't get to it, we will be known by our digital exhaust, captured in our generative models.



Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Twitter
Facebook
Website
Copyright © 2021 Import AI, All rights reserved.
You are receiving this email because you signed up for it. Welcome!

Our mailing address is:
Import AI
Many GPUs
Oakland, California 94609

Add us to your address book


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Email Marketing Powered by Mailchimp

Older messages

Import AI 257: Firefighting robots; how Europe's AI legislation falls short; what the DoD thinks about responsible AI

Monday, July 12, 2021

Would a dataset of the entire universe be sufficient to encapsulate anything a being stationed in that universe could imagine? Or would it be insufficient in some way? View this email in your browser

Import AI 256: Facial recognition VS COVID masks; what AI means for warfare; CLIP and AI art

Tuesday, July 6, 2021

Will computer viruses ever become so complex that we might consider them sentient? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email

Import AI 254: Facebook uses AI for copyright enforcement; Google uses RL to design better chips.

Monday, June 21, 2021

Do organizations with significant AI investments make faster decisions (in certain areas) than ones which haven't made these investments? Or is it more that the organizations which invest in AI are

Import AI 253: The scaling will continue until performance saturates

Monday, June 14, 2021

If certain types of AI progress are predictable, then should the government anticipate certain soon-to-arrive capabilities and alter the behavior of its own institutions? View this email in your

Import AI 252: Gait surveillance; a billion Danish words; DeepMind makes phone-using agents

Monday, June 7, 2021

As off-the-shelf AI advances, the potential for emergence increases; in a few years, perhaps some of our most impactful AI systems will be assembled at home by hobbyists out of pre-built components.

You Might Also Like

Christmas On Repeat 🎅

Monday, December 23, 2024

Christmas nostalgia is a hell of a drug. Here's a version for your browser. Hunting for the end of the long tail • December 22, 2024 Hey all, Ernie here with a refresh of a piece from our very

SRE Weekly Issue #456

Monday, December 23, 2024

View on sreweekly.com A message from our sponsor, FireHydrant: On-call during the holidays? Spend more time taking in some R&R and less getting paged. Let alerts make their rounds fairly with our

The Power of an Annual Review & Grammarly acquires Coda

Sunday, December 22, 2024

I am looking for my next role, Zen Browser got a fresh new look, Flipboard introduces Surf, Campsite shuts down, and a lot more in this week's issue of Creativerly. Creativerly The Power of an

Daily Coding Problem: Problem #1645 [Hard]

Sunday, December 22, 2024

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Facebook. Implement regular expression matching with the following special characters: .

PD#606 How concurrecy works: A visual guide

Sunday, December 22, 2024

A programmer had a problem. "I'll solve it with threads!". has Now problems. two he ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌ ͏ ‌

RD#486 (React) Things I Regret Not Knowing Earlier

Sunday, December 22, 2024

Keep coding, stay curious, and remember—you've got this ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

🎶 GIFs Are Neat, but I Want Clips With Sound — Your Own Linux Desktop in the Cloud

Sunday, December 22, 2024

Also: 9 Games That Were Truly Ahead of Their Time, and More! How-To Geek Logo December 22, 2024 Did You Know Dextrose is another name for glucose, so if you see it listed prominently on the ingredients

o3—the new state-of-the-art reasoning model - Sync #498

Sunday, December 22, 2024

Plus: Nvidia's new tiny AI supercomputer; Veo 2 and Imagen 3; Google and Microsoft release reasoning models; Waymo to begin testing in Tokyo; Apptronik partners with DeepMind; and more! ͏ ͏ ͏ ͏ ͏ ͏

Sunday Digest | Featuring 'The World’s 20 Largest Economies, by GDP (PPP)' 📊

Sunday, December 22, 2024

Every visualization published this week, in one place. Dec 22, 2024 | View Online | Subscribe | VC+ | Download Our App Hello, welcome to your Sunday Digest. This week, we visualized public debt by

Android Weekly #654 🤖

Sunday, December 22, 2024

View in web browser 654 December 22nd, 2024 Articles & Tutorials Sponsored Solving ANRs with OpenTelemetry While OpenTelemetry is the new observability standard, it lacks official support for many