Import AI 258:Game engines are data generators; Spanish language models; the logical end of civilization

If there were three planetary-scale AIs on three different planets with a 15 minute delay between them, then how might a conflict unfold?
View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

Open source GPT-ers Eleuther turn one:
...What can some DIY hackers with a Discord channel and a mountain of compute do in a year? A lot, it turns out…
Eleuther, a collective of hackers working on open source AI projects, has recently celebrated their one year birthday by writing a retrospective about their work. For those who haven't kept up to date, Eleuther is trying to do an open source replication of GPT-3 (and people affiliated with the organization have already released GPT-J, a surprisingly powerful code-friendly 6BN parameter model). They've also dabbled in a range of other open source projects. This retrospective gives a peek into what they've been working on and also gives us a sense of the ideology behind the organization - something we find interesting here at Import AI is the different release philosophies encapsulated by orgs like Eleuther, so keeping track of their thinking is worthwhile.
  Read more: What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective (Eleuther blog).

####################################################Game engines are data generators now:
...Unity Perception represents the future of game engines…
Researchers with Unity Technologies, makers of the widely-used Unity game engine, have built an open source tool that lets AI researchers use Unity to generate data to train AI systems on. The 'Unity Perception' package "supports various computer vision tasks (including 2D/3D object detection, semantic segmentation, instance segmentation, and keypoints (nodes and edges attached to 3D objects, useful for tasks such as human-pose estimation)", the authors write. The software also comes with systems to automatically label the generated data, along with tools for randomizing the assets used in a data generation task (which makes it easy to create additional data to train systems on to increase their robustness).

Proving that it works: To test out the system, Unity also built 'SynthDet', a project where they used Unity Perception to generate synthetic data for 63 common grocery objects, then train an object recognition system on this. They used their software to generate a synthetic dataset containing 400,000 images and 2D bounding box annotations, then also collected a real-world dataset of 1627 images of the 63 items. They then show that by pairing the synthetic data with the real data, they can get substantially improved performance. "Our results clearly demonstrate that synthetic data can play a significant role in computer vision model training," they write.

Why this matters - data generators are engines, computers are electricity: I think of game engines like Unity as the equivalent to an engine that you might place in a factory, where here the factory is a datacenter. Systems like Unity help you take in a small amount of input fuel (e.g, a scene rendered in a 3D world), then run electricity (compute) through the engine (Unity) until you output a much larger dataset made possible by the initial fuel. You can then pair this output with 'real' data gathered via other means and in doing so improve the performance and efficiency of your AI factory. This feels like another important trend to look at when thinking about the steady industrialization of AI development.
Read more:Unity Perception: Generate Synthetic Data for Computer Vision (arXiv).

####################################################

Can your algorithm handle the real world? Use the 'Shifts' dataset to find out:
...Distributional shift data from industrial sources = more of a real world dataset than usual…
Much of AI progress is reliant on algorithms doing well on certain narrow, pre-defined benchmarks. These benchmarks are based on datasets that simulate or represent tasks found in the real world. However, once these algorithms get deployed into the real world it can be quite common fro them to break, because they encounter some situation which their dataset and benchmark didn't represent. This phenomenon is called 'distributional shift'.
  Now, researchers with (primarily) Russien tech company Yandex, along with ones at HSE University, Moscow Institute of Physics and Technology, University of Cambridge, University of Oxford, and the Alan Turing Institute, have developed the 'Shifts Dataset', which consists of "data taken directly from large-scale industrial sources and services where distributional shift is ubiquitous".

What data is in Shifts? Shifts contains tabular weather prediction data from the Yandex Weather service, machine translation data taken from the WMT robustness track and mined from Reddit (and annotated in-house by Yandex), and self-driving car data from Yandex's self-driving car project. 
  Read more: Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (arXiv).
  Get the dataset from here (Yandex, GitHub).

####################################################

Buy Sophia the robot (for $80,000):
...Sure, little quadruped robots are cool, but what about the iconic (for better or for worse) human-robot?...
Sophia the robot is a fancy human-appearing robot made by Hanson Robotics. Sophia has become a lightning rod in the AI community for giving wildly unrealistic impressions of what AI is capable of. But the hardware is really, really nice. If you've got $80,000 to spare and want to buy a couple of 21st century animatronics, maybe put a bid in here. I, for one, would love to be invited to a rich person's party where some fancy puppets might be swanning around. Bonus points if you lose the skirt and go for the full hybrid-frightener look. (You could always spend a rumored $75k on a Boston Dynamics 'Spot' robot, but where's the fun in that).
  Consider buyinga robot here (RobotShop).

####################################################

Spanish researchers embed Spanish culture into some large-scale RoBERTa models:
...National data for national models...
Researchers with the wonderfully named "Text Mining Unit" within the Barcelona Supercomputing Center have created a couple of Spanish-language RoBERTa models, helping them to imbue some AI tools with Spanish language and culture. This is part of a recent trend of countries seeking to build their own nationally/culturally representative AI models. Some other examples include Korea, where a startup named Naver created a Korean-representing GPT-3 style model called 'HyperCLOVA' (Import AI 251), and a Dutch RoBERTA (Import AI 182), among others.

What they did: They gathered 570GB of predominantly Spanish-language data, then trained a RoBERTa base and RoBERTA large model on the dataset. In tests, their models generally did better than other pre-existing Spanish-focused BERT models.

The ethics of dragnet data fishing: In the past year, there's been some debate about how large datasets should be constructed, where some people argue such datasets should be heavily curated by the people that gather them, while others argue they should be deliberately uncurated. Here, the researchers opt for what I'd call a curated uncurated strategy - they create three different types of data (theme-based, e.g datasets relating to politics, feminism, etc), event-based (events of significance to Spanish society), and domains at risk of disappearing (e.g, if a website is about to be shutdown). You can find out more information here about the crawls. My expectation is most of the world will move to lightly curated dragnet fishing data gathering, as individual human curation may be too expensive and slow.
  Read more: Spanish Language Models (arXiv).
  Get the RoBERTa base model here (HuggingFace).
Get the RoBERTa large model here (HuggingFace).

####################################################

Tech Tales:

Repetition and Recitation at the End of Time

[A historian in another Solar System, either now or thousands of years prior or thousands of years in the future]

He was a historian and he studied the long-dead by the traces they had created in the AI systems that had outlasted the civilization. It worked like this: he found a computational artefact, got it running, worked out how to prime it, then started plugging details in until the system would spit out data it had memorized about the individual's life: home addresses, contact details, extracts of speeches they had made, and so on.

Of course, some of the data was fuzzy. Most AI systems trend towards a form of poetic license, much like how when people recite things from memory they have a tendency to embellish - to over-dramatize, or to insert illusory facts that come from their own lives and dreams.

But it was all they had to work with: the living beings that had made the AI were longdead, and so he made do with these bottled up representations of their culture. He wrote his reports and published them to the system-wide internet, where they were read and commented on. And, of course, ingested in turn by his own civilization's AI systems.

Just a decade ago, the first AI probes had been sent out - trained artefacts embedded into craft and then sent, in hopes they might arrive at target systems intact and in stable orbits and then exist there, waiting to be found by other civilizations, other forms of life, who might probe them and learn to extract their secrets and develop an understanding of the civilization they came from. His own reports were in there, as well. So perhaps one day soon some being unlike him would sit down and try to extract his name and habits and details, eager to learn about the strange beings now showing up as zeros and ones in cold machines, sent into the dark.

Things that inspired this story: The recent discussion about memorization and recitation in neural nets; ideas about how culture gets represented within AI models; thoughts of space and the purpose of existing in space; the idea that there may be a more limited design space for AI than for biological life so perhaps such things as the above may be possible; hope for a stellar future and fear that if we don't get to it, we will be known by our digital exhaust, captured in our generative models.



Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Twitter
Facebook
Website
Copyright © 2021 Import AI, All rights reserved.
You are receiving this email because you signed up for it. Welcome!

Our mailing address is:
Import AI
Many GPUs
Oakland, California 94609

Add us to your address book


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Email Marketing Powered by Mailchimp

Older messages

Import AI 257: Firefighting robots; how Europe's AI legislation falls short; what the DoD thinks about responsible AI

Monday, July 12, 2021

Would a dataset of the entire universe be sufficient to encapsulate anything a being stationed in that universe could imagine? Or would it be insufficient in some way? View this email in your browser

Import AI 256: Facial recognition VS COVID masks; what AI means for warfare; CLIP and AI art

Tuesday, July 6, 2021

Will computer viruses ever become so complex that we might consider them sentient? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email

Import AI 255: The NSA simulates itself; China uses PatentNet to learn global commerce; are parameters the most important measure of AI?

Monday, June 28, 2021

If aliens visit Earth in 20000 years and the planet has been radically disfigured and reformulated by global warming, what traces of advanced technological civilization might exist on the planet? (

Import AI 254: Facebook uses AI for copyright enforcement; Google uses RL to design better chips.

Monday, June 21, 2021

Do organizations with significant AI investments make faster decisions (in certain areas) than ones which haven't made these investments? Or is it more that the organizations which invest in AI are

Import AI 253: The scaling will continue until performance saturates

Monday, June 14, 2021

If certain types of AI progress are predictable, then should the government anticipate certain soon-to-arrive capabilities and alter the behavior of its own institutions? View this email in your

JSK Weekly - August 05, 2021

Thursday, August 5, 2021

Happy Work Like A Dog Day everyone! Depending on what kind of dog you have, that could either mean hard working, lazing around all day or just simply playing in the sun all day but which ever it is,

The numbers don't lie | Act now to save on Disrupt passes

Thursday, August 5, 2021

Don't miss out on $100 savings before Friday Your Logo Disrupt Email Header. Event is September 21 to 23 Sign up to network with 10k TechCrunch enthusiasts at Disrupt Great news for budget-

A terminal dashboard for K8s, semantic grep for code, and a GitHub/GitLab alternative

Thursday, August 5, 2021

StackShare Weekly Email not displaying correctly? View it in your browser. StackShare Weekly Digest August 5th, 2021 Sponsored by CircleCI. Let CircleCI focus on CI/CD, so you can build the next big

Infographic | Visualizing the 4,000-Year History of Global Power 💪

Thursday, August 5, 2021

We examine an ambitious timeline that details the power of various civilizations going all the way back to 2000 BC TIMELESS Histomap: Visualizing the 4000 Year History of Global Power We examine an

Issue 175 - Tesla's mobile app redesign

Thursday, August 5, 2021

🎨 View this email in your browser If you are just now finding out about Tesletter, you can subscribe here! If you already know Tesletter and want to support us, check out our Patreon page If you have

 Automate The Planet- Compelling Tuesday

Thursday, August 5, 2021

Compelling Tuesday xUnit Tutorial | Part 7 | Geolocation Testing Using xUnit The last module of my XUnit course. There I discuss what Geolocation testing is and how to write such automated tests. Check

Programmer Weekly - Issue 66

Thursday, August 5, 2021

View this email in your browser Programmer Weekly Welcome to issue 66 of Programmer Weekly. Let's get straight to the links this week. From Our Sponsor Retool: The Fastest Way To Build Internal

Daily Coding Problem: Problem #486 [Medium]

Thursday, August 5, 2021

Daily Coding Problem Good morning! Here's a solution to yesterday's problem. This is your coding interview problem for today. This problem was asked by Pinterest. At a party, there is a single

New Course: Integrate Combine Into an App!

Thursday, August 5, 2021

Hey there! We have a new course for you. 🎥 Integrate Combine Into an App iOS & SWIFT • 27 MIN • INTERMEDIATE Check it out! Learn Combine in iOS by practice—while building a Chuck Norris jokes app!

Python Weekly - Issue 511

Thursday, August 5, 2021

View this email in your browser Python Weekly Welcome to issue 511 of Python Weekly. Let's get straight to the links this week. From Our Sponsor SonarLint Free and Open Source IDE Extension for