Import AI 241: The $2 million dataset; small GPT-3 replications; ImageNet gets a face-blur update

Prediction: By 2030, most computation on planet earth will be "restricted" and "unrestricted computation" will be associated with fringe actors and nationstate/megacorp proxies.

View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

CUAD: A free $2million legal dataset!
...Specific rather than general evaluation: okay, your model can understand language, but can it understand legal contracts?...
AI is moving from a technology of general, scientific interest, to one of broad commercial interest. Because of this, we're seeing the way we evaluate AI change. Now, along with judging the performance of an AI system on a generic task (like classifying some images from ImageNet, or judging the quality of generative text outputs), we're moving to evaluating performance on highly-specific tasks grounded in the real-world. This gives us a better understanding of where contemporary AI systems are strong and where they're weak.
One such specific evaluation comes to us from researchers at Berkeley and the Nueva School: CUAD, the Contract Understanding Atticus Dataset, is a dataset of legal contracts with expert annotations by lawyers. CUAD helps us test out how well AI systems can do on a specific, challenging task found in the real world.

What's in CUAD? CUAD contains 500 contracts annotated with 13,000 expert annotations across 41 label categories. The dataset is originally meant to test for how well AI systems can highlight the parts of a contract that are relevant to a given label - a task the authors compare to "finding needles in a haystack".

The $2 million dataset: CUAD was built using a bunch of expert law student annotators who received 70-100 hours of contract review training before they started labeling stuff, and each of their labels were validated by additional validators. Therefore, "a conservative estimate of the pecuniary value of CUAD of is over $2 million (each of the 9283 pages were reviewed at least 4 times, each page requiring 5-10 minutes, assuming a rate of $500 per hour)", the researchers note.
Read more: CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (arXiv).
Get the dataset: Contract Understanding Atticus Dataset (CUAD) from here (Atticus Project website).

###################################################

Interested in speech but hate stitching together software? SpeechBrain could be for you:
...PyTorch-based software simplifies a bunch of fiddly tasks…
Speech: it's how most humans transmit most of their information. And in recent years, advances in AI have made speech recognition get significantly better and more efficient. But it's still weirdly hard to use the full stack of speech capabilities - especially when we compare the usability of speech to things like text (where packages like HuggingFace's 'Transformers' have made things relatively easy), or image recognition (where there are a ton of easy-to-use systems available).

Now, a team of researchers have built SpeechBrain, open source software "designed to be simple, extremely flexible, and user-friendly", according to the website.

Key features: SpeechBrain ships with inbuilt models for speech recognition, speaker recognition, speech enhancement, speech processing (including multi-microphone processing), and a bunch of documentation and tools to aid researchers.
Get the code: SpeechBrain - A PyTorch powered Speech Toolkit (official website).

###################################################

Does Google want to open source GPT3?
...Recent outputs by the ethical AI team suggest 'no', while Google's TFRC suggests 'yes'...
Google isn't publicly replicating GPT3, the large-scale NLP model developed by OpenAI. And some parts of Google - most notably its ethical AI team, formerly led by Timnit Gebru and Meg Mitchell - has published research about the ethical and safety issues of language models like GPT3 and Google's BERT.
Yet, Google is supporting the open source release of GPT3, because Google is supplying hundreds of thousands of dollars per compute per month via the Tensorflow Research Cloud (TFRC) to Eleuther, an AI organization whose goal is to replicate and release GPT3 (and even larger models). This is an action that neatly illustrates why AI policy is confusing and coordination (within companies or between them) is challenging.

GPT3-esque open source models: Eleuther has just published 1.3billion and a 2.7billion-parameter models designed to replicate GPT3 and trained on 'The Pile', an 800GB dataset of text also developed by Eleuther. Eleuther trained these models using compute it accessed via the TFRC project (and TFRC understands that Eleuther's goal is to replicate GPT-3).

Why this matters: Google's actions here are confusing. On the one hand, the company publishes AI principles and periodically goes on publicity drives about 'responsible AI'. On the other hand, Google is enabling the release of a class of models with some non-trivial ethical challenges via a process that lets it sidestep accountability. It's hard for us to know what Google believes aa an institution, here.

Factories are opinions: Right now, it's as though Google has specific opinions about the products (software) it makes in its factories (datacenters), yet at the same time is providing unrestricted access to its factories (datacenters) to external organizations. It'd be interesting to understand the thinking here - does TFRC become the means by which Google allows open source models to come into existence without needing to state whether it has chosen to 'release' these models?
Get the GPT-3 model code here (Eleuther GitHub).
More information about TFRC+Eleuther here (Eleuther member, Stella Rose, Twitter).

###################################################

ImageNet: Now sanitised with blurred faces:
...As AI industrializes, datasets get cleaned up...
ImageNet, one of the most widely used datasets in machine learning, has been sanitised. Specifically, a team of researchers at Princeton and Stanford University have gone through the multi-million picture dataset and tried to blur the faces of every human within ImageNet. They call this "an attempt to mitigate ILSVRC's privacy issues". The paper is also notable because of the authors - Fei-Fei Li led the creation of the original dataset and is listed as an author.

What they did: The authors use Amazon's 'Reckognition' service on all images in ILSVRC to find faces, then refine these results through human-annotation via Amazon Mechanical Turk. They then blur the identified faces.

What effect does it have? Blurring means you remove information that was present in the image. Therefore, though only 3 of ImageNet's categories relate to people, we might expect the blurring to lead to a reduction in the utility of the overal dataset. This seems to be the case: in tests, systems trained on the 'blurred' version of ImageNet do about 0.5 absolute points worse than the non-blurred versions. This is actually pretty good - it's a negligible reduction in accuracy for a privacy bonus. Some categories do get affected more severely - specifically, the 'mask' and 'harmonica' categories now seem to do worse "as obfuscation removes visual cues necessary for recognizing them".

Who gets to be a scholar? This paper has attracted some controversy because of its relationship (or lack thereof) to earlier work done by Vinay Prabhu and Adeba Birhane, who in June of last year wrote a paper about the challenges created via large-scale datasets such as ImageNet - the de-blurring paper doesn't mention much of this work. Prabhu says, in a blog post, the paper "appears to be a calculated and systematic erasure of the entire body of critique that our work was part of".
There's some apparent merit to this case - Prabhu said they carried out a live Q&A with Fei-Fei Li about some of the issues with computer vision subsequently covered in their work. It's not clear to me what the precise mechanics of this situation are, but the significant amount of public evidence here makes it feel worth mentioning. (One of the things I take from all of this is that the AI space may be starting to fracture into different research communities, with this incident seeming to indicate a rift forming between some researchers. We saw similar patterns with the Timnit Gebru and Margaret Mitchell situations at Google recently, as well.

Why this matters: Today, the datasets to train AI are broadly unknown, undocumented, and unregulated. In the future, like with any key input to any important industrial process, we can expect datasets to be known, documented, and regulated. Techniques like applying blurring to faces post-dataset construction are useful to work on, because they give us a path we can use to convert today's datasets into one better fit for the regulatory future. It also raises issues of dataset circulation - now that there's an official, blurred-face ImageNet, where will the unblurred 'black market ImageNet' dataset circulate and who might use it?
Read more: A Study of Face Obfuscation in ImageNet (arXiv).
Get the code here (ImageNet Face Obfuscation, GitHub).
Read more: A study of “A Study of Face Obfuscation in ImageNet” (Vinay Prabhu, blog).

###################################################

Now that reinforcement learning works, what will it do to the world?
...Researchers grapple with the societal implications of (semi-)autonomous agents…
Recently, reinforcement learning has started to work well enough to be applied to large, consequential deployments; RL-infused systems help create recommendation algorithms for social media, calibrate the power usage of equipment in Google's datacenters, and are starting to teach robots how to move.
Now, researchers with the Leverhulme Center for the Future of Intelligence in Cambridge, and Imperial College London, have written a paper analyzing the societal impacts of deep reinforcement learning. Their conclusion? We need to spend a bit more time thinking about the implications of these systems and coming up with effective oversight schemes to control them. "As more companies develop and deploy DRL systems with wide-ranging impacts on users, we must consider both how to ensure that these systems behave as intended over the long-term, and whose interests they are serving," they write.

What should we do about RL systems? As reinforcement learning systems get better, they're going to be deployed more widely, which means they'll continually explore a broader range of environments. This is mostly going to be good, but we'll need to ensure we have adequate human oversight to stop them taking dangerous actions in high risk situations. We'll also need to closely observe RL-trained systems' behavior, so we can be confident that their reward function doesn't lead to pathological breakdowns.

Policy suggestions: One practical recommendation by the researchers is to "find ways to track progress in DRL and its applications" - I think this is a great idea! Something I've spent a few years doing at the AI Index is regularly tracking and analyzing technical progress. It's been surprisingly difficult to do this on RL because, after a blissful few years in which most people competed with eachother on the Atari-57 set of games, people are now testing RL in dissimilar, hard-to-compare environments. They also suggest researchers develop "notions of responsible DRL development" - by this, they basically mean splicing technical teams together with ethicists and safety-oriented people.
Read more: The Societal Implications of Deep Reinforcement Learning (JAIR).

###################################################

800GB of cleaned, Common Crawl text:
...Fresh, processed data for researchers on a budget…
The Allen Institute for Artificial Intelligence (AI2) has published C4, a dataset of 800GB of cleaned English text data (along with a 6.3TB uncleaned variant). C4 is a massive dataset which was originally developed by Google to train its 'T5' natural language processing system.
AI2 has uploaded the data into a requester-pays bucket in Google storage, which means the whole dataset will cost about $100 to download. By processing and uploading the datasets, AI2 has helped create a common-good dataset that would otherwise have been replicated privately by researchers around the world.
Get the dataset here: Download the C4 dataset (GitHub, AI2 repo).
More about the dataset here: C4 (TensorFlow website).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

A primer on AI safety:
DC thinktank CSET has released a 3-part primer on AI safety, offering a non-technical summary of the key problems and approaches. CSET uses a framework from DeepMind’s to split safety into three components:
- Knowing that anb AI system will perform reliably in a diverse range of environments not encountered during training (robustness)
- Being able to understand why it behaves the way it does, and whether it will adhere to our expectations (assurance)
- Knowing how to specify its goals such that the goals align with the behavior we want it to manifest (specification).
Read more: (1) Key Concepts in AI Safety—Overview (2) Robustness and adversarial examples; (3) Interpretability.

--------

ARIA — the UK’s answer to DARPA

The UK is launching an agency to fund “high-risk, high-reward” research in emerging technologies, modelled on the US’ DARPA program. The Advanced Research & Invention Agency (ARIA) will be led by a small group of experts, and will operate independently from the government. It has been given initial funding of £800m over four years. It is hoped that ARIA will be able to deliver funding to researchers with flexibility and speed; without unnecessary bureaucracy; and with a high tolerance for failure. ARIA is the brainchild of Dominic Cummings, who has long advocated for a DARPA-esque agency for the UK.

Read more: Gov press release

--------

What Matthew is reading:

Donald Knuth’s The Dangers of Computer Science Theory (H/T Gwern)
Excel Never Dies: Packy McCormick on the enduring, underappreciated magic of Excel
A Brief History of Semiconductors: How The US Cut Costs and Lost the Leading Edge via Employ America
All Human Knowledge: Wikipedia user Emijrp on the project of compiling the sum total of knowledge + Wikipedia: There is a deadline, on instances of lost and destroyed knowledge
Thomas Moynihan on dolphin intelligence and humanity’s cosmic future

###################################################

Tech Tales:

The 10,000 Faces of Confrontation
[A 'young professional' style apartment, up on the tenth to twentieth floor, in some part of the hot tech economy - San Francisco, Singapore, London, or wherever]

You stare into the SmartMirror and it's not your face looking back at you, it's the face of a boss who has been putting you through hell. You yell at the boss. Tell them your feelings. The boss looks hurt. They don't apologize - you pre-programmed that 'bias against yielding' into the system - but you feel some catharsis at getting something of a rise out of them.

Each day, you have a conversation with a different person. You have your favorites, of course. Like the boss or the girlfriend or - of course - your mother and father. But there are other characters that you're developing as well - a restaurant server who, you think, subtly insulted you. A celebrity whose adverts you have taken a dislike to.

The next day you stare into the SmartMirror and you make your girlfriend appear. You tell them you are disappointed in how they behaved last night. You explain you're hurt by them. They try to explain themselves, but it's just a language model taking your conversation and combining it with a response primed around being 'conciliatory'. You tell them their excuses are not going to cut it.

The day after that, and your SmartMirror "suggests" someone for you to talk to. An old friends of yours. "We believe this avatar will inspire a significant emotional response," says the accompanying note. "We have determined that a significant emotional response interaction might help you".

Things that inspired this story: Progress in multimodal learning; deepfakes and associated technologies; thinking about a 'psychological tonal'; the general tendency of AI+Capitalism to lead to extraneous attempts at providing recommendations for the edges of life.

Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Copyright © 2021 Import AI, All rights reserved.
You are receiving this email because you signed up for it. Welcome!

Our mailing address is:

Import AI

Many GPUs

Oakland, California 94609

Add us to your address book

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

Monday, March 15, 2021

In a few decades, the only true records of certain pieces of art or media will be in neural networks, rather than the original input media (which will have been lost). How might 'neural

Import AI 239: China trains a massive 10b model, Vicarious does pick and place; the GCHQ publishes some of its thoughts on AI

Monday, March 8, 2021

What could be the physically largest neural network you could build? And what would it be made of? People moving in formations? Water flowing through vessels? Great temperature fluctuations being

Import AI 241: The $2 million dataset; small GPT-3 replications; ImageNet gets a face-blur update

Older messages

Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

Import AI 239: China trains a massive 10b model, Vicarious does pick and place; the GCHQ publishes some of its thoughts on AI

Import AI 238: Robots that fold clothes; how Bytedance censors its product; a differentiable simulator.

Import AI 237: GPT3 at 5X the speed; 6 hours of AI breakbeats; NeuralMMO++

Import AI 236: EfficientNet++; why robots are hard; AI2 makes a harder ARC

You Might Also Like

Master the New Elasticsearch Engineer v8.x Enhancements!

Daily Coding Problem: Problem #1707 [Medium]

Simplification Takes Courage & Perplexity introduces Comet

Mapped | Which Countries Are Perceived as the Most Corrupt? 🌎

The new tablet to beat

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

GCP Newsletter #440

Apple Should Swap Out Siri with ChatGPT

⚡ THN Weekly Recap: Alerts on Zero-Day Exploits, AI Breaches, and Crypto Heists

⚙️ AI price war