Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.
CUAD: A free $2million legal dataset!
...Specific rather than general evaluation: okay, your model can understand language, but can it understand legal contracts?...
AI is moving from a technology of general, scientific interest, to one of broad commercial interest. Because of this, we're seeing the way we evaluate AI change. Now, along with judging the performance of an AI system on a generic task (like classifying some images from ImageNet, or judging the quality of generative text outputs), we're moving to evaluating performance on highly-specific tasks grounded in the real-world. This gives us a better understanding of where contemporary AI systems are strong and where they're weak.
One such specific evaluation comes to us from researchers at Berkeley and the Nueva School: CUAD, the Contract Understanding Atticus Dataset, is a dataset of legal contracts with expert annotations by lawyers. CUAD helps us test out how well AI systems can do on a specific, challenging task found in the real world.
What's in CUAD? CUAD contains 500 contracts annotated with 13,000 expert annotations across 41 label categories. The dataset is originally meant to test for how well AI systems can highlight the parts of a contract that are relevant to a given label - a task the authors compare to "finding needles in a haystack".
The $2 million dataset: CUAD was built using a bunch of expert law student annotators who received 70-100 hours of contract review training before they started labeling stuff, and each of their labels were validated by additional validators. Therefore, "a conservative estimate of the pecuniary value of CUAD of is over $2 million (each of the 9283 pages were reviewed at least 4 times, each page requiring 5-10 minutes, assuming a rate of $500 per hour)", the researchers note.
Read more: CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (arXiv).
Get the dataset: Contract Understanding Atticus Dataset (CUAD) from here (Atticus Project website).
###################################################
Interested in speech but hate stitching together software? SpeechBrain could be for you:
...PyTorch-based software simplifies a bunch of fiddly tasks…
Speech: it's how most humans transmit most of their information. And in recent years, advances in AI have made speech recognition get significantly better and more efficient. But it's still weirdly hard to use the full stack of speech capabilities - especially when we compare the usability of speech to things like text (where packages like HuggingFace's 'Transformers' have made things relatively easy), or image recognition (where there are a ton of easy-to-use systems available).
Now, a team of researchers have built SpeechBrain, open source software "designed to be simple, extremely flexible, and user-friendly", according to the website.
Key features: SpeechBrain ships with inbuilt models for speech recognition, speaker recognition, speech enhancement, speech processing (including multi-microphone processing), and a bunch of documentation and tools to aid researchers.
Get the code: SpeechBrain - A PyTorch powered Speech Toolkit (official website).
###################################################
Does Google want to open source GPT3?
...Recent outputs by the ethical AI team suggest 'no', while Google's TFRC suggests 'yes'...
Google isn't publicly replicating GPT3, the large-scale NLP model developed by OpenAI. And some parts of Google - most notably its ethical AI team, formerly led by Timnit Gebru and Meg Mitchell - has published research about the ethical and safety issues of language models like GPT3 and Google's BERT.
Yet, Google is supporting the open source release of GPT3, because Google is supplying hundreds of thousands of dollars per compute per month via the Tensorflow Research Cloud (TFRC) to Eleuther, an AI organization whose goal is to replicate and release GPT3 (and even larger models). This is an action that neatly illustrates why AI policy is confusing and coordination (within companies or between them) is challenging.
GPT3-esque open source models: Eleuther has just published 1.3billion and a 2.7billion-parameter models designed to replicate GPT3 and trained on 'The Pile', an 800GB dataset of text also developed by Eleuther. Eleuther trained these models using compute it accessed via the TFRC project (and TFRC understands that Eleuther's goal is to replicate GPT-3).
Why this matters: Google's actions here are confusing. On the one hand, the company publishes AI principles and periodically goes on publicity drives about 'responsible AI'. On the other hand, Google is enabling the release of a class of models with some non-trivial ethical challenges via a process that lets it sidestep accountability. It's hard for us to know what Google believes aa an institution, here.
Factories are opinions: Right now, it's as though Google has specific opinions about the products (software) it makes in its factories (datacenters), yet at the same time is providing unrestricted access to its factories (datacenters) to external organizations. It'd be interesting to understand the thinking here - does TFRC become the means by which Google allows open source models to come into existence without needing to state whether it has chosen to 'release' these models?
Get the GPT-3 model code here (Eleuther GitHub).
More information about TFRC+Eleuther here (Eleuther member, Stella Rose, Twitter).
###################################################
ImageNet: Now sanitised with blurred faces:
...As AI industrializes, datasets get cleaned up...
ImageNet, one of the most widely used datasets in machine learning, has been sanitised. Specifically, a team of researchers at Princeton and Stanford University have gone through the multi-million picture dataset and tried to blur the faces of every human within ImageNet. They call this "an attempt to mitigate ILSVRC's privacy issues". The paper is also notable because of the authors - Fei-Fei Li led the creation of the original dataset and is listed as an author.
What they did: The authors use Amazon's 'Reckognition' service on all images in ILSVRC to find faces, then refine these results through human-annotation via Amazon Mechanical Turk. They then blur the identified faces.
What effect does it have? Blurring means you remove information that was present in the image. Therefore, though only 3 of ImageNet's categories relate to people, we might expect the blurring to lead to a reduction in the utility of the overal dataset. This seems to be the case: in tests, systems trained on the 'blurred' version of ImageNet do about 0.5 absolute points worse than the non-blurred versions. This is actually pretty good - it's a negligible reduction in accuracy for a privacy bonus. Some categories do get affected more severely - specifically, the 'mask' and 'harmonica' categories now seem to do worse "as obfuscation removes visual cues necessary for recognizing them".
Who gets to be a scholar? This paper has attracted some controversy because of its relationship (or lack thereof) to earlier work done by Vinay Prabhu and Adeba Birhane, who in June of last year wrote a paper about the challenges created via large-scale datasets such as ImageNet - the de-blurring paper doesn't mention much of this work. Prabhu says, in a blog post, the paper "appears to be a calculated and systematic erasure of the entire body of critique that our work was part of".
There's some apparent merit to this case - Prabhu said they carried out a live Q&A with Fei-Fei Li about some of the issues with computer vision subsequently covered in their work. It's not clear to me what the precise mechanics of this situation are, but the significant amount of public evidence here makes it feel worth mentioning. (One of the things I take from all of this is that the AI space may be starting to fracture into different research communities, with this incident seeming to indicate a rift forming between some researchers. We saw similar patterns with the Timnit Gebru and Margaret Mitchell situations at Google recently, as well.
Why this matters: Today, the datasets to train AI are broadly unknown, undocumented, and unregulated. In the future, like with any key input to any important industrial process, we can expect datasets to be known, documented, and regulated. Techniques like applying blurring to faces post-dataset construction are useful to work on, because they give us a path we can use to convert today's datasets into one better fit for the regulatory future. It also raises issues of dataset circulation - now that there's an official, blurred-face ImageNet, where will the unblurred 'black market ImageNet' dataset circulate and who might use it?
Read more: A Study of Face Obfuscation in ImageNet (arXiv).
Get the code here (ImageNet Face Obfuscation, GitHub).
Read more: A study of “A Study of Face Obfuscation in ImageNet” (Vinay Prabhu, blog).
###################################################
Now that reinforcement learning works, what will it do to the world?
...Researchers grapple with the societal implications of (semi-)autonomous agents…
Recently, reinforcement learning has started to work well enough to be applied to large, consequential deployments; RL-infused systems help create recommendation algorithms for social media, calibrate the power usage of equipment in Google's datacenters, and are starting to teach robots how to move.
Now, researchers with the Leverhulme Center for the Future of Intelligence in Cambridge, and Imperial College London, have written a paper analyzing the societal impacts of deep reinforcement learning. Their conclusion? We need to spend a bit more time thinking about the implications of these systems and coming up with effective oversight schemes to control them. "As more companies develop and deploy DRL systems with wide-ranging impacts on users, we must consider both how to ensure that these systems behave as intended over the long-term, and whose interests they are serving," they write.
What should we do about RL systems? As reinforcement learning systems get better, they're going to be deployed more widely, which means they'll continually explore a broader range of environments. This is mostly going to be good, but we'll need to ensure we have adequate human oversight to stop them taking dangerous actions in high risk situations. We'll also need to closely observe RL-trained systems' behavior, so we can be confident that their reward function doesn't lead to pathological breakdowns.
Policy suggestions: One practical recommendation by the researchers is to "find ways to track progress in DRL and its applications" - I think this is a great idea! Something I've spent a few years doing at the AI Index is regularly tracking and analyzing technical progress. It's been surprisingly difficult to do this on RL because, after a blissful few years in which most people competed with eachother on the Atari-57 set of games, people are now testing RL in dissimilar, hard-to-compare environments. They also suggest researchers develop "notions of responsible DRL development" - by this, they basically mean splicing technical teams together with ethicists and safety-oriented people.
Read more: The Societal Implications of Deep Reinforcement Learning (JAIR).
###################################################
800GB of cleaned, Common Crawl text:
...Fresh, processed data for researchers on a budget…
The Allen Institute for Artificial Intelligence (AI2) has published C4, a dataset of 800GB of cleaned English text data (along with a 6.3TB uncleaned variant). C4 is a massive dataset which was originally developed by Google to train its 'T5' natural language processing system.
AI2 has uploaded the data into a requester-pays bucket in Google storage, which means the whole dataset will cost about $100 to download. By processing and uploading the datasets, AI2 has helped create a common-good dataset that would otherwise have been replicated privately by researchers around the world.
Get the dataset here: Download the C4 dataset (GitHub, AI2 repo).
More about the dataset here: C4 (TensorFlow website).
###################################################
AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…
A primer on AI safety:
DC thinktank CSET has released a 3-part primer on AI safety, offering a non-technical summary of the key problems and approaches. CSET uses a framework from DeepMind’s to split safety into three components:
- Knowing that anb AI system will perform reliably in a diverse range of environments not encountered during training (robustness)
- Being able to understand why it behaves the way it does, and whether it will adhere to our expectations (assurance)
- Knowing how to specify its goals such that the goals align with the behavior we want it to manifest (specification).
Read more: (1) Key Concepts in AI Safety—Overview (2) Robustness and adversarial examples; (3) Interpretability.
--------
ARIA — the UK’s answer to DARPA
The UK is launching an agency to fund “high-risk, high-reward” research in emerging technologies, modelled on the US’ DARPA program. The Advanced Research & Invention Agency (ARIA) will be led by a small group of experts, and will operate independently from the government. It has been given initial funding of £800m over four years. It is hoped that ARIA will be able to deliver funding to researchers with flexibility and speed; without unnecessary bureaucracy; and with a high tolerance for failure. ARIA is the brainchild of Dominic Cummings, who has long advocated for a DARPA-esque agency for the UK.
Read more: Gov press release
Read more: Why Dominic Cummings fears the £800m research agency he championed will fail (NS)
--------
What Matthew is reading:
###################################################
Tech Tales:
The 10,000 Faces of Confrontation
[A 'young professional' style apartment, up on the tenth to twentieth floor, in some part of the hot tech economy - San Francisco, Singapore, London, or wherever]
You stare into the SmartMirror and it's not your face looking back at you, it's the face of a boss who has been putting you through hell. You yell at the boss. Tell them your feelings. The boss looks hurt. They don't apologize - you pre-programmed that 'bias against yielding' into the system - but you feel some catharsis at getting something of a rise out of them.
Each day, you have a conversation with a different person. You have your favorites, of course. Like the boss or the girlfriend or - of course - your mother and father. But there are other characters that you're developing as well - a restaurant server who, you think, subtly insulted you. A celebrity whose adverts you have taken a dislike to.
The next day you stare into the SmartMirror and you make your girlfriend appear. You tell them you are disappointed in how they behaved last night. You explain you're hurt by them. They try to explain themselves, but it's just a language model taking your conversation and combining it with a response primed around being 'conciliatory'. You tell them their excuses are not going to cut it.
The day after that, and your SmartMirror "suggests" someone for you to talk to. An old friends of yours. "We believe this avatar will inspire a significant emotional response," says the accompanying note. "We have determined that a significant emotional response interaction might help you".
Things that inspired this story: Progress in multimodal learning; deepfakes and associated technologies; thinking about a 'psychological tonal'; the general tendency of AI+Capitalism to lead to extraneous attempts at providing recommendations for the edges of life.
Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf
|