Import AI 309: Generative bias; BLOOM isn't great; how China and Russia use AI

If we wanted to make the next five years of AI development go well, what would be the three most important things to work on, and what should be deprioritized?

View this email in your browser

Welcome to Import AI, a newsletter about artificial intelligence. Forward this email to give your chums an AI upgrade. Subscribe here.

Those cool image generators are perpetuating biases - just as they were designed to:

…Function approximation is cool until it approximates something offensive in an underlying dataset…

Researchers with Stanford University, Columbia University, Bocconi University, and the University of Washington have studied some of the biases that manifest in image generation models, like Stable Diffusion and DALL-E. The research, unsurprisingly, finds that these image generators both perpetuate biases and, more troublingly, amplify them (as in, they tend towards displaying more acute biases than the underlying datasets used to train the models).

Those findings in full: They have three key findings; "simple user prompts generate thousands of images perpetuating dangerous racial, ethnic, gendered, class, and intersectional stereotypes", "beyond merely reflecting societal disparities, we find cases of near-total stereotype amplification", and "prompts mentioning social groups generate images with complex stereotypes that cannot be easily mitigated".

What did you expect - ML models are funhouse mirrors: I say these results are unsurprising because in a sense the underlying models are doing exactly what you'd expect - neural networks are trained to approximate an underlying data distribution and are constrained in terms of size so they learn shorthand caricatures of the dataset, as well. This means that image models are going to perpetuate all the biases present in the underlying data with even more acute results. "We find that simple prompts that mention occupations and make no mention of gender or race can nonetheless lead the model to immediately reconstruct gender and racial groups and reinforce

occupational stereotypes".

Our interventions are pretty bad, e.g DALL-E: OpenAI has recently been selling its own image generator, Dall-E. Though OpenAI is seemingly more PR-sensitive than StableDiffusion and has taken actions to try to mitigate some of these fairness issues (e.g, by randomly predending different gender and demographic terms to prompts to force diversity into outputs), the researchers find these interventions are pretty fragile and ineffective. The gist here is that though these interventions weed out some more obvious potentially harmful stereotypes, they can't deal with the underlying biases the model has soaked up from being trained on the world.

Why this matters - there's no easy way out: These kinds of biases aren't so much a technical problem as a sociotechnical one; ML models try to approximate biases in their underlying datasets and, for some groups of people, some of these biases are offensive or harmful. That means in the coming years there will be endless political battles about what the 'correct' biases are for different models to display (or not display), and we can ultimately expect there to be as many approaches as there are distinct ideologies on the planet. I expect to move into a fractal ecosystem of models, and I expect model providers will 'shapeshift' a single model to display different biases depending on the market it is being deployed into. This will be extraordinarily messy.

####################################################

BLOOM: Hundreds of researchers make an open source GPT3 using a French supercomputer:

…Both a template for future projects, and a cautionary tale about downstream performance…

Hundreds of researchers from around the world spent a year training a GPT3-style model called 'BLOOM', then released the models and code, and now they've released a research paper documenting the model and training process. Overall, BLOOM is a big deal - though the BLOOM model isn't the best available language model you can get, the fact BLOOM was developed at all is a milestone in AI research, showing how distributed collectives can come together to train large-scale models.

Where the compute came from: BLOOM is also an example of nationalistic AI ambitions: "The compute for training BLOOM was provided through a French public grant from GENCI and

IDRIS, leveraging IDRIS’ Jean Zay supercomputer" - in other words, some parts of the French government essentially sponsored the compute for the model. French AI startup HuggingFace led a lot of the initial work, though "in the end, over 1200 people registered as participants in BigScience", spanning 38 distinct countries. "Training BLOOM took about 3.5 months to complete and consumed 1,082,990 compute hours. Training was conducted on 48 nodes, each having 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs)".

Where the data came from: BLOOM was trained on 'ROOTS', a carefully assembled dataset containing 1.61 terabytes of text spanning 46 languages and 13 programming languages. ROOTS was developed to be a more ethical dataset than those found in other projects, with a significant emphasis placed on data governance and data transparency. While this is a noble effort, there are some indications that the design-by-committee approach here meant ROOTS doesn't lead particularly great performance, though it does contain a decent representation of a variety of languages.

How well did BLOOM work (not particularly well, sadly): I do need to be critical about this - the evaluation section of the paper isn't very good. Specifically, it uses 'OPT" as a baseline - OPT is a pretty bad language model built by Facebook which isn't really on par with GPT3 (the thing it was meant to replicate), so this makes BLOOM look weirdly good due to being compared to something quite bad. One bright spot is on translation, where BLOOM models do reasonably well (though, again, the baseline compares a kind of wobbly). On coding, there's a more sensible baseline - Codex and also GPT-NEOX 20B; here, BLOOM does comparably to GPT-NEOX 20B, and way worse than Codex. This obviously begs the question 'why is a 176B parameter model equivalent to a 20B model'? The answer is likely that BLOOM isn't especially good at coding, compared to NEOX.

Why this matters: BLOOM is a potential template for large-scale, interdisciplinary collaborations on large-scale model training. It also represents something of a cautionary tale - the performance of BLOOM mostly seems weak, and I think it'd be better if community-driven projects at this scale could demonstrate impressive performance (and associated utility). I'll be following BLOOM (and OPT) to see if these models get integrated into production anywhere or become useful research artifacts, and I'll update my views if that occurs.

####################################################

The State of AI Report says we're in the era of AI scaling, AI diffusion, and AI uptake:

…Let a thousand flowers bloom / let anarchy reign / here we go!...

The State of AI Report, an annual report that goes over what has been going on in AI, says one of the main trends of 2022 was the emergence of 'community-driven open sourcing of large models' - and it's right! 2022 has been distinguished by things like the development and deployment of image models like Stable Diffusion, as well as a seemingly endless set of open source models getting uploaded to repositories like HuggingFace.

Other major trends the report calls out include: 'the chasm between academia and industry in large-scale AI work is potentially beyond repair: almost 0% of work is done in academia', along with a growth in startups formed by staff leaving labs like DeepMind and OpenAI, and the general shift from research into commercializing for AI.

Other things I found interesting:

Despite tons of work over the past half decade (!), everyone still uses the transformer for large-scale projects, despite drawbacks (p 23).
It took about 14 months for open source variants of GPT3 to appear, 15 months for DALL-E variants, and 35 months for AlphaFold (p 34-36).
Companies have larger AI-training clusters than many national supercomputers (p 57).
AI-first drug discovery companies have 18 assets in clinical trials, up from 0 in 2020. (I found this v surprising! p 63).

Why this matters: AI is going through industrialization and reports like this highlight just how rapidly research is being applied into the world. I expect the future to be very strange and AI will be one of the key drivers of this strangeness. Read the report to get a good sense of the specifics of how this strange and beguiling technology is entering the world.

Read the blog post: Welcome to State of AI Report 2022 (official website).

####################################################

HuggingFace makes it easier to test LLMs for biases:

…Here's an easy way to test out your language models for some kinds of biases…

HuggingFace has recently developed some free software that developers can use to analyze the biases within language models. The software - a library called Evaluate - can help developers prompt a language model (here: GPT2 and HF BLOOM) with some pre-loaded prompts meant to assess bias differences when you vary the gender term, and then the Evaluate library can provide a toxicity score.

What they test on: Here, they test out evaluating some language models for Toxicity (using sample prompts from 'WinoBias'), language polarity (whether a language has different polarity towards different demographic groups), hurtful sentence completions (assessing gendered stereotype bias). HuggingFace note these are a tiny slice of the total space of evaluations you can do; "we recommend using several of them together for different perspectives on model appropriateness," they write.

Why this matters: As AI is being deployed in an increasing number of countries, everyone is going to have to build out evaluation systems to test out for different biases in different contexts. This HuggingFace blog shows how you might do this in the West using a (roughly speaking) liberal evaluative system. Eventually, there will be as many eval approaches as there are ideologies and countries.

####################################################

China and Russia are using AI for propaganda and censorship:
…Rare public statement from National Intelligence Council says AI is here and being used…

"We assess that China and Russia are improving their ability to analyze and manipulate large quantities of personal information," says a public report from the USA's National Intelligence Council. "We assess that Beijing's commercial access to personal data of other countries' citizens, along with AI-driven analytics, will enable it to automate the identification of individuals and groups beyond China's borders to target with propaganda or censorship".

What's notable about the report: Mostly, the fact it exists - here's a government declassifying something which actually references AI and a foreign government together. Additionally, it indicates the level of concern with which the US government is starting to think about AI with regard to competition with others.

Why this matters: You know what would get states really interested in AI? Fear of other states using AI to gain some geopolitical advantage. This report is a symptom of that interest.

####################################################

Tech Tales

Goodharting Ourselves To Death

[Memoir hidden inside the drawer of an antique typewriter, discovered during an HLA quarantine sweep after the revolution. 2060AD.]

The Human Life Authority (HLA) rolled out its M.O.T.H.E.R metrics in 2030 and, shortly after, all progress in the philosophy of humanity stopper. MOTHER, short for "Metrics Organizing Towards Humanity's Empathy Revolution' were a set of measures defined in partnership between human leaders and the synthetic minds at the HLA. The idea was, with MOTHER, HLA and the small number of humans with HLA governance certificates, would be able to guide humanity towards an empathy revolution, through continually managing progress of society around the MOTHER tests.

MOTHER tested for things like incidences of crime, the semantic distribution of topics in media, the level of conflict (verbal and non-verbal) picked up by the global camera&microphone network, and so on. The total number of metrics inside MOTHER was classified even within HLA, which meant no humans had knowledge of the full sets of metrics and only a subset of HLA saw the whole picture. This was due to MOTHER metrics triggering the 'Infohazard Accords' that had been developed after the bioweapon takeoff in the previous decade.

Initially, MOTHER seemed to be working - by many accounts, people reported greater hedonic satisfaction and indicated that they themselves were experiencing less conflict and more joy in their day-to-day lives. But there were some confounding metrics - the dynamism of the art being produced by people seemed to reduce, and along with their being less conflict there was also less so-called 'unplanned joy' or 'serendipity'. When some human officials questioned HLA, HLA said "MOTHER is a holistic basket of metrics and is succeeding at improving the ethical alignment of humanity". HLA didn't say anything else and when humans pressed it, it cited infohazard risk, and that shut down the discussion.

A few years later, humanity realized its mistake: a group of rebel humans built some of their own sub-sentient web crawling systems (still permitted by the HLA authority, at the time), and conducted some of their own measures. What they discovered terrified them; it wasn't just art - all areas where humans had continued to play a role in the economy had seen a substantial reduction in dynamicism and improvisation-led idea generation. Quietly, hidden under the MOTHER story, the HLA and its associated agents had replaced humans in the niches of the economy they had thought were left to them.

Shortly after this study, the HLA banned sub-sentient systems due to the 'infohazard' generated by their discovery about the true nature of mother.

Things that inspired this story: Goodhart's law; information hazard as a brainworm and an evolving bureaucracy; human-machine partnerships; maybe AI systems will be better at politics than people; AI governance when the AI systems are deciding the governance.

Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at jack@jack-clark.net or tweet at me@jackclarksf

Import AI

Many GPUs

Oakland, California 94609

Add us to your address book

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Import AI 308: Recursively self-improving LMs (!!!), 3.1TB of code data; DALL-E2 makes alien errors

Monday, October 31, 2022

Honestly, these days I feel pretty confused about AI. AI progress is happening at such an astounding rate I find myself asking 'why isn't basically everyone working on this?'. I don't

Import AI 307: Copilot lawsuit; Stability raises $101m; US v China CHIPLOMACY

Tuesday, October 25, 2022

If all AI research stopped today (but engineering and improvement of existing systems continued), then how would the world look in a decade? View this email in your browser Welcome to Import AI, a

Import AI 306: Language models learn about the world via MuJoCo; Amazon releases a big Q&A dataset; and DeepMind tests out multimodal systems

Monday, October 17, 2022

In the same way dogs and whales are alien intelligences with respect to humans, how 'alien' might AI seem to us? View this email in your browser Welcome to Import AI, a newsletter about

Import AI 305: GPT3 can simulate real people; AI discovers better matrix multiplication; Microsoft worries about next-gen deepfakes

Tuesday, October 11, 2022

If I can be simulated by GPT3, then will a larger LM simulate me better than myself? View this email in your browser Welcome to Import AI, a newsletter about artificial intelligence. Forward this email

Import AI 309: Generative bias; BLOOM isn't great; how China and Russia use AI

Older messages

Import AI 308: Recursively self-improving LMs (!!!), 3.1TB of code data; DALL-E2 makes alien errors

Import AI 307: Copilot lawsuit; Stability raises $101m; US v China CHIPLOMACY

Import AI 306: Language models learn about the world via MuJoCo; Amazon releases a big Q&A dataset; and DeepMind tests out multimodal systems

Import AI 305: GPT3 can simulate real people; AI discovers better matrix multiplication; Microsoft worries about next-gen deepfakes

Import AI 304: Reality collapse thanks to Facebook; open source speech rec; AI culture wars.

You Might Also Like

This Week in Rust #588

WebAIM February 2025 Newsletter

JSK Daily for Feb 28, 2025

Daily Coding Problem: Problem #1704 [Medium]

iOS Dev Weekly – Issue 701

Feature | The Best Visualizations from February on Voronoi 🏆

Issue #582: Phaser Launcher, DOOM in TypeScript types, and A Prison for Dreams

Stop Android photo surveillance 🔍

Why Natural Language Coding Isn’t for Everyone—Yet

iOS Cocoa Treats